Constraint grammar—it is a natural language processing formalism with great two distinctions: it routinely scores amongst the highest in tasks such as partofspeech tagging and wordsense disambiguation, with Fscores at around 99%; and it has made some of the most dubious choices in programming language syntax in history. Though its specification has changed tremendously since CG1, it is nontheless a grammar formalism which sees a lot of usage. One natural question to ask of any grammar formalism is “how expressive is it?”
Over the weekend, inariksit visited me, and we decided to find out.
It’s not immediately obvious how to even approach this question, as constraint grammar doesn’t generate strings per se. It simply constrains existing, ambiguous strings. We took the following approach: we view a constraint grammar as a formal language \(\mathcal{L}\), generated over an alphabet \(\Sigma\). We generate the strings in our language by passing maximally ambiguous strings of every length to the grammar. With maximally ambiguous, I mean those strings where each position contains the entire alphabet, so \(\langle \Sigma \rangle_n\). A constraint grammar is said to accept a string \(w\) of length \(n\) if, when we pass \(\langle \Sigma \rangle_n\) as an input to the CG, \(w\) is one of the possible interpretations of its output.^{1}
The specification of CG3 mentions tags such as EXTERNAL
, which
passes information to an external command. So constraint grammar is
obviously Turing complete. However, that’s a little bit boring, so
let’s see what we can say about the expressiveness of the absolute
core of constraint grammar: REMOVE
with sections. If we leave out
sections, there is no recursion, and therefore the language will be
strictly finite and boring. If we leave out REMOVE
then there is no
way to restrict strings, so we’d only have the languages
\(\Sigma^*\) for any \(\Sigma\).
There are a few concessions we will allow ourselves. If we had MAP
,
ADD
, or any such other command, we would have a way to store
information. In this strict fragment, all we have is the current set
of symbol assignments. Therefore, we will allow ourselves a second
alphabet \(\Sigma\prime\) of hidden symbols –
i.e. symbols that we are not allowed to pass to the output. In
addition, we update our definition of \(\mathcal{L}\) to state that
we pass in \(\langle \Sigma \cup \Sigma\prime
\rangle_n\). This is not strictly necessary in CG3, as we could
use APPEND
to add these hidden characters, but we would like to stay
as faithful to our fragment as possible.
One last hurdle is that constraint grammar has no notion of
failure. The worst that can happen is that a grammar changes
nothing. Worse so, if there is only one reading left, the REMOVE
command will have no effect. So one more concession we make is that we
allow ourselves to use the REMCOHORT
command – which removes an
entire “cohort”, or “position” in our terminology – for the sole
purpose of deleting the entire string if it is not accepted.
From here on out, when we say ‘CG3’, we are referring to this fragment of constraint grammar.
CG3 is not regular; the language \(a^nb^n\)
In this section we show that CG3, restricted to sections and REMOVE
is not regular. We show this by implementing a grammar for the
counting language \(a^nb^n\).
The first thing we do is to try and detect the edges of the string. CG3
has “magical” constants for this, called >>>
and <<<
for the left
and right edge, respectively. However, we cannot use those. Instead,
we define them ourselves using two hidden variables, which we also
call >>>
and <<<
. We do this as follows:
Initially, all positions will be labeled with both >>>
and
<<<
. These above rules check whether there is any position
preceding or succeeding the current position, and if so, delete >>>
or <<<
. As a result, the first position will be the only one tagged
>>>
, and the last the only one tagged <<<
.^{2}^{3}
Next, we note that all strings in the language \(a^nb^n\) are of
even length, and that every even length corresponds to exactly one
string. Therefore, we must reject all strings of uneven length. We
assume two more hidden symbols, EVEN
and ODD
. We can use these to
label whether a position is even or odd: we know the first position is
odd, so we delete EVEN
; we know that positions following odd
positions must be even, so we delete ODD
; and we know that positions
following even positions are ODD
, so we delete EVEN
…^{4}
It’s exactly this “marking as even by deleting odd” that makes it a bit of a confusing read, so if you’d like to play around with an example, my full code with examples is available here, and vislcg3 is available here.
Anyway, after performing this labelling, we can check if the last position is even, and if so, delete all positions:^{5}
Now that we are certain that we only accept evenlength strings, it is safe to say that the first symbol must be an \(a\), and the last must be a \(b\):^{6}
And now it’s only a matter of slowly growing these \(a\)s and
\(b\)s until they meet. We do this as follows: in each pass, we mark
the position after the last definite \(a\) as a candidate for
\(a\) (written OPT_A
), and do likewise for the last position
before the first definite \(b\). Then we mark each candidate
\(a\) and \(b\) as definite, and we continue:^{7}
The grammar described so far exactly expresses the language \(a^nb^n\).^{5} Since this language is not regular, we can conclude that constraint grammar is not regular.
CG3 is not contextfree; the language \(a^nb^nc^n\)
In this section we show that CG3, restricted to sections and REMOVE
is not contextfree. We show this by implementing a grammar for the
counting language \(a^nb^nc^n\).
The language \(a^nb^nc^n\) has us divide strings whose length is a
multiple of three into three even chunks. The first part of this is
obviously to find the bounds of the input string, as before, and make
sure that it has a length divisible by three. We can trivially extend
our previous approach – now abandoning “even” and “odd” in favour of
X1
, X2
and X3
:
Note that, somewhat counterintuitively, REMOVE X1_OR_X2
^{8}
removes both X1
and X2
, but 0 X1_OR_X2
matches if the current
position still has either option.
Now that we can be sure that our string is of some length \(3n\), we
can proceed to divide it into three equal chunks. One good way to do
this, is to start by finding the middle. This is exactly what we did
in our grammar for \(a^nb^n\). Below we implement the same, but now
without using SELECT
, as using this would erase all other
tags. For this, we assume four new hidden symbols FST
, SND
– for
first and second half – and OPT_*
varieties:
Once we’ve divided the word in half, it becomes fairly easy to point out the middle. Below, we mark the first position as \(a\), the last position as \(c\) and the middle position as \(b\):^{9}
And finally, we grow \(a\) and \(b\), and \(b\) and \(c\) towards one another as we did before. Note that we have to let \(a\) and \(c\) grow twice every time we grow \(b\), because \(b\) is growing in two directions at the same time:
The grammar described so far exactly expresses the language \(a^nb^nc^n\). Since this language is not contextfree, we can conclude that constraint grammar is not contextfree.
Beyond ContextFree
It seems pretty obvious that a language formalism whose only construct has the power to observe all of its surrounding context ends up being at least contextsensitive. I could continue. It is still fairly straightforward to generate the language \(a^nb^nc^nd^n\) – divide into half, and divide halves into half – and using similar strategies, you can keep on constructing CGs which compute the counting language \(\sigma_1^n\cdots \sigma_k^n\) for any \(k\) as long as you can come up with new strategies for prime numbers.. but this won’t do us a whole lot of good – at least, it won’t help us escape the class of contextsensitive languages.
So for now, let’s leave it at this. I’m a little bored of programming CG at any rate. If you want to have a go, my full code and examples are available here, and vislcg3 is available here.

An obvious downside to this approach is that for finite languages (as well as infinite ones, but duh) the CG will never stop generating the language, as we have to feed it \(\langle \Sigma \rangle_n\) for every \(n\). But we’re playing fast and loose here, so what gives. ↩

CG3’s magic constants are just outside of the string, whereas ours are right at the edge of the string. Therefore, all indices using magic constants are moved by one. ↩

While we use
BEFORESECTIONS
andAFTERSECTIONS
throughout this post, their usage is not strictly necessary. The grammar also works if everything is executed under a singleSECTION
. ↩ 
Note that
LINK
is a conjunction, but one in which indices in the second argument are interpreted from the perspective of the position matched in the first. ↩ 
We have chosen to describe faillure by outputting the empty string. If we would have been more careful, we could have added a dedicated symbol for failure. However, under our current definitions we compare languages minus the empty string. ↩ ↩^{2}

We can use
SELECT
, since it is equivalent to callingREMOVE
with the complement – i.e. remove everything but its argument. ↩ 
Note that
SELECT A
has no effect if \(a\) is not a valid option, and thatREMOVE A
has no effect if \(a\) is the only remaining option. ↩ 
When we write
X_OR_Y
, this means that we have defined a “set” asSET X1_OR_X2 = X1 OR X2;
. The reason for this is that CG3 does not allow the inline use of set primitives. ↩ 
If the string has an evennumbered length, we in fact mark the middle two positions as \(b\). ↩