Linguistics 484 - Grammars Ling484 Notes A. C. Brett
Department of Linguistics
University of Victoria
Clearihue C139
Last updated: 9 March 2003

Phrase-Structure Grammars and the Chomsky Hierarchy

This note presents of on overview of the Chomsky hierarchy of phrase-structure grammars and of the languages they generate. The information covered here serves principally to provide context for the context-free grammars and languages. The presentation consists of brief observations on the following topics:

Phrase-Structure Grammars. A phrase-structure grammar G is defined as an ordered quadruple of the form

G = VN, VT, S, P
where the entries are identified as follows: The sets VN and VT are not empty, but they are normally disjoint so that their intersection is empty; that is, VN VT = . The vocabularies VT and VN are sometimes identified as the terminal alphabet and the nonterminal alphabet, respectively. The symbols of VN also may be referred to as variables, metasymbols, or intermediate symbols, while the symbols of VT may be called object symbols. A phrase-structure grammar itself also may be known as a generative grammar.

Productions. Members of the production set P of G have the general form

a b
where a (VT VN)+ and b (VT VN)*, subject to the following conditions:

Sentences. Derivations are undertaken by beginning with the starting symbol S of a grammar G and applying productions from P. A derivation terminates when a sentential form is obtained that consists solely of words from the vocabulary VT of terminal symbols. A string s V*T is said to be a sentence generated by G if and only if

S * s

Languages. The language generated by G, denoted L(G), where L(G) V*T, is the set of all strings of terminal symbols derivable from S. Thus,

L(G) = {s | s V*T, S * s}

Equivalent Grammars. Two phrase-structure grammars Ga and Gb with the same terminal vocabulary VT, but with production sets Pa and Pb, respectively, where Pa Pb, could generate different languages; however, they also may generate the same language. Different grammars which generate the same language are said to be equivalent; that is, two grammars Ga and Gb are equivalent if and only if

L(Ga) = L(Gb)

Chomsky Hierarchy. Phrase-structure grammars may be classified according to the form of their productions. The conventional classification scheme is known as the Chomsky Hierarchy. According to this scheme, there four types of phrase-structure grammar. The first of these, identified as the Type 0 grammars, consists of those grammars which have no restrictions on their productions other than the conditions noted above. The other three classes of grammar, identified as the Type 1 through Type 3 grammars, are defined by increasingly restrictive conditions on their productions, with the Type 3 grammars being those that are subject to the most stringent conditions. The grammars in the Chomsky hierarchy are also identified by name, and these names together with the forms of their productions are shown in the following table:

Type Name Productions
Unrestricted a b with a (VT VN)+ and b (VT VN)*
Context-Sensitive a1Aa2 a1b a2 with A VN and a1, a2 (VT VN)* and b (VT VN)+
Context-Free A b with A VN and b (VT VN)*
Regular, Finite A bB or A b with A, B VN and b V*T

Hierarchy of Languages. A language over a vocabulary VT is said to be of Type i if it can be generated by a Type i Grammar. For exampe, a Type 2 Language, also known as a Context-Free Language, is generated by a Type 2 Grammar, that is, by a Context-Free Grammar. It is possible to speak of the family of languages over VT generated by the grammars of Type i, and identify this family or class of languages by writing Li. For example, the family of languages generated by the context-free, or Type 2 grammars may be denoted by L2. The families of languages generated by the grammars in the Chomsky hierarchy form an inclusive hierarchy that may be represented as

L3 L2 L1 L0
Thus, for example, the context-free languages are included in the family of context-sensitive languages, and the context-free languages include the regular or finite languages. The inclusion is proper so that, for example, there are context-free languages that are not regular or finite.

l-Productions. A l-Production is a rewriting rule of the form

A l
with its right-hand side being the null string l. Note that the condition on the right-hand sides of context-free rules that they be strings in (VT VN)* includes the possibility that the production set might contain l-productions. It can be demonstrated that l-productions are required only if the language generated by the grammar includes the null string as a sentence. Hence, an equivalent grammar can be constructed wherein all of the l-productions have been eliminated, and replaced by the two rules
S' l
S' S
The new symbol S' is added to the nonterminal vocabulary of the grammar and the starting symbol of the resulting grammar is replaced by S'. The production S' l is then applied only to derive the sentence consisting of the null string.

l-Free Grammars. A language that does not include the null string as a sentence is called l-free and a grammar that generates a l-free language is said to be l-free itself. Note that the form of the context-sensitive productions presented above means that they generate only l-free languages. A context-sensitive language can include the null string as a sentence. In such cases, the rule S l is added to the production set of the generating grammar; but, S cannot appear in the right-hand side of any of the other productions.

Normal Form Grammars. A Normal Form for a given type or class of grammar is a grammar with specified additional conditions imposed upon its productions, but which is equivalent to the grammars in the given type. For example, the productions of a Type 2, or context-free grammar have the general form

A b with A VN and b (VT VN)*
The right-hand sides of such rules may consist of strings that include a mixture of terminal and non-terminal symbols. Thus, the rule
NP the N
with NP, N VN and 'the' VT is an acceptable context-free production. Normally, we would replace such productions with rules of the form
DET the
with the symbol DET being added to the nonterminal vocabulary VN if it not already included therein. The right-hand sides of the rules then consist either of a single terminal symbol, or of a string of non-terminals. Hence, all the rules of the context-free grammar have either one or the other of the following two general forms:
A a
B b
with A, B VN, a VT, and b V*N. The productions of any context-free grammar can be replaced with rules of the foregoing form, with additonal symbols being added to the vocabulary VN of non-terminal symbols as required. The resulting grammar generates the same language as the original, and hence, is equivalent to the original context-free grammar.

Chomsky Normal Form. A l-free context-free grammar is said to be in Chomsky Normal Form if each of its productions has one of the following forms:

A a
with A, B, C, D VN and a VT. If the original grammar is not l-free, then the production set of the equivalent Chomsky Normal Form grammar includes the production S l. Because the right-hand sides of rules of the second variety always consist of exactly two symbols, Chomsky Normal Form grammars are sometimes called binary grammars.

If the productions of the original grammar are already in the normal form described above, namely A a B b with A, B VN, a VT, and b V*N, then an equivalent Chomsky Normal Form grammar can be constructed by replacing those rules of the original grammar wherein the right-hand side string b consists of more than two nonterminal symbols. For example, a rewriting rule such as

is eliminated and two rules such as the following are included in the production set of the normal form grammar:
Chain rules such as
wherein the right-hand side consists of a single nonterminal symbol must also be eliminated by, in these cases, treating collective nouns as comprising NPs and intransitive verbs as VPs.

In the foregoinging examples, the nonterminal symbols in the normal form grammar are already in the nonterminal vocabulary of the original grammar. In general, however, the nonterminal vocabulary of the Chomsky Normal Form grammar will include new symbols, not included in the nonterminal vocabulary of the original grammar.

Self-Embedding Grammars. A grammar is said to be self-embedding if there is a symbol A VN and a derivation such that

A + bAg where b,g (VT VN)+
For example, a context-free grammar with the rules
is self-embedding because of the derivation
which might be included in the generation of a sentence such as
'a cat the dog saw ran.'
The clause 'the dog saw' is sometimes described as being centre-embedded in the matrix sentence 'a cat ran.'

Type 3 Grammars. A context-free grammar that is not self-embedding generates a Type 3 language. Thus, it is the self-embedding property of some context-free grammars which enables them to generate languages that are not Type 3. The rules of a Type 3 grammar, that is, a finite or regular grammar, all have one or the other of the following two forms

A bB, or
A b
with A, B VN and b V*T. Consequently, although a Type 3 rule might be recursive, with the form
A bA
and could yield right-embedded constituents, Type 3 rules cannot produce a centre-embedding because there can be no string to the right of the noterminal symbol A.

A Type 3 grammar is a Type 2 grammar because the left-hand side of each Type 3 production consists of a single nonterminal symbol. Furthermore, for any non-self-embedding Type 2 grammar there is an equivalent Type 3 grammar; that is, provided it is not self-embedding, a context-free grammar can be converted to a Type 3 grammar.

Type 3 Normal Forms. Every Type 3 language, that is, every finite or regular language, can be generated by a grammar all the rules of which have one or the other of the following forms:

A aB with A,B VN and a VT
A l with A VN
In a variant of this normal form Type 3 grammar, rules of the form
A a where a VT
can be substituted in place of those with the form A l.

A rule such as

S dogs like NP
with S,NP VN and the string 'dogs like' V*T is a legitimate Type 3 production. To obtain normal form rules, a new symbol such as VPp can be added to the nonterminal vocabulary of the original Type 3 grammar, and the following rules can be added to its production set:
S dogs VPp
VPp like NP
with S,NP,VPp VN and 'dogs','like' VT. The original rule S dogs like NP is eliminated. If this rule happens to be the only one in the original grammar that is not in the normal form, then its replacement with the two rules cited above, and the addition of the new symbol VPp to the nonterminal vocabulary will yield a Type 3 normal form grammar. In other words, if G where
G = VN, VT, S, P
denotes the original Type 3 grammar, then the normal form Type 3 grammar G' can be defined as
G' = V'N, VT, S, P'
V'N = VN {VPp}
P' = P - {S dogs like NP} {S dogs VPp, VPp like NP}

Type 3 Grammar Parse Tree

Type 3 Grammar Parse Tree. The accompanying figure shows the structure generated by a Type 3 normal form grammar for the sentence 'dogs like them.' Observe that, because all the rules of the grammar are either of the form A l, or of the form A aB, such as in this example S dogs VPp and VPp like NP, every node of the tree, except for the lowest one labelled H here, has two branches. The left-hand branch of the two connects the node to a word, while the right-hand branch connects it to another node. Thus, the parser tree is a strictly right-branching structure, with the last branch ending in a node connected to the null string.

Note that, according to the alternative Type 3 normal form, all the rules have one or the other of the two forms A a or A aB. Thus, application of the alternative normal form to this example, would see the rule H l would be eliminated, and the rule NP them H replaced by NP them. The upper part of the tree generated by this grammar will then be the same as that shown in the figure; but the lowest node of the new tree will be labelled NP, and its single branch will connect to the word 'them.'

Linguistics 484 Notes Index Top of Page