Linguistics 484 - Grammars Ling484 Notes A. C. Brett
Department of Linguistics
University of Victoria
Clearihue C139
Last updated: 19 January 2005

Syntactic Structure Representations

This section is intended to provide a brief overview of some of the topics that will be covered during the course. The material is presented in the context of examining several different, but equivalent representations of sentence structure. The sentence analysis model discussed here is based in part on the constituent theory of sentence structure. Therefore, the section begins with a review of some of the intuitive notions that give rise to the constituent theory of sentence analysis and to the concept of a phrase-structure grammar. Other topics discussed include the following:

The constituent-structure theory of sentence analysis derives from the perception that the words of a sentence seem to combine naturally into recognisable units. A simple active declarative sentence, for example, appears to consist of three components: a subject, a verb, and an object, where the subject is a unit which includes words that identify the author or agent of an action; the verb identifies this action, and the object consists of words that identify the target, result, or theme of the action.

While the subject and object can each consist of a single word, that being a noun which names someone or something, both constituents often include other words, such as adjectives, that describe, explain, or elaborate whatever the noun names. It is the resulting collection of words that is then recognised as comprising the unit which functions as the subject or object of the sentence.

Furthermore, the object and the verb together are seen as making up another, more comprehensive sentence constituent that is identified as the predicate. A sentence is therefore perceived as being composed of two major functional units, namely, its subject and predicate. The composition or structure of a sentence is consequently regarded as a hierarchy formed as successive collections of words are combined into progressively more comprehensive or inclusive constituents.

These perceptions of sentence structure give rise the intuition that constituents such as the subject and predicate somehow play a role in the processing a sentence by the cognitive system. Experimental evidence can, in fact, be cited that lends credence to claims for the psychological reality of these sentence constituents and which corroborates the intuition that they are realised in cognitive processes. (See, for example, Fodor et al., 1974.) It is consequently supposed that the cognitive system does undertake processes whereby combinations of words are identified as comprising the functional units of a sentence. These cognitive processes entail the recognition that there are several different kinds or varieties of word.

Lexical Categories. Although some kinds of word are recognised as sharing characteristics, they are also identified by the cognitive system as differing in terms of the functions they perform in the sentence. The nouns and adjectives, for example, share the characteristic that they both serve to identify things, but they are recognised to play different roles in a sentence, with a noun naming something and an adjective describing it. Both varieties of word are identified as differing from others such as the verbs that are associated with actions, and adverbs that elaborate upon actions represented by verbs or that amplify descriptions conveyed by adjectives. The cognitive system obviously does not explicitly acknowledge these word classes as such, but operates on the basis of the different qualities of the words comprising the classes.

The distinctions and similarities among varieties of word form the basis of the familiar convention for classifying individual words in terms of their part of speech. According to this convention, parts of speech, or lexical categories as they are also known, are identified by labels such as V for the verb and Adv for the adverb categories, with the N and the A or Adj labels being associated with the noun and adjective lexical categories, respectively. These labels of course play no role in cognitive processes; the cognitive system is not conceived of as a symbol processor. Rather, the labels form the basis of a system for representing and describing the analysis of sentences.

Syntactic Categories. The system for classifying words into lexical categories therefore extends to the classification of sentence constituents into syntactic categories. These categories are usually described as phrases, and their characteristics or properties derive from the properties or characteristics of the words that make them up. The predominant properties that characterise each particular variety of phrase, and which establish the role it plays in a sentence, are determined by the properties of the principal or head word that it includes.

The head word of the subject or object of a sentence, for example, is a noun so that these constituents are classified as noun phrases and are identified by the NP label. Since the verb is regarded as the head of the predicate, this consitutuent is classified as a verb phrase and is identified by the VP label. A sentence can thus be represented as consisting of noun phrase and verb phrase syntactic categories. The sentence itself is classified as a syntactic category with the label S.

A formalisation of the foregoing classification system, and of the intuitions regarding sentence structure that form the basis of this system, is called a grammar. Most grammars incorporate in some manner the notion that sentences can be analyzed in terms of syntactic categories such as the noun and verb phrases. There are exceptions, such as the Word Grammar, wherein the role of syntactic categories is played by relationships among words or word to word dependencies. (See the Word Grammar Home Page and the references cited therein.) The relationships and interword dependencies of Word Grammar therefore represent the word combination aspect syntactic categories that is inherent in the constituent-structure theory of sentence analysis. The grammars that are generally regarded as the prototypical formalisations of this theory, however, and which are still most frequently applied in cognitive models of sentence processing, are variants of the phrase-structure grammars.

Rewriting Rules. Phrase-structure grammars are often characterised as rule-based because the classification of words and the formation of phrases is described by a collection of rules. For example, the classification of words such as 'dogs,' 'like,' and 'nice' into the noun, verb, and adjective lexical categories, respectively, may be represented by lexical rules of the form N dogs, V like, and A nice. The arrow is often read as 'is replaced by' or 'rewrites as' when it is regarded as denoting a symbol replacement operator in the application of a phrase-structure grammar to produce sentences. Hence, rules of this form are usually called rewriting rules.

The arrow symbol may also be read as 'includes' or 'is the parent of' when the rule is understood to express a constituency relation on the categories for which the symbols in a rule serve as labels. For example, a noun phrase category may be regarded as including or being the parent of adjective and noun categories, and this fact may be represented by a phrase-structure rule of the form NP A N.

The blank between the right-hand side symbols A and N, which may be read as 'and,' can be treated as representing a concatenation of the symbols, or as a combination or composition of the categories to the right of the arrow to produce the category on the left in the rule. The left-hand side category may be referred to as the 'parent' or 'mother' of the right-hand side 'children' or 'daughters' categories, with each of these categories being described as the 'sibling' or 'sister' of the other.

A syntactic category may have only one daughter, as in the rule NP N, wherein the noun phrase is the parent of a noun category; but in rules with more than one category label on their right-hand side, the ordering of the labels determines the order of the corresponding categories in the sentence. For example, the rule VP V NP describes the verb phrase category as the parent of verb and noun phrase daughters with the verb preceding its noun phrase sister. Similarly, according to the phrase-structure rule S NP VP that specifies the daughters of the sentence category, the verb phrase category follows its noun phrase sibling.

Phrase-Structure Grammar. A collection of lexical and phrase-structure rules such as those illustrated above comprise one of the four components of a conventional phrase-structure grammar. A phrase-structure grammar, G, is defined by the entries in an ordered quadruple of the form

G = <VN, VT, S, P>
where the entries are identified as follows: The sets VN and VT are not empty, but they are normally disjoint so that their intersection is empty.

Derivation of Sentences. Sentences can be produced or derived on the basis of G by beginning with the starting symbol and applying rewriting rules from the production set P. The grammar G is thus treated as a rewriting system whereby the elements of its nonterminal vocabulary VN are treated as symbols that can be rewritten or replaced by applying rules from P. This rewriting process continues by applying rules with left-hand sides that match the nonterminal symbols produced by previous replacements. The derivation stops when no further rules can be applied because all of the nonterminal symbols have been replaced by words from the terminal vocabulary VT.

Note, however, that it is not considered that the cognitive system produces sentences by a rewriting or symbol replacement process. It is assumed only that the cognitive processes of sentence production are in some sense equivalent to the rewriting procedures. The cognitive system performs operations that can be described in their effect by the rewriting or replacement operations represented above.

The following sequence or string of words, for example

nice dogs like cats
can be derived if
VN = {S, NP, VP, A, N, V}
VT = {cats, dogs, like, nice}
and P includes the following productions:
N dogs
N cats
V like
A nice
The derivation begins by replacing S with the symbols NP and VP by applying the rule S NP VP. Each of the resulting symbols, namely NP and VP, can then be rewritten by applying the rules NP A N and VP V NP. This rewriting or replacement process continues until the string of words from VT is produced.

Context-Free Language. The resulting sequence or string of words from VT is said to comprise a sentence that is derived or generated by the grammar. The set of all sentences that can be generated by a grammar G is called the language of G and is identified as L(G). Because all the rules in the grammar illustrated here have a single left-hand side symbol, and the rules can be applied whenever there is a match for this symbol, without regard for whatever other symbols might be adjacent to it, a grammar such as G is described as context-free. The language L(G) generated by a context-free grammar G is said to be a context-free language.

Tree Diagram

Tree Diagram. A phrase-structure grammar G can also produce a structure for a sentence that it generates. If G is context-free, this structure can have the form of a tree with its root node corresponding to the starting symbol of the grammar. Other nodes correspond to, and are labelled by symbols from the nonterminal vocabulary VN. The leaves of the tree correspond to the words of the sentence and are therefore identified by elements of the terminal vocabulary VT. A tree can be constructed for a sentence by a process analoguous to the rewriting procedure just described; but rather than replacing a nonterminal symbol, a rule attaches branches to it, with these branches ending in nodes that are labelled by the right-hand side symbols of the rule. The operation of applying rules then consists of tree substitution, rather than replacement. The substitution procedure continues as long as there are branches that end in nodes labelled with symbols from VN, and it stops when all the branches end in leaves consisting of those words from VT that comprise the sentence.

The resulting tree is usually represented upside down relative to the normal orientation of a tree, with its root node at the top and the leaves, corresponding to the words of the sentence, at the bottom. A tree constucted in the course of producing the sentence 'nice dogs like cats' is shown in the figure. A tree such as this represents an analysis of the sentence and may be described variously as an analysis tree, a phrase marker, or a parse tree, or is called simply an analysis or a parse for the sentence. Note that is not normally thought that the cognitive system develops a representation of a sentence that has the form of a tree. It is usually assumed only that the cognitive system develops relationships among sentence constituents that can be described or represented by a tree structure. The mental representation is thus equivalent in some sense to the tree diagram.

Bracketed String. The phrase or constituent structure of a sentence can also be represented in the form of a bracketed string such as the following:

[S[NP[Anice] [Ndogs]] [VP[Vlike] [NP[Ncats]]]]
The development of this representation can be described in terms of a bottom-up analysis of the sentence, that is, an analysis which begins with the words of the sentence, rather than with the starting symbol of the grammar. The process begins by matching the right-hand sides of lexical rules such as A nice and N dogs to the words of the sentence. This matching is represented by inserting brackets into the sentence that carry subscripts corresponding to the lexical category of the individual words. Thus, we can write

 [Anice] [Ndogs] [Vlike] [Ncats]
The bracketed words are then grouped according to the syntactic categories they comprise by introducing brackets that corresponding to the syntactic rewriting rules. Thus, inserting brackets corresponding to the applications of the rules NP A N and NP N yields

 [NP[Anice] [Ndogs]] [Vlike] [NP[Ncats]]
Applying the rule VP V NP and introducing the correponding bracketing to group the words making up the verb phrase then produces

 [NP[Anice] [Ndogs]] [VP[Vlike] [NP[Ncats]]]
Finally, the bracketing that corresponds to application of the rule S NP VP yields the bracketed string originally cited above.

Equivalence of Representations. The equivalence of the bracketed string and the tree diagram representations of the analysis of the sentence can be established by devising a procedure, a well-defined, step by step process that converts one of the representations into the other. The bracketed string can be converted into a tree by starting with the bracketed words. For each word, the brackets are converted into a tree branch that runs from to word to a node labelled with the label of the left bracket. (Each word actually can be called a leaf of the tree; so this branch might be more appropriately called a petiole.) The brackets that enclose sequences of words, and which correspond to syntactic categories, can then be converted into the branches of the tree that connect the lexical category nodes to nodes corresponding to, and identified by the syntactic category labels on the brackets. The other syntactic category brackets are transformed into branches that connect nodes labelled with the category labels on the brackets. This process continues until the outer-most brackets are encountered and the root node is attached to the tree by branches connecting it to the nodes at the next level down in the tree. Another procedure can be devised that converts a tree into a bracketed string. Hence, the two representations are equivalent.

Prolog List. The bracketed string itself can be converted into the following related representation

[S [NP [A nice] [N dogs]] [VP [V like] [NP [N cats]]]]
by "promoting" the labels on the brackets so that they lie at the same level in the line as the words. The resulting representation can be converted into what amounts to a Prolog list by inserting commas to produce the following:

[S, [NP, [A, nice], [N, dogs]], [VP, [V, like], [NP, [N, cats]]]]
The foregoing representations corresponds to the structure often produced by Prolog programs that cause the Listener to function as a parser, a procedure that yields a structure for a string of words that it accepts as a sentence. The category labels, of course, will be represented with lowercase letters so that they correspond to Prolog atoms. The Prolog list therefore appears as follows:

[s, [np, [a, nice], [n, dogs]], [vp, [v, like], [np, [n, cats]]]]

In a list representation of a sentence structure, each of the lexical and syntactic category labels corresponds to the head of a list. For example, the label S is the head of a list with a tail consisting of the following two terms that are also lists:

 [NP, [A, nice], [N, dogs]]
 [VP, [V, like], [NP, [N, cats]]]
But, the label NP is the head of a list the tail of which consists of the following two lists:

 [A, nice]
 [N, dogs]
and the VP label is the head of a list with the following lists in its tail:

 [V, like]
 [NP, [N, cats]]]
Thus, the structure of the sentence is represented in the form of a sequence of nested lists, that is, lists with elements that are in turn lists. The head terms of each of these lists is always a category label. Except for those lists in which the head is a lexical category label and the tail is an atom representing a word, the tails of all the other lists consist of one or more lists. For consistency, therefore, the words are sometimes included as the elements of lists such as in the following:

 [A, [nice]]
 [N, [dogs]]
 [V, [like]]
 [N, [cats]]
so that the list representation of the structure of the sentence displayed by a Prolog program might have the form

[S, [NP, [A, [nice]], [N, [dogs]]], [VP, [V, [like]], [NP, [N, [cats]]]]]
but with the category labels in lowercase as follows:

[s, [np, [a, [nice]], [n, [dogs]]], [vp, [v, [like]], [np, [n, [cats]]]]]
because the labels correspond to atoms.

Indented Vertical List. Because the list representation is difficult to read, particularly when the structure of a sentence is complicated, some Prolog programs might also display the structure vertically as in the following representation, rather than horizontally:

wherein dashes (and sometimes blanks) are used to indicate the level of nesting of the category labels, and the words, in the original list. The number of dashes (or blanks) preceding a label or word also indicates to the level of the corresponding node in a tree diagram representation of the sentence structure.

The vertical representation is equivalent to the Prolog list structure. The equivalence of the two representations can be established by devising a procedure that, in effect, keeps track of the number of brackets to the left of each label or word in the list representation. The number of brackets (or blanks) preceding a category label or word in the vertical representation is equal to the sum of the left brackets minus the sum of the right brackets that precede the label or word in the list. For example, there are four left brackets to the left of the word 'nice' in the list so that there are four dashes preceding it in the vertical representation. There are three dashes preceding the label N in the vertical representation because there are four left brackets preceding the word 'nice' in the list, less the two right brackets that follow 'nice,' to give two, plus the left bracket immediately preceding the label N in the list, to yield three dashes in the vertical representation.

Character Tree. Some Prolog programs produce representations of the structure of sentences that appear as follows:

      NP          VP
   -------     -------
   A     N     V     NP
   |     |     |     |
  nice  dogs  like   N
because they are relatively easy to produce in the Listener output log, but still communicate the information contained in a conventional tree diagram. The equivalence of this "character" representation to a graphical tree representation is evident. The branches of the tree connecting a parent node to its children have simply been replaced with a horizontal line of dashes beneath the parent node label, but above the node labels of the children, extending from the leftmost label to the rightmost of the labels of the children.

Transition Network

Transition Network. A tree diagram representation of the structure of a sentence is also equivalent to the Transition Network diagram representation. Such diagrams consist of a collection of arcs, the curved line segments in this case with an arrow head to indicate a direction. The arcs join nodes which correspond to the beginning and end of the sentence, and to the spaces between the words. The nodes are usually identified with numbers that serve to keep track of the progress of the analysis of the sentence. The nodes correspond to the states of an abstract computer. The arcs joining the nodes are marked with category labels. These labels represent the lexical or syntactic category identified by the abstract computer as it executes a transition corresponding to one of the arcs in the diagram. (See Woods(1970).)

The abstract computer does not normally display Transition Network diagrams for the sentences is recognises. Although sentence structure can in principle be represented as a Transition Network, such diagrams actually depict the operations the machine performs to recognise sentences.

For example, as the abstract machine processes the first word of the sentence, 'nice,' it executes a transition that spans the word, moving from the state labelled '0' to the state labelled '1' in the direction identified by the arrow head on the corresponding arc, during which it identifies the word as an adjective, in the category labelled A. Similarly, as it processes the next word, 'dogs,' the abstract machine identifies it as a noun, with the lexical category label N. Thus, the machine executes transitions that correspond to the two rewriting rules A nice and N dogs. Having identified the categories of the first two words, the machine can now execute a transition that spans both words and which corresponds to the rewriting rule NP A N.

All of the transitions of the abstract machine therefore correspond to rewriting rules. The correspondence between transitions, and the arcs representing them, can be established as follows. Each arc corresponds to the left-hand side of a rewriting rule, and is identified with the label of the left-hand side category of the rule. The arcs immediately below the arc in the diagram correspond to the right-hand side of the same rewriting rule, and are identified with the right-hand side labels in this rule. For example, the topmost arc in the diagram here is identified with the S label, and corresponds to the left-hand side of the rewriting rule S NP VP. The arcs immediately below are identified with the NP and VP labels corresponding to the right-hand side categories of this rule.

The transcription of the rewriting rules of a context-free phrase-structure grammar as described above yields a Transition Network Grammar. The operations of the abstract computer that works on the basis of a Transition Network Grammar can be emulated using a programming language such as Prolog. The resulting program processes sentences in top-down, depth-first fashion; that is, the program starts with an attempt to execute the transition represented by the arc identified with the S label. To complete this transition, however, the machine must first execute the transitions corresponding to the arcs immediately below the S arc in the diagram. Thus, it must traverse the NP arc; but, in order to execute this transition, it must first traverse the A arc, and then the N arc. In other words, it must identify the lexical categories of the first two words. Having done so, it can then traverse the NP arc. In order to execute the VP transition, however, the machine must first execute the V and NP transitions represented below the VP arc in the diagram. Once these arcs have been traversed, the machine can traverse the VP arc, and hence complete the S transition, thereby recognising the sentence.

The processing outlined above can be compared with the operation of the Listener as it recognises a sentence on the basis of a Prolog program that implements a Definite Clause Grammar. For example, the rewriting rule S NP VP can be translated into a Prolog rule with a head structure corresponding to the sentence category on the left-hand side of the rewriting rule and with goals in its body corresponding to the noun and verb phrase categories on the right-hand side. In order that this rule be proved during a sentence recognition, the goals in the body of the rule must first be proved. The proving of the Prolog rule thus corresponds to the condition on the transitions in the Transition Network Grammar that the arcs immediately below an arc in the diagram must be traversed both the arc itself can be traversed.

In fact, an early study of Prolog as a programming language for natural language processing demonstrated the equivalence of a form of Prolog implementation of Definite Clause Grammar to Transition Network Grammars. (See Pereira and Warren(1980).) At the time this study was undertaken, Transition Network Grammars were among the most important natural language processing strategies then available. Their importance has declined with the development of other, more effecient and convenient analysis procedures. Transition Network Grammars can nonetheless still serve to introduce significant computational concepts. They also can be interpreted in terms of psychological models of sentence processing. (See Fodor and Frazier(1980).) For example, the transitions of the abstract machine representing a Transition Network Grammar can be viewed as modelling the changes in mental state that take place in the course of processing a sentence.


Chart. Sentence analysis programs based on Transition Network Grammars have largely been replaced by Chart Parsers. (See Kay(1980).) These procedures employ a data structure called a chart to store intermediate results obtained in the course of analysing a sentence. Most chart-based parsers, albeit not all of them, work on the basis of a context-free phrase-structure grammar, and the information recorded in the chart includes the rewriting rules that can be applied, and the location of their application within the sentence at each stage of its analysis.

The application location of a rewriting rule is represented by numbers that correspond to the positions of the words in the sentence. These numbers mark the end points of an edge. The edge is identified by the label of the left-hand side category the rewriting rule. For example, in the diagram reproduced here, the edge corresponding to the application of the rule VP V NP is labelled VP and consists of the horizontal line immediately under this label, together with the two vertical lines connected to the little circles around the position numbers 2 and 4 that mark the beginning and end, respectively, of the verb phrase 'like cats.'

A Chart Parser stores all the different rules that can be applied at each stage in the analysis of a sentence. Thus, if a sentence is ambiguous, and permits more than one analysis, a Chart Parser will identify all of the alternative analyses, and will produce structural representations of each. For example, in the sentence

radio broadcasts pay
radio can be classified as a noun, an adjective, and a verb, while both broadcasts and pay can be classified as nouns and verbs. Thus, when the first word is analysed, edges corresponding to the three categories for radio are recorded in the chart. An edge representing the identification of radio as comprising a noun phrase is also stored.

When broadcasts is analysed, two further edges are stored corresponding to its lexical categories. Other edges, corresponding to the second word being identified as a verb phrase, and to the two words radio broadcasts being identified as comprising both a noun phrase, and as a sentence, are also recorded. These words or sequences of words that can be identified as legitimate syntactic constituents or phrases are called well-formed substrings. Hence, a chart is sometimes called a Well-Formed Substring Table.

Although it is unlikely the cognitive system employs a data structure such as a Well-Formed Substring Table to store intermediate results, chart parsers can be viewed as models for those psychological models based on the conjecture that human language analysis is a parallel process, that is, alternative analyses are pursued concurrently.

Chart parsers also store edges corresponding to rules that might be applied when more words are encountered during an analysis. For example, when the word radio is analysed, but the word broadcasts has not yet been encountered, the parser will record an edge corresponding to the rule NP A N because, having classified radio as an adjective, it " anticipates" that the next word will be a noun.

The structural representations displayed by a Chart Parser usually consist of conventional tree diagrams or the list form of the trees. Although charts such as that shown here can be used to represent sentence structures, the chart diagram actually serves as a depiction of the elements of the data structure employed to construct the parse tree. Since procedures can be devised that convert a chart to a tree, and transform a tree into a chart diagram, the two representations are equivalent.

Box Diagram

Box Diagram. Another sentence structure representation that is not normally displayed by a parser, but which is equivalent to the other representations discussed above is the box diagram. As the accompanying figure illustrates, a box diagram consists of a collection of nested rectangles, each of which is identified by a category label.

The equivalence of the nested boxes in the diagram to the nested lists of the list representation of sentence structure is particularly easy to establish. The label of a given box corresponds to the head of a list, and the box or boxes inside this box correspond to the terms comprising the tail of the list. For example, the NP box that includes the A and the N boxes corresponds to the list [NP, [A, nice], [N, dogs]]. The box label NP is the head of the list while the A and the N boxes inside the NP box correspond to the lists [A, nice] and [N, dogs], respectively, comprising the tail of the list with the head NP. Thus, the box diagram can in fact be regarded as a pictorial representation of the corresponding nested list. In this context, notice that, if the words were each enclosed in their own boxes, then the equivalent list would have the form [NP, [A, [nice]], [N, [dogs]]].

The box diagram can also be viewed as a graphical representation of the interpretation of rewriting rules as inclusion relations. For example, the rule S NP VP can be read as a statement that the sentence category includes noun and verb phrase categories. This relation of inclusion on the categories is depicted in the diagram by the NP- and VP-labelled boxes being included inside the S-labelled box. Similarly, the A- and N-labelled boxes are included in the NP-labelled box to represent the relationship of inclusion expressed by the rewriting rule NP A N. The box diagram can therefore be seen as depicting a model of the mental processes of sentence analysis based on the conjecture the cognitive system recognises successive sequences of words as comprising progressively more comprehensive and inclusive constituents.

The box diagram can be viewed as a representation of the structure of a sentence in terms of a hierarchy of sets, that is, a set of sets of sets, and so on. The category labels identify the sets involved. Note, however, that the duplicate labels N and NP identify different sets. For example, in the diagram shown here, the two boxes labelled A and N inside the first NP box identify sets that each consist of one member, namely, the words 'nice' and 'dogs,' respectively. Since it is clear from the context that we are dealing with words, we can drop the apostrophes, and the italic font, so that we can write A = {nice} and N = {dogs}.

The first NP label thus identifies a set consisting of these two sets, that is, NP = {A, N}, or NP = {{nice}, {dogs}}. The VP label then identifies a set that can be written out as VP = {V, NP} = {{like}, {{cats}}} because V = {like} and the second NP- and N-labelled sets consist of NP = {N} = {{cats}}. Consequently, the S label identifies the set S ={{{nice}, {dogs}}, {{like}, {{cats}}}}.

Linguistics 484 Notes Index Top of Page