| Ling484 Notes |
A. C. Brett acbrett@uvic.ca
Department of Linguistics University of Victoria Clearihue C139 |
Set theory is the mathematics of classes. Sets are classes. The notion of class is so fundamental to thought that we cannot hope to define it in more general terms. We can say that a class is any aggregate, any collection, any combination of objects of any sort; if it helps, well and good. But even this will be less help than hindrance unless we keep clearly in mind that the aggregating or collecting or combining here is to connote no actual displacement of the objects ... In short, a class may be thought of as an aggregate or collection or combination of objects just so long as 'aggregate' or 'collection' or 'combination' is understood strictly in the sense of 'class'. Quine (1969).
A set is a collection of things. The term collection, in fact, as well as such terms as class, category, family, aggregate, combination, and ensemble may be used synonymously with the term set. This page reviews some of the concepts and notation from the theory of sets that are applied to sentence analysis. The notation discussed here serves principally as a language whereby sentence analysis can described. The page begins with some illustrations of this notation, and the concepts for which it stands. The application of this notation to sets of words and to sentence analysis is then illustrated.
Specifying a set by listing its members. The specification of the membership of a set completely characterises the set. A set may be specified by listing its members, or by specifying particular properties or attributes the members must have in order that they be considered members of the set. For example, the set of things on my desk may be specified by listing these things as follows:
{mug,pencil,paper,dictionary,eraser,lamp,elbow,dust}By convention, such lists are enclosed in braces, { and }, sometimes called curly brackets, and the individual members or elements of the set are separated by commas. The left-hand brace may be read as "the set consisting of", with the commas separating the elements being read as " and".
A letter may be used to name or represent a set, as
D = {mug,pencil,paper,dictionary,eraser,lamp,elbow,dust}where the equal sign may be read "represents" or " identifies" or "denotes", and is understood to have the meaning that the letter on the left may be substituted for the set specification on the right wherever this might be convenient. Combinations of characters, including subscripts and superscripts, also may be used to name, identify, or label sets.
The order in which elements are listed is not important. Thus, we can represent the set D by listing its elements as follows:
D = {elbow,mug,lamp,paper,dictionary,dust,pencil,eraser}which is the same set D as was specified previously. Also, one or more elements may be named more than once without changing the set that is specified. For example
D = {elbow,elbow,mug,lamp,paper,pencil,dictionary,dust,pencil,eraser,pencil}specifies the same set as that identified as D, above, notwithstanding the element elbow has been cited twice, and pencil three times.
Note that neither D itself, nor the list of its elements is the set: D denotes or identifies the set, and the list merely denotes or identifies its members. Furthermore, the words in D identify objects. If we wanted to refer to the words themselves, we would enclose them in apostrophes, also know as single quote marks, as in the following example
Dw = {'mug','pencil','paper','dictionary','eraser','lamp','elbow','dust'}where Dw is a different set from D. The apostrophes may be omitted, in the interests of readability and to facilitate typing, but only so long as it is clear from the context that the elements of the set are words, and not the things identified by the words.
Specifying a set by identifying a property of its members. The foregoing examples illustrate the specification of sets by listing their members. A set also may be specified by citing an attribute or property comprising a criterion on the basis of which things are collected together as a set. For example, the set of things on my desk may be specified as
D = {d | d a thing on my desk}wherein the "dummy variable," d in this case, denotes a "typical element" of the set being specified. The vertical stroke, which is sometimes replaced by a colon, is read as " such that". The attribute or property that d must satisfy is given following the vertical stroke. Note that, in this example, both the specification by attribute and by a list of members have been used to specify the same set D.
Identifying a member of a set. The statement that my elbow is a member or element of the set D may be written
elbow Î Dwherein the character Î stands for the membership relation. It is read as " is an element of", " is a member of", or "belongs to".
The symbol Ï denotes the negation of Î , and is read " is not an element of", " is not a member of", or "does not belong to". Thus, it would be used in a statement such as
computer Ï Dto say that my computer is not on my desk. (My computer is on a separate table beside my desk.)
Subset of a set. It is possible to talk about subsets of sets. For example, suppose we want to consider only those elements of D that are used in writing and that someone might describe as writing tools or materials. Then, we might specify this set of writing tools or materials as follows:
M = {m | m Î D, m used in writing}On the basis of this specification, the set M of writing materials might alternatively be identified by listing its elements as follows:
M = {paper,pencil}The set M can be described as a subset of D because every element of M is an element of the set D. This fact can be represented by writing
M Í Dwhere the symbol Í can be read as " is a subset of" or " is contained by". We can then also write
D Ê Mwhich can be read as a statement that "D is a superset of M," or "D contains M." In general, given two sets X and Y, X is a subset of Y, and we can write X Í Y, if and only if every element of X is also a member of Y.
Note that the symbols Í and Ê denote a relationship among sets, or a relation on sets. The symbol Î , however, denotes a relation between elements and sets. For example, we can write
paper,pencil Î Dto indicate that the individual things, pencil and paper, are elements of the set D, but we write
{paper,pencil} Í Dto indicate that the set consisting of the elements pencil and paper is a subset of D. As idicated above, we have used M to identify this set made up of the two the elements, pencil and paper. To illustrate the difference between Î and Í , we can write
paper Î {paper} Í Mto say that paper is an element of the set {paper} which is a subset of M. The set {paper}, incidentally, since it consists of a single element, is called a singleton set.
Proper subset of a set. Since the set D contains elements that are not members of M, we can say the M is a proper subset of D and write
M Ì Dwhere the symbol Ì may be read as " is a proper subset". We can also write
D É Mwhere É may be read as "properly contains". In general, given two sets X and Y, X is a proper subset of Y, and we can write X Ì Y, if and only if every element of X is also a member of Y, but there is an element of Y that is not an element of X.
Equality of sets. The horizontal line added beneath É and Ì , which yields the symbols Ê and Í , respectively, indicates the possibility that two sets might be equal. For example, we can write
D Ê Dthat is, D contains itself. In general, given two sets identified as X and Y, if X Í Y and Y Í X, or in other words, if every element of X is a member of Y and every element of Y is a member of X, then we can say that X equals Y and write
X = Ymeaning that X and Y must designate the same set.
Equivalence of sets. It is also possible to speak of equivalent sets. Two sets are said to be equivalent if and only if they contain the same number of elements. For example, the set D of things on my desk, and the set Dw of words identifying the things on my desk, are equivalent because we can match each word in Dw with a thing in D. We can state this relation between the two sets by writing
Dw º Dwhere º can be read as " is equivalent to". Recall, however, that Dw and D are different sets, and notwithstanding they are equivalent, they are not equal. Hence, we can write
Dw ¹ DSince D includes elements, such as eraser and dictionary, which are not members of M, we can also write
M ¹ D
The specification of the set M is rather vague and open to alternative interpretations. Someone might, for example, consider that the eraser should be included, along with the pencil and paper, because it is obviously used in writing. Consequently, this person will interpret the specification as actually meaning the set Me where
Me = {eraser,pencil,paper}with he subscript e being appended to distinguish it from the original set M. Clearly, M Í Me because pencil,paper Î M and pencil,paper Î Me, since eraser Î Me, but eraser Ï M, we could also write M Ì Me.
Suppose another person thinks that the erase should not be included, but she claims she cannot write without a dictionary. Hence, she construes the specification of materials used in writing as meaning the set
Md = {dictionary,paper,pencil}where the subscript d serves to distinguish the set Md from the other subsets Me and M of D. Since each of the two sets Md and Me includes an element that is not included in the other, we can write
Md ¹ MeIntersection of sets. As is the case for Me, Md also includes M as a proper subset, that is, Md É M. Thus, the two individuals agree in their interpretations that, among the elements in D, paper and pencil can be considered as things used in writing. The two sets, Md and Me can consequently be said to intersect. The intersection of two sets is the set consisting of those elements that are shared by, or included in both sets. In the case of two sets Md and Me, we can specify that their intersection is the set M by writing
M = Md Ç Mewhere the symbol Ç can be read as " intersect", although the foregoing line would normally be read as "M is the intersection of Md and Me".
In general, the intersection of any two sets X and Y is the set of all elements that are members of both X and Y. The set intersection operation is commutative, that is
X Ç Y = Y Ç Xfor any sets X and Y. Set intersection is also associative so that
(X Ç Y) Ç Z = X Ç (Y Ç Z)for any sets X, Y, and Z. The parentheses indicate that the intersection operation in the expression they enclose is performed first. Thus, performing the intersection X Ç Y, and then taking the intersection of the resulting set with Z, yields the same result as performing the intersection Y Ç Z first, then taking the intersection of the resulting set with X.
The null set. If two sets share no elements in common, their intersection is the empty or null set, which is usually identified by the symbol Æ, or by writing {}. For example, suppose we consider the subset
Mc = {dust,elbow,mug,lamp}of those items in D that no one in this illustration thinks of as things that can be used in writing, and we compare it with the set M = {paper,pencil} of things upon which everyone agrees are used in writing. These two sets share no elements in common. Hence, their intersection is the null set, that is
M Ç Mc = ÆSets that have no elements in common, so that their intersection is empty, are said to be disjoint.
The null set is a subset of any set so that, for example, we can write
Æ Ì M and Æ Ì McAlso, the intersection of any set X with the null set yields the null set, that is
X Ç Æ = ÆThus, the null set is analogous to the number 0, and taking the intersection of a any set with the null set is something like multiplying any number by 0.
Union of sets. In identifying the set of those elements of D that are used in writing, one of the two individuals included the dictionary, in addition to the paper and pencil, while the other included the eraser. Thus, the four items, the paper, pencil, eraser, and dictionary, were selected by one or the other of the two individuals as things used in writing. We might use Mu to name this set and identify its elements as follows
Mu = {paper,pencil,eraser,dictionary}The set Mu can be called the union of Md and Me, and we can specify the relationship among the three sets by writing
Mu = Md È Mewhere the symbol È can be read as " union", while the equation can be read as stating that "Mu is the union of Md and Me".
In general, the union of any two sets X and Y is the set of all elements that are members of either X or Y. Like the set intersection operation, set union is commutative, that is
X È Y = Y È Xfor any sets X and Y. Set union is also associative so that
(X È Y) È Z = Y È (X È Z)for any sets X, Y, and Z.
Difference of sets. The individuals selecting the subsets Me and Md of D agree that the remaining elements, namely, those comprising the subset Mc consisting of the mug, lamp, elbow, and dust, cannot be considered as things used in writing. These things comprise a set consisting of those elements of D with the elements in the set Mu excluded or removed. We can identify this set as the difference between D and Mu, or as the complement of Mu relative to D. If the set Mc denotes this difference, or relative complement set, we can define it as follows:
Mc = D - Muwhere the minus sign can be construed as meaning the removal from the set D of all those elements that are also members of Mu.
Because Mu Í D, the difference operation entails that all the elements of Mu are removed from D. Thus, the intersection of Mc and Mu is the empty set so that we can write
Mc Ç Mu = Æ
The interpretation of the difference operation also means that we can obtain the complement of a set such as Md relative to another set such as Me. In this case, the removal from Me of those elements that are also members of Md leaves the singleton set {eraser}.
Me - Md = {eraser}Notwithstanding dictionary is an element of Md, only those elements shared by Md and Me, namely, the set M = {paper,pencil}, can be removed from Me. Thus, only those elements in the intersection of the two sets are removed from Me. The complement of Me relative to Md consequently yields the set {dictionary} so that we can write
Md - Me = {dictionary}
Sets of sets. A set can also contain other sets as elements. For example, we can define a set Mde as consisting of the two sets Md and Me, that is
Mde = {Md, Me}Note that this set has two elements. Note especially that Mde is most certainly not the union of Md and Me. This fact is evident from examination of the following specification of Mde wherein Md and Me have been replaced by the lists of elements comprising the sets that they name:
Mde = {{dictionary,paper,pencil}, {eraser,paper,pencil}}
Sets of words. The concepts and notation discussed above can be applied to sets of words. For example, consider the following sentence:
s = 'they stand by the bow and see her bow near the stand'Note that the lowercase Greek letter, sigma in this case, is used here to identify the sentence, following the convention the lowercase letters from the Greek alphabet are used to designate sequences of words. We might then specify a set W consisting of the words in s by writing
W = {w | w a word in s}or we might specify this set by listing its elements as follows:
W = {'they','stand','by','the','bow','and','see','her','near'}Notice that, although the words 'stand', 'the', and 'bow' each occur twice in the sentence s, they appear only once in the set W.
Subsets of words. We can identify several subsets of W. For example, we might consider the following subsets:
N = {w | w Î W, w a noun in s}Assuming these specifications are sufficiently unambiguous, and the concepts "noun" and "verb" are clear, then the sets N and V consist of the following elements:
V = {w | w Î W, w a verb in s}
N = {'stand','bow'}Both of the sets N and V are proper subsets of W so that we can write
V = {'stand','bow','see'}
N Ì W and V Ì W
As a noun in s, 'bow' for example can identify the front end of a ship, a device for launching arrows, or a bending in general or of the head or body in a gesture of greeting or respect. As a verb, 'bow' might correspond to the action of bending in general or of the head or body to perform a gesture of greeting or respect. Since the two sets share elements in common, their intersection is not empty. In fact,
N Ç V = NWe can also look at the union and the difference of the sets.
N È V = V
V - N = {'see'}
As the definitions of the sets N and V illustrate, the identification of such sets corresponds to the classification of the words in s according to their part of speech or lexical category. We can also specify other subsets of W on the same basis as follows:
PRO = {'they','her'}where the set names PRO, DET, P, and CONJ correspond to the labels PRO, DET, P, and CONJ for the pronoun, determiner, preposition, and conjunction lexical categories, respectively. Observe that set names have been indicated above using a slanted or italic font, while the category labels are written in a what is usually called a roman type face.
DET = {'the'}
P = {'near','by'}
CONJ = {'and'}
Finite, infinite, denumerable, and countable sets. The sets illustrated above can all be described as finite; that is, they are either empty, or they consist of n distinct elements, where n denotes a whole number, usually called an integer, which is greater than zero. Thus, finite sets are equivalent to the set consisting of the first n positive integers, which is sometimes represented by writing
In = {1, 2, 3, ¼, n}Notwithstanding there is no order among the elements of a set, the ellipsis, '¼', is used as an abbreviation in this case to indicate the numbers between 3 and n. The equivalence of an arbitrary set X to In is established by matching each element of In with an element of X. This matching up of the elements of the two sets is obviously just the counting process, but for a finite set, the counting stops at some number n, that may nonetheless be very large, rather than continuing indefinitely.
A set that is not finite is said to be infinite. There can be sets that are equivalent to the set of all positive integers, I+, which may be represented by writing
I+ = {1, 2, 3, ¼}where the ellipsis in this case indicates that the numbers continue indefinitely, without bound, so that I+ is itself an infinite set. Thus, sets equivalent to I+ are infinite, but because the elements of such sets are matched with the elements of I+, these sets are described as denumerable, and a set that is either finite or denumerable is said to be countable. (Incidentally, there are sets, such as the fractions in the interval between 0 and 1, that are infinite, but are not denumerable.)
Products of sets.
The lexicons or vocabularies of natural languages are considered
to constitute finite sets, and sentences can, in principle, be analysed
in terms of subsets of these finite sets of words.
Sets, however, include no relationship of order among their elements,
and in languages such as English, the order the words in a sentence
is important.
To represent this order, an order relation must be added to the
relations on sets, such as union and subset, described above.
Although there are other ways of introducing an order relation on
sets of words, the method most commonly employed is based upon the
product of sets.
Ordered pairs.
To illustrate the concept of a product of sets, we consider the
set W of the words that make up the sentence
s and the set C, consisting of lexical category labels cited
above, including the lables N and V for the noun and verb categories,
respectively, where
The set W × C is called variously the Cartesian
product, the cross product, or simply the product of C
and W.
As shown above, the elements of W × C consist of ordered
pairs with the form áw,cñ, wherein the entries,
namely, w and c, are separated by a comma and the pair is enclosed
in angled brackets, á and ñ.
The relative order of the entries in these pairs is
determined by the positions of the sets W and
C relative to the × sign.
Since W appears to the left of the ×, its elements
appear as the first entry of each ordered pair.
The elements of C appear in the second entry of each pair
áw,cñ because the set C is to the right of the
× sign.
Ordered pairs and rewriting rules.
The product set W × C includes all pairings of the
elements of the two sets C and W.
But in applying the product set to a sentence analysis, we select a
subset of its elements according to the conditions that the second entry
of each pair corresponds to the lexical category label of the word in the
first entry.
Thus, we select those pairs that correspond to rewriting rules
of the form
Then, if we were analysing a sentence such as
Notice that the apostrophes used previously to distinguish words from
the things they identify have been omitted because it is clear from
the context that the lowercase letters on the right-hand sides of
the rewriting rules and in the first entry of the ordered pairs
are words.
The ordered pair notation can be viewed as an alternative way of
representating rewriting rules, or depending upon your point of
view, the rewriting rule format might more appropriately be seen
as an alternative to the ordered pair notation.
Mappings and partial mappings.
The set F Í W × C can be
called a mapping or function that maps from the
set of words W into the set of category labels C.
The set C is usually called the range of F, while
W is called the domain of F.
F represents an association of the elements of its domain
W with particular elements of its range C.
Actually, in this example, F might be more appropriately
identified as a partial function or partial mapping
because it is defined only for a subset of W.
Functions such as F would normally be defined for every
element of their domain set; that is, for every w Î W,
there would be a pair áw,cñ Î F.
A function meeting this condition might be called a
total function or a complete function to distinguish
it from a partial function.
Normally, if it is described simply as a function, then the
function is assumed to be total.
The partial function F enables us, in principle, to associate
each word in the sentence b with a lexical category label.
To complete the analysis of b, however, we require functions,
or partial functions, that map the lexical category labels into
syntactic category labels NP and VP, for example.
Thus, we need a partial function, or functions, with elements
corresponding to rewiting rules such as NP ® DET N and
VP ® V NP.
Notice that the right-hand sides of these particular rules include
two category labels for a total, including the left-hand label,
of three symbols.
Hence, the elements of the function we require must consist of
ordered triples.
Ordered triples and rewriting rules.
We can redefine the set C of category labels to include
the elements NP, VP, and S, as well as the lexical labels, so that
Although C2 × C = C3,
the representation of the rule S ® NP VP as
ááNP, VPñ,
Sñ makes it clear that the two symbols on the right-hand side
of the rule constitute an ordered pair.
Thus, in addition to the condition that a simple English sentence consists of an NP
and a VP (and not two NPs or two VPs),
the pair áNP, VPñ can be seen as
a statement that the subject NP of a simple English sentence precedes the predicate VP.
Similarly, the pair áV, NPñ says
that the object NP follows the verb in the VP, and
áDET, Nñ states that the DET
precedes the N in an NP.
The foregoing conditions thus restrict the elements of C3 to those
listed in G.
But since the subject noun phrase of the sentence b consists
of a pronoun, we also require an ordered pair
áPRO, NPñ Î
C2 corresponding to the rewriting rule
NP ® PRO.
Hence, we define a partial mapping H Í C2
as follows:
The ordered pair in H, together with the ordered triples
comprising the partial mapping G Í C3,
and the ordered pairs making up the partial mapping
F Í W × C permit us to define
the set
Ordered n-tuples and strings.
Before the elements of R can be applied to analyse b,
however, it must be recognised that b is itself an
ordered quadruple; that is, it is an element of W4.
We can therefore show b as
Sequences of words such as b, regardless whether they are
written as a list of words or are represented as an n-tuple,
are called strings.
In this example, b is said to be a string over W.
Strings over a vocabulary.
A set such as W is generally described as a vocabulary.
Sequences of words such as 'the bow' and 'they see', or even
single words such as 'see' that make up b are described as
substrings of b.
The string b is just one instance of all the different
strings over the vocabulary W.
For example, if we consider only that subset of W consisting
of the four words in b, we can devise the following
strings over the vocabulary W:
Length of a string.
The subscript numbers here serve only to distinguish the foregoing
strings over W from b.
The subscripts on b1, b2, and so on, however, also
happen to correspond to the length of the string in each case.
The length of a string is, naturally enough, defined to be the
number of words it contains.
The length of an arbitrary string a over W is denoted
as | a | where
The upper bound n on the length a string, as signified
by the condition i £ n, indicates that we
normally deal only with finite strings; that is, for any
string a over a vocabulary W, although a might
be very long, it is always possible to identify an integer n
such that | a | £ n.
Null string.
That the length of a string might be zero, as signified by the
condition 0 £ i, includes the possibility that a
might be the string consisting of no words, that is, it is the
null string.
The null string might be represented by writing áñ or
W0, but the symbols l or e, the
lowercase Greek letters lambda or espsilon, respectively, are usually
used to denote it.
The null string is analoguous to the null set; however, they are
not the same entities.
Note that {l} is not the null set; rather, it is the
set consisting of the null string.
Concatenation of strings.
Strings can be concatenated.
For example, the string áthe, bowñ can be concatenated
to the string áthey, seeñ to obtain the string
áthey, see, the, bowñ.
The concatenation of strings might be described as a combining of two
strings by, in effect, adding one of the strings onto the end of the
other (so long as it is understood that " adding" in this context
is not an arithmetic operation).
In terms of the apostrophes and blanks notation, the concatenation
of two strings can be represented by writing one of the strings after
so that, for example, the concatenation of 'they see' and 'the bow'
yields the string 'they see the bow'.
Concatenation is obviously not commutative, unlike the union or
intersection of sets, because the string 'the bow they see', obtained
by concatenating 'they see' to 'the bow', is not the string
'they see the bow'.
Concatenation, however, is associative.
For example, concatenating 'the bow' to 'see' to produce 'see the bow',
and then catcatenating this string to 'they' yields 'they see the bow'.
The same result is obtained by first concatenating 'see' to 'they',
and then concatenating 'the bow'.
In general, the result of concatenating any strings g and
d over a vocabulary such as W is written as
gd where
Finite concatenative closure.
Strings over a vocabulary such as W can, in principle, be
of any length, provided they are finite.
The set of all finite strings over W is denoted by
W*, where the superscript asterisk is called the
Kleene star.
This set might be represented by writing
The concept of finite closure under concatenation in the domain
of strings is analogous to the concept of closure under the
operation of addition in the domain of integers.
The integers are said to be closed under addition because adding
two of them together yields another integer, not some
other kind of object.
Thus, adding two integers, say i and j, produces the
integer i+j which is finite because it is possible to identify
another integer n such that i+j £ n.
Closure without the null string.
The set W* includes the null string l.
The set W+ is just W* but without the null
string; that is,
The concept of a concatenative closure is not restricted to just
sets or vocabularies of words such as W.
One can also consider the closure of the set or vocabulary
of category labels such as C, defined above as
The analysis of the string b yields strings that are elements
of he sets C1, C2, and C3.
These sets are subsets of the finite concatenative closure C*
of the vocabulary of category labels C.
Hence, all of the foregoing strings are elements of C*.
The closure C+ of C excluding the null string
can also be defined
Notice that the string áPRO, V, DET, Nñ
is the result of mapping all the words of b to their
lexical categories.
This element of C4 is thus the last in a sequence
of strings which includes the following:
Partial mappings from words to category labels, and among category labels,
can, in principle, be applied anywhere in a string.
The only condition is that a substring of the string be an element of the
domain of the mapping.
Thus, the following strings might be
produced in the course of analysis of b:
Derivations and sentential forms.
If a string such as b Î W* can be reduced to a string
consisting of just the sentence category label S Î C,
then b can be identified as a sentence.
The intermediate strings such as those shown above that are produced
during the course of an analysis are called sentential forms.
The sequence of sentential forms produced during the analysis of
b which ends in áSñ can be called a
derivation.
The derivation relation.
Note that the term "derivation" is normally applied to the
sequence of sentential forms that begin with áSñ
and which ends with a sentence such as b.
For example, the string áNP, V, NPñ can be derived
from the string áNP, VPñ, and we can write
Notice the difference between the double-shafted derivation arrow and
the single-shafted rewriting arrow.
The string on the right-hand side of the Þ derivation
arrow is the string that results from applying a rewriting rule to
the string to the left of the Þ.
The string on the left-hand side includes a substring, consisting
of the single category label VP in this example, that matches the
left-hand side of the rewriting rule VP ® V, NP.
The label VP is replaced by the string V NP on the right-hand
side of the rewriting rule to produce the string
áNP, V, NPñ on the right of the Þ
derivation arrow.
Thus, the strings on either side of the derivation arrow are just
strings before and after the application of a rewriting rule.
In general, suppose that a sentential form consisting of the string
ág1, A, g2ñ, where
g1,g2 Î (C ÈW)* and A Î C,
is produced at some point in a derivation.
Then, if there is a rewriting rule A ®a, where
a Î (C ÈW)*, application of this rule will
derive a sentential form consisting of the string
ág1, a, g2ñ.
Closure of the derivation relation.
The symbol Þ* denotes the finite closure of the
derivation relation, and means that the string on its right-hand
side can be derived from a string on the left-hand side by the
application of zero rewriting rules, or by at most a finite number
of applications of rewriting rules.
Thus, we can write
Note that Þ* includes the application of no rewriting
rules so that we can actually write something such as
The notion of a derivation has been introduced above in the context
of applying rewriting rules, although we have continued to represent
strings using the ordered n-tuple notation.
In the context of partial mappings, rather than rewriting rules,
derivations correspond to reductions; that is, strings of words
are reduced to a string consisting of the sentence category label.
Of course, a string s Î W+ can be reduced to
S Î C if and only if s is a sentence,
and we can write
Terminal and non-terminal vocabularies.
The particular sets C and W of category labels
and words, respectively, have been used to provide concrete examples
for concepts and notation discussed above.
Generally, the names S (the uppercase Greek letter sigma) or
VT is used to identify the vocabulary, or a subset of the
vocabulary of a language.
The subscript T is appended to V in
VT because the words comprising the vocabulary are often
called terminal symbols.
The term "terminal" is used because strings consisting exclusively
of words are the end result of a derivation.
The term "symbol" is used to include both the words and the category
labels that are processed during a derivation.
The category labels are identified as "non-terminal symbols" or as
variables, and
they comprise the non-terminal vocabulary, VN.
(When the vocabulary of words or terminal symbols is identified
using S, then vocabulary of non-terminal symbols or
variables is sometimes identified simply as V.)
The two vocabularies are normally disjoint, that is,
Generative grammar.
A set of partial mappings or rewriting rules such as those
illustrated here, together with the terminal and non-terminal
vocabularies VT and VN, respectively,
comprise a grammar.
Such grammars are also called generative grammars because
sentences can be thought of as being generated from the
sentence category label by applying the rewriting rules.
These grammars are also called phrase-structure grammars
or constituent-structure grammars because they are based
upon the notion of sentence constituents or phrases such as
the noun and verb phrases.
A generative or phrase-structure grammar G is defined
by an ordered quadruple
Language generated by a grammar.
A string s Î VT* that can be derived from the
sentence category label S Î VN by the rewriting rules
in P, or which can be reduced to a string
áSñ consisting of only the sentence category label
by the corresponding partial mappings, is said to be a sentence
generated by G.
The set of all such strings is described as the language
generated by G, denoted as as L(G), where
C = {PRO, DET, P, CONJ, N, V}
We then define the product of W and C as follows:
W × C =
{áw,cñ | w
Î W, c Î C}
which consists of all the ordered pairs of elements of the form
áw,cñ for every w Î W and c Î C
where the first entry in each pair is a word from the set W
and the second entry is a category label from the set C.
c ® w
where w Î W and
c Î C, with W being the set
of words and C the set of lexical category labels as above.
The foregoing rewriting rule can be recast as the following statement:
w Î c
which just says that the word w is in the lexical category c.
For example, the rule
PRO ® they
is a statement that
they Î PRO
b = 'they see the bow'
made up of words from W, we might choose a subset of
W × C according to the following identifications:
in order to assign lexical category labels to the words of b.
Thus, we will choose the following subset F of
W × C:
PRO ® they áthey, PROñ V ® see ásee, Vñ DET ® the áthe, DETñ N ® bow ábow, Nñ
F = {áthey, PROñ,
ásee, Vñ,
áthe, DETñ,
ábow, Nñ}
where F is itself a subset of the set
{ áw, cñ |
áw, cñ
Î W × C,
w Î c } .
with F being obtained by restricting w to the words in b.
C = {S, NP, VP, PRO, DET, P, CONJ, N, V}
where
S = {NP, VP},
The ordered triples we require are then elements of the set
NP = {DET, N, PRO}, and
VP = {V, NP} .
C × C × C =
{áa,b,cñ |
a,b,c Î C}
This product set may also be represented by writing C3,
where the superscripted number 3 denotes the number of factors,
each consisting of the set C, in the cross product.
The particular elements of interest in C3, and the
rewriting rules to which they correspond, are as follows:
We can therefore obtain the following partial mapping G
Í
C3:
áNP, VP, Sñ
S ® NP VP áDET, N, NPñ NP ® DET N áV, NP, VPñ VP ® V NP
G = {áNP, VP, Sñ
áDET, N, NPñ
áV, NP, VPñ}
by first imposing the conditions
NP, VP Î, S
Additional conditions, however, must be applied.
These conditions follow from the observation that
the ordered triples comprising G constitute a partial mapping
from C × C into C.
Thus, the domain of this partial function can be viewed as a
set of ordered pairs.
In fact, the ordered triples identified above can actually be
written as ordered pairs, but with the first entry being itself
an ordered pair.
For example, áNP, VP, Sñ
can be represented as
ááNP, VPñ,
Sñ which can be regarded
as an element of the set C2 × C.
V, NP Î VP, and
DET, N Î NP .
H = {áPRO, NPñ}
consisting of one element.
R = F È G È H
of ordered n-tuples corresponding to the rewriting rules
that might be applied to the analysis of the sentence b.
b =
áthey, see, the, bowñ Î W4
Writing b as an ordered quadruple just makes explict what
is implicit in the representation of b as a sequence of
words separated with blanks; that is, the
sequence is an ordered listing of the words.
Thus, when we write
b = 'they see the bow'
we actually mean that the four words constitute an ordered
quadruple.
The two representations are actually not all that different.
If we replace the apostrophes with angled brackets, and replace
the blanks separating the words with commas, we can convert
the ordered list into the ordered quadruple form.
b1 = áseeñ Î W1
where in the case of b1 we have extended the angled bracket
notation to include strings consisting of just one word, with
W1 being just W.
b2 = áthey, seeñ Î W2
b3 = ásee, the, bowñ Î W3
b4 = áthey, the, bow, seeñ Î W4
| a | = i if and only if
a Î Wi for 0 £ i £ n
The vertical strokes on either side of a can be read as
"the length of".
The notation 0 £ i £ n means that the length
of the string a is an integer i between zero and
n, where n denotes some, perhaps very large integer.
The symbol £ can thus be read as " is less than or equal to".
gd Î Wi+j if and only if
g Î Wi and
d Î Wj
Thus, if | g | = i and | d | =
j, then | gd | = i+j (where
g and d are the lowercase Greek letters gamma and
delta, respectively).
Two strings such as g and d are equal (that is, they
are the same string) if and only if they are the same length and
each of their entries are equal (are the same).
Hence, we can write
g = d if and only if gk = dk for
1 £ k £ n = | g | =
| d |
Because concatenation is not commutative,
gd = dg if and only if g = d;
The concatenation of the null string with any string yields just
the string so that we can write
gd ¹ dg otherwise
gl = lg = g
for any string g.
Thus, concatenation of the null string with any string is something
like the union of the null set with any set, or the addition
of 0 to any number.
W* = {W0 ÈW1 ÈW2 ȼÈWn}
where, as indicated above, W0 can stand for the null
string, W1 is the set of all single-word strings,
W2 is the set of all two-word strings, and so on up to
the set Wn of all strings of n words, where n
does not denote any particular integer, but rather indicates that,
regardless how long they might be, the strings in W* are
nonetheless finite.
W* is called the finite concatenative closure, or
simply the closure of W because it is considered that the
concatenation of any two stings in W* is also a string
in W*, that is
gd Î W* if and only if
g, d Î W*
For example,
'they stand' Î W* and
'and' Î W* means that
'they stand and' Î W* and
'see her near' Î W* means that
'they stand and see her near' Î W* and
'the stand by the bow' Î W* means that
'they stand and see her near the stand by the bow' Î W*
W+ = W* - {l}
The superscipted plus sign on W+ may be thought of
in terms of an analogy with the set of positive integers without 0.
C = {S, NP, VP, PRO, DET, P, CONJ, N, V}
For example, application of the partial mapping F defined above
that associates the words in the string b, where
b =
áthey, see, the, bowñ Î W4
with their lexical categories, yields the string
áPRO, V, DET, Nñ Î C4
Partial mappings and sentence analysis.
Application of other partial mappings from the set R defined
above, and which correspond to other rewriting rules, can produce the
following strings in the course of an analysis of b:
áPRO, V, NPñ Î C3
Note that a mapping can be applied only if a substring is in the
domain of the mapping.
For example, the substring áV, NPñ of the first of
the strings listed above is in the domain of the partial mapping
that includes the element áV, NP, Sñ.
Hence, the substring can be mapped to áVPñ, which is
a substring of the second of the foregoing strings.
If, on the other hand, a string such as b4 =
áthey, the, bow, seeñ were being analysed, the string
áPRO, NP, Vñ would be produced in the course of its
analysis.
Since the substring áNP, Vñ is not in the domain of
a mapping in R, the analysis could proceed no further.
áPRO, VPñ Î C2
áNP, VPñ Î C2
áSñ Î C1
C+ = C* - {l}
áPRO, see, the, bowñ Î C1×W3
where each of these strings is produced as the words in b are
mapped to their category labels one word at a time.
The sets C1×W3, C2×W2
and C3×W1 are actually all subsets the set
(C ÈW)4, which includes strings beginning with one,
two, or three entries from C, as well as strings with either words
from W or category labels from C in each entry.
áPRO, V, the, bowñ Î C2×W2
áPRO, V, DET, bowñ Î C3×W1
áthey, see, NPñ Î (C ÈW)3
The foregoing examples illustrate that the strings produced during an
analysis can be treated as elements of the finite concatenative
closure of the union of the sets of words and category labels,
(C ÈW)*.
The concatenative closure (C ÈW)+ excluding the null
string l is defined as (C ÈW)* - {l}.
áthey, VPñ Î (C ÈW)2
áSñ Î (C ÈW)1
áNP, VPñÞ áNP, V, NPñ
by applying the rewriting rule
VP ® V, NP
The double-shafted arrow symbol Þ dnotes the
derivation relation on strings which can be read as "derives".
S Þ* áthey, see, the, bowñ
In general, if
S Þ* s, where
s Î W*
then s is a sentence.
Thus, of all the strings in W*, that is, strings consisting
exclusively of words, only those that can be derived from the sentence
category label S Î C are sentences.
áNP, VPñÞ* áNP, VPñ
The derivation arrow Þ+, however, specifies that at least
one rewriting rule is required; that is, the application of zero rules
is excluded.
Hence, it would normally be the case that
S Þ+ s, where s Î W+
Athough it is possible in principle to have a sentence consisting
of the null string, and there could be a rewriting rule
S ® l, sentences usually consist of one or more
words and at least one rewriting rule is required to derive them
from S.
sÜ+ S, where s Î W+
or more generally
sÜ* S, where s Î W*
wherein we have reused the double-shafted arrow notation, but have
written the arrows pointing leftward rather than to the right.
The application of the partial mapping
áV, NP, VPñ
whereby the substring áV, NPñ is mapped to the
substring áVPñ, can reduce the string
áNP, V, NPñ to the string áNP, VPñ
so that we can write
áNP, V, NPñÜ áNP, VPñ
VT ÇVN = Æ
In terms of these two vocabularies, rewriting rules often have
the form
B ® b, where b Î VT,
B Î VN
that is, the uppercase Roman letters A and B stand for category labels,
such as NP and N, and are elements of the non-terminal vocabulary
VN; b stands for a word in the vocabulary of terminal
symbols VT; and a in this illustration stands
for a string of non-terminal symbols in the VN*,
the closure of the non-terminal vocabulary.
Thus, if A were to stand for the category label NP, then a
might stand for the string áDET, Nñ, which is
usually written as DET N, as in the rewriting rule NP ®
DET N.
Elements of the partial mappings that correspond to the foregoing
rewriting rules appear as follows:
A ® a, where a Î VN*,
A Î VN
áb, Bñ, where b Î VT,
B Î VN
áa, Añ, where a Î VN*,
A Î VN
G = áVN, VT, S, Pñ
where VN and VT denote the non-terminal
and terminal vocabularies, respectively.
The sentence category label, S Î VN, sometimes called
the starting or root symbol of the grammar, is included
as one of the elements of the quadruple.
P identifies
the set of rewriting rules, which are sometimes called
productions because sentences can be thought of as being
produced by their application.
(The set of rewriting rules or partial mappings or functions may
also be identified by the names F or R.)
L(G) = {s | s Î VT*,
S Þ*s}
or
L(G) = {s | s Î VT*,
sÜ*áSñ}
with L(G) Í VT*.
| Linguistics 484 | Notes Index | Top of Page |