Linguistics 482 - Computational Linguistics Fall 2000 Laura Proctor lproctor@uvic.ca
Department of Linguistics
University of Victoria
(250) 721-8282
Last updated September 6, 2000

INTRODUCTION

According to Crystal (1985:63), computation linguistics is

A branch of linguistics in which computational techniques and concepts are applied to the elucidation of linguistic and phonetic problems. Several research areas have developed, including speech synthesis, speech recognition, automatic translation, the making of concordances, the testing of grammars, and the many areas where statistical counts and analyses are required (e.g., in literary textual studies). [David Crystal (1985). A Dictionary of Linguistics and Phonetics. 2nd edn. Basil Blackwell.]

Some workers in the area appear to prefer restricting application of the appellation computational linguistics to those activities which I presume Crystal includes under "the testing of grammars." For example, Hirst (1987) seems to indicate that computational linguistics is strictly "... about applying computers as tools in research in theoretical linguistics, or ... linguistic theory." [Graeme Hirst (1987). "Review of Computers in Linguistics by Christopher S. Butler." Computational Linguistics 13, 335-336.]

Hirst also appears to describe what Crystal includes under "the making of concordances" and "areas where statistical counts and analyses are required (e.g., in literary textual studies)" as "the field that has come to be known as literary and linguistic computing -- the application of computers in the language-oriented humanities." Consequently, I conclude that he excludes such activities from the domain of computational linguistics.

Although I agree with the greater part of Hirst's comments regarding Butler's book [Christopher S. Butler (1985). Computers in Linguistics. Basil Blackwell.], I take exception to his very restricted use of the term computational linguistics. My view of what activities are included within the purview of the academic domain called computational linguistics is more nearly reflected in Crystal's definition. Thus, I consider computational linguistics to be that subdiscipline of linguistics (and perhaps of computer science and engineering, and even of psychology and philosophy, among other disciplines) dealing with the application of computational techniques and concepts to the study of natural language.

Course Objectives

This course is intended to constitute an introduction to computational concepts and methods as these are applied to the study of natural language. The principal objectives of the course are to provide students with sufficient background that they be able to employ these concepts and methods in their other studies, and to prepare students for further study in computational linguistics. No previous knowledge of computing nor experience with its linguistic applications is assumed. For those with some previous experience in these areas, however, the concepts, techniques, and applications presented should be sufficiently interesting and challenging that the course will be worth the time they must devote to it.

Course Content

The following three major topics are employed as a framework or context for the other topics treated in this course:

  1. The InterNet
  2. DOS/Windows The Microsoft Operating System & Interface
  3. Prolog: Programming in Logic

The last-named, Prolog, is a programming language, while the second, DOS, is an operating system underlying the graphical user interface (GUI) Microsoft Windows. The topic of operating systems, including what they are and why they are required, will be discussed. Related matters that will be discussed include the distinction between hardware and software, and the differences between the low-level machine language of a computer and the high-level programming languages such as Prolog which human beings normally employ to communicate instructions and other information to a computer. The translation or interpretation of human-oriented high-level language programs into the machine language instructions which a computer can "understand" and execute will also be discussed.

Another topic to be treated is the encoding of character graphics such as letters of an alphabet into the binary representations with which a computer works. These binary representations are written customarily as sequences of zeros and ones, with these two characters being called binary digits, or bits. The term "customarily" is significant here because any other pair of characters, such as "+" and "-" for example, would serve just as well to denote the two binary digits. The computer itself, of course, does not "see" these digits as either "0" and "1", or "X" and "O" for that matter; what it "senses" is a change in the flow of electric current, and we simply use two different characters to stand for two states of current flow. While a computer is capable only of sensing changes in electric current flow, what we usually want it to "see", and to work upon, are sequences of letters. To enable the machine to "see" our letters, we employ conventions whereby given sequences of binary digits, representing current flow changes, correspond to particular letters and other characters such as the numbers and punctuation marks. Such correspondence schemes are arbitrary, and hence, there are several such conventions in common use. The currently most prevalent is ASCII, pronounced "askey", the American Standard Code for Information Interchange; but, there are other, sometimes competing conventions, and there are variations, elaborations, and extensions of the ACSII convention.

Normally, we need not concern ourselves with the correspondence between letters and the current flow changes we denote using binary digits: the conversion is performed automatically by electronic circuits so that when we press a key on a computer keyboard, the corresponding sequence of electrical changes is transmitted to the computer. The reverse conversion is also performed automatically when information originating from a computer is displayed on a screen or is printed. This automatic conversion process is entirely adequate so long as we are content to type and display information transcribed employing variations and standard extensions of the Roman alphabet. As soon as we must exceed these limitations, however, we face problems, and linguists are among those who must contend with these problems. Hence, issues of this nature, namely, the encoding of nonstandard set of characters such as IPA, for example, will be discussed.

Other topics that will be introduced and which are related to computers and computing in general, include operating system commands and utilities (such as are required for the editing, copying, and printing of files); computer components (such as CPU, RAM, ROM, disks, controllers, terminals) and computer architecture; sequential and parallel (and "connectionist") processing; and symbolic versus neural computing.

The InterNet is a large collection of communications networks which connects almost all universities and research institutions in the world and permits members of these organisations to exchange information. The most commonly used medium for this information exchange is electronic mail, known usually as e-mail, whereby individuals send messages to one another, where the communicants might be in the same room, or on opposite sides of the world. As part of this course, you will use e-mail. There are several InterNet facilities or services which are based upon and employ simple e-mail. Among the most frequently employed of these are lists which are used to broadcast submissions, or postings, from individuals to all subscribers of the list. You will be subscribing to a list devoted to the exchange of information among linguists, and you will be retrieving information from the database of past submissions to this list. In addition to e-mail and lists, other InterNet services such as telnet, FTP (File Transfer Protocol), and WWW (World Wide Web) will be introduced.

The following topics will be discussed in the context of the Prolog programming language:

The foregoing list is not exhaustive, and other, related topics will be introduced. Nor will the topics cited necessarily be covered in the order in which they are listed, and neither will they receive equal attention in terms of the time devoted to them. Some topics will receive much greater attention because of their particular significance to computational linguistics. Especially important concepts will be reviewed, reiterated in varying contexts, and elaborated upon throughout the course. The relevance to the study of natural language of the topics covered will be discussed, and linguistic applications and examples will be introduced. For example, in the context of the InterNet, text formatting and encoding tools such as SGML (Standard Generalised Markup Language), HTML (HyperText Markup Language), and TEI (the Text Encoding Initiative) will be discussed. In the context of Prolog, the description of grammars such as DCGs (Declarative Clause Grammars) and UGs (Unification Grammars) will be introduced, and the organisation, encoding, and storage of lexicons will be discussed.
Linguistics 482 Home Page Top of Page