Wednesday, December 26, 2007

Symbols, Tokens, and Terminals

intro
prev: Languages and Recognition

Here we are going to turn from the metaphorical to more precise, but abstract definitions of our subject matter. This is what mathematicians (and by extension computer scientists) do.

To recognize if what we are looking at is a language, it needs to be composed of parts. That is things which we can see over and over again and they always are accepted as being the same. We call those parts symbols.

The first thing we need to define are the symbols. We want the symbols to be abstract, so we simply say that there exists a set of them. We call this set Σ (sigma). In Egyptian hieroglyphs, the set of symbols would be the pictures that they drew: eyes, owls, cats, etc. If we are reading English, one letter at a time, they would be the letters of the alphabet: a, b, c, etc. However, we also might read English a word at a time, and in that case the symbols could be words like: See, Dick, run.

The point is that we get to choose a set of symbols that we are going to operate on. The theory we are working with doesn't care what the symbols are. It just says that they must exist.

There are two key attributes to symbols, which make symbols atomic. The first key thing about symbols is they must stand for a complete mark by themselves. When you look at a symbol it must be sufficient to stand-alone.

In this way, the field is somewhat like chemistry. In chemistry, one has different kinds of atoms, Carbon (C), Hydrogen (H), Oxygen (O), and so forth. This is the set of symbols for chemistry. And, for the most part, chemistry doesn't care about how the atoms are made up the neutrons and protons in the nucleus or the electrons in their orbits and even less so about the quarks and gluons making up the various sub-atomic particles. The symbols, the atoms, are sufficient to know that H2O is water. You do not need to know that hydrogen has one proton and oxygen has eight to write that equation. There are no protons or neutrons directly in the equation. They are embedded in the atoms. The carbon, hydrogen, and oxygen atoms are sufficient for the grammar.

And, so it is with parsing. Once we select a symbol set we treat those symbols as complete entities.

If we are describing our grammar at the character level, say ASCII for parsing English text, or Unicode for parsing text from a wider variety of languages, we don't care about the bits or bytes within a character. We treat the characters as complete entities. Therefore, if we are parsing in terms of Unicode characters, we don't need to know that in the UTF-8 representation of Unicode, some characters are 1 byte long, others are 2, and still others are 3 bytes long. We treat the characters as units and ignore their representation as bytes.

The same thing happens if we make our symbol set be words and treat words as complete units. We don't actually care that the word "See" has three characters in it. We treat the word "See" as an atomic symbol. Thus, the sentence "See Dick run." might have 3 symbols if we are parsing solely at the word level, instead of 13 at the character level.

The other part of the atomic symbols is that they need to be indivisible at the grammar level. We don't look at the constituent parts and see if they form another symbol.

The word "See" in the sentence is a word by itself. That makes is a fine symbol. If we are talking about English words, we could also have the word "Seed" in our grammar. However, if "See" is a symbol, its appearance inside this other word is irrelevant. The word "See" is not a part of the word "Seed". The word "Seed" is atomic in itself and "See" is not part of that word, although the three characters that make up "See" are. However, because symbols are indivisible, we cannot extract "See" from "Seed". They are simply two distinct symbols, that just happen to share some characters when we represent them in English using the alphabet.

There are two words commonly used in parsing to describe these symbols: tokens and terminals. These two words are essentially interchangeable in parsing terminology. However, they do emphasize the distinct aspects of the concept.

The word token emphasizes the treatment of a symbol as a specific thing, like a subway token or a token of affection, that are used for their purposes only in their intact state. If you melt a subway token down and attempt to use it, it is no longer a subway token, and won't get you through the turnstile. Thus, parsing people usually use the word token when they want to describe the symbols of a grammar as atomic clusters of smaller units, as in words made of characters. This is similar to the effect that "See" is a word in English and the constituent characters "S", "e", and "e" are not by themselves treated as words as part of the word "See". Thus, "See" is a token from this point of view. And "S" is not a token, because it does not stand-alone.

The word terminal is related to how symbols are used in a grammar. Grammars are made of rules (often called productions) that give definitions. Eventually, the definitions come down to atomic entries, ones that do not have definitions themselves. These definitions terminate (end) the chain of definitions. Thus, they are terminal definitions and hence they are called terminals.

So, from these two viewpoints, one can see that tokens emphasize the point that the interesting symbols may have constituent parts that are outside of the interest of this particular grammar but as terminals they are atomic. To return to our chemistry example, the tokens for carbon, hydrogen, and oxygen are atomic (terminal) for the reactions we might wish to describe. The fact that there are sub-atomic levels (protons and neutrons which are further sub-divided down to quarks and gluons) is not relevant for that grammar. The reactions described by chemistry do not pull a neutron out of one atom and stick it in another. Operating at that level is a whole different set of rules and thus a different grammar.

No comments: