Wednesday, December 26, 2007

Symbols, Tokens, and Terminals

intro
prev: Languages and Recognition

Here we are going to turn from the metaphorical to more precise, but abstract definitions of our subject matter. This is what mathematicians (and by extension computer scientists) do.

To recognize if what we are looking at is a language, it needs to be composed of parts. That is things which we can see over and over again and they always are accepted as being the same. We call those parts symbols.

The first thing we need to define are the symbols. We want the symbols to be abstract, so we simply say that there exists a set of them. We call this set Σ (sigma). In Egyptian hieroglyphs, the set of symbols would be the pictures that they drew: eyes, owls, cats, etc. If we are reading English, one letter at a time, they would be the letters of the alphabet: a, b, c, etc. However, we also might read English a word at a time, and in that case the symbols could be words like: See, Dick, run.

The point is that we get to choose a set of symbols that we are going to operate on. The theory we are working with doesn't care what the symbols are. It just says that they must exist.

There are two key attributes to symbols, which make symbols atomic. The first key thing about symbols is they must stand for a complete mark by themselves. When you look at a symbol it must be sufficient to stand-alone.

In this way, the field is somewhat like chemistry. In chemistry, one has different kinds of atoms, Carbon (C), Hydrogen (H), Oxygen (O), and so forth. This is the set of symbols for chemistry. And, for the most part, chemistry doesn't care about how the atoms are made up the neutrons and protons in the nucleus or the electrons in their orbits and even less so about the quarks and gluons making up the various sub-atomic particles. The symbols, the atoms, are sufficient to know that H2O is water. You do not need to know that hydrogen has one proton and oxygen has eight to write that equation. There are no protons or neutrons directly in the equation. They are embedded in the atoms. The carbon, hydrogen, and oxygen atoms are sufficient for the grammar.

And, so it is with parsing. Once we select a symbol set we treat those symbols as complete entities.

If we are describing our grammar at the character level, say ASCII for parsing English text, or Unicode for parsing text from a wider variety of languages, we don't care about the bits or bytes within a character. We treat the characters as complete entities. Therefore, if we are parsing in terms of Unicode characters, we don't need to know that in the UTF-8 representation of Unicode, some characters are 1 byte long, others are 2, and still others are 3 bytes long. We treat the characters as units and ignore their representation as bytes.

The same thing happens if we make our symbol set be words and treat words as complete units. We don't actually care that the word "See" has three characters in it. We treat the word "See" as an atomic symbol. Thus, the sentence "See Dick run." might have 3 symbols if we are parsing solely at the word level, instead of 13 at the character level.

The other part of the atomic symbols is that they need to be indivisible at the grammar level. We don't look at the constituent parts and see if they form another symbol.

The word "See" in the sentence is a word by itself. That makes is a fine symbol. If we are talking about English words, we could also have the word "Seed" in our grammar. However, if "See" is a symbol, its appearance inside this other word is irrelevant. The word "See" is not a part of the word "Seed". The word "Seed" is atomic in itself and "See" is not part of that word, although the three characters that make up "See" are. However, because symbols are indivisible, we cannot extract "See" from "Seed". They are simply two distinct symbols, that just happen to share some characters when we represent them in English using the alphabet.

There are two words commonly used in parsing to describe these symbols: tokens and terminals. These two words are essentially interchangeable in parsing terminology. However, they do emphasize the distinct aspects of the concept.

The word token emphasizes the treatment of a symbol as a specific thing, like a subway token or a token of affection, that are used for their purposes only in their intact state. If you melt a subway token down and attempt to use it, it is no longer a subway token, and won't get you through the turnstile. Thus, parsing people usually use the word token when they want to describe the symbols of a grammar as atomic clusters of smaller units, as in words made of characters. This is similar to the effect that "See" is a word in English and the constituent characters "S", "e", and "e" are not by themselves treated as words as part of the word "See". Thus, "See" is a token from this point of view. And "S" is not a token, because it does not stand-alone.

The word terminal is related to how symbols are used in a grammar. Grammars are made of rules (often called productions) that give definitions. Eventually, the definitions come down to atomic entries, ones that do not have definitions themselves. These definitions terminate (end) the chain of definitions. Thus, they are terminal definitions and hence they are called terminals.

So, from these two viewpoints, one can see that tokens emphasize the point that the interesting symbols may have constituent parts that are outside of the interest of this particular grammar but as terminals they are atomic. To return to our chemistry example, the tokens for carbon, hydrogen, and oxygen are atomic (terminal) for the reactions we might wish to describe. The fact that there are sub-atomic levels (protons and neutrons which are further sub-divided down to quarks and gluons) is not relevant for that grammar. The reactions described by chemistry do not pull a neutron out of one atom and stick it in another. Operating at that level is a whole different set of rules and thus a different grammar.

Language and Recognition

prev: intro

Many people when they are only partially occupied will doodle on a piece of paper, drawing pictures or complex shapes and maybe even writing down notes in words of things which pass through their mind. Such doodles are generally not considered a language. There is very little intent to convey meaning, except perhaps in the notes. In contrast, the Egyptian and Mayan pyramids are also covered in pictures and these are not doodles. They are writing that attempt to convey very important meanings. These pictures are a language.

So, how does one distinguish the doodles on a pad from the language on the pyramids. First, the language attempts to contain and convey meaning. Now, since most of us read neither Egyptian nor Mayan, we cannot extract the meaning of the writing on the pyramids, but it is still there.

But, even if we cannot extract the meaning from the writing on the pyramids one can still distinguish Egyptian hieroglyphs from Mayan ones. They simply look different.

This brings us to our first task, recognition. Recognition is simply the act of saying whether something is written in the language or not. Thus, if we look at a set of symbols on a pad, we can recognize whether they are Egyptian hieroglyphs or not and give a yes/no answer to that question.

An analogous process can be applied to the bits stored in a computer. These bits can be organized into symbols. The symbols can then be looked at to see if they look like a language and it they are recognized as looking like a language we can say "yes" they appear to be writing and not just doodles.

If the bits in a computer are a writing in a language, we can then attempt to extract the meaning that the writing conveys. This is our grand goal. However, we start with the simplest question first, do we recognize the symbols as language.

next: Languages Are Defined By A Grammar

Introduction to a "blook" about pattern matching

This blog was intended to really be in book disguise. Thus, you might call it a "blook".

The intent is to write down in very simple terms what I know about pattern matching. I want to make this subject accessible to just about anyone. And, the fact is that it should be. The fundamentals of the subject are trivial.

Of course, the implications, interactions, special cases, et al are very subtle and sometimes non-intuitive. Thus, it is kind of like playing "Go" or solving SuDoku puzzles. Easy to learn. Difficult to master. However, even the difficulties of the subject are actually simple, as long as one approaches them correctly and don't take inappropriate short-cuts. So, if I succeed in that, you, my reader, can grasp the topic and it will all seem simple to you. Moreover, even in those cases where you know it is complex and you must tread carefully, you will have a clear way of taking each incremental step.

I initially found that difficult because blogs tend to be structured so that one starts reading from the most recent postings first and books have a different order, where one starts at the beginning, which is generally not the most recent section. In fact, I put this project down for over two years because of that. However, realizing that one can achieve the book order by simply laying down the links one wants, I have returned to this project. If you follow the next links, at the bottom of each section, you can read what I intended the book to look like.

At the same time, I actually have need for a more normal blog on pattern matching. Thus, you will also find, outside of the linked sections, random blog entries on the topic. Over time, if they fit within the scope of the book, I will join them in the links at the appropriate places.

However, if you are trying to read this like a book, here is your first link.

Turn of the [21st] Century Pattern Matching