Tuesday, August 11, 2015

The Alphabet

Now, we will get a bit more general. By text, we don’t literally need it to be text. It can be any sequence we can treat as being composed of letters. And, again, by letters, we don’t necessarily mean letters, any sequence of “things” (often called characters especially when dealing with text that includes non-alphabetic characters, e.g. numbers, punctuation symbols, et al.) will do.

The only requirements on the things we will use as characters is that we can uniquely and quickly recognize them and that there is a finite list of them we know in advance. Thus, bits in a computer memory will do. There are two distinct values “0” and “1” (or high and low or current flowing and grounded). We can also use the four chemical bases in DNA: adenine, cytosine, guanine, and thymine or “a”, “c”, “g”, and “t”. Again, they are distinct and the list is finite. Packets sent across network traffic are treated as being composed of bytes (or octets), and while a byte can take on any of 256 values, those values are still unique and the list finite, so that is a suitable alphabet.

The alphabet is also often called the character set as it is the set of all the characters that the text might contain. In theoretical texts, the alphabet is often represented by the uppercase Greek letter sigma (Σ), particularly in formulas. When you see a Σ, you should mentally think “any character will do here”.

If you are reading this, there is a good chance you learned the "alphabet" at some point in your life. If you know more than one language, you probably know that there isn't just one alphabet. Some languages have extra letters. However, all languages have specific letters in their alphabet.

This fixed character set is the alphabet of the language and the basis of the language. Each language has a basis, an alphabet.

This is even true of pictographic languages, where we draw pictures of words. Even pictographic languages have an alphabet. There is a limited (although in some cases large) set of symbols one has to learn how to make to be able to write. Similarly, if one speaks a language, there is a limited set of sounds one has to learn to make. The same is true of languages we want to pattern match, there is a fixed set of characters that one has to be able to represent as the basis of those languages also.

In contrast, the set of integer numbers would not make an acceptable alphabet, as there aren’t a finite number of them. On the other hand, representing the integers as decimal numbers is okay, because we have only 10 distinct digits “0” through “9” and thus a legitimate alphabet. That is something to be aware of, subtle differences in how the problem is worded can affect it in powerful ways, in this case changing a problem we couldn’t address into one we can.

Similarly, most image processing is not a pattern matching problem that we will address. Trying to identify whether a picture contains a bird does not involve a fixed alphabet of simple to recognize characters. In fact, recognizing the parts of a picture is the hard part. Once, we have a picture divided into parts and if those parts could be represented by words like "wing", "eye", "head", "body" it might be possible to turn it into a pattern matching problem we can address. However, in the process, you have skipped the hard part of the problem.

Next, in general, we also don’t care how the letters of the alphabet are written. Thus, it is perfectly valid to consider “a” and “a” and “a” all to be the same letter. It is also valid to consider them 3 distinct letters. The only requirement is that whether those letters are the same or different must be known in advance and must be treated consistently. Thus, not only can subtle differences have big effects on the problem, there are also large differences that can be swept under the rug as irrelevant.

So, again a pattern matching problem begins with some text that can be broken down into distinct letters and we can look at those letters one-by-one and determine whether they match the pattern or not. That is all there is to it. It is a very simple problem and thus it is useful in many places.

To reiterate: the basis of a pattern match problem is an alphabet, which is a set of characters. We take the characters of the alphabet and form sequences of them by concatenating them. By concatenating, we merely mean writing them in some order. These concatenated sequences we call strings.

We match strings by seeing if they have the same characters in the same order as the pattern does. The string we are trying to match is generally called the input text or subject.

To distinguish the characters in the pattern from the letters in the input text, we will call the characters in the pattern the items. The items in the pattern are the things which tell us which characters in the input text will match. We call them items, because they aren’t always the exact character to match, but sometimes, just a requirement on the character, such as the input character must be a digit.

Exercises and Questions


Here are some questions to consider in deciding if you understand what a pattern matching problem is (or can be). The yes-no questions also deserve a why or why-not answer. Moreover, for those questions where you think the answer is “no” and pattern matching doesn’t apply, consider what would be needed to be done to allow pattern matching to apply? Similarly, if the answer is yes, how could some subtlety be introduced that would make the answer no?
  1. Are the characters that represent Roman Numerals, e.g. I, II, III, IV, V, X, XV, XX, a suitable alphabet?

  2. Does the fact that DNA is often coiled in a helical spiral affect its interpretation as text?

  3. Are chemical formulas generally treatable as text?

  4. Are mathematical formulas generally treatable as text?

  5. Are weather patterns generally treatable as text?

  6. Could the sequences of left and right turns to solve a maze be a problem that could be expressed as a pattern matching problem?

  7. Can the squares on a Monopoly® board be treated as an alphabet? Could the sequence of play of a Monopoly game be treated as a pattern matching problem?

  8. Can text that has been encrypted be dealt with by pattern matching?

  9. Is spoken speech suitable for a pattern matching problem?

prev: The Simple Basics of Pattern Matchine
next: Pattern Matching Is Easy

Monday, July 27, 2015

Literals and Quoting

Because the pattern is the key part of a pattern matching problem, it is important to talk about the notation used to represent the pattern. Notation simply means the way we will write the problem out, or more importantly how we will specify the problem as input to a tool, the pattern matching compiler, such that the tool can generate a program (or configure some hardware) that will process the input text and tell us whether we have a match or not. Understanding the notation is a key point of emphasis.

For the single fixed string problem, we need a way of representing the fixed string. In particular, we need a notation for literals.

As mentioned before, if we allow characters. to represent themselves and simply list the characters in order to give us the sequence, then we have a very simple notation for fixed string literals. This notation is used quite commonly in tools that only allow regular expressions. Thus, if you pass a pattern to the libPCRE library, you write a regular expression to search with in Emacs, or you write a lexer in LEX, each regular expression starts with literals where each character represents itself. Thus, we simply write the five characters h, e, l, l, and o for the literal string “hello”, as shown below:

hello

For more complicated uses, one wants to separate literal strings from names of patterns. In that case, it is useful to have a way of distinguishing a literal string. Fortunately, the convention of quoting a string applies nicely. This notation is used in Yacc++ and many programming languages. When one wants to write a literal string, one encloses the string in quotes, as shown below:

"hello"

The idea of quoting is important. It allows us to separate that which we want to take literally from that which we mean symbolically. In most natural languages, one quotes text which refers to something someone else said and means we are trying to say what they said literally. The reason one quotes the literal text (someone else said) and not what we are saying is that we say a lot more text than we quote text from other people. Thus, it makes sense to quote their text, because it means a lot fewer quotation marks.

In computer languages, we often have to choose which items we quote. And in many [simple] pattern languages there are a lot more literals than references.

In fact as the level of simple strings, we have no references, only literals. Thus, it makes more sense to reverse the quoting convention. That’s why in most regular expression languages, simple literal text is not quoted. Thus, if you are writing a regular expression for LEX, or PCRE, or Emacs, you do not (generally) quote literal strings. Again, in these usages, by not quoting literal strings we are saving ourselves a lot of quotation characters.

However, in higher level languages where there are many references and literal strings are less frequent, it makes sense to quote the literal strings. Thus, in a programming language like C, or Emacs Lisp, or in a parser generator like Yacc++, one does quote literal strings.

Of course, there are a variety of middle-grounds and variations on the theme. For example, in many regular expression uses, one is allowed to enclose a string in quotation marks. For example, LEX allows that. Moreover, in LEX if the regular expression includes whitespace , one must quote the regular expression because whitespace is considered to terminate the regular expression otherwise. Thus,

hello return 1;

and

"hello" return 1;

are the same to LEX, where hello is a string literal (and the quotation characters are ignored).

However, if you wanted spaces in the literal string, as below, you need the quotes.

"h e l l o" return 1;

Note: in each of these cases the part after the whitespace (the “return 1;” is an action, and that allows LEX to do something (in this case solve a lookup problem , which is something we’ll define shortly) more than just recognize the text.

Exercises and Questions


These questions are more practical and attempt to get you to see where you have seen quoting conventions in any programming you’ve done so far and to evaluate their differences.

  1. Does your favorite programming language have a notation for string literals? If so, what is it? Does it use quotation marks? Which ones? Does if use different quotation marks to mean different things?

  2. How about your web page design software or language? Same questions.

  3. How about your favorite text editor?

  4. Where are string literals used in a text editor?

  5. Some languages use different opening and closing quotation marks. What are the advantages and dis-advantages of doing so?

  6. In recent versions of C dialects, two consecutive quoted strings are joined together as if they were one string literal, what are the advantages and disadvantages of doing that?

  7. What are the advantages and disadvantages for splitting the regular expression from the action in LEX?

prev: More on the Simplest-Pattern Matching Problem

More on the Simplest Pattern Matching Problem

Now that we have the general pattern matching problem defined: we are looking at some text character-by-character, we are going to describe that problem more rigorously by building up from the simplest versions of the problem to more complicated ones. That’s right we are taking this one general problem and turn it into lots of different problems. In the process, we will build up our vocabulary of ways to describe the problem, introduce notations to express the more complicated problems, and develop different techniques that can actually solve these problems.

We have already looked at the simplest problem, the single fixed string recognition problem. That’s the problem where we are given a piece of text that might match a word, and the (single) word we want to determine whether it matches, and we want back a simple yes or no answer. Is the text the word we are asking about or not?

The name of the problem includes two special terms: “fixed string” and “recognition”.

Fixed String means the problem is dealing with a word that has only one spelling and we write that spelling out as part of the problem. Another name for these words is literals, which means we literally write the characters of the word to represent the word. In our case, we had a single fixed string. That is we are looking only for one word. Because we know what the word is (and the reason we know what the word is that the characters are literally spelled out one-by-one as part of the problem and none of the characters can vary from the spelling if we are to have a match), we call it a fixed string problem. Any pattern where the items of the pattern spell out the characters to match exactly (and literally) is called a fixed string pattern.

Recognition means we are simply looking at the text and trying to recognize if it matches the pattern. We are not doing more complicated analysis. We simply want a yes or no, match or not match, answer. We will eventually get to more complicated problems where we want much more than a yes-or-no answer, but for a recognition problem that is all we want.

Implicit in this is the fact that we are looking at the whole text and only if whole text exactly matches the word in question do we return a yes. So, although we didn’t mention it above, this is also a “match” problem.

Match means we want the entire text to match the pattern and to exactly correspond character by character to the pattern. If there are characters left over in the text that we didn’t line up to the pattern, then the answer is no and the text is not a match. Similarly, if there are characters in the pattern that weren’t lined up to the text, the answer is also not a match.

Now, this may seem like a lot of detail to describe a simple problem, but it gives us the start of a rich vocabulary that allows us to precisely distinguish the different pattern matching problems separating the simple ones from those that are complex.

prev: Pattern Matching Is Easy
next: Literals and Quoting

Pattern Matching is Easy

Looking at the simplest problem, one might be tempted to think pattern matching is easy. The more cynical might assume that it is just deceptively easy and there are real complexities lurking. At some level both points of view are right.

Pattern matching is intrinsically easy. Even as we build up more and more complicated notions of what kinds of patterns we wish to match. These notions will all be relatively simple steps. Moreover, we will generally be able to sole them with essentially the same simple problem solving methods with a few extensions.

Pattern matching is also deceptively easy. There are definite places where ones intuition leads one astray. Places where after making three right turns, one finds ourselves unavoidably going left. However, it isn’t that counter-intuitive aspect that is the real issue. One can learn the places where common sense fails one and 1 + 1 ≠ 2.

Pattern matching being easy is part of the problem itself.

Along with the sense that much of pattern matching is well-known and well-understood and little is left to discover is the fact that it is fairly easy to write a pattern matching solution. Many people have. Don’t be surprised when you learn that you can implement some of the solutions mentioned here in a few hours. Moreover, it is easy to implement a solution that incorporates some novel aspect. Again, many people have. In fact, there is a room full of papers showing one novel aspect or another implemented by someone and never carried forward in my basement.

The hard part of writing a pattern matching solution is the engineering part, the user-interface part, the scaling part, the combinatorial complexity part, the keeping simple things simple, but not too simple part. There are lots of little choices to make in a solution that impact its performance or the ability of some user to describe the pattern they want. Be aware that even for a one-time throw-away solution, you may have to re-write and change parts as you become aware of the real needs and the real issues. And, if you are trying to make something that you aren’t going to throw away, you need to be all the more conscious of what you choose and why. Because of that, you will find many points in this blog where the trade-offs between different solutions are mentioned.

Along those lines, you will find me particularly intolerant of people who write a hand-written parser or implement their own DFA interpreter without considering using a well-proven tool to do that job. The problem isn’t that their solution won’t work. Pattern matching is easy and it surely will work.

What a solution like that won’t do is scale. It will be hard for it to survive multiple rounds of enhancement without either being entirely re-written or becoming a nest of convoluted hacks. It also won’t easily be made parallel or otherwise optimized, as our optimization technology works well on stylized code and high-level abstractions, but not so well when given an arbitrary program that may have assumptions that are hard to prove or transform. Long ago I heard a great quote, which I paraphrase, about a compiler versus an assembly language programmer that makes this point well, “compiler generated code can generally reach about 75% the performance of a hand-optimized assembly program, but the compiler will continue to reach that rate even as the program is modified and all the careful register assignments in the hand optimized program have been rendered incorrect”. This is particularly true in pattern matching. You can quickly write a program that exploits the ideas contained in this blog. It is much harder and more tedious, thus best left to a tireless program, to line up all the details of the ideas in the right order to express an approximation to the optimal solution for a non-trivial set of patterns. This is why even if you have what seems to be an easy pattern matching problem, you want to use a well written tool to solve that problem, so that as the problem becomes more complex the solution continues to evolve with it, rather than becoming a rats nest of ad hack patches.

Just as it is easy to implement a pattern matching solution, it is usually easy to write a pattern itself. In fact, it needs to be easy to do that. Otherwise, pattern matching is useless. However, that easiness gets some people carried away.

It is far too easy to write a pattern that over-specifies the problem. As a friend of mine once said (again slightly paraphrased), “how often does a hacker gets some sophisticated attack written, but they mess up and get the quotes wrong, and does one really want to not detect those almost attacks?” To address that, you will find points where this blog tries to indicate that a simpler solution might cover the same space. It's one of those curious contradictions that a sophisticated, subtle, and pedantically correct solution may be “worse” than a simpler solution that only approximates the right answer, but does it in a way that works better for the user.

prev: The Alphabet next: More on the Simplest Pattern Matching Problem

The Simple Basics of Pattern Matching

To understand pattern matching, we need to understand what kind of problems it is designed to address. We will start with the very simplest problem. If someone sends us a text message with the letters “h” “e” “l” “l” “o”, can we recognize that as the word “hello”?

Now there are certainly many ways to envision that as a complicated problem, but we are trying to keep it simple. Did we see the letters “h” “e” “l” “l” “o” together and in the right order? To put it more bluntly, was the first letter an “h” and the second letter an “e” and the third and fourth letters both “l”s and the fifth and final letter an “o”? That is all we are trying to ask.

Notice how the question was asked by specifying an exact step-by-step set of sub-questions, one for each letter of the desired answer. That explains what pattern matching is all about. We have some text and we look at it letter-by-letter and if all the letters are correct we have found a match.

Every pattern matching problem is a variation on that. We have some text (or something we treat like text). The text is composed of letters (or again something we will treat like letters). We look at each letter and see if it matches the pattern. And, if the letters match, we have a match for the pattern.

This also gives us a first bound on the complexity of the problem. In the case of finding a match, we are limited by the length of the string that matches. It must take at least one comparison for each letter. Thus for the matching case, we have a lower case linear bound on the problem. That is as the text being matches gets longer, the time to solve problem gets longer at a linear rate. One more letter, one more comparison. As long as we can perform each comparison in a constant amount of time, the total time to perform the match with this method is a linear multiple of that constant, the multiplier being the number of letters we need to match.

Exercises and Questions


Here are some questions to consider in deciding if you understand the time bounds on simple pattern matching.
  1. Why is the lower bound for a pattern matching problem at least linear in the number of characters in the text that constitutes the match? What can happen if one of those comparisons isn’t performed?

  2. How about the mis-match case? Is the lower bound for a pattern matching problem for finding out if the text doesn't match the patterns different? Can we find ways to detect a mis-match more quickly?

  3. What about the worst case bounds for the mis-match problem? How much text must we look at to determine that the pattern doesn't match if we get unlucky?

prev: Turn of the [21st] Century Pattern Matching (Introduction) next: The Alphabet

Turn of the [21st] Century Pattern Matching

This blog is going to look at our knowledge of pattern matching like it was Victorian scientific knowledge.

At the turn of the 20th century, it was a commonly held belief that all major scientific principles had been discovered and there was little left but filling in the pieces. That was before Gödel's incompleteness theorem and Heisenberg's uncertainty principle. Now that hubris seems quaint.

That quaintness is exemplified by Mr. Banks in Mary Poppins, a villain whose redemption is gained by learning to fly a kite and feed the birds. However, one is assured that Mr. Banks took his beliefs very seriously before that.

It is quite possible that the future will look back on our knowledge of pattern matching the same quaint way. However, it is possible to capture what we know about pattern matching and give a summary of our current knowledge. Moreover, that knowledge has a coherence that should make it understandable.

We may not know all there is to know about pattern matching, but what we do know, makes sense, even when that sense seems counter-intuitive at first.

This blog will attempt to capture that coherence. In particular it will attempt to formulate a consistent set of theories that explain practical results. Note the emphasis on practice here. A certain amount of theoretical rigor will be abandoned to simplify matters and make the insights accessible. We do not have a well enough advanced theory to explain all the important facets.

Thus, in some cases, we will have to wave our hands and suggest what we believe to be true and not state it with certainty. Similarly, there are known theoretical results that are not relevant to practice which we will not discuss. The goal is to have this blog be readable by an interested student with only rudimentary mathematical and programming language skills. Moreover, that student should be able to understand it once they have learned the relevant vocabulary. Most importantly, an attempt will be made to define that vocabulary in simple non-technical terms.

In fact, don't worry if you can't understand the next paragraphs yet. You can follow the links to see definitions of the terms, but don't worry if those definitions aren't helpful yet. The next paragraphs give an outline of where we are going, without trying to explain any of the terms in it. In fact, explaining those terms is an important aspect of this blog.

However, some of the technical vocabulary will be borrowed from other fields and used without much explanation when doing so would take us far aware from our core subject. That is to help show the connections between what we are covering and things that people may already know. Try simply reading past the troublesome word or phrase. Hopefully, that will only render some minor details incomprehensible. If not, please point out the things which need more background material.

Attempting to explain pattern matching theory in a practical way is why this blog will talk about pattern matching rather than automata theory. By pattern matching, we mean the things we can express with regular expressions, including the extensions such as back-references which aren't traditionally addressed and with context free grammars. Far less emphasis will be placed upon Turing Machines. The practitioner rarely knowingly constructs a traditional Turing Machine to solve a problem. Instead they are more likely to augment a simpler machine with extra operations which give the machine Turing equivalence and ignore the ability of machine to express problems which do not halt.

In fact, this blog will most concentrate on solutions that are solvable in linear or polynomial time and try to find places where sub-linear solutions or approximations are possible or the polynomial can be made at worst quadratic and better yet where we can bound that quadratic performance to a subset of the problem. That is the practical space. That is the space where problems can be scaled up without exhausting resources.

It is not that other problems are unimportant, but more that we don't have practical solutions for them. Note we will allow exponential solutions when they are the only ones known, or when in actual use, one can avoid the worst case behavior (or detect it and give up).

We will also try to cover the causes of exponential behavior, so that the wise practitioner can avoid those pitfalls.

This blog will essentially build up several things in parallel.
  1. A vocabulary for describing patterns matching problems.

  2. A hierarchy of more and more complex pattern matching problems.

  3. Diagrams and notations to represent those problems.

  4. Algorithms to solve those problems.

For the theoretically inclined, although most of what will be explained is common knowledge, some of the results here have never been published although I have hinted at them in some of my comp.compilers newsgroup postings. In particular, you will find that along the way, I will discuss little variants of the common form and show how those variants simplify certain practical problems.

Hopefully, you will find the presentation rigorous enough that one will be able to see that the results are valid even if not presented in a precise mathematical formulation. It is not my goal to prove my results, but simply to establish them as plausible and understandable, with the emphasis on understandable.

A note on pacing and presentation.

The material here is presented very slowly and often in small chunks, although those chunks are often littered with many details, so that one might be tempted to skip ahead. Please do so. Many of the details may not be relevant to the understanding that you need. They are just presented in the place where I thought they belonged.

Moreover, the details are presented to give the reader the sense of how even simple things in pattern matching have many even simpler parts that need to be lined up together to make a particular solution work correctly.

prev: Introduction to a "blook" about pattern matching
next: The Simple Basics of Pattern-Matching

Sunday, June 20, 2010

Languages Are Defined By A Grammar

intro
prev: Languages and Recognition

To be a language, there must be rules that distinguish what is "in" that language from what is not in the language. If we can distinguish Egyptian hieroglyphs from random doodles, there must be a set of rules that tell us when the pictures are hieroglyphs. Those rules are the grammar.

Technically, the grammar is said to define the language. That is the rules of the grammar say precisely what is in the language and what is not. If the symbols follow the rules laid down by the grammar, they are in the language. If they do not follow the rules, they are not in the language.
Remember that deciding if a set of symbols is in a language is called recognition. Therefore, by applying the rules of the grammar to the set of symbols one can decide if the symbols are recognized by the grammar, and thus are in the language.
Any set of rules which allow you to decide if something is in a language or not is a grammar. However, we are going to concentrate on rules that follow a specific pattern. These rules will be called a formal grammar.

There are rules about what is a formal grammar. We will learn those rules. Since the rules of a formal grammar allow one to distinguish what is a formal grammar from what is not. There is a language of formal grammars and a grammar of formal grammars.
By the way, the rules of what is a formal grammar are quite simple and many people who deal with this topic often at this point quickly just expound the rules and assume that a reader will get the subtleties embodied therein. However, I am going to avoid that and carefully develop the parts of a formal grammar, so that you can see how the parts come together and build complexity.

You can liken this to the way you learned numbers. First one learns to count to ten, then one learns to add, then to count to one hundred, subtract, add two-digit numbers, subtract two-digit numbers, count to a thousand, multiply, recognize numbers by place-value, divide evenly, use fractions, and so on. All despite the fact that the whole of real number arithmetic can be expressed by some simple rules--rules which one only understands by the building up of the parts. I will try to approach this topic the same way.

next: Symbols