Monday, July 27, 2015

Turn of the [21st] Century Pattern Matching

This blog is going to look at our knowledge of pattern matching like it was Victorian scientific knowledge.

At the turn of the 20th century, it was a commonly held belief that all major scientific principles had been discovered and there was little left but filling in the pieces. That was before Gödel's incompleteness theorem and Heisenberg's uncertainty principle. Now that hubris seems quaint.

That quaintness is exemplified by Mr. Banks in Mary Poppins, a villain whose redemption is gained by learning to fly a kite and feed the birds. However, one is assured that Mr. Banks took his beliefs very seriously before that.

It is quite possible that the future will look back on our knowledge of pattern matching the same quaint way. However, it is possible to capture what we know about pattern matching and give a summary of our current knowledge. Moreover, that knowledge has a coherence that should make it understandable.

We may not know all there is to know about pattern matching, but what we do know, makes sense, even when that sense seems counter-intuitive at first.

This blog will attempt to capture that coherence. In particular it will attempt to formulate a consistent set of theories that explain practical results. Note the emphasis on practice here. A certain amount of theoretical rigor will be abandoned to simplify matters and make the insights accessible. We do not have a well enough advanced theory to explain all the important facets.

Thus, in some cases, we will have to wave our hands and suggest what we believe to be true and not state it with certainty. Similarly, there are known theoretical results that are not relevant to practice which we will not discuss. The goal is to have this blog be readable by an interested student with only rudimentary mathematical and programming language skills. Moreover, that student should be able to understand it once they have learned the relevant vocabulary. Most importantly, an attempt will be made to define that vocabulary in simple non-technical terms.

In fact, don't worry if you can't understand the next paragraphs yet. You can follow the links to see definitions of the terms, but don't worry if those definitions aren't helpful yet. The next paragraphs give an outline of where we are going, without trying to explain any of the terms in it. In fact, explaining those terms is an important aspect of this blog.

However, some of the technical vocabulary will be borrowed from other fields and used without much explanation when doing so would take us far aware from our core subject. That is to help show the connections between what we are covering and things that people may already know. Try simply reading past the troublesome word or phrase. Hopefully, that will only render some minor details incomprehensible. If not, please point out the things which need more background material.

Attempting to explain pattern matching theory in a practical way is why this blog will talk about pattern matching rather than automata theory. By pattern matching, we mean the things we can express with regular expressions, including the extensions such as back-references which aren't traditionally addressed and with context free grammars. Far less emphasis will be placed upon Turing Machines. The practitioner rarely knowingly constructs a traditional Turing Machine to solve a problem. Instead they are more likely to augment a simpler machine with extra operations which give the machine Turing equivalence and ignore the ability of machine to express problems which do not halt.

In fact, this blog will most concentrate on solutions that are solvable in linear or polynomial time and try to find places where sub-linear solutions or approximations are possible or the polynomial can be made at worst quadratic and better yet where we can bound that quadratic performance to a subset of the problem. That is the practical space. That is the space where problems can be scaled up without exhausting resources.

It is not that other problems are unimportant, but more that we don't have practical solutions for them. Note we will allow exponential solutions when they are the only ones known, or when in actual use, one can avoid the worst case behavior (or detect it and give up).

We will also try to cover the causes of exponential behavior, so that the wise practitioner can avoid those pitfalls.

This blog will essentially build up several things in parallel.
  1. A vocabulary for describing patterns matching problems.

  2. A hierarchy of more and more complex pattern matching problems.

  3. Diagrams and notations to represent those problems.

  4. Algorithms to solve those problems.

For the theoretically inclined, although most of what will be explained is common knowledge, some of the results here have never been published although I have hinted at them in some of my comp.compilers newsgroup postings. In particular, you will find that along the way, I will discuss little variants of the common form and show how those variants simplify certain practical problems.

Hopefully, you will find the presentation rigorous enough that one will be able to see that the results are valid even if not presented in a precise mathematical formulation. It is not my goal to prove my results, but simply to establish them as plausible and understandable, with the emphasis on understandable.

A note on pacing and presentation.

The material here is presented very slowly and often in small chunks, although those chunks are often littered with many details, so that one might be tempted to skip ahead. Please do so. Many of the details may not be relevant to the understanding that you need. They are just presented in the place where I thought they belonged.

Moreover, the details are presented to give the reader the sense of how even simple things in pattern matching have many even simpler parts that need to be lined up together to make a particular solution work correctly.

prev: Introduction to a "blook" about pattern matching
next: The Simple Basics of Pattern-Matching

No comments: