Monday, July 27, 2015

Pattern Matching is Easy

Looking at the simplest problem, one might be tempted to think pattern matching is easy. The more cynical might assume that it is just deceptively easy and there are real complexities lurking. At some level both points of view are right.

Pattern matching is intrinsically easy. Even as we build up more and more complicated notions of what kinds of patterns we wish to match. These notions will all be relatively simple steps. Moreover, we will generally be able to sole them with essentially the same simple problem solving methods with a few extensions.

Pattern matching is also deceptively easy. There are definite places where ones intuition leads one astray. Places where after making three right turns, one finds ourselves unavoidably going left. However, it isn’t that counter-intuitive aspect that is the real issue. One can learn the places where common sense fails one and 1 + 1 ≠ 2.

Pattern matching being easy is part of the problem itself.

Along with the sense that much of pattern matching is well-known and well-understood and little is left to discover is the fact that it is fairly easy to write a pattern matching solution. Many people have. Don’t be surprised when you learn that you can implement some of the solutions mentioned here in a few hours. Moreover, it is easy to implement a solution that incorporates some novel aspect. Again, many people have. In fact, there is a room full of papers showing one novel aspect or another implemented by someone and never carried forward in my basement.

The hard part of writing a pattern matching solution is the engineering part, the user-interface part, the scaling part, the combinatorial complexity part, the keeping simple things simple, but not too simple part. There are lots of little choices to make in a solution that impact its performance or the ability of some user to describe the pattern they want. Be aware that even for a one-time throw-away solution, you may have to re-write and change parts as you become aware of the real needs and the real issues. And, if you are trying to make something that you aren’t going to throw away, you need to be all the more conscious of what you choose and why. Because of that, you will find many points in this blog where the trade-offs between different solutions are mentioned.

Along those lines, you will find me particularly intolerant of people who write a hand-written parser or implement their own DFA interpreter without considering using a well-proven tool to do that job. The problem isn’t that their solution won’t work. Pattern matching is easy and it surely will work.

What a solution like that won’t do is scale. It will be hard for it to survive multiple rounds of enhancement without either being entirely re-written or becoming a nest of convoluted hacks. It also won’t easily be made parallel or otherwise optimized, as our optimization technology works well on stylized code and high-level abstractions, but not so well when given an arbitrary program that may have assumptions that are hard to prove or transform. Long ago I heard a great quote, which I paraphrase, about a compiler versus an assembly language programmer that makes this point well, “compiler generated code can generally reach about 75% the performance of a hand-optimized assembly program, but the compiler will continue to reach that rate even as the program is modified and all the careful register assignments in the hand optimized program have been rendered incorrect”. This is particularly true in pattern matching. You can quickly write a program that exploits the ideas contained in this blog. It is much harder and more tedious, thus best left to a tireless program, to line up all the details of the ideas in the right order to express an approximation to the optimal solution for a non-trivial set of patterns. This is why even if you have what seems to be an easy pattern matching problem, you want to use a well written tool to solve that problem, so that as the problem becomes more complex the solution continues to evolve with it, rather than becoming a rats nest of ad hack patches.

Just as it is easy to implement a pattern matching solution, it is usually easy to write a pattern itself. In fact, it needs to be easy to do that. Otherwise, pattern matching is useless. However, that easiness gets some people carried away.

It is far too easy to write a pattern that over-specifies the problem. As a friend of mine once said (again slightly paraphrased), “how often does a hacker gets some sophisticated attack written, but they mess up and get the quotes wrong, and does one really want to not detect those almost attacks?” To address that, you will find points where this blog tries to indicate that a simpler solution might cover the same space. It's one of those curious contradictions that a sophisticated, subtle, and pedantically correct solution may be “worse” than a simpler solution that only approximates the right answer, but does it in a way that works better for the user.

prev: The Alphabet next: More on the Simplest Pattern Matching Problem

No comments: