Monday, July 27, 2015

The Simple Basics of Pattern Matching

To understand pattern matching, we need to understand what kind of problems it is designed to address. We will start with the very simplest problem. If someone sends us a text message with the letters “h” “e” “l” “l” “o”, can we recognize that as the word “hello”?

Now there are certainly many ways to envision that as a complicated problem, but we are trying to keep it simple. Did we see the letters “h” “e” “l” “l” “o” together and in the right order? To put it more bluntly, was the first letter an “h” and the second letter an “e” and the third and fourth letters both “l”s and the fifth and final letter an “o”? That is all we are trying to ask.

Notice how the question was asked by specifying an exact step-by-step set of sub-questions, one for each letter of the desired answer. That explains what pattern matching is all about. We have some text and we look at it letter-by-letter and if all the letters are correct we have found a match.

Every pattern matching problem is a variation on that. We have some text (or something we treat like text). The text is composed of letters (or again something we will treat like letters). We look at each letter and see if it matches the pattern. And, if the letters match, we have a match for the pattern.

This also gives us a first bound on the complexity of the problem. In the case of finding a match, we are limited by the length of the string that matches. It must take at least one comparison for each letter. Thus for the matching case, we have a lower case linear bound on the problem. That is as the text being matches gets longer, the time to solve problem gets longer at a linear rate. One more letter, one more comparison. As long as we can perform each comparison in a constant amount of time, the total time to perform the match with this method is a linear multiple of that constant, the multiplier being the number of letters we need to match.

Exercises and Questions


Here are some questions to consider in deciding if you understand the time bounds on simple pattern matching.
  1. Why is the lower bound for a pattern matching problem at least linear in the number of characters in the text that constitutes the match? What can happen if one of those comparisons isn’t performed?

  2. How about the mis-match case? Is the lower bound for a pattern matching problem for finding out if the text doesn't match the patterns different? Can we find ways to detect a mis-match more quickly?

  3. What about the worst case bounds for the mis-match problem? How much text must we look at to determine that the pattern doesn't match if we get unlucky?

prev: Turn of the [21st] Century Pattern Matching (Introduction) next: The Alphabet

No comments: