The only requirements on the things we will use as characters is that we can uniquely and quickly recognize them and that there is a finite list of them we know in advance. Thus, bits in a computer memory will do. There are two distinct values “0” and “1” (or high and low or current flowing and grounded). We can also use the four chemical bases in DNA: adenine, cytosine, guanine, and thymine or “a”, “c”, “g”, and “t”. Again, they are distinct and the list is finite. Packets sent across network traffic are treated as being composed of bytes (or octets), and while a byte can take on any of 256 values, those values are still unique and the list finite, so that is a suitable alphabet.
The alphabet is also often called the character set as it is the set of all the characters that the text might contain. In theoretical texts, the alphabet is often represented by the uppercase Greek letter sigma (Σ), particularly in formulas. When you see a Σ, you should mentally think “any character will do here”.
If you are reading this, there is a good chance you learned the "alphabet" at some point in your life. If you know more than one language, you probably know that there isn't just one alphabet. Some languages have extra letters. However, all languages have specific letters in their alphabet.
This fixed character set is the alphabet of the language and the basis of the language. Each language has a basis, an alphabet.
This is even true of pictographic languages, where we draw pictures of words. Even pictographic languages have an alphabet. There is a limited (although in some cases large) set of symbols one has to learn how to make to be able to write. Similarly, if one speaks a language, there is a limited set of sounds one has to learn to make. The same is true of languages we want to pattern match, there is a fixed set of characters that one has to be able to represent as the basis of those languages also.
In contrast, the set of integer numbers would not make an acceptable alphabet, as there aren’t a finite number of them. On the other hand, representing the integers as decimal numbers is okay, because we have only 10 distinct digits “0” through “9” and thus a legitimate alphabet. That is something to be aware of, subtle differences in how the problem is worded can affect it in powerful ways, in this case changing a problem we couldn’t address into one we can.
Similarly, most image processing is not a pattern matching problem that we will address. Trying to identify whether a picture contains a bird does not involve a fixed alphabet of simple to recognize characters. In fact, recognizing the parts of a picture is the hard part. Once, we have a picture divided into parts and if those parts could be represented by words like "wing", "eye", "head", "body" it might be possible to turn it into a pattern matching problem we can address. However, in the process, you have skipped the hard part of the problem.
Next, in general, we also don’t care how the letters of the alphabet are written. Thus, it is perfectly valid to consider “a” and “a” and “a” all to be the same letter. It is also valid to consider them 3 distinct letters. The only requirement is that whether those letters are the same or different must be known in advance and must be treated consistently. Thus, not only can subtle differences have big effects on the problem, there are also large differences that can be swept under the rug as irrelevant.
So, again a pattern matching problem begins with some text that can be broken down into distinct letters and we can look at those letters one-by-one and determine whether they match the pattern or not. That is all there is to it. It is a very simple problem and thus it is useful in many places.
To reiterate: the basis of a pattern match problem is an alphabet, which is a set of characters. We take the characters of the alphabet and form sequences of them by concatenating them. By concatenating, we merely mean writing them in some order. These concatenated sequences we call strings.
We match strings by seeing if they have the same characters in the same order as the pattern does. The string we are trying to match is generally called the input text or subject.
To distinguish the characters in the pattern from the letters in the input text, we will call the characters in the pattern the items. The items in the pattern are the things which tell us which characters in the input text will match. We call them items, because they aren’t always the exact character to match, but sometimes, just a requirement on the character, such as the input character must be a digit.
Exercises and Questions
Here are some questions to consider in deciding if you understand what a pattern matching problem is (or can be). The yes-no questions also deserve a why or why-not answer. Moreover, for those questions where you think the answer is “no” and pattern matching doesn’t apply, consider what would be needed to be done to allow pattern matching to apply? Similarly, if the answer is yes, how could some subtlety be introduced that would make the answer no?
- Are the characters that represent Roman Numerals, e.g. I, II, III, IV, V, X, XV, XX, a suitable alphabet?
- Does the fact that DNA is often coiled in a helical spiral affect its interpretation as text?
- Are chemical formulas generally treatable as text?
- Are mathematical formulas generally treatable as text?
- Are weather patterns generally treatable as text?
- Could the sequences of left and right turns to solve a maze be a problem that could be expressed as a pattern matching problem?
- Can the squares on a Monopoly® board be treated as an alphabet? Could the sequence of play of a Monopoly game be treated as a pattern matching problem?
- Can text that has been encrypted be dealt with by pattern matching?
- Is spoken speech suitable for a pattern matching problem?
prev: The Simple Basics of Pattern Matchine
next: Pattern Matching Is Easy