Monday, July 27, 2015

Literals and Quoting

Because the pattern is the key part of a pattern matching problem, it is important to talk about the notation used to represent the pattern. Notation simply means the way we will write the problem out, or more importantly how we will specify the problem as input to a tool, the pattern matching compiler, such that the tool can generate a program (or configure some hardware) that will process the input text and tell us whether we have a match or not. Understanding the notation is a key point of emphasis.

For the single fixed string problem, we need a way of representing the fixed string. In particular, we need a notation for literals.

As mentioned before, if we allow characters. to represent themselves and simply list the characters in order to give us the sequence, then we have a very simple notation for fixed string literals. This notation is used quite commonly in tools that only allow regular expressions. Thus, if you pass a pattern to the libPCRE library, you write a regular expression to search with in Emacs, or you write a lexer in LEX, each regular expression starts with literals where each character represents itself. Thus, we simply write the five characters h, e, l, l, and o for the literal string “hello”, as shown below:

hello

For more complicated uses, one wants to separate literal strings from names of patterns. In that case, it is useful to have a way of distinguishing a literal string. Fortunately, the convention of quoting a string applies nicely. This notation is used in Yacc++ and many programming languages. When one wants to write a literal string, one encloses the string in quotes, as shown below:

"hello"

The idea of quoting is important. It allows us to separate that which we want to take literally from that which we mean symbolically. In most natural languages, one quotes text which refers to something someone else said and means we are trying to say what they said literally. The reason one quotes the literal text (someone else said) and not what we are saying is that we say a lot more text than we quote text from other people. Thus, it makes sense to quote their text, because it means a lot fewer quotation marks.

In computer languages, we often have to choose which items we quote. And in many [simple] pattern languages there are a lot more literals than references.

In fact as the level of simple strings, we have no references, only literals. Thus, it makes more sense to reverse the quoting convention. That’s why in most regular expression languages, simple literal text is not quoted. Thus, if you are writing a regular expression for LEX, or PCRE, or Emacs, you do not (generally) quote literal strings. Again, in these usages, by not quoting literal strings we are saving ourselves a lot of quotation characters.

However, in higher level languages where there are many references and literal strings are less frequent, it makes sense to quote the literal strings. Thus, in a programming language like C, or Emacs Lisp, or in a parser generator like Yacc++, one does quote literal strings.

Of course, there are a variety of middle-grounds and variations on the theme. For example, in many regular expression uses, one is allowed to enclose a string in quotation marks. For example, LEX allows that. Moreover, in LEX if the regular expression includes whitespace , one must quote the regular expression because whitespace is considered to terminate the regular expression otherwise. Thus,

hello return 1;

and

"hello" return 1;

are the same to LEX, where hello is a string literal (and the quotation characters are ignored).

However, if you wanted spaces in the literal string, as below, you need the quotes.

"h e l l o" return 1;

Note: in each of these cases the part after the whitespace (the “return 1;” is an action, and that allows LEX to do something (in this case solve a lookup problem , which is something we’ll define shortly) more than just recognize the text.

Exercises and Questions


These questions are more practical and attempt to get you to see where you have seen quoting conventions in any programming you’ve done so far and to evaluate their differences.

  1. Does your favorite programming language have a notation for string literals? If so, what is it? Does it use quotation marks? Which ones? Does if use different quotation marks to mean different things?

  2. How about your web page design software or language? Same questions.

  3. How about your favorite text editor?

  4. Where are string literals used in a text editor?

  5. Some languages use different opening and closing quotation marks. What are the advantages and dis-advantages of doing so?

  6. In recent versions of C dialects, two consecutive quoted strings are joined together as if they were one string literal, what are the advantages and disadvantages of doing that?

  7. What are the advantages and disadvantages for splitting the regular expression from the action in LEX?

prev: More on the Simplest-Pattern Matching Problem

No comments: