GRAMMARS One mechanism we use to define a language. Grammars have two kinds of symbols: terminals and non-terminals. The terminals are simply the characters in the alphabet over which the language is defined. The non-terminals are characters used in the grammar but are not part of the alphabet. Traditionally, we tend to use capital letters for the non-terminals and lowercase letters (and other symbols) for the terminals. We always have at least one non-terminal: S, the start symbol. We always have at least one terminal, because the alphabet is never empty. We often use the Greek letter epsilon to refer to the empty string (lower case epsilon, and somewhat stylized). In TextPad, I usually denote the empty string eps. A grammar is a set of production rules. Each production rule contains a string on the left and a string on the right (and an arrow in between). What constitutes valid strings on the left and right depends on what sort of language we're trying to define. The arrow points to the right. At least one production rule contains S (and only S) on the left side side of the arrow. Here's an example of a "classic" context-free grammar. 1: S->(S) 2: S->SS 3: S->eps This defines balanced parentheses. If starting from the string "S" and iteratively replacing a substring matching the left side of a production with the string on the right side of the production rule, you eventually get to a string with only terminals, that string is said to be part of the language generated by the grammar. S SS (rule 2) SSS (rule 2) (S)SS (rule 1) (S)(S)S (rule 1) (S)(S)(S) (rule 1) ()(S)(S) (rule 3) ()()(S) (rule 3) ()()() (rule 3) ()()() contains no non-terminals, and so it is part of the language generated by the grammar. Starting with regular languages / regular grammars / Type 3 languages. The left side of each production rule contains exactly one nonterminal. The right side of each production contains a string terminals preceded by AT MOST ONE nonterminal. -------- prefix vs. postfix notation. Polish notation. Reverse Polish notation. 5 + 3 x 5 = ? Is it 40? Is it 20? My TI-1000 calculator says it's 40. My TI-35 calculator says it's 20. This is dumb. The notation is ambiguous. You need parentheses to clear this up. He used prefix notation. x + 5 3 5 : This is unambiguously 40. + 5 x 3 5 : This is unambiguously 20. When the Hewlett-Packard calculator, they adopted what became known as Reverse Polish notation: the operator follows the numbers. 5 3 + 5 x : 40 5 3 5 x + : 20 These two notations, prefix and postfix appear to be equivalent. They take the same of inputs, the same number of operators; it's just that in postfix the operator comes last and in prefix the operator comes first. However, postfix is significantly easier to implement: An entered number goes on the stack. An operator pops two numbers computes the result, pushes the result onto the stack. The operator is then forgotten. But in prefix, you have to store the operator because the operator's arguments may come a fair bit later in the process. + + + + + 1 2 3 4 5 6 ******** Along the same lines, putting the nonterminals in front allows for easier algorithms than putting the nonterminals at the end. S->Sb S->Sab S->eps This is a regular grammar and it generates a regular language. What are examples of strings generated by this grammar? b , bbbbb , abababababab , ababab, abbbbbbb , eps , S Sab Sbab bab What's the SHORTEST string NOT in the language? a . How would we, in plain English, describe the language? All strings of a's and b's such that every a is immediately followed by a b. ***** S -> R S -> S0 R -> T1 T -> T0 T -> eps All strings of 0's and 1's containing EXACTLY one 1. Most people don't use regular grammars to express Type 3 languages. The most common way to do this is with "regular expressions". Regular expressions, at their most basic, employ three operations: concatenation, union, and Kleene closure. 0*10* The star represents Kleene closure. This means 0 or more copies of what the star references are concatenated together. Zero or more 0's followed by 1 followed by zero or more 0's. (b | ab)* Zero or more times I grab EITHER b or ab and paste them together. ( (a | eps) b ) * Concatenation: two strings (or sets of strings) are pasted together simply writing them one after the other adjacently. AB is set of strings formed by grabbing one string from A and appending to it one string from B. Union: A | B is the set of strings contained in either A or B (or both). Kleene Closure: A* is the set of strings by taking zero or more strings from A and appending them together. abc: As a regular expression, it's really not the string abc; it's actually a set containing exactly one string: abc. We also use for clarity eps to refer to the set containing only the empty string, and phi to represent the empty set. (a | b)* = (a*b*)* ==> Both of these are the set of all strings over a and b. Now, in some languages with regexp capability, of course, you have other operations. A+ = AA* (one or more instances of A). Not all regular expressions are obvious: ((0|1)*0) | eps : Binary numbers divisible by 2 ((0|1)*00) | 0 | eps : Binary numbers divisible by 4 (0|1)*10 : Binary numbers whose remainder is 2 when divided by 4 : Binary numbers divisible by 3