Grammars and Languages: Alphabet is a nonempty set of symbols called characters. A word is a finite string (or sequence) of characters from that alphabet. A language over an alphabet is a set of words over that alphabet. If my alphabet is {0,1} Then a language might be {01,10,100,0} A language might be {x | x is divisible by 3 (expressed as a binary number} {0,00,000,...,11,011,0011,...,110,0110,00110,...} One of most basic ideas of syntax is this: Is your language such that a computer can unequivocally say yes or no as whether a certain specific string is part of the language? Type 0: Regular languages Type 1: Context-Free Type 2: Context-Sensitive Type 3: R.E. (Unrestricted) We like to pretend that most programming languages are Context-Free, but this is not true. Most programming languages are Context-Sensitive. These descriptions, context-free, context-sensitive derive from the grammars that generate them. A grammar is a set of production rules. In addition to the alphabet (sometimes called terminals), you have also a nonempty set of tokens not part of the alphabet (sometimes called nonterminals). One of these nonterminals is called S (or start). A production rule has two strings, one on the left side of an arrow or something and one on the right side. These strings can be any strings over the union of terminals and nonterminals. Alphabet: {a,b} Nonterminals: {S,T} S -> aS S -> T T -> bT T -> b A set of production rules is called a grammar. And a grammar defines a language as follows: Starting with the string S, you make a finite series of substitutions, substituting a substring found in the left side of a production rule with its corresponding right side. If, after a finite number of such substitutions, you have a string with no nonterminals, that string is said to be in the language. S aS aaS aaaS aaaT aaabT aaabb Ergo, aaabb is in the language given by the grammar. Regular grammars: The left side consists of exactly one nonterminal. The right side has either zero or one nonterminals. If there is a nonterminal, it is the last character of the string. Regular grammars result in languages decidable by a computer with finite memory. The grammar above generates strings consisting of zero or more a's followed by 1 or more b's. In regular expressions, a*bb*. OR a*b+ Context-Free grammars: The left side consists of exactly one nonterminal. The right side can be any string of terminals and nonterminals. alphabet: ( ) nonterminals: S S -> (S) S -> SS S -> () S (S) (SS) ((S)S) (((S))S) (((()))S) (((()))()) In English, what does the above grammar generate? Nonempty strings of balanced parentheses. If I added a grammar rule such as S->epsilon, epsilon represents and empty string (some books use lambda for that), then my language would include the empty string. Languages generated by context-free grammars are decidable by a computer provided the computer has access to an infinite stack or something analogous. Go through the string left to right, add 1 to a counter for each (, subtract 1 for each ). If the counter is zero at the end and never went negative in the process, the parentheses are balanced. Balanced Parentheses are a classic example of a language that is NOT regular but IS context-free. Context-Sensitive: The left side of a production can be any string containing at least one non-terminal. The right of the production rule can be anything as long as the right side is no shorter than the left side. (With one caveat, the production rule S->epsilon is allowed since that would be the only way to allow the empty string to be generated by the grammar.) S -> abc aSb -> aabbc Context-sensitive languages are decidable by a Turing Machine. However, the grammars are messy and difficult to understand so nobody likes them. The classic example of a language that is context-sensitive but not context-free is {x|xx} So, the language containing {ABAB,CATCAT, FF} Another classic example is the set of a^nb^nc^n aka a bunch of a's followed by a bunch of b's followed by a bunch of c's with the same numbers of a's b's and c's. Most languages are not context-free because of the xx problem. int x = 5; printf (y); I know this does not compile because y was never declared. But seeing if the use of a variable matches the declaration of a variable is essentially trying to solve the {xx} problem which cannot be done with a context-free grammar. When we write a compiler for this, we write code to manually check whether a used variable was actually ever declared. Unrestricted grammars (or recursively enumerable). The left side of a production must contain at least one nonterminal. The right side can contain absolutely anything. The idea behind r.e. grammars is that YES is decidable, but NO might not be. In other words if you were to write a program that tried to decide whether a certain string was part of the language, your program could be written to guarantee a YES if the string was part of the language, but might infinite loop when a string is not part of the language.