Context Sensitive Languages: Different definitions that boil down to the same thing. The definition I learned: Valid production rules are those with at least one nonterminal on the left, and the right side is no shorter than the left side. (There is a problem with this.) In Chomsky's definition, all production rules were of the form: X T Y ==> X Z Y, where T is a non-terminal, X, Y, Z are strings of tokens and non-terminals, and Z, specifically is not the empty string. This is why we have the terms "context-free" and "context-sensitive." In a context-free grammar, the left side is always just one nonterminal. So, if you have T ==> X, then you can always replace T with X when you are expanding the start symbol S to a valid string in the language: S==> (S) | SS | empty, you can ALWAYS replace S with one of these productions on the right. But in a context-sensitive language, X T Y ==> X Z Y: You can't ALWAYS replace T with Z, because this rule is only valid when T is between X and Y. So, T only expands to Z in the context of X and Y. Hence it is "context-sensitive." With a Rule T ==> Z, since T can ALWAYS be replaced with Z, irrespective of context, this is why such grammars are called "context-free." In NEITHER of these definitions, can you generate the empty string. So, under this definition, the empty string is not part of any context-sensitive language. Is it always true that every context-free language is also context-sensitive? Yes, provided that the empty string is not part of that context-free language. Here is a way around this: In either definition of Context-Sensitive Languages, add another rule: The start symbol is not allowed to be on the right side of any production. And, S ==> empty is explictly allowed as a production rule, even though it violates either strict definition. Here's an example: {a^n b^n c^n} The language of strings with an equal number of a's b's and c's with the a's coming before the b's and the b's coming before the c's. S ==> empty S ==> T //<-- T is the "real" start symbol, after we allow the empty string in the language. T ==> a T B C T ==> a B C //<-- Keep applying this rule until you get as many a's as you want. C B ==> B C //<-- moves b's and c's around a B ==> a b b B ==> b b b C ==> b c c C ==> c c aaaaabbbbbccccc S T aTBC aaTBCBC aaaTBCBCBC aaaaTBCBCBCBC aaaaaBCBCBCBCBC aaaaaBBCCBCBCBC aaaaaBBCBCCBCBC aaaaaBBBCCCBCBC aaaaaBBBCCCBBCC aaaaaBBBCCBCBCC aaaaaBBBCBCCBCC aaaaaBBBBCCCBCC aaaaaBBBBCCBCCC aaaaaBBBBCBCCCC aaaaaBBBBBCCCCC aaaaabBBBBCCCCC aaaaabbBBBCCCCC aaaaabbbBBCCCCC aaaaabbbbBCCCCC aaaaabbbbbCCCCC aaaaabbbbbcCCCC aaaaabbbbbccCCC aaaaabbbbbcccCC aaaaabbbbbccccC aaaaabbbbbccccc Now, parsing: Given a context-free grammar or a context-sensitive grammar, or, for that matter, a regular grammar, how can you tell whether a certain string is generated by the grammar? aaaaaabbbbbbcccccc: Is this generated by the above grammar? (Answer: yes), but how would you decide this? Suppose you are writing a parser, and need to be able to tell. BASIC: 10 PRINT "Please enter your name: "; 20 INPUT N$ 30 IF N$="Donald Trump" GOTO 60 40 PRINT "That's nice, ";N$ 50 GOTO 70 60 PRINT "You go to hell. You go to hell and you die." 70 END In particular, how could you program a computer to give you a firm yes or no: Is this string generated by the grammar? And the answer is: work backwards. aabbcc 1. S ==> empty 2. S ==> T 3. T ==> a T B C 4. T ==> a B C 5. C B ==> B C 6. a B ==> a b 7. b B ==> b b 8. b C ==> b c 9. c C ==> c c aaBbcc X aabBcc X aabbCc X aabbcC X aaBbCc X aaBbcC X aaBBcc X aabBcC X aaBbCc X aabBCc X aabbCC X aaBbCC X aaBBcC X aaBBCc X aabCBc X aabBCC X aaBCBc X aabCBC X aaCBBc X aaBCBC X aabCCB X aTBc X aTBC X aaCBBC X aaBCCB aSBc X T X aTCB aaCBCB S <- DING DING DING If we end up with just S, the string is in the language. If our queue eventually runs dry without ever getting S, then our string is not in the language. ab X aB X Well, what if the program runs forever, we never get to S, but the queue never runs dry. What then? Answer is: THIS CANNOT HAPPEN. Context sensitive production rules are longer on the right than on the left. Every time we pull a string out and replace it with the left side of the rule, we make our string shorter (or the same length). If we make it short enough, it will eventually be of length 1. Now, if we replace a string with a string of the same length, why can't we keep doing that forever, never emptying the queue and never getting a smaller string? The answer is that there aren't an infinite number of strings of length n, if we stay at the same length, we'll eventually hit them all (or all of them we're going to hit). Since we are not adding duplicates to the queue, we will eventually exhaust all strings of a certain length and either empty the queue or get to shorter strings. This is horribly inefficient, but it will determine whether a certain string is accepted by a CSG. REAL compilers can't be so trial and error. They have to parse quickly, meaningfully, and unambiguously. if (test) if (othertest) stuff else otherstuff This is called the dangling else problem. Which if does that else go with? In C-like languages, the else goes with the most recent if it could go with. A CFG might look like: ifstmt: if ( expr ) stmt ifstmt: if ( expr ) stmt else stmt Two easy way to disambiguate the language: a) require else for all if's b) require braces around every stmt matchingif: if (expr) matchingif else stmt stmt unmatchedif if (expr) matchingif ^^^^ This isn't quite right. I'll have to think about this. Anyway, this notion of CSG parsing, inefficient but guaranteed to give us an answer leads us to understand why CSG's were invented the first rule. That rule that that says the right side is no shorter than the left was SPECIFICALLY to ensure that we can ALWAYS give a definite YES or NO to the question of whether a specific string is generated by the specific grammar. Now, let's say we remove the rule that says that the right side cannot be shorter than the left. So, now we just have a bunch of rules. The left side must contain a nonterminal in there somewhere. The right side can be anything: This is called an unrestricted grammar, and it's the most powerful grammar there is. (But with great power, there must also come great responsibility.) The unrestricted grammars lead us to Turing Machines, and two classifications of language. Recursive and Recursively Enumerable (r.e.) Recursive: A Turing Machine can be built that will give a definitive YES or NO to the question of a specific string being part of the language for which the Turing Machine was built. R.E.: A TM can be built that will give a definitive YES if a specific string is in that language and will infinite loop if the string is not. The Turing Machine: designed by Alan Turing. Turing designed the TM to be a mathematical model of what the "ideal computer" could compute. The TM is actually more powerful than any real life computer, but it's still the theoretical model. (TM's have infinite memory. Real computers don't.) Church's thesis: no real-life computer will ever solve a problem beyond the scope of the TM.