LANGUAGE PARSING

Given a grammar, and a string, how can you tell whether the string is part of the language defined by the grammar?

Generally speaking, if you want a "good" compiler, you can't just use any old grammar, you have to use one that satisfies certain requirements (e.g. an LALR grammar or an LR(1) grammar) that will make compiling efficient, so that you can you can decide whether a language is accepted by a grammar.

A CFG can always be compiled to.  A CSG grammar can always be compiled to.  However, not necessarily efficiently.  An unrestricted grammar can't always be compiled to.

The two major forms of parsing are "top down" and "bottom up".

S->e
S->SS
S->(S)
S->[S]
S->{S}

Why do I not like this grammar?  It is ambiguous.

S->ST
S->T
T->e      <== left-recursive grammar because anytime recursion is needed
T->(S)        the symbol on the left of the arrow appears on the right
T->[S]        side, it is on the left.
T->{S}

S->SS  and SS can generate SS S   or S SS, same thing but generated two different ways.

It's a lot easier to parse a string if the grammar is not ambiguous.

S->TS
S->T
T->e     <== right-recursive grammar
T->(S)
T->[S]
T->{S}

In "top down" parsing, left-recursive are "almost" verboten.
In "bottom up" parsing, left-recursive are the easiest grammars.
"Right recursive," while not impossible, can come with hazards.

Define a language with order of operations, such that exponentiation is
first, and binds to the right, with * / % is next, binding to the left
and + - are next, binding to the left.

3 ^ 3 ^ 3, this would mean 3 ^27, rather than 27^3.

(In Pascal, following a pointer:  P is the pointer and P^ is the stuff being pointed to.  But P@ would also work because not all keyboards had the ^.  Likewise any [ ] could be replaced with (. .) and any { } could
be replaced with (* *), and this was done at the preprocessor level so that X = A (.I] was perfectly legal.

In BASIC, exponentiation was left associative, in Javascript, it's right associative.

S -> S + A
S -> S - A
S -> A
A -> A * F
A -> A / F
A -> A % F
A -> F
F -> B ^ F
F -> B
B -> n
B-> (S)

My naked eye tells me this properly accepts arithmetic expressions
with the "normal" precedences.

This grammar is "mostly" left-recursive, but the F->B^F is a right-recursive rule.  Normally in bottom up parsing, things are better if you use left recursion, but right recursive here and there probably won't cause you an issue.

Here's another example from "real" languages:

STMT_LIST -> STMT
          -> STMT_LIST STMT

STMT -> { STMT_LIST }
     -> ASSIGNMENT_STMT
     -> if ( EXPR ) STMT
     -> if ( EXPR ) STMT else STMT
     -> while ( EXPR ) STMT

This grammar is ambiguous, and the ambiguity arises in the form of statments

if ( EXPR ) if ( EXPR ) STMT else STMT

which if gets the else?  The most recent if gets the else.

If { } were mandatory in ifs and whiles, this ambiguity goes away.  Some languages (Swift, I think) mandates the braces

Anyway, "bottom up" parsing which starts with the string and endeavours to work back to the start state is the most common form or parsing these days.  It won't work "easily" for all grammars, but will for (what's called) LR(1) grammars (and other kinds of grammars as well).

You move through the string from left to right applying the grammar rules in reverse as they come up.  If more than one rule can be applied, you can look ahead to the next symbol coming up and see if that will disambiguate the system.  If it can, great; if it can't, it's not an
LR(1) grammar.

DO 100 I = 1,10  vs. DO 100 I = 1.10
The first is a loop; the second is an assignment statement.  ForTran can't resolve this with one symbol of lookahead.

Anyway, because different types of parsers require different types of grammars, we have these things called "normal forms."  We now all know what a context-free grammar is, but sometimes we want them in a specific format.

Chomsky Normal Form.  Every production in a CNF grammar is of the form
A -> BC   or    A -> a   .

Every nonterminal points to either two nonterminals or a token (symbol
in the alphabet).  Any grammar that does not generate the empty string
can be put into this format.  (If you want the empty string, we allow the rule S->e , provided that S does not appear on the right hand side of any rule.)