Grammars for programming languages - Mikhail Barash - Medium

By Mikhail Barash

After Chomsky introduced phrase structure grammars in 1950s, this model quickly became a standard for definition of syntax of programming languages. Grammar rules of the form

Aα

convey the idea of defining a construct A as a sequence of strings represented by α. Indeed, consider a rule of the form

VariableDecl -> "var" ident ";"

This rule states that whenever there is a keyword var followed by an identifier that, in its turn, is followed by a semicolon, then this sequence of strings represents a variable declaration.

Though being surprisingly simple, this mechanism allows defining syntax of modern programming languages (in a form of BNF notation that is equivalent to context-free grammars) and is still used as a lingua franca when describing syntax.

Soon after context-free grammars had been introduced, it became apparent that their expressive power is limited. In 1962, Floyd has shown that Algol is not a context-free language and thus cannot be defined by a context-free grammar. Floyd considered the following Algol program, in which the only identifier is symbol x repeated n times.

For this program to be valid, number n should be the same in all its three occurrences. Under certain encoding, such programs can be represented as strings of the form a...a b...b c...c , where symbols a , b and c occur the same number of times (for example, aabbccis fine, but aabbbccc is not because b and c occur three times, whereas a only occurs twice). These strings form a well-known non-context-free language.

Nevertheless, context-free grammars are capable of defining the structure of such programs: indeed, as required by Algol’s syntax:

  • statements are separated by semicolons,
  • there is an identifier in the left-hand side of the assignment operator,
  • sequence of statements is frame with correctly spelled keywords beginend, and so on.

Below is an example of a program that satisfies all these conditions and will be considered correct despite containing undeclared identifiers.

begin
real x; xx := xxx;

end

Context-free grammars define structural syntax of languages, but cannot express any facts about their static semantics (what is called context conditions). It is impossible to specify in a context-free grammar that every identifier should be declared before use, or that the number of formal parameters and actual arguments in a function call should agree.

Context-sensitive grammars (initially proposed by Chomsky to describe syntax of natural languages) are powerful enough to define these conditions, but they are equivalent to a special class of Turing machines, which means that no efficient parsing algorithm exists for them.

A LEGO implementation of a Turing machine. Image credit: https://filmfreeway.com/legoturingmachine

Subsequent research was done in two main directions:

  • to embed “checking actions” into rules of a context-free grammar,
  • to come up with an entirely new grammatical model.

The first approach led to attribute grammars and syntax-directed translation. In a grammar, every rule is associated with semantic actions, which, for example, may use symbol tables to look up whether a certain variable has been previously declared.

VariableDeclaration ::=
"var" ident {symbolTable.safeAdd($1);} ";"

In this example, method safeAdd could check whether an identifier with the same name has been declared (the name of the “current” identifier is referred to by $1), and, if so, raise an exception. Otherwise, the new identifier is added to the symbol table. Such essentially ad hoc techniques are still used in industry-level compiler compilers.

Example of semantic actions (code within curly braces) in lex and yacc. Image credit: https://github.com/faustinoaq/vscode-lex-flex-yacc-bison

The other direction of research was to develop a reasonable (that is, efficiently parsable) model, which would be able to specify the desired context conditions. Of many such attempts, two-level grammars by van Wijngaarden received some attention: they were used to formally specify syntax and static semantics of Algol 68. Unfortunately, two-level grammars soon turned out to be Turing-complete: this makes impossible any practical parsing algorithms for them.

Fragment of van Wijngaarden’s grammar for loop clauses of Algol 68.

Parsing expression grammars by Ford introduce ordering of rules in a context-free grammar: the choices are tried in order and the first one to succeed is used to parsing the input string. This is useful in disambiguating constructs like if-then-else statements.

IfStatement :
"if" Expr "then" Statements "else" Statements /
"if" Expr "then" Statements

This rule matches a conditional statement in a way that the optional else clause always binds to the innermost if. Without priority of rules, the rule would lead to dangling else ambiguity.

Parsing expression grammars also provide predicates for Boolean conjunction (&) and negation (!). The following rules define nested comments in Pascal (example taken from this book).

Comment : "(*" CommentedText* "*)"
CommentedText : Comment / ! "*)" .

A comment consists of an opening token (*, some commented text, and a closing token *). This construct is non-trivial to recognize because the commented text may contain symbols * and ), but not the sequence *). Moreover, comments may again contain comments: indeed,

(* a := b; (* nested comment *) *)

is a correct statement.

This construct can be recognized by the given rules: the idea is that CommentedText is either a complete comment or any symbol (expressed as . in parsing expression grammars) provided that it does not start with the string *) (this is expressed by the negation predicate ! "*)"). Similarly to negation, conjunction in parsing expression grammars is used as a mechanism of lookahead.

BulletList : &BulletSymbol List
List : Item* !BulletSymbol

These rules define the structure of bullet lists in Markdown. Rule for BulletList first verifies whether the first element of a list starts with a bullet symbol and only then it tries to recognize the string according to the rule for List. Elements of the list cannot contain the bullet symbol (this is expressed by !BulletSymbol in the second rule).