In the last post in the Explanck
series, we built a tokenizer. After creating a tokenizer, the next step is to build a parser, but before we do that, we
need to understand some theory behind what a parser generates: **abstract syntax trees**.

In this post, we will equip ourselves with this theoretical foundation.

# Abstract Syntax Trees ๐

An abstract syntax tree is a tree representation of code. For example, consider an expression like `6 / 2 * 4`

. We can
represent it as the following tree:

Leetcode faithfuls will have caught on to the fact that we can
use post-order traversal to generate our value. First, we
evaluate `6 / 2`

to get `3`

, and then we evaluate `3 * 4`

to get `12`

. Smooth.

However, we can also represent the expression as this tree:

And if we traverse it in a post-order fashion, we evaluate `2 * 4`

first to get `8`

. Then we evaluate `6 / 8`

to
get `0.75`

.

One expression can generate at least two ASTs if there are no rules. And you see that the two ASTs generate two different values, which is problematic. Remember our income statement example from the previous post?

The rules we must define to prevent this from happening are called the [formal] grammar of the language. Sometimes, even when there are rules, generating more than one AST may still be possible. A grammar with rules that allow that is called an ambiguous grammar. In explanck, we will avoid them like the plague.

So, let’s talk about grammars.

# Grammars ๐

A formal grammar is a set of rules that defines what is valid within a language. In particular, we care about a type of
formal grammar called context-free grammar (CFG). Each rule in a CFG is
of the form *head := body*. *head* is the rule’s name, and *body* describes the symbol(s) the rule generates.

Symbols come in two forms: **terminals** and **non-terminals**. If a symbol within a body refers to another rule’s
head (or name), then that symbol is a **non-terminal** - this means that the symbol can be converted to other symbols
using a rule. A symbol is a **terminal** if no rule exists that can convert it to another symbol.

We all know about grammar from human languages, like English, so let’s use that to understand our context.

Let’s consider simple present tenses in English. For example,
intuitively, we may know that “I read” is a valid statement while “read I” is not. To be methodical about it, we can
define a context-free grammar, using the *head := body* format, for simple present tenses like so:

```
simplePresentTense := subject verb (noun)+
subject := "I" | "He"
verb := ("read" | "eat" | "go") (s | es)+
noun := "books" | meat
```

Based on our introduction, we can immediately tell that *simplePresentTense*, *subject*, *verb*, and *noun* are rule
heads (or rule names); this also means that **they are non-terminals because rules exist which can convert them to other
symbols**. So, for clarity, terminal symbols are represented using quotation marks and non-terminal symbols are
represented without them.

The grammar definition also tells us the following:

- We can generate a
*simplePresentTense*by combining a*subject*, a*verb*and zero or one noun - We can replace the
*subject*non-terminal with terminals*“I”*or*“He”* - We can replace the
*verb*non-terminal with terminals*“read”*,*“sleep”*or*“go”*- Also, we can combine each of these terminals with
*“s”*or*“es”*where appropriate, meaning we can also generate*“reads”*,*“sleeps”*, or*“goes”*

- Also, we can combine each of these terminals with
- We can replace the
*noun*non-terminal with terminals*“books”*or*“meat”*

We can make some valid statements such as “He reads books” and “I eat meat” with this understanding. However, statements
such as “books read he” and “I meat eat” are invalid because they don’t follow the grammar’s order. **So, apart from a
grammar telling us what terminals - English words, in this case - are valid, they also tell us how we must order them.**

In explanck, a terminal is always a scanner-generated token (i.e. a number or an operator). And a non-terminal is a rule name that describes how we can order one or more numbers and operators to form a valid expression. With this in mind, let’s say we defined a grammar for explanck like so

```
expression := literal | binary | grouping
literal := NUMBER
grouping := "(" expression ")"
binary := expression operator expression
operator := "+" | "-" | "/" | " * " | "^"
```

We’ve learned how to interpret grammars from the *simple present tense* example, so what does this grammar above tell
us?

- When we get an
**expression**in explanck, it is either a literal, binary or grouping expression - A
**literal**expression is a number (i.e. any valid number in math). E.g.*1, 30, and 56.34* - A
**grouping**expression is an expression wrapped in parentheses. E.g.*(2 + 3)*, where*2 + 3*is the expression - A
**binary**expression is of the form*left_operand operator right_operand*. E.g.*2 + 3*, where*2*is the left operand,*+*is the operator, and*3*is the right operand - The
**operator**in a binary expression can be one of*+, -, /, * and ^*

One last thing. If you look closely at explanck’s grammar, you’ll notice a recursive quality: the *grouping* and *
binary* rules reference the *expression* rule, and the *expression* rule references them. This cyclic relationship makes
it possible for this grammar to handle math expressions with an arbitrary nesting level like `(45 * 5) - 2 * 1`

. Here’s
a visual representation:

Grand! Our grammar looks sufficient to handle our needs on explanck. But that’s the thing with looks: they can deceive.

# The Threat of Ambiguous Grammars ๐

I mentioned in the [first section] that ambiguous grammars make generating more than one AST possible. With the current
grammar definition for explanck, can we generate more than one AST for a given math expression? Let’s test with our
example from the last post: *120 + 12.3 * 53 - 9*.

Without trying too hard, I notice we can generate the following as valid ASTs from our current grammar - we use *[]* to
highlight the order of executions for clarity:

[120 + 12.3] * [53 - 9] = 5821.2

120 + [12.3 * [53 - 9]] = 661.2

120 + [12.3 * 53] - 9 = 762.9

[[120 + 12.3] * 53] - 9 = 7002.9

If you’re interested, drawing the ASTs for the last two interpretations above is an excellent exercise.

Our examples show that ambiguity in explanck’s grammar makes operator precedence unclear. In its current state, we can resolve addition before multiplication, which can significantly change the result of an expression. However, to make a reliable evaluator, precedence must be clear, so let’s enforce that.

To enforce precedence, we must first define it. We do that in the table below, going from lowest to highest.

Name | Explanation | Example(s) |
---|---|---|

expression | any type of expression | 7 * 3 - (4 + 2 ^ 3) |

term | expressions with + or - | 4 + 2 or 4 - 2 |

factor | expressions with * or / | 7 * 3 or 7 / 3 |

power | expressions with ^ | 2 ^ 3 |

primary | literal or grouping expressions | 4 or (4 + 2 ^ 3) |

Given these categorizations, we can now modify our grammar to this:

```
expression -> term
term -> factor (("+" | "-") factor)*
factor -> power ((" * " | "/") power)*
power -> primary ("^" primary)*
primary -> NUMBER | "(" expression ")"
Note: * means zero or more
```

The first thing we’ve done is break down our old *binary* rule into different rules that uniquely represent the
expressions we’ve identified in the table above.

The second thing we’ve done is to specify precedence in our grammar. We ensure that each rule will only match
expressions at their precedence level or higher. If we try to generate an AST for *120 + 12.3 * 53 - 9* again, we will
find that we can only generate one of the variations - *(120 + (12.3 * 53)) - 9* - which is the correct one.

Don’t fret if this doesn’t make sense right now, take my word for it. We’ll dive deeper into how this works in the next post as we bring our parser to life.

We’re mostly done here, but before we go, let’s define some data structures to represent the different types of expressions for explanck. Then, we will use these data structures to generate an expression’s abstract syntax tree when we parse.

First, we define a base expression class:

```
abstract class Expr {}
```

Then define a `Literal`

class to store the value of a NUMBER expression, a `Grouping`

class for expressions surrounded
by parentheses, and a `Binary`

class for expressions with two operands and an operator. All these classes extend `Expr`

:

```
// For Term, Factor & Power expression
class Binary extends Expr {
final Expr left;
final Token operator;
final Expr right;
}
// To store literal value for NUMBER
class Literal extends Expr {
final Object value;
}
// To store expression within a () pair
class Grouping extends Expr {
final Expr expression;
}
```

**One last note:** Polymorphism tells us that `Binary`

, `Literal`

, `Grouping`

objects are also `Expr`

objects. We’ll
capitalize on this later.

Thanks for sticking this out; I hope you learned something. Check out the code for the expression classes on GitHub. In the next post, we will implement a parser!