Evaluating expressions programmatically is unassumingly tricky and in this series we will learn what it takes to do it
well by building a program of our own called Explanck ^{1}^{2}.

I hope you enjoy reading this and find the journey as rewarding as I did!

# Prologue ๐

Let’s consider the gross profit margin formula: `((Revenue - COGS) / Revenue) * 100`

^{3}. Imagine this formula without
the parentheses. We’d have `Revenue - COGS / Revenue * 100`

. These two formulas give wildly different values.

In `((Revenue - COGS) / Revenue) * 100`

, we intuitively^{4} know to calculate the expressions within the parentheses
first. So `Revenue - COGS`

(a.k.a gross profit) is evaluated first. Then `Gross Profit / Revenue`

is evaluated next, and
that result is multiplied by `100`

to give the margin.

Conversely, in `Revenue - COGS / Revenue * 100`

, `COGS / Revenue`

is evaluated first.Then we multiply the result of
that division by `100`

. Finally, the result from the last step is subtracted from `Revenue`

.

These two supposed variations of the same formula highlight the importance of operator precedence. In the first
formula,`()`

takes the highest precedence. The operators that take the highest precedence in the second formula are `*`

and `/`

. But the second formula tells us something else too: if we have operators of the same precedence, evaluate them
from left to right. So, `/`

is evaluated first. Can you imagine the disaster this formula could unleash in companies'
income statements if used without the correct precedence?

The rules surrounding expressions and precedence have intrigued me for a while, and I’ve often wondered how one might build a program that reliably evaluated expressions. Last year, I read an article that described how we might do this by converting an Infix Notation to a Reverse Polish Notation (RPN) and then evaluating the RPN using a stack - that was a fun learning experience. But, I left wondering if there was another approach we could take to do this. Can we solve this without converting to RPN?

This year, I found an answer in a book titled *Crafting Interpreters by Robert Nystrom*^{5}: “Yes, we can, but we’ll
have to convert to something else”. And this post (and next few posts) will expand this answer into a program.

# Problem Statement ๐

We start with a simple problem. Evaluate this expression: `120 + 12.3 * 53 - 9`

. As good mathematicians, we know that we
need to multiply `12.3`

by `53`

first to give `651.9`

. Then add `651.9`

to `120`

to give `771.9`

. Then subtract `771.9`

from `9`

to give `762.9`

. We’ll stick with this expression throughout this series and our goal is to build a program
that can evaluate it for us.

For our context, an expression is a valid and finite combination of operands and operators e.g. 1 + 2. And evaluating an
expression means applying the operators to the operands according to some rules to produce a **single** value e.g. 1 + 2
= 3.

Building Explanck is similar to building a language like JavaScript because, as you’ll see, we implement all the phases
of building an interpreted language: `Scan (or Tokenize) -> Parse -> Intrepet`

. But while JavaScript can be used to
build programs, Explanck only understands mathematical expressions^{2}.

In the scanning (or tokenization) phase, we’ll convert our expression into a series of tokens. Then, in the parsing phase, we’ll build something called an abstract syntax tree from our tokens. Finally, we will traverse (and evaluate) this tree to get our final value in the interpret phase.

In this post, we’ll focus on the first phase: **Tokenization**.

# Tokenization ๐

When our program receives an expression as input, it receives a string - a stream of characters. Strings don’t mean
anything to us - they cannot be evaluated as they are - so in this phase we will convert input strings into meaningful
units called **tokens** that represent operands or operators. Here are the token types we care about for Explanck:

```
enum TokenType {
// Operators
LEFT_PAREN, RIGHT_PAREN,
PLUS, MINUS,
STAR, SLASH,
POWER,
// Operand
NUMBER,
}
```

And we will store tokens in a Token class:

```
class Token {
final TokenType type;
// The value of the token
final Object literal;
// Constructor
Token(TokenType type, Object literal){}
}
```

So, given an expression like `120 + 12.3 * 53 - 9`

, the tokens are:

```
[NUMBER 120, PLUS, NUMBER 12.3, STAR, NUMBER 53, MINUS, NUMBER 9]
```

Notice that the expression contains whitespaces, but we don’t care about it, so we ignore it. We do the same thing for every other invalid character (we might decide to throw errors later). The number of invalid characters we can get is infinite, but we know which characters are valid, so we define rules for them. Our pseudocode is

- Traverse each character in the expression string
- If a valid character sequence is found (that is, a sequence that matches one of our defined token types), create a token from it and add it to the token list

For operator token types, the characters we care about include `(, ), +, -, *, /, ^`

. These are all one character tokens
that we can match easily, and when we find them, we add them to a list of tokens.

```
class Scanner {
private final String source;
private final List<Token> tokens = new ArrayList<>();
private int start = 0;
private int current = 0;
List<Token> scanTokens(){
// continue loop until we reach the end of the source string.
while (!isAtEnd()){
char c = source.getCharAt(current);
current++;
switch(c) {
case '(': add(TokenType.LEFT_PAREN); break;
case ')': add(TokenType.RIGHT_PAREN); break;
case '*': add(TokenType.STAR); break;
case '/': add(TokenType.SLASH); break;
case '+': add(TokenType.PLUS); break;
case '-': add(TokenType.MINUS); break;
case '^': add(TokenType.POWER); break;
}
}
return tokens;
}
// adds a token to tokens
private void addToken(TokenType type){}
}
```

The logic to create a number token is a little more interesting, so let’s expound on that. Numbers aren’t always single
characters. From our previous example, `120`

is our first number token. How do we extract it? Here’s our pseudocode:

- Traverse each character in the expression string
- If the current character is a digit, keep looking ahead until we get to a character that is not a digit.
- When we find a number that isn’t a digit (or reach the end of the string), we extract all the digit characters as one whole number.
- We create a token from that number and add it to the token list

So we get something like

```
// in Scanner.java
List<Token> scanTokens(){
while(!isAtEnd()){
//> omitting previously written code <
// We need this to extract substrings of 1+ digits
start = current;
switch (c) {
default:
// If the current char is a digit, we call addNumberToken to keep looking ahead for more digits.
if (isDigit(c)) addNumberToken();
}
}
}
// keeps looking ahead until we don't see a digit again
private void addNumberToken(){
while (isDigit(source.charAt(current)) && !isAtEnd()) {
current++;
}
// after all digits have been found, extract the number
// Remember we do `start = current` at the top. This substring extraction is why it is important
Double number = Double.parseDouble(source.substring(start, current));
addToken(TokenType.NUMBER, number);
}
// an overloaded method to add tokens with a given literal value
private void addToken(TokenType type, Object literal){}
```

This logic works well for `120`

, but consider the following number from our expression: `12.3`

. Based on the current
code, we will extract `12`

and `3`

as two separate numbers. We don’t want that, so how do we extend our logic?

- Instead of arbitrarily stopping when we stop seeing digits, we check if the next character is a
`.`

, - If we find
`.`

, it is potentially a decimal point. And we confirm this by checking if there’s a number after the`.`

. - As we implement this logic, there are a few edge-cases to cater for
`.123`

is invalid because there are no digits before`.`

`123.12.126`

is not a valid NUMBER token because there are two decimal points. The code should extract`123.12`

and`126`

separately, completely ignoring the second`.`

`123.`

will not work because there’s no number after the decimal. What gets extracted is`123`

. The decimal gets ignored.

- Then, we extract all the characters that make up our number, create a token and add it to the token list

```
private void addNumberToken(){
//...after first while
if (peek(0) == '.' && source.isDigit(peek(1))) {
current++;
while (isDigit(source.charAt(current)) && !isAtEnd())
current++;
}
// extract number, create a token and add it to the list.
}
// a helper method to lookahead by an arbitrary number of steps
private char peek(int stepsForward) {
int pos = current + stepsForward;
if (pos >= source.length()) return `\0`;
return source.charAt(pos);
}
```

That’s it. That is all it takes to extract tokens for our needs (minus error handling). So, now we can do:

```
String exp = "120 + 12.3 * 53 - 9";
List<Token> tokens = new Scanner(exp).scanTokens();
sout(tokens) // [NUMBER 120, PLUS, NUMBER 12.3, STAR, NUMBER 53, MINUS, NUMBER 9]
```

You can find the code for the scanner on GitHub. In the next post, we tackle some preliminary theory to enable us to implement our next phase - parsing!

I initially called explanck a programming language because we build it like we would build an interpreted language. This didn’t seem “right” though since the only valid syntax are math expression - where are variables, loops, classes etc? On that note, we’ll just refer to explanck as a

**program**that evaluates expressions. ↩︎When I still played around with calling explanck a language, I initially wanted to call it

**explang**, as in expression language, but I googled the name and saw that many people had used it for different projects. So, while I pondered what to name it, a friend who was with me suggested**explanck**because**plang**reminded him of**planck**, as in**Planck’s constant**. The Planck’s constant is one of the smallest constants used in Physics, and I reckoned that**explanck**will be one of the smallest expressions of a programming language you will ever find. It seemed to fit! The moral of this story is that naming is hard; if you have a cool name for anything, hold it tight! ↩︎COGS is Cost of Goods Sold. ↩︎

Our “intuition” was formed through those math classes where we learned about operator precedence ;) ↩︎

Many ideas I use in this series are based on things I learned while reading Crafting Interpreters. It’s a good read if you’re interested in learning about languages. ↩︎