Explanck: An expression evaluator (Part 1)

Mar 29, 2022 · 1739 words · 9 minute read

Evaluating expressions programmatically is unassumingly tricky and in this series we will learn what it takes to do it well by building a program of our own called Explanck ¹².

I hope you enjoy reading this and find the journey as rewarding as I did!

Prologue 🔗

Let’s consider the gross profit margin formula: ((Revenue - COGS) / Revenue) * 100³. Imagine this formula without the parentheses. We’d have Revenue - COGS / Revenue * 100. These two formulas give wildly different values.

In ((Revenue - COGS) / Revenue) * 100, we intuitively⁴ know to calculate the expressions within the parentheses first. So Revenue - COGS (a.k.a gross profit) is evaluated first. Then Gross Profit / Revenue is evaluated next, and that result is multiplied by 100 to give the margin.

Conversely, in Revenue - COGS / Revenue * 100, COGS / Revenue is evaluated first.Then we multiply the result of that division by 100. Finally, the result from the last step is subtracted from Revenue.

These two supposed variations of the same formula highlight the importance of operator precedence. In the first formula,() takes the highest precedence. The operators that take the highest precedence in the second formula are * and /. But the second formula tells us something else too: if we have operators of the same precedence, evaluate them from left to right. So, / is evaluated first. Can you imagine the disaster this formula could unleash in companies' income statements if used without the correct precedence?

The rules surrounding expressions and precedence have intrigued me for a while, and I’ve often wondered how one might build a program that reliably evaluated expressions. Last year, I read an article that described how we might do this by converting an Infix Notation to a Reverse Polish Notation (RPN) and then evaluating the RPN using a stack - that was a fun learning experience. But, I left wondering if there was another approach we could take to do this. Can we solve this without converting to RPN?

This year, I found an answer in a book titled Crafting Interpreters by Robert Nystrom⁵: “Yes, we can, but we’ll have to convert to something else”. And this post (and next few posts) will expand this answer into a program.

Problem Statement 🔗

We start with a simple problem. Evaluate this expression: 120 + 12.3 * 53 - 9. As good mathematicians, we know that we need to multiply 12.3 by 53 first to give 651.9. Then add 651.9 to 120 to give 771.9. Then subtract 771.9 from 9 to give 762.9. We’ll stick with this expression throughout this series and our goal is to build a program that can evaluate it for us.

For our context, an expression is a valid and finite combination of operands and operators e.g. 1 + 2. And evaluating an expression means applying the operators to the operands according to some rules to produce a single value e.g. 1 + 2 = 3.

Building Explanck is similar to building a language like JavaScript because, as you’ll see, we implement all the phases of building an interpreted language: Scan (or Tokenize) -> Parse -> Intrepet. But while JavaScript can be used to build programs, Explanck only understands mathematical expressions².

In the scanning (or tokenization) phase, we’ll convert our expression into a series of tokens. Then, in the parsing phase, we’ll build something called an abstract syntax tree from our tokens. Finally, we will traverse (and evaluate) this tree to get our final value in the interpret phase.

In this post, we’ll focus on the first phase: Tokenization.

Tokenization 🔗

When our program receives an expression as input, it receives a string - a stream of characters. Strings don’t mean anything to us - they cannot be evaluated as they are - so in this phase we will convert input strings into meaningful units called tokens that represent operands or operators. Here are the token types we care about for Explanck:

enum TokenType {
	// Operators
	LEFT_PAREN, RIGHT_PAREN,
	PLUS, MINUS, 
	STAR, SLASH,
	POWER,
	
	// Operand
	NUMBER,
}

And we will store tokens in a Token class:

class Token {
	final TokenType type;
	// The value of the token
	final Object literal;
	
	// Constructor
	Token(TokenType type, Object literal){}
}

So, given an expression like 120 + 12.3 * 53 - 9, the tokens are:

[NUMBER 120, PLUS, NUMBER 12.3, STAR, NUMBER 53, MINUS, NUMBER 9]

Notice that the expression contains whitespaces, but we don’t care about it, so we ignore it. We do the same thing for every other invalid character (we might decide to throw errors later). The number of invalid characters we can get is infinite, but we know which characters are valid, so we define rules for them. Our pseudocode is

Traverse each character in the expression string
If a valid character sequence is found (that is, a sequence that matches one of our defined token types), create a token from it and add it to the token list

For operator token types, the characters we care about include (, ), +, -, *, /, ^. These are all one character tokens that we can match easily, and when we find them, we add them to a list of tokens.

class Scanner {
	private final String source;
	private final List<Token> tokens = new ArrayList<>();
	private int start = 0;
	private int current = 0;

	List<Token> scanTokens(){
		// continue loop until we reach the end of the source string.
		while (!isAtEnd()){
			char c = source.getCharAt(current);
			current++;

			switch(c) {
				case '(': add(TokenType.LEFT_PAREN); break;
				case ')': add(TokenType.RIGHT_PAREN); break;  
				case '*': add(TokenType.STAR); break;  
				case '/': add(TokenType.SLASH); break;  
				case '+': add(TokenType.PLUS); break;  
				case '-': add(TokenType.MINUS); break;  
				case '^': add(TokenType.POWER); break;
			}
		}
		
		return tokens;
	}
	
	// adds a token to tokens
	private void addToken(TokenType type){}
}

The logic to create a number token is a little more interesting, so let’s expound on that. Numbers aren’t always single characters. From our previous example, 120 is our first number token. How do we extract it? Here’s our pseudocode:

Traverse each character in the expression string
If the current character is a digit, keep looking ahead until we get to a character that is not a digit.
When we find a number that isn’t a digit (or reach the end of the string), we extract all the digit characters as one whole number.
We create a token from that number and add it to the token list

So we get something like

// in Scanner.java
List<Token> scanTokens(){	
	while(!isAtEnd()){
		//> omitting previously written code <
		
		// We need this to extract substrings of 1+ digits
		start = current;
		
		switch (c) {
			default:
				// If the current char is a digit, we call addNumberToken to keep looking ahead for more digits.
				if (isDigit(c)) addNumberToken();
		}
	}
}

// keeps looking ahead until we don't see a digit again
private void addNumberToken(){
	
	while (isDigit(source.charAt(current)) && !isAtEnd()) {
		current++;
	}
	
	// after all digits have been found, extract the number
	// Remember we do `start = current` at the top. This substring extraction is why it is important
	Double number = Double.parseDouble(source.substring(start, current));
	addToken(TokenType.NUMBER, number);
	
}

// an overloaded method to add tokens with a given literal value
private void addToken(TokenType type, Object literal){}

This logic works well for 120, but consider the following number from our expression: 12.3. Based on the current code, we will extract 12 and 3 as two separate numbers. We don’t want that, so how do we extend our logic?

Instead of arbitrarily stopping when we stop seeing digits, we check if the next character is a .,
If we find ., it is potentially a decimal point. And we confirm this by checking if there’s a number after the ..
As we implement this logic, there are a few edge-cases to cater for
- .123 is invalid because there are no digits before .
- 123.12.126 is not a valid NUMBER token because there are two decimal points. The code should extract 123.12 and 126 separately, completely ignoring the second .
- 123. will not work because there’s no number after the decimal. What gets extracted is 123. The decimal gets ignored.
Then, we extract all the characters that make up our number, create a token and add it to the token list

private void addNumberToken(){
	//...after first while
	
	if (peek(0) == '.' && source.isDigit(peek(1))) {
		current++;
		while (isDigit(source.charAt(current)) && !isAtEnd())
			current++;
	}
	
	// extract number, create a token and add it to the list.
}

// a helper method to lookahead by an arbitrary number of steps
private char peek(int stepsForward) {
	int pos = current + stepsForward;
	if (pos >= source.length()) return `\0`;
	return source.charAt(pos);
}

That’s it. That is all it takes to extract tokens for our needs (minus error handling). So, now we can do:

String exp = "120 + 12.3 * 53 - 9";  
List<Token> tokens = new Scanner(exp).scanTokens();  
sout(tokens) // [NUMBER 120, PLUS, NUMBER 12.3, STAR, NUMBER 53, MINUS, NUMBER 9]

You can find the code for the scanner on GitHub. In the next post, we tackle some preliminary theory to enable us to implement our next phase - parsing!

I initially called explanck a programming language because we build it like we would build an interpreted language. This didn’t seem “right” though since the only valid syntax are math expression - where are variables, loops, classes etc? On that note, we’ll just refer to explanck as a program that evaluates expressions. ↩︎
When I still played around with calling explanck a language, I initially wanted to call it explang, as in expression language, but I googled the name and saw that many people had used it for different projects. So, while I pondered what to name it, a friend who was with me suggested explanck because plang reminded him of planck, as in Planck’s constant. The Planck’s constant is one of the smallest constants used in Physics, and I reckoned that explanck will be one of the smallest expressions of a programming language you will ever find. It seemed to fit! The moral of this story is that naming is hard; if you have a cool name for anything, hold it tight! ↩︎
COGS is Cost of Goods Sold. ↩︎
Our “intuition” was formed through those math classes where we learned about operator precedence ;) ↩︎
Many ideas I use in this series are based on things I learned while reading Crafting Interpreters. It’s a good read if you’re interested in learning about languages. ↩︎