So You Want to Build a Language VM - Part 19 - Starting on Palladium

Begins building the higher level language and compiler

Introduction

If you’ve been following the development of Iridium, you know how heavily it uses Nom to parse the assembly language. I hope you liked it, because we’re going to be using Nom for this as well. =)

In this tutorial, we’re going to start creating a language called Palladium that will compile down to the assembly code we’ve been using. Before we get started, please remember…​

I have no idea what I’m doing

I read the LLVM Kaleidoscope Tutorial and that’s pretty much it. If you see something horrific I’ve done, have suggestions, etc, please let me know.

Flow

This language complements the Iridium VM, and coupled with it. A diagram of the flow looks something like:

+------------------+      +------------------+      +------------------+
| Palladium        |  ->  | Iridium ASM      |  ->  | Iridium VM       |
+------------------+      +------------------+      +------------------+

Goals

We’re going to start off easy for this tutorial. Our goal is to write a compiler that will transform this:

1 + 2

into this:

LOAD $0 #1
LOAD $1 #2
ADD $0 $1 $2

Parsers

Here’s the parsers we’ll need:

  1. Operator (+,-,/,*)

  2. Integer

  3. Expression

  4. Program

Let’s get started!

Step 1: New Project

  1. Run cargo new palladium

  2. In Cargo.toml, add nom = "^4.0" to the dependencies

Step 2: Tokens

In src/ create a file called tokens.rs and put this in it:

#[derive(Debug, PartialEq)]
pub enum Token {
    AdditionOperator,
    SubtractionOperator,
    MultiplicationOperator,
    DivisionOperator,
    Integer{ value: i64 },
    Expression{ left: Box<Token>, op: Box<Token>, right: Box<Token> },
    Program{ expressions: Vec<Token> }
}

These are the tokens Nom will be spewing out as it chews through our code.

Important
Note how we have to use some form of indirection with Expressions and Program; this is because at compile time we don’t know the sizes (due to the recursion) and would have to allocate an infinite amount of memory. Using Box or Vec solves the problem by using pointers, which have a known size.

Step 3: Operator Parsers

Now make a file called src/operator_parsers.rs. We’ll need 5 parsers here, one for each operator, and then one that uses alt! to check for each of them. Here’s the addition operator parser:

named!(addition_operator<CompleteStr, Token>,
    ws!(
        do_parse!(
            tag!("+") >>
            (
                Token::AdditionOperator
            )
        )
    )
);

The rest are nearly identical, so I won’t show them here. A parser that checks for each of those four is a bit trickier:

named!(operator<CompleteStr, Token>,
    ws!(
        alt!(
            addition_operator |
            subtraction_operator |
            multiplication_operator |
            division_operator
        )
    )
);

Step 4: Operand Parsers

This is similar to the above. Right now, we only have integers as operands, but as we continue, we’ll add in more.

Make src/operand_parsers.rs, and in it put:

/// Parses an integer
named!(integer<CompleteStr, Token>,
    ws!(
        do_parse!(
            sign: opt!(tag!("-")) >>
            reg_num: digit >>
            (
                {
                    let mut tmp = String::from("");
                    if sign.is_some() {
                        tmp.push_str("-");
                    }
                    tmp.push_str(&reg_num.to_string());
                    let converted = tmp.parse::<i64>().unwrap();
                    Token::Integer{ value: converted }
                }
            )
        )
    )
);

Note the sign: opt!(tag!("-")) line. This will let us handle 64-bit positive or negative integers.

Statements and Expressions

Before moving on, let’s cover what these two things are and how they are used in a programming language context.

Expression

An expression is something that produces a new value. This can be a combination of functions and other values. 5 is a value, for example. Another way to think about it is that expressions are something.

Statement

Statements do not return an expression. For example, x = 5 is a statement, because it doesn’t produce any new values. x and 5 are both values, though. You can think of statements as doing something.

Tests

Oh, right, tests for the operand parsers:

mod tests {
    use super::*;
    use tokens::Token;
    use nom::types::CompleteStr;

    #[test]
    fn test_parse_integer() {
        let test_integers = vec!["0", "-1", "1"];
        for o in test_integers {
            let parsed_o = o.parse::<i64>().unwrap();
            let result = integer(CompleteStr(o));
            assert_eq!(result.is_ok(), true);
            let (_, token) = result.unwrap();
            assert_eq!(token, Token::Integer{value: parsed_o});
        }
    }
}

I tried to be a bit more clever and not write three different test cases for 0, -1, and 1.

Step 5: Expression Parser

We have parsers for operands and operators. Since 1 + 2 is a combination of operands and an operator that returns a value (3), that means its an expression. Thus, we need an expression parser. Make src/expression_parsers.rs and in it put:

use nom::types::CompleteStr;

use tokens::Token;
use operand_parsers::operand;
use operator_parsers::operator;

named!(pub expression<CompleteStr, Token>,
    ws!(
        do_parse!(
            left: operand >>
            op: operator >>
            right: operand >>
            (
                Token::Expression{
                    left: Box::new(left),
                    right: Box::new(right),
                    op: Box::new(op)
                }
            )
        )
    )
);

Step 6: Program Parser

This is the top-most parser, Program! Make src/program_parser.rs and in it put:

use nom::types::CompleteStr;

use expression_parsers::*;
use tokens::Token;

named!(program<CompleteStr, Token>,
    ws!(
        do_parse!(
            expressions: many1!(expression) >>
            (
                Token::Program {
                    expressions: expressions
                }
            )
        )
    )
);

mod tests {
    use super::*;
    #[test]
    fn test_parse_program() {
        let test_program = CompleteStr("1+2");
        let result = program(test_program);
        assert_eq!(result.is_ok(), true);
    }
}

Let’s print out the result from test_parser_program:

Program {
    expressions: [
        Expression {
            left: Integer {
                value: 1
            },
            op: AdditionOperator,
            right: Integer {
                value: 2
            }
        }
    ]
}

Notice the tree-like structure? This is an AST, or Abstract Syntax Tree. I’m not going to go into exhaustive detail explaining what they are here since there are explanations better than what I could do:

A quick summary is that our Nom parser transforms our source text into a data structure. Its called an abstract syntax tree because the parser removes information, such as comments, whitespace, etc. A concrete syntax tree would preserve all of that. Why use ASTs? So we can do…​

Step 7: Code Generation!

(For some reason, this is my favorite part of writing a programming language) Now that we have a tree representing our program, we want to output appropriate assembler code that the Iridium VM can then compile into executable bytecode.

Visitor Pattern

One way to do this is called the Visitor pattern. Think of it is as writing a program that visits each node in our tree, examines it, and outputs the IASM code. This pattern is useful when you need to do something across a collection of different objects.

To do this, let’s use Traits! Make src/visitor.rs. This part gets a bit complicated, so we’ll do it in segments. First, put this in the file:

use tokens::Token;

pub trait Visitor {
    fn visit_token(&mut self, node: &Token);
}

Here we define a Trait called Visitor. There’s one function, visit_token, and it accepts a mutable reference to self and a reference to a Token. Because we used enums in tokens.rs, we cannot define the trait like this:

pub trait Visitor {
    fn visit_addition_operator(&mut self, node: &Token::AdditionOperator);
}

Enum variants are not types. We could use this method if we had implemented each enum variant as a struct instead. I have a serious addiction to Rust’s enums, though.

Next, put in this:

pub struct Compiler;

impl Visitor for Compiler {
    fn visit_token(&mut self, node: &Token) {
        match node {
            &Token::AdditionOperator => {},
            &Token::SubtractionOperator => {},
            &Token::MultiplicationOperator => {},
            &Token::DivisionOperator => {},
            &Token::Integer{ value } => {},
            &Token::Expression{ ref left, ref op, ref right } => {},
            &Token::Program{ ref expressions } => {}
        }
    }
}

We declare an empty struct called Compiler, and then we implement the Visitor Trait for it.

Important
The match section is how we deal with enum variants not being types. The Compiler will visit each node, each of which is of type `Token`, and then it will use matching to determine what further functions to call.

To illustrate this, I’ve added some debug prints to the various match arms:

fn visit_token(&mut self, node: &Token) {
    match node {
        &Token::AdditionOperator => {
            println!("Visiting AdditionOperator");
        },
        &Token::SubtractionOperator => {},
        &Token::MultiplicationOperator => {},
        &Token::DivisionOperator => {},
        &Token::Integer{ value } => {
            println!("Visited Integer with value of: {:#?}", value);
        },
        &Token::Expression{ ref left, ref op, ref right } => {
            println!("Visiting an expression");
            self.visit_token(left);
            self.visit_token(op);
            self.visit_token(right);
            println!("Done visiting expression");
        },
        &Token::Program{ ref expressions } => {
            println!("Visiting program");
            for expression in expressions {
                self.visit_token(expression);
            }
            println!("Done visiting program");
        }
    }
}

We’ll remove those later, but for now, it will help illustrate how this works.

Tests

Here’s a simple test to put in visitor.rs:

mod tests {
    use super::*;
    use nom::types::CompleteStr;
    use program_parsers::program;

    fn generate_test_program() -> Token {
        let source = CompleteStr("1+2");
        let (_, tree) = program(source).unwrap();
        tree
    }

    #[test]
    fn test_visit_addition_token() {
        let mut compiler = Compiler{};
        let test_program = generate_test_program();
        compiler.visit_token(&test_program);
    }
}

And if we run the test, we get:

$ cargo test test_visit_addition_token -- --nocapture
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running target/debug/deps/palladium-2f91df5ee61c7967

running 1 test
Visiting program
Visiting an expression
Visited Integer with value of: 1
Visiting AdditionOperator
Visited Integer with value of: 2
Done visiting expression
Done visiting program
test visitor::tests::test_visit_addition_token ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 8 filtered out

Pretty cool, huh? We’re almost there!

Tokens…​COMPILE!

Because our VM is register-based, we have to worry about something called register allocation. Our compiler has to track which registers are in use, and which ones are not. In the case of our example program, 1+2, we need three registers. Why three? Consider the steps needed:

  1. Load 1 into register 0

  2. Load 2 into a register 1

  3. Add the first two registers

  4. Store the result in a third register 2

  5. Registers 0 and 1 are free again for use

With that in mind, let’s change our Compiler struct a bit. Well, a lot:

#[derive(Default)]
pub struct Compiler{
    free_registers: Vec<u8>,
    used_registers: Vec<u8>,
    assembly: Vec<String>
}

impl Compiler {
    pub fn new() -> Compiler {
        let mut free_registers = vec![];
        for i in 0..31 {
            free_registers.push(i);
        }
        free_registers.reverse();
        Compiler{
            free_registers: free_registers,
            used_registers: vec![],
            assembly: vec![]
        }
    }
}

We added three new attributes: a vector to store registers that are free to use, one to store registers current in use, and a vector of Strings to hold the final output.

Note
We reverse the free registers vector so it starts allocating from 0. Not that it matters, it just annoys me for some reason.

Ready for the final version of the visit_token function? Here we go…​

impl Visitor for Compiler {
    fn visit_token(&mut self, node: &Token) {
        match node {
            &Token::AdditionOperator => {
                // TODO: Need to clean this up. Remove the unwraps.
                let result_register = self.free_registers.pop().unwrap();
                let left_register = self.used_registers.pop().unwrap();
                let right_register = self.used_registers.pop().unwrap();
                let line = format!("ADD ${} ${} ${}", left_register, right_register, result_register);
                self.assembly.push(line);
                self.free_registers.push(left_register);
                self.free_registers.push(right_register);

            },
            &Token::SubtractionOperator => {},
            &Token::MultiplicationOperator => {},
            &Token::DivisionOperator => {},
            &Token::Integer{ value } => {
                let next_register = self.free_registers.pop().unwrap();
                let line = format!("LOAD ${} #{}", next_register, value);
                self.used_registers.push(next_register);
                self.assembly.push(line);
            },
            &Token::Expression{ ref left, ref op, ref right } => {
                self.visit_token(left);
                self.visit_token(right);
                self.visit_token(op);
            },
            &Token::Program{ ref expressions } => {
                for expression in expressions {
                    self.visit_token(expression);
                }
            }
        }
    }
}
  1. A Program consists of Expressions, so we iterate over each Expression.

  2. For each Expression, visit the left, right and op side.

  3. When we visit an Integer, pop a free register off to store it

  4. When we visit an AdditionOperator, pop a free register to store the result

  5. Pop the two most recently added used registers

  6. Generate the IASM

  7. Add the registers previously used to store the left and right values to the free list

Important
This will not work for more complex expressions, but we’ll deal with that in a later tutorial.

Now if we run the test, the output gives us:

[
    "LOAD $0 #1",
    "LOAD $1 #2",
    "ADD $1 $0 $2"
]

Even cooler, eh?

Let’s start up the Iridium REPL and input these three lines:

$ iridium
Welcome to Iridium! Let's be productive!
>>> load $0 #1
>>> load $1 #2
>>> add $1 $0 $2
>>> .registers
Listing registers and all contents:
[
    1,
    2,
    3,
    0,
    0,
    <snip>
]
End of Register Listing

Success!

End

Phew, that was a long post. We can make this seamless by adding an assembler to the Compiler struct, which we’ll do in a later tutorial. I’m done for now though. =) Palladium is on GitLab

See you next tutorial!


If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.