So You Want to Build a Language VM - Part 08 - Assembler: The Beginning

Starts on an assembler

August 13, 2018

Instructions…Assemble!

We could torture ourselves by writing all our programs in hex, and if that’s your thing, this section is technically optional.

Technically. Anywho, what’s an assembler? Its the program that can turn this:

LOAD $1 #10

into:

00 01 00 0A.

It also has a few other responsibilities:

Handling labels
Calculating constants
Optimizations

Lexing

A Lexer is a program that takes in a stream of text, checks it against a bunch of rules, and emits a stream of tokens.

Lexemes and Tokens

Lexing produces lexemes, which are something like "units of meaning" in a sentence. In our example LOAD $1 #10, the lexemes would be: LOAD, $, 1, #, 1, 0.

These lexemes are combined with an id or name into a token. So in our example, our tokens are: <opcode, 0>, <register, 1>, <number, 10>

Grammar

So let’s define some things about our assembly language:

A Program is composed of Instructions
An Instruction is composed of:
- An Opcode
- A Register
- A IntegerOperand
- A Newline
An Opcode is composed of:
- One or more Letters in a row
- A Space
A Register is composed of:
- The symbol $
- A Number
- A Space
A IntegerOperand is composed of:
- The symbol #
- A Number
A Number is composed of:
- The symbols 0-9
A Newline is composed of:
- The symbol \ followed by the symbol n

This is called a *grammar*. It has the rules of our language so far, and we’ll expand it as we go through this section.

Important

To delve further into lexing and grammars, google context free grammars and backus-naur form.

Back to Lexing

So how do we get a lexer?

Two options:

We could write our own Lexer. This is not super difficult, and everyone should do it at least once.
We could use a tool like lex.

I gave serious consideration to writing our own, but I want the focus of these tutorials to be the VM. There’s a ton of tutorials on writing lexers, and I might add one at a later date. For now, we’re going to use the Rust tool: Nom. This tool makes it very, very easy to handle our lexing and parsing needs.

Omnomnomnomnom

To start gobbling bits, we have to do the following:

Add nom as a dependency
Add the nom crate to main.rs
Create a module directory for the assembler
Add the assembler module to main.rs
Create a Token enum
Create rules for nom

Nom Dependency

In your Cargo.toml file, add nom as a dependency:

[dependencies]
nom = "^4.0"

Files

Create a new directory under src called assembler, and add a mod.rs file to it
In main.rs, add pub mod assembler; to the top
In main.rs, add #[macro_use] to the top
- In main.rs add extern crate nom; on the line below
Create opcode.rs in src/assembler

Token Enum

nom needs something to emit, so we need to create a Token enum. In src/assembler/mod.rs, put:

use instruction::Opcode;

#[derive(Debug, PartialEq)]
pub enum Token {
    Op{code: Opcode},
}

Right now, we’re only going to teach our parser to recognize an Opcode. Whenever it finds one, it will create Token::Op{code: Opcode}.

Note	Yes, it’s a trifle weird to have Token::Op contain `code` which is an `Opcode`. Originally, the definition was `Token::Opcode{code: Opcode}` but that was confusing due to the duplication of the word `Opcode` and using it in two different spots. I hope this makes it clearer.

Rules: The Basics

OK, time to write our first rule. I’ll try to explain enough about nom as we go, but for a deep-dive into it, peruse their GitHub. The basic idea is that we are going to use nom’s macros to write a bunch of rules that, when applied, can parse a file containing valid iridium assembler code. These parsers can build on each other; this permits creating complex rules composed of simpler rules. We’ll see how this is useful in a bit.

Rules: Opcode

Earlier in the post, we defined an Opcode as:

. An `Opcode` is composed of:
  * One or more `Letters` in a row
  * A `Space`

Let’s take this one line at a time. In src/assembler/opcode_parsers.rs:

use nom::types::CompleteStr;

This loads in the nom type CompleteStr. We’ll be passing complete strings to our rules, not streaming data.

named!(opcode_load<CompleteStr, Token>,

This uses the named! macro in the nom crate to define a function called opcode_load. It will take a CompleteStr and return a Token.

  do_parse!(
      tag!("load") >> (Token::Op{code: Opcode::LOAD})
  )
);

The critical part here is tag!("load") >> (Token::Op{code: Opcode::LOAD}). It looks for the string "load" in the string we give it, and if it finds it, it returns an enum.

Note	The `>> (Token::Op{code: Opcode::LOAD})` part is from the `do_parse!` macro. It lets us chain parsers and pass the results to parsers further downstream. We’ll see how this works in more detail in a bit.

Tests

Time to write a test for our parser! In opcode_parsers.rs, put:

mod tests {
    use super::*;

    #[test]
    fn test_opcode_load() {
        // First tests that the opcode is detected and parsed correctly
        let result = opcode_load(CompleteStr("load"));
        assert_eq!(result.is_ok(), true);
        let (rest, token) = result.unwrap();
        assert_eq!(token, Token::Op{code: Opcode::LOAD});
        assert_eq!(rest, CompleteStr(""));

        // Tests that an invalid opcode isn't recognized
        let result = opcode_load(CompleteStr("aold"));
        assert_eq!(result.is_ok(), false);
    }
}

Yay, we have a function that can recognize one opcode!

Important

Holy shit, I just figured out how to do callouts in asciidoc! Muhahahahaha!

Rules: Registers

Now on to registers. First, let’s create the Register variant of our Token enum in src/assembler/mod.rs:

#[derive(Debug, PartialEq)]
pub enum Token {
    Op{code: Opcode},
    Register{reg_num: u8}
}

Next, a parser for a register. We’ll put this in src/assembler/register_parsers.rs. In our assembly language, they take the form of $0; a dollar sign followed by a number >= 0. Our function for this looks like:

use nom::types::CompleteStr;
use nom::digit;

use assembler::Token;

named!(register <CompleteStr, Token>, <1>
    ws!( <2>
        do_parse!( <3>
            tag!("$") >> <4>
            reg_num: digit >> <5>
            ( <6>
                Token::Register{ <7>
                  reg_num: reg_num.parse::<u8>().unwrap() <8>
                } <9>
            ) <10>
        )
    )
);

We create a function named register that accepts a CompleteStr and returns a CompleteStr and Token or an Error
We use the ws! macro, which tells it to consume any whitespace on either side of our register. This lets us write variants such as LOAD $0 in addition to LOAD $0
We use the do_parse! macro to chain parsers
We use tag! to look for $, pass the result of tag!…
…to the function digit, and save the result in a variable called reg_num. nom provides the function digit, which recognizes one or more 0-9 characters
Create the Token enum with the appropriate info and return
Start creation of the Token. We want the Register variant
Attempt to unwrap and store the result of parsing the digits into a u8
Close the Token struct
Close the tuple result the macro will return

Tests

And now a test for it…

  #[test]
  fn test_parse_register() {
      let result = register(CompleteStr("$0"));
      assert_eq!(result.is_ok(), true);
      let result = register(CompleteStr("0"));
      assert_eq!(result.is_ok(), false);
      let result = register(CompleteStr("$a"));
      assert_eq!(result.is_ok(), false);
  }

You could add even more error cases, such as "$", depending on how test-happy you are.

Rules: Integer Operands

And finally, integer operands! Create the IntegerOperand variant of our Token enum in src/assembler/mod.rs:

#[derive(Debug, PartialEq)]
pub enum Token {
    Op{code: Opcode},
    Register{reg_num: u8},
    IntegerOperand{value: i32},
}

Note	Yes, we are technically allowing the user to input negative numbers here, since we parse it into an i32. Our `LOAD` instruction can only load 16 bits, though. This is for future expansion.

Next, make the file src/assembler/operand_parsers.rs, in which we’ll put the last parser we have to write: one to recognize an IntegerOperand. We said those are composed of # followed by digits. In operand_parsers.rs, put:

use nom::types::CompleteStr;
use nom::digit;

use assembler::Token;

/// Parser for integer numbers, which we preface with `#` in our assembly language:
/// #100
named!(integer_operand<CompleteStr, Token>,
    ws!(
        do_parse!(
            tag!("#") >>
            reg_num: digit >>
            (
                Token::Number{value: reg_num.parse::<i32>().unwrap()}
            )
        )
    )
);

Tests

Guess what this is? A test!

#[test]
fn test_parse_integer_operand() {
    // Test a valid integer operand
    let result = integer_operand(CompleteStr("#10"));
    assert_eq!(result.is_ok(), true);
    let (rest, value) = result.unwrap();
    assert_eq!(rest, CompleteStr(""));
    assert_eq!(value, Token::IntegerOperand{value: 10});

    // Test an invalid one (missing the #)
    let result = integer_operand(CompleteStr("10"));
    assert_eq!(result.is_ok(), false);
}

Wrapping it up in `mod.rs`

Now in src/assembler/mod.rs, export the three modules we made by adding:

pub mod opcode_parsers;
pub mod operand_parsers;
pub mod register_parsers;

to the top.

End

Phew, this was a longer post, so I’m going to stop here. Next, we’ll go over how to combine these parsers into ones that can parse entire instructions, and ultimately entire programs.

If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.

Instructions…​Assemble!

Lexing

Lexemes and Tokens

Grammar

Back to Lexing

Omnomnomnomnom

Nom Dependency

Files

Token Enum

Rules: The Basics

Rules: Opcode

Tests

Rules: Registers

Tests

Rules: Integer Operands

Tests

Wrapping it up in mod.rs

End

Instructions…Assemble!

Wrapping it up in `mod.rs`