So You Want to Build a Language VM - Part 08 - Assembler: The Beginning
Starts on an assembler
Instructions…Assemble!
We could torture ourselves by writing all our programs in hex, and if that’s your thing, this section is technically optional.
Technically. Anywho, what’s an assembler? Its the program that can turn this:
LOAD $1 #10
into:
00 01 00 0A
.
It also has a few other responsibilities:
Handling
labels
Calculating constants
Optimizations
Lexing
A Lexer is a program that takes in a stream of text, checks it against a bunch of rules, and emits a stream of tokens.
Lexemes and Tokens
Lexing produces lexemes, which are something like "units of meaning" in a sentence. In our example LOAD $1 #10
, the lexemes would be: LOAD, $, 1, #, 1, 0
.
These lexemes are combined with an id or name into a token
. So in our example, our tokens are:
<opcode, 0>, <register, 1>, <number, 10>
Grammar
So let’s define some things about our assembly language:
A
Program
is composed ofInstructions
An
Instruction
is composed of:An
Opcode
A
Register
A
IntegerOperand
A
Newline
An
Opcode
is composed of:One or more
Letters
in a rowA
Space
A
Register
is composed of:The symbol
$
A
Number
A
Space
A
IntegerOperand
is composed of:The symbol
#
A
Number
A
Number
is composed of:The symbols
0-9
A
Newline
is composed of:The symbol
\
followed by the symboln
This is called a *grammar*. It has the rules of our language so far, and we’ll expand it as we go through this section.
Important | To delve further into lexing and grammars, google context free grammars and backus-naur form . |
Back to Lexing
So how do we get a lexer?
Two options:
We could write our own Lexer. This is not super difficult, and everyone should do it at least once.
We could use a tool like
lex
.
I gave serious consideration to writing our own, but I want the focus of these tutorials to be the VM. There’s a ton of tutorials on writing lexers, and I might add one at a later date. For now, we’re going to use the Rust tool: Nom. This tool makes it very, very easy to handle our lexing and parsing needs.
Omnomnomnomnom
To start gobbling bits, we have to do the following:
Add
nom
as a dependencyAdd the
nom
crate tomain.rs
Create a module directory for the assembler
Add the assembler module to
main.rs
Create a
Token
enumCreate rules for
nom
Nom Dependency
In your Cargo.toml file, add nom
as a dependency:
[dependencies]
nom = "^4.0"
Files
Create a new directory under
src
calledassembler
, and add amod.rs
file to itIn
main.rs
, addpub mod assembler;
to the topIn
main.rs
, add#[macro_use]
to the topIn
main.rs
addextern crate nom;
on the line below
Create
opcode.rs
insrc/assembler
Token Enum
nom
needs something to emit, so we need to create a Token enum. In src/assembler/mod.rs
, put:
use instruction::Opcode;
#[derive(Debug, PartialEq)]
pub enum Token {
Op{code: Opcode},
}
Right now, we’re only going to teach our parser to recognize an Opcode. Whenever it finds one, it will create Token::Op{code: Opcode}
.
Note | Yes, it’s a trifle weird to have Token::Op contain code which is an Opcode . Originally, the definition was Token::Opcode{code: Opcode} but that was confusing due to the duplication of the word Opcode and using it in two different spots. I hope this makes it clearer. |
Rules: The Basics
OK, time to write our first rule. I’ll try to explain enough about nom
as we go, but for a deep-dive into it, peruse their GitHub. The basic idea is that we are going to use nom’s macros to write a bunch of rules that, when applied, can parse a file containing valid iridium assembler code. These parsers can build on each other; this permits creating complex rules composed of simpler rules. We’ll see how this is useful in a bit.
Rules: Opcode
Earlier in the post, we defined an Opcode as:
. An `Opcode` is composed of:
* One or more `Letters` in a row
* A `Space`
Let’s take this one line at a time. In src/assembler/opcode_parsers.rs
:
use nom::types::CompleteStr;
nom
type CompleteStr
. We’ll be passing complete strings to our rules, not streaming data.named!(opcode_load<CompleteStr, Token>,
named!
macro in the nom
crate to define a function called opcode_load
. It will take a CompleteStr
and return a Token
. do_parse!(
tag!("load") >> (Token::Op{code: Opcode::LOAD})
)
);
The critical part here is tag!("load") >> (Token::Op{code: Opcode::LOAD})
. It looks for the string "load" in the string we give it, and if it finds it, it returns an enum.
Note | The >> (Token::Op{code: Opcode::LOAD}) part is from the do_parse! macro. It lets us chain parsers and pass the results to parsers further downstream. We’ll see how this works in more detail in a bit. |
Tests
Time to write a test for our parser! In opcode_parsers.rs
, put:
mod tests {
use super::*;
#[test]
fn test_opcode_load() {
// First tests that the opcode is detected and parsed correctly
let result = opcode_load(CompleteStr("load"));
assert_eq!(result.is_ok(), true);
let (rest, token) = result.unwrap();
assert_eq!(token, Token::Op{code: Opcode::LOAD});
assert_eq!(rest, CompleteStr(""));
// Tests that an invalid opcode isn't recognized
let result = opcode_load(CompleteStr("aold"));
assert_eq!(result.is_ok(), false);
}
}
Yay, we have a function that can recognize one opcode!
Important | Holy shit, I just figured out how to do callouts in asciidoc! Muhahahahaha! |
Rules: Registers
Now on to registers. First, let’s create the Register
variant of our Token
enum in src/assembler/mod.rs
:
#[derive(Debug, PartialEq)]
pub enum Token {
Op{code: Opcode},
Register{reg_num: u8}
}
Next, a parser for a register. We’ll put this in src/assembler/register_parsers.rs
. In our assembly language, they take the form of $0
; a dollar sign followed by a number >= 0. Our function for this looks like:
use nom::types::CompleteStr;
use nom::digit;
use assembler::Token;
named!(register <CompleteStr, Token>, <1>
ws!( <2>
do_parse!( <3>
tag!("$") >> <4>
reg_num: digit >> <5>
( <6>
Token::Register{ <7>
reg_num: reg_num.parse::<u8>().unwrap() <8>
} <9>
) <10>
)
)
);
We create a function named
register
that accepts aCompleteStr
and returns aCompleteStr
andToken
or anError
We use the
ws!
macro, which tells it to consume any whitespace on either side of our register. This lets us write variants such asLOAD $0
in addition toLOAD $0
We use the
do_parse!
macro to chain parsersWe use
tag!
to look for$
, pass the result oftag!
……to the function
digit
, and save the result in a variable calledreg_num
. nom provides the functiondigit
, which recognizes one or more 0-9 charactersCreate the
Token
enum with the appropriate info and returnStart creation of the
Token
. We want theRegister
variantAttempt to unwrap and store the result of parsing the digits into a
u8
Close the
Token
structClose the tuple result the macro will return
Tests
And now a test for it…
#[test]
fn test_parse_register() {
let result = register(CompleteStr("$0"));
assert_eq!(result.is_ok(), true);
let result = register(CompleteStr("0"));
assert_eq!(result.is_ok(), false);
let result = register(CompleteStr("$a"));
assert_eq!(result.is_ok(), false);
}
You could add even more error cases, such as "$", depending on how test-happy you are.
Rules: Integer Operands
And finally, integer operands! Create the IntegerOperand
variant of our Token
enum in src/assembler/mod.rs
:
#[derive(Debug, PartialEq)]
pub enum Token {
Op{code: Opcode},
Register{reg_num: u8},
IntegerOperand{value: i32},
}
Note | Yes, we are technically allowing the user to input negative numbers here, since we parse it into an i32. Our LOAD instruction can only load 16 bits, though. This is for future expansion. |
Next, make the file src/assembler/operand_parsers.rs
, in which we’ll put the last parser we have to write: one to recognize an IntegerOperand
. We said those are composed of #
followed by digits. In operand_parsers.rs
, put:
use nom::types::CompleteStr;
use nom::digit;
use assembler::Token;
/// Parser for integer numbers, which we preface with `#` in our assembly language:
/// #100
named!(integer_operand<CompleteStr, Token>,
ws!(
do_parse!(
tag!("#") >>
reg_num: digit >>
(
Token::Number{value: reg_num.parse::<i32>().unwrap()}
)
)
)
);
Tests
Guess what this is? A test!
#[test]
fn test_parse_integer_operand() {
// Test a valid integer operand
let result = integer_operand(CompleteStr("#10"));
assert_eq!(result.is_ok(), true);
let (rest, value) = result.unwrap();
assert_eq!(rest, CompleteStr(""));
assert_eq!(value, Token::IntegerOperand{value: 10});
// Test an invalid one (missing the #)
let result = integer_operand(CompleteStr("10"));
assert_eq!(result.is_ok(), false);
}
Wrapping it up in mod.rs
Now in src/assembler/mod.rs
, export the three modules we made by adding:
pub mod opcode_parsers;
pub mod operand_parsers;
pub mod register_parsers;
End
Phew, this was a longer post, so I’m going to stop here. Next, we’ll go over how to combine these parsers into ones that can parse entire instructions, and ultimately entire programs.
If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.