So You Want to Build a Language VM - Part 09 - Assembler 2: Cruise Control
Adds more functionality to our assembler
Megazord…ACTIVATE!!!
We’ve written basic parsers. Now we can take a step up on the abstraction ladder and create a parser that combines some of our smaller parsers. Right now, we can recognize one opcode, registers and integer operands. We can group these into an AssemblerInstruction
.
In src/assembler/instruction_parsers.rs
, put:
use assembler::Token;
use assembler::opcode_parsers::*;
use assembler::operand_parsers::integer_operand;
use assembler::register_parsers::register;
#[derive(Debug, PartialEq)]
pub struct AssemblerInstruction {
opcode: Token,
operand1: Option<Token>,
operand2: Option<Token>,
operand3: Option<Token>,
}
And now, the parser for the instruction itself…
/// Handles instructions of the following form:
/// LOAD $0 #100
named!(pub instruction_one<CompleteStr, AssemblerInstruction>,
do_parse!(
o: opcode_load >>
r: register >>
i: integer_operand >>
(
AssemblerInstruction{
opcode: o,
operand1: Some(r),
operand2: Some(i),
operand3: None
}
)
)
);
See how we are using the parsers we defined? opcode
, register
and integer
. Collectively, they make up one AssemblerInstruction
. We leave the operand fields as Optional, to allow greater flexibility.
Note | You may be wondering about the |
Tests
Now a test…put this at the bottom of src/assembler/instruction_parsers.rs
#[cfg(test)]
mod tests {
use super::*;
use assembler::opcode::Opcode;
#[test]
fn test_parse_instruction_form_one() {
let result = instruction_one(CompleteStr("load $0 #100\n"));
assert_eq!(
result,
Ok((
CompleteStr(""),
AssemblerInstruction {
label: None,
opcode: Token::Opcode { code: Opcode::LOAD },
operand1: Some(Token::Register { reg_num: 0 }),
operand2: Some(Token::Number { value: 100 }),
operand3: None
}
))
);
}
Up to Program
And now our final parser, the Program
parser. A Program
consists of Instructions
. Make src/assembler/program_parsers.rs
, and in it put:
use nom::types::CompleteStr;
use assembler::instruction_parsers::{AssemblerInstruction, instruction_one};
#[derive(Debug, PartialEq)]
pub struct Program {
instructions: Vec<AssemblerInstruction>
}
named!(pub program<CompleteStr, Program>,
do_parse!(
instructions: many1!(instruction_one) >>
(
Program {
instructions: instructions
}
)
)
);
We now have a struct that contains a vector of assembler instructions. Next step is to give AssemblerInstructions the ability to write themselves out as Vec<u8>s. Then we just have to iterate through the instructions
vec and done!
But first…
ANOTHER TEST!
#[test]
fn test_parse_program() {
let result = program(CompleteStr("load $0 #100\n"));
assert_eq!(result.is_ok(), true);
let (leftover, p) = result.unwrap();
assert_eq!(leftover, CompleteStr(""));
assert_eq!(
1,
p.instructions.len()
);
// TODO: Figure out an ergonomic way to test the AssemblerInstruction returned
}
The instructions field of p
(which is a Program
struct) is private. I’m not sure if it is better to make them public, make an accessor function, or what. Let’s revisit this later.
Getting at the Bits
We need each AssemblerInstruction
to have a function we can call to get a Vec<u8>. Let’s head over to instruction_parser.rs
and add one.
impl AssemblerInstruction {
pub fn to_bytes(&self) -> Vec<u8> {
let mut results = vec![];
match self.opcode {
Token::Op { code } => match code {
_ => {
results.push(code as u8);
}
},
_ => {
println!("Non-opcode found in opcode field");
std::process::exit(1);
}
};
for operand in vec![&self.operand1, &self.operand2, &self.operand3] {
match operand {
Some(t) => AssemblerInstruction::extract_operand(t, &mut results),
None => {}
}
}
return results;
}
This is where implementing impl From<u8> for Opcode {
over in src/instruction.rs
pays off. If you derive Copy
and Clone
on the Opcode
enum, then we can convert any opcode into its integer with code as u8
. All this function does is write the opcode bit to a vector, then uses a helper function to extract the operands for any of the operand fields that are not None.
That helper function also goes in impl AssemblerInstruction
and looks like:
fn extract_operand(t: &Token, results: &mut Vec<u8>) {
match t {
Token::Register { reg_num } => {
results.push(*reg_num);
}
Token::IntegerOperand { value } => {
let converted = *value as u16;
let byte1 = converted;
let byte2 = converted >> 8;
results.push(byte2 as u8);
results.push(byte1 as u8);
}
_ => {
println!("Opcode found in operand field");
std::process::exit(1);
}
};
}
I thought for sure borrowck
was going to chastise, but passing the results vector around worked like I thought it would.
What extract_operand
does is check for the operand type, converts it to bytes and then stuffs them in the results vector.
Note | You may wonder why we order them this way:
This is because they need to be in the proper order according to our big endian/little endian rule. |
Back to the Program
Let’s go back to program_parsers.rs
and add a function to convert the entire vector of AssemblerInstruction
to bytes:
impl Program {
pub fn to_bytes(&self) -> Vec<u8> {
let mut program = vec![];
for instruction in &self.instructions {
program.append(&mut instruction.to_bytes());
}
program
}
}
And a test…
#[test]
fn test_program_to_bytes() {
let result = program(CompleteStr("load $0 #100\n"));
assert_eq!(result.is_ok(), true);
let (_, program) = result.unwrap();
let bytecode = program.to_bytes();
assert_eq!(bytecode.len(), 4);
println!("{:?}", bytecode);
}
Modifying the REPL
Almost done! Right now, our REPL still speaks hex. Head over src/repl/mod.rs
and in the catch-all match arm of the function run
, put:
_ => {
let parsed_program = program(CompleteStr(buffer));
if !parsed_program.is_ok() {
println!("Unable to parse input");
continue;
}
let (_, result) = parsed_program.unwrap();
let bytecode = result.to_bytes();
// TODO: Make a function to let us add bytes to the VM
for byte in bytecode {
self.vm.add_byte(byte);
}
self.vm.run_once();
}
cargo run
and type in load $0 #100
:Welcome to Iridium! Lets be productive!
>>> load $0 #100
>>> .registers
Listing registers and all contents:
[
100,
0,
<snip>
]
End of Register Listing
A Wild Bug Appears!
Try entering LOAD $0 #100
. You should get:
>>> LOAD $0 #100
Unable to parse input
>>>
Our assembler is case-sensitive! I’m going to leave it as an exercise for the reader to figure out how to fix it. If you get stuck, you can check out the code in GitLab.
Hex code
At this point, we could delete the parse_hex
function, or we can leave it in case someone’s idea of a good time on a Friday night is to code in hex. Some options on what to do with it are:
The REPL could try both and go with whichever parser doesn’t return an
Error
The REPL could look for input prefaced with
0x
and useparse_hex
for that inputWe could add a command to our REPL to let it switch input modes. In one, it accepts hex. In the other, assembly code.
End
Yay, we now have a basic, but functional, assembler. Next, we’ll teach our assembler how to recognize more opcodes and instruction forms, and how to provide helpful hints to the user when they type something incorrectly. See you then!
If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.