So You Want to Build a Language VM - Part 09 - Assembler 2: Cruise Control

Adds more functionality to our assembler

Megazord…​ACTIVATE!!!

We’ve written basic parsers. Now we can take a step up on the abstraction ladder and create a parser that combines some of our smaller parsers. Right now, we can recognize one opcode, registers and integer operands. We can group these into an AssemblerInstruction. In src/assembler/instruction_parsers.rs, put:

use assembler::Token;
use assembler::opcode_parsers::*;
use assembler::operand_parsers::integer_operand;
use assembler::register_parsers::register;

#[derive(Debug, PartialEq)]
pub struct AssemblerInstruction {
    opcode: Token,
    operand1: Option<Token>,
    operand2: Option<Token>,
    operand3: Option<Token>,
}

And now, the parser for the instruction itself…​

/// Handles instructions of the following form:
/// LOAD $0 #100
named!(pub instruction_one<CompleteStr, AssemblerInstruction>,
    do_parse!(
        o: opcode_load >>
        r: register >>
        i: integer_operand >>
        (
            AssemblerInstruction{
                opcode: o,
                operand1: Some(r),
                operand2: Some(i),
                operand3: None
            }
        )
    )
);

See how we are using the parsers we defined? opcode, register and integer. Collectively, they make up one AssemblerInstruction. We leave the operand fields as Optional, to allow greater flexibility.

Note

You may be wondering about the pub in front of the parser name, such as: pub instruction_one. This makes the function generated by the nom macro public so we can access it from other modules. Our Program parser will need to access the instruction_one parser from its module.

Tests

Now a test…​put this at the bottom of src/assembler/instruction_parsers.rs

#[cfg(test)]
mod tests {
    use super::*;
    use assembler::opcode::Opcode;

    #[test]
    fn test_parse_instruction_form_one() {
        let result = instruction_one(CompleteStr("load $0 #100\n"));
        assert_eq!(
            result,
            Ok((
                CompleteStr(""),
                AssemblerInstruction {
                    label: None,
                    opcode: Token::Opcode { code: Opcode::LOAD },
                    operand1: Some(Token::Register { reg_num: 0 }),
                    operand2: Some(Token::Number { value: 100 }),
                    operand3: None
                }
            ))
        );
    }

Up to Program

And now our final parser, the Program parser. A Program consists of Instructions. Make src/assembler/program_parsers.rs, and in it put:

use nom::types::CompleteStr;

use assembler::instruction_parsers::{AssemblerInstruction, instruction_one};

#[derive(Debug, PartialEq)]
pub struct Program {
    instructions: Vec<AssemblerInstruction>
}

named!(pub program<CompleteStr, Program>,
    do_parse!(
        instructions: many1!(instruction_one) >>
        (
            Program {
                instructions: instructions
            }
        )
    )
);

We now have a struct that contains a vector of assembler instructions. Next step is to give AssemblerInstructions the ability to write themselves out as Vec<u8>s. Then we just have to iterate through the instructions vec and done!

But first…​

ANOTHER TEST!

#[test]
fn test_parse_program() {
    let result = program(CompleteStr("load $0 #100\n"));
    assert_eq!(result.is_ok(), true);
    let (leftover, p) = result.unwrap();
    assert_eq!(leftover, CompleteStr(""));
    assert_eq!(
        1,
        p.instructions.len()
    );
    // TODO: Figure out an ergonomic way to test the AssemblerInstruction returned
}

The instructions field of p (which is a Program struct) is private. I’m not sure if it is better to make them public, make an accessor function, or what. Let’s revisit this later.

Getting at the Bits

We need each AssemblerInstruction to have a function we can call to get a Vec<u8>. Let’s head over to instruction_parser.rs and add one.

impl AssemblerInstruction {
    pub fn to_bytes(&self) -> Vec<u8> {
        let mut results = vec![];
        match self.opcode {
            Token::Op { code } => match code {
                _ => {
                    results.push(code as u8);
                }
            },
            _ => {
                println!("Non-opcode found in opcode field");
                std::process::exit(1);
            }
        };

        for operand in vec![&self.operand1, &self.operand2, &self.operand3] {
            match operand {
                Some(t) => AssemblerInstruction::extract_operand(t, &mut results),
                None => {}
            }
        }

        return results;
    }

This is where implementing impl From<u8> for Opcode { over in src/instruction.rs pays off. If you derive Copy and Clone on the Opcode enum, then we can convert any opcode into its integer with code as u8. All this function does is write the opcode bit to a vector, then uses a helper function to extract the operands for any of the operand fields that are not None.

That helper function also goes in impl AssemblerInstruction and looks like:

fn extract_operand(t: &Token, results: &mut Vec<u8>) {
    match t {
        Token::Register { reg_num } => {
            results.push(*reg_num);
        }
        Token::IntegerOperand { value } => {
            let converted = *value as u16;
            let byte1 = converted;
            let byte2 = converted >> 8;
            results.push(byte2 as u8);
            results.push(byte1 as u8);
        }
        _ => {
            println!("Opcode found in operand field");
            std::process::exit(1);
        }
    };
}

I thought for sure borrowck was going to chastise, but passing the results vector around worked like I thought it would.

What extract_operand does is check for the operand type, converts it to bytes and then stuffs them in the results vector.

Note

You may wonder why we order them this way:

results.push(byte2 as u8);
results.push(byte1 as u8);
and not:
results.push(byte1 as u8);
results.push(byte2 as u8);

This is because they need to be in the proper order according to our big endian/little endian rule.

Back to the Program

Let’s go back to program_parsers.rs and add a function to convert the entire vector of AssemblerInstruction to bytes:

impl Program {
    pub fn to_bytes(&self) -> Vec<u8> {
        let mut program = vec![];
        for instruction in &self.instructions {
            program.append(&mut instruction.to_bytes());
        }
        program
    }
}

And a test…​

#[test]
fn test_program_to_bytes() {
    let result = program(CompleteStr("load $0 #100\n"));
    assert_eq!(result.is_ok(), true);
    let (_, program) = result.unwrap();
    let bytecode = program.to_bytes();
    assert_eq!(bytecode.len(), 4);
    println!("{:?}", bytecode);
}

Modifying the REPL

Almost done! Right now, our REPL still speaks hex. Head over src/repl/mod.rs and in the catch-all match arm of the function run, put:

_ => {
    let parsed_program = program(CompleteStr(buffer));
    if !parsed_program.is_ok() {
        println!("Unable to parse input");
        continue;
    }
    let (_, result) = parsed_program.unwrap();
    let bytecode = result.to_bytes();
    // TODO: Make a function to let us add bytes to the VM
    for byte in bytecode {
        self.vm.add_byte(byte);
    }
    self.vm.run_once();
}
Now, if you do cargo run and type in load $0 #100:

Welcome to Iridium! Lets be productive!
>>> load $0 #100
>>> .registers
Listing registers and all contents:
[
    100,
    0,
    <snip>
]
End of Register Listing

A Wild Bug Appears!

Try entering LOAD $0 #100. You should get:

>>> LOAD $0 #100
Unable to parse input
>>>

Our assembler is case-sensitive! I’m going to leave it as an exercise for the reader to figure out how to fix it. If you get stuck, you can check out the code in GitLab.

Hex code

At this point, we could delete the parse_hex function, or we can leave it in case someone’s idea of a good time on a Friday night is to code in hex. Some options on what to do with it are:

  1. The REPL could try both and go with whichever parser doesn’t return an Error

  2. The REPL could look for input prefaced with 0x and use parse_hex for that input

  3. We could add a command to our REPL to let it switch input modes. In one, it accepts hex. In the other, assembly code.

End

Yay, we now have a basic, but functional, assembler. Next, we’ll teach our assembler how to recognize more opcodes and instruction forms, and how to provide helpful hints to the user when they type something incorrectly. See you then!


If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.