So You Want to Build a Language VM - Part 13 - Labels

Adds in labels to our VM

August 26, 2018

Intro

Hey everyone! I know I promised labels in the previous part, but as I was writing it, I realized there’s some pre-reqs we can do. Completing them will make implementing Labels and Directives much easier. === New Tokens OK, first up! To handle Labels and Directives we need to add three new Token types over in src/assembler/mod.rs:

#[derive(Debug, PartialEq)]
pub enum Token {
    Op { code: Opcode },
    Register { reg_num: u8 },
    IntegerOperand { value: i32 },
    LabelDeclaration { name: String },
    LabelUsage { name: String },
    Directive { name: String }
}

Don’t worry about what those are, yet.

Tweak AssemblyInstruction

This is an easy one! Head over to instruction_parsers.rs and let’s add the new fields to our AssemblerInstruction:

#[derive(Debug, PartialEq)]
pub struct AssemblerInstruction {
    pub opcode: Option<Token>,
    pub label: Option<Token>,
    pub directive: Option<Token>,
    pub operand1: Option<Token>,
    pub operand2: Option<Token>,
    pub operand3: Option<Token>,
}

First thing to note is that we’ve made everything optional. This is because we now can start an instruction with a Directive OR an Opcode. The second thing is the addition of the label and directive fields. We’ll need these later on.

Instruction Forms Everywhere!

Right now, we are writing one parser per instruction form. One for <opcode>, one for <opcode> <operand> and so on. We should probably optimize it a bit, so let’s combine all those parsers into one.

First step is we need to include registers as operands. Open up src/assembly/operand_parsers.rs and add the following:

use assembler::register_parsers::register;

and then to the operand parser, add it like so:

named!(pub operand<CompleteStr, Token>,
    alt!(
        integer_operand |
        register
    )
);

Next, head over to instruction_parsers.rs and add this parser:

named!(instruction_combined<CompleteStr, AssemblerInstruction>,
    do_parse!(
        l: opt!(label_declaration) >>
        o: opcode >>
        o1: opt!(operand) >>
        o2: opt!(operand) >>
        o3: opt!(operand) >>
        (
            AssemblerInstruction{
                opcode: Some(o),
                label: l,
                directive: None,
                operand1: o1,
                operand2: o2,
                operand3: o3,
            }
        )
    )
);

And now, we directive_parsers.rs and add replace all the macros in there with these:

named!(directive_declaration<CompleteStr, Token>, do_parse!( tag!(".") >> name: alpha1 >> ( Token::Directive{name: name.to_string()} ) ) class=w>); class=w>named!(directive_combined<CompleteStr, AssemblerInstruction>, ws!( do_parse!( tag!(".") >> name: directive_declaration >> o1: opt!(operand) >> o2: opt!(operand) >> o3: opt!(operand) >> ( AssemblerInstruction{ opcode: None, directive: Some(name), label: None, operand1: o1, operand2: o2, operand3: o3, } ) ) ) class=w>); class=w>/// Will try to parse out any of the Directive forms class=sd>named!(pub directive<CompleteStr, AssemblerInstruction>, do_parse!( ins: alt!( directive_combined ) >> ( ins ) ) class=w>); class=w>

This mirrors the structure of the instruction parsers. Next, change the instruction parser to look like this:

/// Will try to parse out any of the Instruction forms class=sd>named!(pub instruction<CompleteStr, AssemblerInstruction>, do_parse!( ins: alt!( instruction | directive ) >> ( ins ) ) class=w>); class=w>

Summary

The end result of all these changes is that we can now accept more forms, such as:

<directive> <opcode> <directive> <operand> <directive> <operand> <operand> <directive> <operand> <operand> <operand> <opcode> <operand> <opcode> <operand> <operand> <opcode> <operand> <operand> <operand>

And we reduced the number of parsers needed. Make sure cargo test still passes, and then on to…

Labels

I should probably explain what a label even is. =)

What is a Label?

In assembly, labels give a logical name to a specific instruction that you can later reference. For example:

test1: LOAD $0 #100

You can then use @test as an operand for certain instructions, such as jump targets:

DJMP @test1

This will require a new file. Make src/assembler/label_parsers.rs, and in it put these two parsers:

use nom::types::CompleteStr;
use nom::{alphanumeric, multispace};

use assembler::Token;

/// Looks for a user-defined label, such as `label1:`
named!(pub label_declaration<CompleteStr, Token>,
    ws!(
        do_parse!(
            name: alphanumeric >>
            tag!(":") >>
            opt!(multispace) >>
            (
                Token::LabelDeclaration{name: name.to_string()}
            )
        )
    )
);

/// Looks for a user-defined label, such as `label1:`
named!(pub label_usage<CompleteStr, Token>,
    ws!(
        do_parse!(
            tag!("@") >>
            name: alphanumeric >>
            opt!(multispace) >>
            (
                Token::LabelUsage{name: name.to_string()}
            )
        )
    )
);

These will let us spot the declaration (some_label:) of a label and its usage (@some_label).

Tests

#[test]
fn test_parse_label_declaration() {
    let result = label_declaration(CompleteStr("test:"));
    assert_eq!(result.is_ok(), true);
    let (_, token) = result.unwrap();
    assert_eq!(token, Token::LabelDeclaration { name: "test".to_string() });
    let result = label_declaration(CompleteStr("test"));
    assert_eq!(result.is_ok(), false);
}

#[test]
fn test_parse_label_usage() {
    let result = label_usage(CompleteStr("@test"));
    assert_eq!(result.is_ok(), true);
    let (_, token) = result.unwrap();
    assert_eq!(token, Token::LabelUsage { name: "test".to_string() });
    let result = label_usage(CompleteStr("test"));
    assert_eq!(result.is_ok(), false);
}

A Slight Digression

Before we move on, we need to talk a bit more about how assemblers work.

Passes

Assemblers operate in 1 or more "passes". That is, they read through the written code, do something, then repeats until done. Our assembler is going to be a two-pass assembler. There is no hard and fast rules on what must be done in which pass. A pass might be to identify all variables, or to do an optimization.

Why do a two-pass assembler? Well, it has to do with the forward reference problem.

Using Labels

Even though we can now parse labels, we can’t actually use them for anything. They don’t get turned into bytes and written out to our bytecode. They are for use during the assembly phase.

Consider what would happen if we tried to run this code:

JMP @target
target: HLT

We are trying to use the label as an operand before we create it. This is sometimes called the forward reference problem, and we’ll solve it by doing two passes.

Storing Symbols

Another issue is where do we store the values of each symbol (of which a label is one type)?

The answer to this one is something called a Symbol Table. This is a data structure that the assembler maintains while going through the code. It stores metadata about the code, such as which byte offset a symbol is for.

Our Symbol Table will look like this:

Symbol Name	Symbol Type	Byte Offset
some_label	Label	12

Symbol Name

Symbol Type

Byte Offset

some_label

Label

But first…

Before we start on passes, labels, and symbols, let’s add two new abilities to our REPL:

A command to clear the program vector
The ability to read from a file

The first one I will leave up to you. The second one requires adding a new match arm for .load_file:

".load_file" => {
    print!("Please enter the path to the file you wish to load: ");
    io::stdout().flush().expect("Unable to flush stdout");
    let mut tmp = String::new();
    stdin.read_line(&mut tmp).expect("Unable to read line from user");
    let tmp = tmp.trim();
    let filename = Path::new(&tmp);
    let mut f = File::open(Path::new(&filename)).expect("File not found");
    let mut contents = String::new();
    f.read_to_string(&mut contents).expect("There was an error reading from the file");
    let program = match program(CompleteStr(&contents)) {
        // Rusts pattern matching is pretty powerful an can even be nested
        Ok((remainder, program)) => {
            program
        },
        Err(e) => {
            println!("Unable to parse input: {:?}", e);
            continue;
        }
    };
    self.vm.program.append(program.to_bytes());
}

The match arm is similar, except this tries to read code from a file and then hands it over to the parser. Later, the match on program in both .load_file and can go into a common function.

End

I think that’s enough for this article. In the next one, we’ll continue working on our assembler and build a symbol table. The code is in GitLab if you need it. See you later!

If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.