So You Want to Build a Language VM - Part 13 - Labels
Adds in labels to our VM
Intro
Hey everyone! I know I promised labels in the previous part, but as I was writing it, I realized there’s some pre-reqs we can do. Completing them will make implementing Labels
and Directives
much easier.
=== New Tokens
OK, first up! To handle Labels
and Directives
we need to add three new Token types over in src/assembler/mod.rs
:
#[derive(Debug, PartialEq)]
pub enum Token {
Op { code: Opcode },
Register { reg_num: u8 },
IntegerOperand { value: i32 },
LabelDeclaration { name: String },
LabelUsage { name: String },
Directive { name: String }
}
Don’t worry about what those are, yet.
Tweak AssemblyInstruction
This is an easy one! Head over to instruction_parsers.rs
and let’s add the new fields to our AssemblerInstruction:
#[derive(Debug, PartialEq)]
pub struct AssemblerInstruction {
pub opcode: Option<Token>,
pub label: Option<Token>,
pub directive: Option<Token>,
pub operand1: Option<Token>,
pub operand2: Option<Token>,
pub operand3: Option<Token>,
}
First thing to note is that we’ve made everything optional. This is because we now can start an instruction with a Directive
OR an Opcode
. The second thing is the addition of the label and directive fields. We’ll need these later on.
Instruction Forms Everywhere!
Right now, we are writing one parser per instruction form. One for <opcode>
, one for <opcode> <operand>
and so on. We should probably optimize it a bit, so let’s combine all those parsers into one.
First step is we need to include registers as operands. Open up src/assembly/operand_parsers.rs
and add the following:
use assembler::register_parsers::register;
operand
parser, add it like so:named!(pub operand<CompleteStr, Token>,
alt!(
integer_operand |
register
)
);
Next, head over to instruction_parsers.rs
and add this parser:
named!(instruction_combined<CompleteStr, AssemblerInstruction>,
do_parse!(
l: opt!(label_declaration) >>
o: opcode >>
o1: opt!(operand) >>
o2: opt!(operand) >>
o3: opt!(operand) >>
(
AssemblerInstruction{
opcode: Some(o),
label: l,
directive: None,
operand1: o1,
operand2: o2,
operand3: o3,
}
)
)
);
And now, we need to do the same for directives. Open up directive_parsers.rs
and add replace all the macros in there with these:
named!(directive_declaration<CompleteStr, Token>,
do_parse!(
tag!(".") >>
name: alpha1 >>
(
Token::Directive{name: name.to_string()}
)
)
);
named!(directive_combined<CompleteStr, AssemblerInstruction>,
ws!(
do_parse!(
tag!(".") >>
name: directive_declaration >>
o1: opt!(operand) >>
o2: opt!(operand) >>
o3: opt!(operand) >>
(
AssemblerInstruction{
opcode: None,
directive: Some(name),
label: None,
operand1: o1,
operand2: o2,
operand3: o3,
}
)
)
)
);
/// Will try to parse out any of the Directive forms
named!(pub directive<CompleteStr, AssemblerInstruction>,
do_parse!(
ins: alt!(
directive_combined
) >>
(
ins
)
)
);
instruction
parser to look like this:/// Will try to parse out any of the Instruction forms
named!(pub instruction<CompleteStr, AssemblerInstruction>,
do_parse!(
ins: alt!(
instruction |
directive
) >>
(
ins
)
)
);
Summary
The end result of all these changes is that we can now accept more forms, such as:
<directive>
<opcode>
<directive> <operand>
<directive> <operand> <operand>
<directive> <operand> <operand> <operand>
<opcode> <operand>
<opcode> <operand> <operand>
<opcode> <operand> <operand> <operand>
And we reduced the number of parsers needed. Make sure cargo test
still passes, and then on to…
Labels
I should probably explain what a label even is. =)
What is a Label?
In assembly, labels give a logical name to a specific instruction that you can later reference. For example:
test1: LOAD $0 #100
You can then use @test
as an operand for certain instructions, such as jump targets:
DJMP @test1
This will require a new file. Make src/assembler/label_parsers.rs
, and in it put these two parsers:
use nom::types::CompleteStr;
use nom::{alphanumeric, multispace};
use assembler::Token;
/// Looks for a user-defined label, such as `label1:`
named!(pub label_declaration<CompleteStr, Token>,
ws!(
do_parse!(
name: alphanumeric >>
tag!(":") >>
opt!(multispace) >>
(
Token::LabelDeclaration{name: name.to_string()}
)
)
)
);
/// Looks for a user-defined label, such as `label1:`
named!(pub label_usage<CompleteStr, Token>,
ws!(
do_parse!(
tag!("@") >>
name: alphanumeric >>
opt!(multispace) >>
(
Token::LabelUsage{name: name.to_string()}
)
)
)
);
These will let us spot the declaration (some_label:
) of a label and its usage (@some_label
).
Tests
#[test]
fn test_parse_label_declaration() {
let result = label_declaration(CompleteStr("test:"));
assert_eq!(result.is_ok(), true);
let (_, token) = result.unwrap();
assert_eq!(token, Token::LabelDeclaration { name: "test".to_string() });
let result = label_declaration(CompleteStr("test"));
assert_eq!(result.is_ok(), false);
}
#[test]
fn test_parse_label_usage() {
let result = label_usage(CompleteStr("@test"));
assert_eq!(result.is_ok(), true);
let (_, token) = result.unwrap();
assert_eq!(token, Token::LabelUsage { name: "test".to_string() });
let result = label_usage(CompleteStr("test"));
assert_eq!(result.is_ok(), false);
}
A Slight Digression
Before we move on, we need to talk a bit more about how assemblers work.
Passes
Assemblers operate in 1 or more "passes". That is, they read through the written code, do something, then repeats until done. Our assembler is going to be a two-pass assembler. There is no hard and fast rules on what must be done in which pass. A pass might be to identify all variables, or to do an optimization.
Why do a two-pass assembler? Well, it has to do with the forward reference problem.
Using Labels
Even though we can now parse labels, we can’t actually use them for anything. They don’t get turned into bytes and written out to our bytecode. They are for use during the assembly phase.
Consider what would happen if we tried to run this code:
JMP @target
target: HLT
We are trying to use the label as an operand before we create it. This is sometimes called the forward reference problem, and we’ll solve it by doing two passes.
Storing Symbols
Another issue is where do we store the values of each symbol (of which a label is one type)?
The answer to this one is something called a Symbol Table
. This is a data structure that the assembler maintains while going through the code. It stores metadata about the code, such as which byte offset a symbol is for.
Our Symbol Table
will look like this:
Symbol Name | Symbol Type | Byte Offset |
---|---|---|
some_label | Label | 12 |
But first…
Before we start on passes, labels, and symbols, let’s add two new abilities to our REPL:
A command to clear the program vector
The ability to read from a file
The first one I will leave up to you. The second one requires adding a new match arm for .load_file
:
".load_file" => {
print!("Please enter the path to the file you wish to load: ");
io::stdout().flush().expect("Unable to flush stdout");
let mut tmp = String::new();
stdin.read_line(&mut tmp).expect("Unable to read line from user");
let tmp = tmp.trim();
let filename = Path::new(&tmp);
let mut f = File::open(Path::new(&filename)).expect("File not found");
let mut contents = String::new();
f.read_to_string(&mut contents).expect("There was an error reading from the file");
let program = match program(CompleteStr(&contents)) {
// Rusts pattern matching is pretty powerful an can even be nested
Ok((remainder, program)) => {
program
},
Err(e) => {
println!("Unable to parse input: {:?}", e);
continue;
}
};
self.vm.program.append(program.to_bytes());
}
The match arm is similar, except this tries to read code from a file and then hands it over to the parser. Later, the match on program in both
.load_file
and can go into a common function.
End
I think that’s enough for this article. In the next one, we’ll continue working on our assembler and build a symbol table. The code is in GitLab if you need it. See you later!
If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.