So You Want to Build a Language VM - Part 16 - String Constants And More

Two-pass assembler, using and printing string constants, and other loose ends

Home Stretch

This is going to be a longer post. In it, we’ll wrap up our two-pass assembler, add a PRTS opcode for printing, and tie up some other loose ends. At the end, we should have an interpreter with a good amount of functionality and a simple assembly language. We are, of course, far from done with the project. I want to work on a different tutorial series for a bit, then we’ll continue with Iridium. === Quick Change Add the following dependencies in your Cargo.toml. We’ll need them later:

log = "0.4"
env_logger = "0.5.13"
byteorder = "1"

Without further ado…​

String Constants

These are strings that you define in code. For example, if your program needed to print "Hello" a lot, you don’t want to a copy of "Hello" for each usage. Instead, we can use something quite similar to variables. To understand this, we need to talk about a special section of the bytecode.

Read-Only Section

When our assembler writes out its final blob of bytecode that our VM executes, it has a specific structure that is:

<header>
<read-only data>
<executable data>

The read-only data is a special section of the bytecode where we store constants. A common example is strings. Let’s say we declare a string constant like this:

hello: .asciiz 'Hello'
  1. hello: is a label

  2. .asciiz declares the type of the constant

  3. 'Hello' is the constant itself.

When our assembler sees a line like this, it reads the bytes that make up Hello into the read-only section of memory and adds a 0. Since the RO section is just another Vec<u8>, it would look like:

[72, 101, 108, 108, 111, 0]
Note
The above numbers are in base 10. 72 == H, 101 == e, 108 = l, 111 = 0 in UTF-8.

Now let’s say we want another string constant, like cat, which is [99, 97, 116]. If we write this assembly:

hello: .asciiz 'Hello'
cat: .asciiz 'cat'

Then our read-only section would look like:

[72, 101, 108, 108, 111, 0, 99, 97, 116, 0]

See how we stick all the constants together in a giant vector? This answers the question of how we store them, but how do we retrieve them when the program is executing, and comes across an instruction like this:

PRTS @hello

During the first pass of our assembler, it did the following:

  1. It found hello: and verified it had no entry in the symbol table

  2. It saw the directive was .asciiz, which means to treat the next operand as a null-terminated string

  3. It parsed out Hello, removed the single-quotes

  4. Here’s the secret: it recorded the offset in the symbol table of where this string constant begins

  5. To retrieve a string constant, we do a lookup in the symbol table, and start reading bytes from there until we hit a 0

  6. The PRTS instruction knows to look in the read-only section

This is how all null-terminated string systems (that I am aware of) work. A second way to deal with string constants is (shockingly) non-null-terminated strings. You can store the length of the constant along with it, or the end of the constant. These also work fine, but I find them more annoying to work with. Hence, we are using null terminated strings. =)

There are a couple other rules for string constants:

  1. The user must declare them in the .data segment

  2. All strings will be UTF-8 by default, terminated by a 0

  3. The format for declaring a string constant is: my_string: .asciiz '<string>'

  4. In the .code segment, developers can use @my_string as an operand

Note
Why single quotes, you ask? Because getting nom to parse a single " was, er, making me not happy. At some point, I’ll figure out how to accept either ' or ".
Note
Why the z at the end of .asciiz, you ask? In MIPS assembly, which we are shamelessly copying, there’s a second directive, .ascii. This declares a non-null-terminated string. Since MIPS assembly is our guide, more or less, we’re following its conventions.

MOAR PARSERS!

Head over to src/assembler/operand_parsers.rs, and let’s write a parser for strings:

named!(irstring<CompleteStr, Token>,
    do_parse!(
        tag!("'") >>
        content: take_until!("'") >>
        tag!("'") >>
        (
            Token::IrString{ name: content.to_string() }
        )
    )
);

Note
Why is it called IrString, you ask? To not conflict with Rust keywords. Ir stands for Iridium, so IridiumString.

We also need to add irstring to the list of sub-parsers in the operand parser, like so:

named!(pub operand<CompleteStr, Token>,
    alt!(
        integer_operand |
        label_usage |
        register |
        irstring
    )
);

And a Test

#[test]
fn test_parse_string_operand() {
    let result = irstring(CompleteStr("'This is a test'"));
    assert_eq!(result.is_ok(), true);
}

Directives

Remember, we are using a directive, not an opcode. The directive is of this form: label: .asciiz 'String content'. A quick peek at our directive_combined parser reveals that we are not allowing for an optional label:

named!(directive_combined<CompleteStr, AssemblerInstruction>,
    ws!(
        do_parse!(
            tag!(".") >>
            name: directive_declaration >>
            o1: opt!(operand) >>
            o2: opt!(operand) >>
            o3: opt!(operand) >>
            (
                AssemblerInstruction{
                    opcode: None,
                    directive: Some(name),
                    label: None,
                    operand1: o1,
                    operand2: o2,
                    operand3: o3,
                }
            )
        )
    )
);

Important
There’s a bug in directive_combined. Note how we have tag!(".") and then directive_declaration, which also has a tag!("."). The net effect is that we are looking for ... Let’s remove this one.

We can add in a quick opt! like so:

use assembler::label_parsers::label_declaration;

named!(directive_combined<CompleteStr, AssemblerInstruction>,
    ws!(
        do_parse!(
            l: opt!(label_declaration) >>
            name: directive_declaration >>
            o1: opt!(operand) >>
            o2: opt!(operand) >>
            o3: opt!(operand) >>
            (
                AssemblerInstruction{
                    opcode: None,
                    directive: Some(name),
                    label: l,
                    operand1: o1,
                    operand2: o2,
                    operand3: o3,
                }
            )
        )
    )
);

And a test for it…​

#[test]
fn test_string_directive() {
    let result = directive_combined(CompleteStr("test: .asciiz 'Hello'"));
    assert_eq!(result.is_ok(), true);
    let (_, directive) = result.unwrap();

    // Yes, this is the what the result should be
    let correct_instruction =
        AssemblerInstruction {
            opcode: None,
            label: Some(
                Token::LabelDeclaration {
                    name: "test".to_string()
                }),
            directive: Some(
                Token::Directive {
                    name: "asciiz".to_string()
                }),
            operand1: Some(Token::IrString { name: "Hello".to_string() }),
            operand2: None,
            operand3: None };

    assert_eq!(directive, correct_instruction);
}

FINALLY

OK, I think we are ready to try a test program with a string declaration in it now. Something simple, like:

.data
hello: .asciiz 'Hello everyone!'
.code
hlt

Wait! I have a great idea! Let’s put that in as a test in program_parsers! =)

#[test]
fn test_complete_program() {
    let test_program = CompleteStr(".data\nhello: .asciiz 'Hello everyone!'\n.code\nhlt");
    let result = program(test_program);
    assert_eq!(result.is_ok(), true);
}

Assembler

The parser is parsing, now we need to teach the assembler what to do with a string constant. The assembler is going to maintain a Vec<u8>. When it finds a string constant, it converts each character to bytes and appends to the read-only data. When there are no more characters, it will add in 0 (which is the numerical zero, not the ASCII or UTF-8 string "0"!) to the end. It will also record the offset at which that constant begins and place it in the symbol table.

Important
Even though we are working with UTF-8, you should get into the habit of thinking of UTF encodings as taking 1 or more bytes. After all, there is UTF-16 and UTF-32.

Two-Pass Assembler

I think it will be easier if I just show you the complete two-pass assembler and go over it. Here’s the declaration:

#[derive(Debug, Default)]
#[derive(Debug, Default)]
pub struct Assembler {
    /// Tracks which phase the assember is in
    phase: AssemblerPhase,
    /// Symbol table for constants and variables
    pub symbols: SymbolTable,
    /// The read-only data section constants are put in
    pub ro: Vec<u8>,
    /// The compiled bytecode generated from the assembly instructions
    pub bytecode: Vec<u8>,
    /// Tracks the current offset of the read-only section
    ro_offset: u32,
    /// A list of all the sections we've seen in the code
    sections: Vec<AssemblerSection>,
    /// The current section the assembler is in
    current_section: Option<AssemblerSection>,
    /// The current instruction the assembler is converting to bytecode
    current_instruction: u32,
    /// Any errors we find along the way. At the end, we'll present them to the user.
    errors: Vec<AssemblerError>
}

I’m not certain we’ll need to keep ro_offset and sections.

impl block

Let’s start going through the functions we implement for our assembler one by one. I’ll do a top-down approach, where we start with the highest-level function and follow it through to the smaller functions.

assemble()

This is the public function most things would call if they want to turn a raw string into bytecode. The raw parameter is the entire program in string format. It returns either a Vec<u8> or a Vec<AssemblerError>.

Important
As you read through this code, you might wonder why we aren’t returning many values. Its because as the assembler parses through the code, it is manipulating its own internal data structures, such as adding data to ro. I did it this way because it seemed to work better with Rust’s ownership system versus trying to pass references all around.
pub fn assemble(&mut self, raw: &str) -> Result<Vec<u8>, Vec<AssemblerError>> {
    // Runs the raw program through our `nom` parser
    match program(CompleteStr(raw)) {
        // If there were no parsing errors, we now have a `Vec<AssemblyInstructions>` to process.
        // `remainder` _should_ be "".
        // TODO: Add a check for `remainder`, make sure it is "".
        Ok((_remainder, program)) => {
            // First get the header so we can smush it into the bytecode later
            let mut assembled_program = self.write_pie_header();

            // Start processing the AssembledInstructions. This is the first pass of our two-pass assembler.
            // We pass a read-only reference down to another function.
            self.process_first_phase(&program);

            // If we accumulated any errors in the first pass, return them and don't try to do the second pass
            if !self.errors.is_empty() {
                // TODO: Can we avoid a clone here?
                return Err(self.errors.clone());
            };

            // Make sure that we have at least one data section and one code section
            if self.sections.len() != 2 {
                // TODO: Detail out which one(s) are missing
                println!("Did not find at least two sections.");
                self.errors.push(AssemblerError::InsufficientSections);
                // TODO: Can we avoid a clone here?
                return Err(self.errors.clone());
            }
            // Run the second pass, which translates opcodes and associated operands into the bytecode
            let mut body = self.process_second_phase(&program);

            // Merge the header with the populated body vector
            assembled_program.append(&mut body);
            Ok(assembled_program)
        },
        // If there were parsing errors, bad syntax, etc, this arm is run
        Err(e) => {
            println!("There was an error parsing the code: {:?}", e);
            Err(vec![AssemblerError::ParseError{ error: e.to_string() }])
        }
    }
}

process_first_phase()

This function has two jobs:

  1. Label declarations (e.g., name:)

  2. Directives (e.g., .data)

In the first phase, we care the most about finding all the label declarations to put them in the symbol table, and making sure we everything is in a segment. I’ve commented this function heavily, so read through it.

Important
For handling errors, I chose to accumulate them into a list on the Assembler and process as much as possible. An alternative strategy would be to bail out at the first error. I chose to do it this way because I thought it might be good for the user to be able to see all the errors at once, instead of having to do a tedious fix-test-fix-repeat cycle.
/// Runs the first pass of the two-pass assembling process. It looks for labels and puts them in the symbol table
fn process_first_phase(&mut self, p: &Program) {
    // Iterate over every instruction, even though in the first phase we care about labels and directives but nothing else
    for i in &p.instructions {
        if i.is_label() {
            // TODO: Factor this out into another function? Put it in `process_label_declaration`?
            if self.current_section.is_some() {
                // If we have hit a segment header already (e.g., `.code`) then we are ok
                self.process_label_declaration(&i);
            } else {
                // If we have *not* hit a segment header yet, then we have a label outside of a segment, which is not allowed
                self.errors.push(AssemblerError::NoSegmentDeclarationFound{instruction: self.current_instruction});
            }
        }

        if i.is_directive() {
            self.process_directive(i);
        }
        // This is used to keep track of which instruction we hit an error on
        // TODO: Do we really need to track this?
        self.current_instruction += 1;
    }
    // Once we're done with this function, set the phase to second
    self.phase = AssemblerPhase::Second;
}

process_label_directive()

We call this function when the assembler finds a directive that is a label declaration:

/// Handles the declaration of a label such as:
/// hello: .asciiz 'Hello'
fn process_label_declaration(&mut self, i: &AssemblerInstruction) {
    // Check if the label is None or String
    let name = match i.get_label_name() {
        Some(name) => { name },
        None => {
            self.errors.push(AssemblerError::StringConstantDeclaredWithoutLabel{instruction: self.current_instruction});
            return;
        }
    };

    // Check if label is already in use (has an entry in the symbol table)
    // TODO: Is there a cleaner way to do this?
    if self.symbols.has_symbol(&name) {
        self.errors.push(AssemblerError::SymbolAlreadyDeclared);
        return;
    }

    // If we make it here, it isn't a symbol we've seen before, so stick it in the table
    let symbol = Symbol::new(name, SymbolType::Label);
    self.symbols.add_symbol(symbol);
}

process_second_phase()

This begins the second pass of the assembler, and it does quite a bit of work:

/// Runs the second pass of the assembler
fn process_second_phase(&mut self, p: &Program) -> Vec<u8> {
    // Restart the counting of instructions
    self.current_instruction = 0;
    // We're going to put the bytecode meant to be executed in a separate Vec so we can do some post-processing and then merge it with the header and read-only sections
    // Examples could be optimizations, additional checks, whatever
    let mut program = vec![];
    // Same as in first pass, except in the second pass we care about opcodes and directives
    for i in &p.instructions {
        if i.is_opcode() {
            // Opcodes know how to properly transform themselves into 32-bits, so we can just call `to_bytes` and append to our program
            let mut bytes = i.to_bytes(&self.symbols);
            program.append(&mut bytes);
        }
        if i.is_directive() {
            // In this phase, we can have directives but of different types than we care about in the first pass. The Directive itself can check which pass the Assembler
            // is in and decide what to do about it
            self.process_directive(i);
        }
        self.current_instruction += 1
    }
    program
}

process_directive()

fn process_directive(&mut self, i: &AssemblerInstruction) { // First let’s make sure we have a parseable name let directive_name = match i.get_directive_name() { Some(name) ⇒ { name }, None ⇒ { println!("Directive has an invalid name: {:?}", i); return; } };

    // Now check if there were any operands.
    if i.has_operands() {
        // If it _does_ have operands, we need to figure out which directive it was
        match directive_name.as_ref() {
            // If this is the operand, we're declaring a null terminated string
            "asciiz" => {
                self.handle_asciiz(i);
            }
            _ => {
                self.errors.push(AssemblerError::UnknownDirectiveFound{ directive: directive_name.clone() });
                return;
            }
        }
    } else {
        // If there were not any operands, (e.g., `.code`), then we know it is a section header
        self.process_section_header(&directive_name);
    }
}

process_section_header()

This little function just processes any new section declarations.

/// Handles a declaration of a section header, such as:
/// .code
fn process_section_header(&mut self, header_name: &str) {
    let new_section: AssemblerSection = header_name.into();
    // Only specific section names are allowed
    if new_section == AssemblerSection::Unknown {
        println!("Found an section header that is unknown: {:#?}", header_name);
        return;
    }
    // TODO: Check if we really need to keep a list of all sections seen
    self.sections.push(new_section.clone());
    self.current_section = Some(new_section);
}

handle_asciiz()

This function is called to handle declarations of string constants. Its heavily commented, so read carefully.

/// Handles a declaration of a null-terminated string:
/// hello: .asciiz 'Hello!'
fn handle_asciiz(&mut self, i: &AssemblerInstruction) {
    // Being a constant declaration, this is only meaningful in the first pass
    if self.phase != AssemblerPhase::First { return; }

    // In this case, operand1 will have the entire string we need to read in to RO memory
    match i.get_string_constant() {
        Some(s) => {
            match i.get_label_name() {
                Some(name) => { self.symbols.set_symbol_offset(&name, self.ro_offset); }
                None => {
                    // This would be someone typing:
                    // .asciiz 'Hello'
                    println!("Found a string constant with no associated label!");
                    return;
                }
            };
            // We'll read the string into the read-only section byte-by-byte
            for byte in s.as_bytes() {
                self.ro.push(*byte);
                self.ro_offset += 1;
            }
            // This is the null termination bit we are using to indicate a string has ended
            self.ro.push(0);
            self.ro_offset += 1;
        }
        None => {
            // This just means someone typed `.asciiz` for some reason
            println!("String constant following an .asciiz was empty");
        }
    }
}

PRTS code

Now that we have stuff stored in the read-only section, it’d be handy to be able to read it. So let’s add a new opcode with this implementation:

Opcode::PRTS => {
    // PRTS takes one operand, either a starting index in the read-only section of the bytecode
    // or a symbol (in the form of @symbol_name), which will look up the offset in the symbol table.
    // This instruction then reads each byte and prints it, until it comes to a 0x00 byte, which indicates
    // termination of the string
    let starting_offset = self.next_16_bits() as usize;
    let mut ending_offset = starting_offset;
    let slice = self.ro_data.as_slice();
    // TODO: Find a better way to do this. Maybe we can store the byte length and not null terminate? Or some form of caching where we
    // go through the entire ro_data on VM startup and find every string and its ending byte location?
    while slice[ending_offset] != 0 {
        ending_offset += 1;
    }
    let result = std::str::from_utf8(&slice[starting_offset..ending_offset]);
    match result {
        Ok(s) => { print!("{}", s); }
        Err(e) => { println!("Error decoding string for prts instruction: {:#?}", e) }
    };
}

AssemblerError

You probably noticed that we are using a custom error type. I started using those for this part of the tutorial, because the project has grown large and complex enough that I think we need it. Rather than pasting the entire file contents, I am going to point you to the file in the repo, since article is almost at 3k words. =)

Note
Using the AssemblerError will require changing some of the match arms from Some(whatever) to Ok(whatever).

symbols.rs

You’ll also notice that I split all the symbol table code out into its own file, src/assembly/symbols.rs. You can see it here: https://gitlab.com/subnetzero/iridium/blob/master/src/assembler/symbols.rs

End

There were some other tweaks and changes I made that I didn’t cover in this tutorial. If your version isn’t working for some reason, I suggest pulling a fresh copy from the repo. The code as it should be up through this tutorial is tagged as 0.0.16 in GitLab.

This is a good stopping point in the series for a few weeks. There’s another tutorial series I want to start on. I have littered the Iridium codebase with // TODO: <stuff> notes, and I’ll try to open tickets in the GitLab issue tracker for each of them. If you’d like to contribute anything (code, spelling/grammar corrections, more rusty ways of doing things, anything at all), please feel free to do a MR and ping me for help or questions if needed.

Another reason I am taking a short break from this project is that I need to think about what to do next. We could continue with the language aspect and build a higher-level, Python-esque language on top of what we have now. Or we could focus on the VM itself, and start adding in features like parallelism and clustering. If you have thoughts/ideas about that, feel free to post them or e-mail me.

Hope you enjoyed this series and are looking forward to the next one!


If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.