So You Want to Build a Language VM - Part 12 - Strings
Adds in strings to our VM
What are Strings?
This may shock you, but they are a bit more complicated than they might seem. Since a computer cares about bytes, it has no concept of the letter s
, !
or any other letter. These having meaning to us humans. But we want our users to be able to give input and read output without having to do it all in hex. The solution is to use some sort of character encoding. This maps a particular character to a number.
You’ll hear two common encodings mentioned these days: ASCII and UTF. I’m not going to go into an exhaustive history of them; for that, check out this article for ASCII and this article for UTF-8. I will cover enough for us to put in support for strings, though.
ASCII
This is the older encoding, and can represent 256 different characters. Most of the time, we use the first 128. This page has some good info on it.
UTF-8
With the limited number of characters in ASCII, representing Japanese kanji, Klingon, and other languages that do not use the latin alphabet is difficult. A UTF-8 character can use 1 to 4 bytes.
Which to use…
UTF-8! ASCII will be around for a long time, but new tech should use UTF for their strings
What Stored Strings Look Like
Now that we have a heap, we can start storing strings there. For now, think of our heap as an array that can grow forever. If we have an empty heap, the beginning might look like this:
[0, 0, 0, 0, 0, 0, 0, 0]
This heap can hold 12 bytes of data. Let’s say we want to store the string Hi
in our heap. In UTF, H
is the number 72
. i
is the number 60
. If we store that in our heap, it would look like:
[72, 60, 0, 0, 0, 0, 0, 0]
Remember how we said UTF-8 is a variable width system and can take 1-4 bytes? How do we know that the letters H
and i
take 1 byte? It depends on the leading byte:
0xxx xxxx A single-byte US-ASCII code (from the first 127 characters)
110x xxxx One more byte follows
1110 xxxx Two more bytes follow
1111 0xxx Three more bytes follow
An even better question is, how do know when the string ends? Our strings will be null terminated. This means that when we start reading a string, we consume bytes until we hit a 0
. In UTF-8, this is not used for any other character, so we can use it to denote when a string ends.
String Constants
If we know what a string will be at compile-time, we can put it directly in the assembly code. Handling user input is trickier and something we’ll tackle later. I’m afraid I will have to go over a few new concepts first, though.
Assembly Sections (or Segments)
So far, we’ve been writing assembly top to bottom, putting any instruction anywhere. Real assembly programs have multiple sections: the text section and the data section.
Note | An assembly section is sometimes referred to as a segment. |
Data Section
This is the part of the program where we store constants.
Text Section
This section is sometimes referred to as the code section as well. It holds the actual instructions.
ELF
In computing, the Executable and Linkable Format (ELF, formerly named Extensible Linking Format), is a common standard file format for executable files, object code, shared libraries, and core dumps.
https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
We need our own ELF, so to speak. That is, we need to define what format the bytecode our assembler outputs will follow.
Note | We could probably use the ELF format, but that didn’t occur to me when I was originally writing this part of the VM. |
Name
For now, I’m going to refer to the format as…uh…the PIE format. Please feel free to suggest other names. =)
Header
Let’s copy ELF for this part. ELF reserves the first 64 bytes, so we will too. This will be the PIE Header.
Note | I like pie. |
Our header will follow this format:
4 bytes with a magic number to identify it as a PIE header
The 5th byte will tell us the version of the PIE format (always 1 for now)
That’s all we’ll encode in the header for now, though we may add bytes later. What should our magic number be? Let’s go with: [45, 50, 49, 45]
. This is EPIE
in ASCII written using hexadecimal notation.
Data Section
The data section will start at the 65th byte. This is the section where things like string constants will be stored.
Code Section
All the Instructions
will be here.
Distinguishing Sections
How do we know at which byte the data section begins? Or the code section? Simple! We encode the starting byte of the code section after the ELF header.
Let’s say this is the header, the first 64 bytes of our program. (I replaced writing a bunch of 0s with the …)
[45, 50, 49, 45, 1, 0 ... 0]
The next eight bytes will contain the byte at which the code section starts. For example:
[0, 0, 0, 0, 0, 0, 0, 200]
We now know the following: . Bytes 0-63 are the ELF header . Bytes 64-71 contain the code start section () . Bytes 72-199 contain the data section . Bytes 200-end contain the code section
Assembler Directives
The next new concept I’m going to introduce is directives. These are instructions to the assembler to do something. Here’s how you create a string constant in MIPS assembly:
my_string: .asciiz "Hello world"
Ignore the my_string:
part for one second, we’ll talk about it next. The directive is .asciiz
, which in MIPS, means create a null-terminated string. .ascii
creates one that is not null-terminated. There are many, many, many more directives in Intel x86 assembly.
If you are curious, you can check out these links:
The important thing to note is that sections are declared using directives. This means our programs will start to be a bit more complex and look like this:
.data
<constants here>
.code
<instructions here>
Assembler Labels
And now, the last new thing to learn! Assembler labels! These let you label a constant or instruction and refer to it by that label elsewhere in your code. In our language, we’ll define a label as:
A sequence of alphanumeric characters
They must start the line
Terminated by a
:
Labels can be referred to in the
code
section by prefacing them with an@
A small example program that will work after we implement these might look like:
.data
my_str: .asciiz "Hello everyone"
.code
prt @my_str
Note | Yes, we’ll also be coding a new instruction, PRT |
End
Lots of new concepts, so I’ll end this here. In the next part, we’ll implement labels!
If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.