Being that tokens are the leaves of the AST, there are a lot of them and they can take a lot of space. To save memory it is a good idea to store only a file location instead of a full token. Whenever token information is needed, just lex again to get the full token, starting at the file location. This works only for languages with a context-free lexical syntax, of course (and not entirely sure "context-free" is the right term here but you get what I mean).
Storing row/column in file location data is wasteful - just a file offset should be enough. Whenever the row/column coordinates are needed (normally only in user messages) they can be quickly recomputed.
In effect, parsed tokens can be stored as just an offset - a 4 or 8 byte integer.
This sounds like premature optimization to me (I know, that phrase gets used a lot lately). As with all optimizations it probably makes sense to not do it and keep your code highly readable until the point where you've profiled issues at this level.
And definitely my focus in writing a post like this is to be as explicit as reasonable to serve as the best educational reference.
I also haven't seen real world compilers do what you say (thinking V8, CPython, other mainstream ones) but I'll keep an eye out for it now.
Is this really the right space to optimize? Having the relevant data on hand in tight array is generally much faster than having to fetch and compute something from somewhere else?
In general it is of course better to store an index to a data where it's needed rather than a copy of the data.
I've seen other experienced people do this, like Jonathan Blow (Jai) and Per Vognsen (bitwise) I believe. I recall measuring a large space overhead from tokens myself when working on my toy compiler.
Token data will not typically be needed a lot: The token kind (integer-literal, plus-operator, open-paren etc.) at exactly one point in the parsing phase and possibly in the type checking phase but you could have a separate "literal expression" kind for that. Binary payloads (string bytes, floating point value etc.) will be needed in the constant phase for literal tokens. I wouldn't expect that a little optimization here is a difficult tradeoff to make - neither with regards to speed nor code complexity.
Update, here is bitwise's code. It looks very good to me from a glance: https://github.com/pervognsen/bitwise/blob/master/ion/lex.c . I think thinking about this gives interesting insights, and the token struct approach is an illustration of how "object-oriented" approaches can go wrong. The concept of a token exists merely at the syntactical level, and the token kind controls the code flow of a recursive descent parser, but I don't see any good reasons to store a token represented as a tagged union. Punctuation tokens are completely ephemeral, and constant data, literal kinds, operator kinds etc. all should better go to very distinct places in the data store that have no resemblance to "token objects".
It might be worth storing this data separately, to improve cache usage, but my experience is on any modern machine the amount of space taken by parsed code / ASTs is small enough to not care about, unless you do something like C++ where you "reparse the world" for every tiny file (due to header includes)
Maybe it's only necessary for serious compilers. Let's say 10 tokens / line avg., 40 bytes / token (but a naive token struct could easily be 100s of bytes), then we're in the region of 100s to 1000s of bytes per line. Now let's say compiling 100K lines would not be unheard of, and for benchmarks you want to push the millions. We're getting into regions where tokens alone can fill a computer's memory. If there is an easy to make and effective optimization, I'm all for it :-)
"expect" is like the word "assert". You express which invariant should always be true and therefore there should be no failure. The use of `.expect("Could not read file")` is actually wrong, because it can fail.
let mut file = File::create("valuable_data.txt").unwrap();
file.write_all(b"important message").expect("failed to write message");
The use of .expect("Could not read file") is consistent with Rust's documentation.
And I agree with OP that .expect() is poorly named. I wouldn't raise it as an issue on the work of others who just use Rust and are stuck with it though.
Working on tokenization and parsing there have been two "lights clicking on" moments that I think every dev working on a PL implementation should have :
- Tokens are the leaves of your syntax trees
- File locations are relative, not absolute
It's easier to build a parser that doesn't buy into these things, but it's way harder to build tooling and good error messaging if you don't.
Relative to the preceding token in the stream, in number of lines and columns from the end of the last token (or beginning, but to me that's conceptually confusing). Take a look at the language server protocol specification for semantic tokens with relative positions as a reference.
Using an absolute line/column/index to indicate the location of a token in a file means you need to retokenize a large amount of the file whenever there's an edit. In a streaming parser that responds to live edits, storing the tokens with relative file locations allows you to only update neighbors on insertion or deletion (and also allows for bulk insert/delete). Absolute file locations can be recovered by scanning the token stream - which is why it's important for the tokens to form the leaves of the tree, it guarantees you can always recover the file locations.
Then you can pull some tricks like the Roslyn compiler does, which is to store things like comments, diagnostics, and white space as "trivia" associated with the tokens and very quickly recover the text that created the token stream/AST nodes. That's invaluable for tooling.
I agree, that's the right way to do it (not required for a batch compiler, though, because there are no edits). Text ropes work that way as well - contents are stored in a tree, and absolute coordinates are computed from relative coordinates using an associative operation (monoid).
Always happy to read compilers, Rust, or VMs. One quick and cheap trick that can help your VM, especially when you have a lot of arithmetics: explicitly represent the top of stack. Eg.
Instruction::Add => {
let left = data.pop().unwrap();
top = left + top;
pc += 1;
}
Instruction::Subtract => {
let left = data.pop().unwrap();
top = left - top;
pc += 1;
}
Instruction::LessThan => {
let left = data.pop().unwrap();
top = if left < top { 1 } else { 0 });
pc += 1;
}
No questions (yet) but I found this a really good post. I'm a Rust newbie and I've coincidentally just started working through 'Crafting Interpreters' using Rust as my implementation language. Some of your code has helped clarify things in my mind.
My actual long-term goal is to build a VM in Rust and then use this to re-do the Make A Lisp project [0]. I completed this a couple of years ago using C# but felt vaguely uneasy that I was using C# to do the heavy lifting associated with garbage collection, etc.
FORTRAN has had it since 1957. But Pascal and C purged "evil computed GOTO" and only offered non-computed goto. Then Java etc. purged non-computed goto.
Not as a language feature. You can setup your code in a way that will cause LLVM to emit identical output as if you did have computed gotos, but there's no guarantee as to it.
You have to call the function. This adds a new stack frame, which then is closed when the function concludes. This means you're constantly going in, going out, going in going out...
I published a benchmark of various techniques of VM implementation today. It was conducted in C++ compliant C, but the results should be applicable to Rust. Just calling the function will always be the slowest due to the work the processor has to do for each instruction.
This is a pet interest of mine for decades. Did you run any of this under perf and check where the time is going? This may be too simple (too regular) to show it, but the biggest issue with all of these is that the branch predictor doesn't have enough context to make good predictions and you end up mispredicting a lot.
This paper https://www.cs.toronto.edu/~matz/dissertation/matzDissertati... (found the reference in https://coredumped.dev/2021/10/21/building-an-emacs-lisp-vm-...) suggests an interesting option which isn't portable, but a lot less work than a full JIT: generate just call and conditional branches. This was enough for them to basically fix all the branch mispredictions. I have yet to try it, but maybe someone else will feel inspired. The pain I ran into when trying is that a naively mapp'ed region will be very far from the code you are trying to call and you won't be able to use simple instructions. The fix is to try to allocate it explicitly closer.
PS: Another technique I have used in Rust (idea was from cloudflare) is to generate closures and have each closure tail call the next. It'll end up very similar to your continuation passing, but at least on my benchmarks it was _slightly_ worse than the naive dispatch.
I would expect that the branch predictor has issues, especially on a chip that's nearly 10 years old. However, I controlled for this as best as I could. The if statements that control the VM's flow are the same for each test. The only real difference is that The switch statement has to branch to jump to the next virtual operation to take, while all others use unconditional jumps to do that.
Storing row/column in file location data is wasteful - just a file offset should be enough. Whenever the row/column coordinates are needed (normally only in user messages) they can be quickly recomputed.
In effect, parsed tokens can be stored as just an offset - a 4 or 8 byte integer.