My perspective on side projects has evolved over time. When I was younger I woul...

My perspective on side projects has evolved over time. When I was younger I would generally devote time to implementing project ideas. This was fine for smaller projects, but larger projects were cumbersome (the "finishing" problem). The birth of my first child threw another huge wrench into my free time and ability to do "work after work" (even the mental energy after a day of work and child-rearing was lacking).

Somewhat organically, my side-development has shifted into a more meditative exercise. While the day job necessarily forces consideration of prioritization, deadlines, and other concerns - my side development is free to be "pure" with respect to some goal.

These days I take small problem spaces and implement them _again_ and _again_, from scratch, trying to get a finer understanding of the problem each time. The code is not on github or publically shared - it's not really meant for outside consumption.

For example, I did several re-implementations (from scratch) of a tokenizer for a toy programming language with some typical syntax, with the goal of optimizing the core tokenizer loop while still handling a full unicode input. Just to illustrate why I find these sorts of exercise to be valuable, I'll expand on this example in detail.

Over the various implementations I gained a few key insights that I carry forward in my future implementation work:

1. State machines with O(1) dispatch using arrays-of-edges for transitions seems appealing at first, but is in fact a poor optimization choice. The approach assumes an even distribution of probabilities between all states, and the distribution is in fact highly skewed. A hand-rolled approach that is able to carry the current tokenizer context implicitly in the code-location performs far better. The final design was a parser that has a top-level `nextToken()` routine which checks the first character and then uses a series of conditionals to branch into subroutines for parsing individual token kinds.

2. Rediscovered the well-known trick of using sentinel characters in the text to eliminate the "test-for-eof" branch in the inner `nextChar()` function which is the main workhorse of any tokenizer.

3. Pushing the parsing of full unicode entirely out of the fastpath by leveraging the fact that the first byte of a multi-byte unicode character will fail any test for an ascii character or range. This led to a design where instead of `nextChar() -> Unichar` as the interface, we split the methods. The fast-path method is `nextAsciiChar() -> MaybeAscii`, which blindly returns a type-wrapped `u8` value. The value is then sent through the series of fast-path checks in the main control flow. If the fast-path checks fail, there are two methods that help handle the slowpath: `unreadAsciiChar(MaybeAscii)`, which can implicitly do a blind decrement on the current text cursor, and `nextFullChar(AsciiChar) -> Unichar`, which reads the full unicode character without unreading the first byte.

4. Realizing that it's better to use a temporary copy of the current cursor during the invocation of a single `nextToken()`. That memory write at every `nextChar()` can be eliminated and replaced with a single write when a token has been successfully parsed (or error). Updating the cursor pointer in place can potentially be eliminated by a smart compiler, but given that we're hand-rolling our tokenizer due to reason 1, the tokenizer code gets large enough (and contains enough loops at various points - e.g. to parse numbers, identifiers) - we cannot expect the compiler to both inline everything as well as eliminate every spurious write-back of the current cursor position. This prompted a modification of the design to have the methods not be implemented directly on the tokenizer, but a temporary `TokenParser` value-type that is effectively a `(&Tokenizer, *u8)` that is moved around by value through the control flow, and is written back and destroyed when a token is parsed (or error).

5. Optimizing the parsing of keywords (which show up as identifiers) was interesting. The trick I used here was to keep track of the cumulative `xor` of the low 4 bits of each identifier byte. At the end of parsing an identifier, this cumulative value is fed to an explicitly coded state machine which switches directly to the most appropriate subset of identifiers for that (admittedly terrible) 4-bit hash value. A better hasher did not justify its own computation cost.

6. Reordering all of the sequences-of-conditionals using a statistical analysis of the probability distribution of characters at every step in typical source.

That whole process took about a year of casual work. I didn't drive myself to complete it - but simply let the problem sit in my mind and percolate - trying new ideas and implementation strategies as time allowed. The final set of insights I derived from the exercise are something I consider very valuable.

A tokenizer is a simple thing, conceptually. It's something a sufficiently intelligent first-year CS student should be able to whip up. It's something I myself have done for various pragmatic reasons several times. But the meditative exercise: "take a small thing, make it faster, then make it faster, then make it more faster, then faster yet", has provided me with a real insight into all nooks and crannies of the problem space. Something that seemed trivial at first yielded more and more depth the more seriously I analyzed it.

This approach to spending your "side-time" is not appropriate for everyone. If you want to ship stuff, and that's your motivation (a perfectly reasonable and fine motivation), this does nothing for you. If you have a lot of extra time and the energy to complete large projects independently of your employment, that might well be a better choice.

But for those of you who are finding yourself in the same position as I am: not enough time for big projects, not enough interest for small toy projects.. perhaps treating your side-development as a meditative exercise on a focused problem is something that works for you.

For me personally, it gives me a lot of gratification because I appreciate the gained understanding far more than I appreciate the litter of toy projects I've produced in my younger days.