One of the moments where I really started to feel like I was starting to 'see the matrix' was when I was working on a regex engine to try to make my compiler faster (it didn't, but that's another story). The asymptotically fast way to approach regex processing actually involves writing a parser to process the regex, so in order to write a fast compiler, you need to write another fast compiler to process the regexes that will process the actual programs that you write. But, if your regexes get complex, you should really write a parser to parse the regexes that will parse the actual program. This is where you realize that it's parsers all the way down.
When you think more about regexes this way, you realize that a regex is just a tiny description of a virtual machine (or emulator) that can process the simplest of instructions (check for 'a', accept '0-9', etc.). Each step in the regex is just a piece of bytecode that can execute, and if you turn a regex on its side you can visualize it as just a simple assembly program.
I mean, the whole point of regexes (not to be confused with PCREs) is that any given regex is isomorphic to some canonical finite state machine. It is, specifically speaking, a tiny description of an FSM over the alphabet of ASCII characters (or whatever charset you're using).
Interestingly, regexes/FSMs are (IIRC) the most powerful class of machines for which equivalence is decidable. So if you give me any two regexes, I can tell you if they match on all the same strings, but this is not true for any more powerful grammar.
Isn't it possible to decide the equivalence of deterministic pushdown automata? Wouldn't DPDAs be considered more powerful than FSMs due to the addition of a stack?
Wikipedia [1] shows there's a paper called "The equivalence problem for deterministic pushdown automata is decidable" that won the Gödel Prize is 2002. I haven't read the paper nor do I currently have access to it though.
Presumably that means "this is not true for any more powerful class of grammars".
Any particular pair grammars from that class. I mean if you showed me a yacc grammar for Pascal and another one for Java, I am sure I could prove they were ineqivelent.
More tricky, if you gave someone clever a hand written Pascal parser and that same yacc grammar she could probably prove equivilency with some effort. But there is no guarantee that this is always possible.
> I mean if you showed me a yacc grammar for Pascal and another one for Java,
This is why I said "any", and not "all". In most practical cases, you can tell if two grammars are different, but it's easy to construct cases where you can't.
While it is true for Regular Languages (FSM) strict RegExs, it's hardly true for any contemporary implementation of RegExs. They have back references and other constructs that make them more powerful than a Regular Language, and thereby make it impossible to decide equivalence in the general case.
PCREs are Perl Compatible Regular Expressions, they have backreferences and other constructs. Almost all languages that have Regular Expressions have similar constructs. "Computer Science" Regular Expressions are more limited and implement a "regular language" as defined in https://en.wikipedia.org/wiki/Chomsky_hierarchy. This gives some guarantees, which go out the window with PCREs.
> When you think more about regexes this way, you realize that a regex is just a tiny description of a virtual machine (or emulator) that can process the simplest of instructions (check for 'a', accept '0-9', etc.). Each step in the regex is just a piece of bytecode that can execute, and if you turn a regex on its side you can visualize it as just a simple assembly program.
A regular expression is by definition a regular grammar, the least mighty language in the chomsky hierarchy and realizable by finite state automaton.
You can get something similar to a virtual equivalent of a turing complete language, due to the fact that the tape of the turing machine is, contrary to the hypothetical theory, practically limit by recourses. CPUs are finite state automatons, yet we use them to solve problems of sufficiently low complexity in NP space.
I am not an expert, please correct me if I am wrong.
The first step in compilation is lexing -- converting a character stream to a stream of semantic "tokens", where a token might be "a number literal" or "the 'while' keyword" or a single character token like "{". This process is usually done via regexs.
It is rare for lexers to use actual regex engines, in my experience, as it's impractical if you want to do proper error reporting unless you write your own regex engine in which case people tend to opt for hand-writing lexer rules.
It's not unheard of, and I have the impression it's getting more common, but still only a small proportion of the compilers I've seen over the years explicitly use regexps.
Most tools like that reduces any regexes to dfa's or similar, but most production compilers I've seen don't use tools like that because they're a pain to do proper error reporting for.
Yea, we're not talking about compilers, we're talking about lexers. Regex doesn't even make a little bit of sense in a compiler world. What lexer are you working with?
There are others, certainly, like parsing various configuration formats etc. but they tend to either use tool generated lexers that tend to use dfa's, or handwritten lexers that may or may not use regexes, but there too it's fairly uncommon to see much regex use.
I custom write all of my lexers for exactly the reasons I outlined above.
According to textbooks, lexers are the bottleneck because they are the only part that has to literally look at every byte of the source (I am including hashing as part of the lexing process).
I am not sure if this is true any more. It probably depends on the language.
Even as CPUs grow faster, code bases grow bigger, so the number-of-bytes argument is still important. On the other hand, heavy optimisations and difficult languages like C++ will shift the bottlenecks to later stages.
I think I read that too, but I am not sure that this is actually true. Please correct me if I am wrong, according to the Rust developers code generation seems to be the bottleneck. That's why they are working on incremental compilation to improve compilation times. They detect changes in the input files by comparing the AST to the old version. So parsing is always necessary but later stages can be cached. That seems to confirm that parsing is not the bottleneck, at least not for Rust.
OTOH that probably also depends on your use case, JS for example needs to get parsed at every page load. Some JS-VMs only parse the source code lazily, at first the parser only detects function boundaries. Only if this function is really executed, the function gets parsed completely.
> I think I read that too, but I am not sure that this is actually true. Please correct me if I am wrong, according to the Rust developers code generation seems to be the bottleneck.
You're both right. The problem of C++ is how '#include' works, that it just includes the content of all files and
therefore there's more overhead on the lexing side.
Rust's "equivalent" of '#include' is 'use', which doesn't have this problem, because it doesn't concatenates files.
You can make codegen faster by doing less work and producing less sophisticated code. You can't not look at every byte. It's the limiting factor in fastest possible compilation, not what takes the most time in most production compilers.
I suspect that modern languages try and fix some of this by making the syntax unambiguous. Means you only need to tokenize each file exactly once. Compare with older languages where if you change a header/module/etc you need to reparse the whole shebang.
Possible with rust that moves a bunch of the grant work into the code generator. AKA where as in C/C++ by the time you're generating code all your type information is set in stone, possible in rust a bunch of stuff isn't resolved yet.
Rust uses LLVM, a heavyweight backend with fearsomely thorough optimisation.
It might be that for non-optimised output, the parser becomes the bottleneck again. Or it might be that LLVM just imposes overhead even for the unoptimised case, that pays for itself in the optimised case.
Depending on how source programs are split up into files, parsing can be easily parallelized (think one thread per file), while other compilation tasks are harder due to interdependencies. E.g. semantic analysis requires building type representations, global namespaces, etc. Code generation is (usually) parallelizable as well, but there are a couple very serial steps in the middle, too.
My experience is that parsers for source languages can reach into the 1-10mb/s range, and depending on how complex the IRs and transformations are after that, code generation is usually around 0.5mb-5mb/s. The stuff in the middle (dealing with IRs) is harder to measure in terms of bytes.
Perhaps in an ancient one-pass C compiler. Modern compilers build many trees, usually one for each pass. Later stages do expensive operations on sets (even without optimization one has to do basic register allocation), so I'd say the lexing stage is entirely negligible.
C++ compilers (at least on most current code bases that don't need modules) still need to parse and lex headers though -- so actual executable code is pretty rare. Most compilations are also debug builds which don't optimize(but creating debug information can also be slow; the really slow step is usually linking)
> But I can't imagine a lexer would ever be the performance bottleneck in a compiler.
one rather trivial way to observe the effects of avoiding building a source file is to use ccache. ccache avoids the recompilation (even if you do a make clean, or some such), and it is not uncommon to observe speed ups of factor of 5 or so.
however, once you have crossed that barrier, you hit the linking wall. which is where you would end up spending a large portion of time. gold (https://en.wikipedia.org/wiki/Gold_(linker)) optimizes that i.e. supports incremental linking, but unfortunately, i haven't had any experience in large code-bases where it is being used.
Lexer generators typically allow describing rules with regular expressions, but through Thompson's construction, subset construction, and DFA minimization, you can end up with a blazing fast/optimized DFA that processes your input at maximum speed.
* Languages (bytecode interpreter, AST interpreter, parser/lexer for a simple language, simple JIT, understanding instruction scheduling)
* DSP programming (writing programs for fast, branchless math)
* Comfort with binary and binary formats (start with packfiles .zip/.tar, move onto reverse engineering simple formats for e.g. games)
* Understanding the difference between RAM and address spaces (e.g. understanding virtual memory, mmap, memory-mapped IO, dynamic linking, the VDSO, page faulting, shared memory)
* Device drivers (easier on Linux, understanding userspace/kernel interaction, ioctls, how hardware and registers work, how to read spec sheets and hardware manuals)
* Graphics (modern software rasterizer that's not scanline-based, understanding 3D and projective transforms, GPU programming and shaders, basic lighting (reflection and illumination) models, what "GPU memory" is, what scanout is, how full scenes are accumulated all along the stack)
I could go heavily into depth for any one of these. Ask me questions if you're interested! They're all fun and I always have more to learn.
Also, the longer you get into any given track, the more you realize they all connect in the end.
2D renderers? Do you mean rendering 2D shapes? In that case, scanline is probably the most efficient thing you can do, but most renderers are not scanline. The only modern 2D scanline renderer I know of is in the Flash Player code.
Yes, 2d shapes. I'm writing vector drawing software and I use agg as backend (which use scanlines). In the long term it could be a good think to use or implement a more modern back-end.
Modern, fast 2D rasterization is a really hard problem, one which I'm currently writing a long-form article on. The classic Red Book stencil trick + Loop-Blinn from 2005 is the current state of the art, I think. I'd love to do more research there.
In terms of a software rasterizer, which can often be faster, it's just a matter of elbow grease and engineering. Use SIMD. Implement tons of special cases.
Write a simple polygon plotter to get you started -- just fill a triangle with white with simple edge functions. Then work your way up to arbitrary polygons. And then look into the theory behind antialiasing and such.
I learned by reverse engineering DSP code in another platform which was used for obfuscation. I had a chat with a friend and he suggested the TI DSP starter kits. I'm not familiar with any of them, though.
A basic game engine! I was exposed to so many concepts over time, building on a code base I understood from the ground up.
It was the first time my code (c++ no less!) felt completely deterministic, that I understood every piece of data down to the byte level, heap/stack/gpu locality at any point of execution, and especially the lifecycle and memory management.
If anyone is interested in creating an indie/hobby game engine, I'd recommend the following:
- game loop with independent render FPS and a constant simulation tick delta (otherwise it runs non-deterministic due to floating point)
- "entities, components, systems" architecture (ECS)
- data-driven content (load level and game object data from JSON or similar, and programmatically build your scene)
- basic event or messaging system
Honestly, I can't think of a field that covers as wide a range of CS topics at the depth I've seen in lower level game development.
If you focus on understanding and implementing the popular patterns and algorithms recommended by the indie community, its not the most daunting of projects. There's so much great information, and a few good books that break down and walk through a working implementation you can build on once you understand it from the ground up.
SFML is a C++ library that handles window management, input, sound, etc...
For reference, a friend and I went through the book chapter by chapter a couple years ago. He was new to programming, and I'd made a few simple games and read scattered info in the past. The book had a good pace and filled in a lot of the gaps I had.
This book is a lot deeper, but I found it a great reference. I probably spent more time reading this book and going back over my engine code to refactor based on what I learned (rather than use it as a direct example implementation.
At that stage, there were a few good articles online that were focused on the ECS pattern. Here's a couple, but try to find something that matches your language, there's plenty of info out there.
I really recommend a raytracer, especially to anyone interested in graphics. It's straightforward, powerful, infinitely expandable with optional features, and opens up a ton of discussion about performance, code complexity, and general organization. Plus it's fun, in an instant gratification kind of way.
Every now and then I get interested in demoscene programming. I've never even been able to get a triangle to render on the screen - except with something like XNA.
Do you think there's any value in going back to say, DOS based VGA programming? People in #osdev thought I was a bit strange for wanting to write a bootable kernel that only put pixels on the screen, but I really enjoy the idea of starting with plotting pixels, then moving on to more complex effects.
Are you more interested in actually writing to PC VGA registers, or just the idea of writing a pixel-by-pixel renderer? There are bindings for SDL and SFML in a lot of languages, and the web technology option mentioned by the sibling comment.
If you like the idea of "DOS-based VGA programming" (taken literally), you could also find a DOS compiler or assembler and run it in DosBox. All of IBM's hardware-level documentation can still be found online: http://www.mcamafia.de/pdf/pdfref.htm
Or with different desktop GUI toolkits for different languages: C++ w/Qt, C++ w/ wxWidgets, wxPython, Perl/Tk, Python +Tkinter, Delphi, Lazarus, or many other lang/toolkit combos. Even many BASICs have that ability built-in (from the early days of personal computers).
The value of DOS based VGA programming is that, although hard things are damn near impossible, easy things are easy. https://www.mail-archive.com/kragen-hacks@canonical.org/msg0... is a 64-byte .COM file I wrote in assembly that draws circles on the screen; you can get to mode 13h with two instructions, and then you have a linear framebuffer with 64000 bytes, one byte per pixel, at your disposal located at A0000h.
So there are some awesome things about this:
- It's really easy to get started. You can literally take the code on that page, assemble it, verify that you get the right hexadecimal output, and run it in dosbox.
- Just about anything you do in the segment you point at the framebuffer will make something appear on the screen and not crash your program. So your bugs will be harder to debug, but they will also be cool.
- For example, running a pointer off the bottom of the screen puts you back on top of the screen instead of in the middle of your stack or heap or something. There are 4 lines of pixels that aren't shown, but often you don't care.
- The pixels are more or less square.
- Since each pixel is one byte, pixels correspond nicely to addressable memory locations.
- The first few colors in the default VGA palette are primary colors.
- Once you draw something, you can get awesome effects by changing the palette. Color-cycling, for example. This was really cool when actually redrawing the screen sixty times a second was computationally infeasible.
There are also a lot of shitty things about it:
- 320×200 is not very much resolution. Nothing will look sharp, except possibly the corners of your pixels.
- Palettes make it easy to draw something, and they even provide a kind of high-level interface that lets you kind of search and replace colors, in a way. (It's really more like macro-expanding your colors I guess.) But they make gradients and alpha-blending really hard to achieve, unless you stick to grayscale or sepia or something. TrueColor video modes are much better for those.
As for triangles, I'm sure you can figure out how to do them, but keep in mind that geometric algorithms easily turn into a rat's-nest of special cases. I've written polygon-filling routines a few times, and they can easily have the same algorithmic complexity as an entire ray tracer, which looks a hell of a lot cooler.
having written one as a slightly larger python thingy, i can fully attest to that.
now that it is kind of done, i want to make it faster :) for example, having a home-grown vector_3d class kind of sucks (performance wise). it might be better to have vector_3d be actually based on, say, numpy.array ? once that is done, and i start seeing some real improvement, it might be possible to go the other route as well i.e. write the hot-spots in c++/c, and interface that with python.
or go with lua all the way ? or maybe try a hybrid approach (which would allow you to see how embeddable the language really is)
possibilities are endless, and as you so rightly said, gratification is instantaneous :)
Simply replacing your own vector_3d class with numpy.array() won't actually speed you up that much as the overhead of creating all the tiny 3 element arrays kills you (I think I got only a 2x speed up from going from a pure python vector_3d to numpy arrays). Numpy is optimised for the creation of a small number of big arrays.
The massive enormous speed ups come from creating massive arrays and implicitly working on them in parallel. So instead of iterating through every pixel and creating a origin and direction vector for each one you create a Total_number_of_pixels X 3 numpy array and pass that to your ray tracing function. Due to the way numpy broadcasts arrays the amount of rewriting you need to do is incredibly minimal and the speed ups over pure python are astronomical.
> The massive enormous speed ups come from creating massive arrays and implicitly working on them in parallel. So instead of iterating through every pixel and creating a origin and direction vector for each one you create a Total_number_of_pixels X 3 numpy array and pass that to your ray tracing function. Due to the way numpy broadcasts arrays the amount of rewriting you need to do is incredibly minimal and the speed ups over pure python are astronomical.
thank you for your insight! this is very useful indeed.
Just to give an idea of the speed up from going from processing individual arrays to batch processing - an Image I was rendering went from 2 hours to 20 seconds (and that was with further obvious speed ups left on the table as well).
Oh, and make sure you're using 64-bit version of python as you'll be slinging multi-gigabyte arrays around all over the place.
I haven't find a good introductory yet complete resource about build a complete database. I'm on the hunt for it (building also a relational language).
While not exactly what you're looking for, The Definitive Guide to SQLite takes you from writing a SELECT statement all the way to an overview of SQLite internals. O'Reilly also published a booklet called Inside SQLite that goes a bit deeper into the subject. I suggest SQLite because the source code is superb (seriously, some of the most readable, most logically organized and best commented C code you'll ever see) and it's a fairly small codebase with a huge community.
And then there is Database Systems, The Complete Book where the second half of the book offers in deep coverage of the implementation of database systems.
This area is one where it look more impenetrable. You find good info about compilers and even build "computers" from scratch but databases is black magic.
Specially, because RDBMS look like are at all or nothing. I wonder if is possible to pick "only" the transactional/storage and made by myself the query/optimizer/api on top. I will be happy if is possible to build on top of something like sqlite, unfortunately, I don't know C at all (work with F#,C#, Delphi, Python).
In the other hand, know the basic steps could be good enough...
Same category of probably not a great idea to use and claim is better than what's out there ( it won't be ) but boy howdy will it teach you about distributed systems and pitfalls that will absolutely help you when you are working on one at your day job.
Seconded. One of my plethora of hobby projects has been an interpreter or compiler (it changes from month to month) for Common Lisp, for an esoteric platform. It's a huge amount of fun, and I have learnt so much about software from it.
Hmm, rasterizer was just a starting point for me on the way to understand "how browser works".
Having just a rasterizer is like naked VM without bytecode compiler.
So I decided to add HTML/CSS engines to it.
But if you have HTML/CSS then why not to add scripting to it? So added VM executing bytecodes, GC and compiler producing those bytecodes. Having VM I thought that it would be cool to have built-in persistence in it - you need to persist UI state somehow, right? So it got integrated NoSQL database on board.
Reading from a raw filesystem dump was both interesting and rewarding as it works on a problem domain that exists in reality, not just on some simplified playground. If you target something as primitive as FAT, the challenge isn't terribly high.
A screen-oriented text editor with undo and search/replace. Can you make it handle a 10MB file without "simple" operations having annoying delays? How about 100MB? 1GB? 100GB? With no line breaks in the file?
With both emacs and vim you can certainly edit 500MB file. I concatenated all sources of recent Linux kernel 4.6.3 once. From what I remember it was all .c, .h and .S files. Resulting file has 542MB and 19'726'498 lines. I did it to test the text editor I am working on.
Some stats:
$ time vim -c 'edit kernel-total' -c q
real 0m8.162s
user 0m7.687s
sys 0m0.398s
$ vim kernel-total ^Z
$ ps aux | awk "/kernel-total$/ || NR == 1"
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
hawski 10467 2.7 17.1 805120 670988 pts/2 T 17:41 0:07 vim kernel-total
$ time emacs kernel-total -f kill-emacs
real 0m7.155s
user 0m6.869s
sys 0m0.237s
$ emacs kernel-total ^Z
$ ps aux | awk "/kernel-total$/ || NR == 1"
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
hawski 10825 43.8 14.8 857484 581152 pts/2 T 17:47 0:07 emacs kernel-total
With vim editing is quite smooth and with emacs it feels laggy. I only edited a bit, like entering a new line here and there, moving few pages down, going to the end of the file and to the top again. Saving is sluggish.
I don't think so. Can a text editor do search and replace on a 100 GB file without a delay? You could construct an index, but that's not going to happen instantaneously either.
> Can a text editor do search and replace on a 100 GB file without a delay?
I meant that things like simply scrolling through a few pages or inserting/deleting a single character in the middle of the file shouldn't cause noticeable delays. Doing an operation that spans the entire file is different.
Because if you're familiar with conventional real register architectures, the stack VM forces you to rethink it, in a much simpler way. It's a useful simplifying abstraction.
There used to be a lot of interest in, and articles about, stack machines (real ones, not VMs) back in the days of mags like BYTE and PC Mag. Good reading.
Actually, implementing your own crypto for real is not so crazy. The best algorithms are surprisingly simple, and easy to make side-channel resistant. The only real difficulty I have so far is with modulo arithmetic on big numbers (for elliptic curves and one time authentication).
With proper test vectors and some code review, crypto is in fact quite easy. More dangers lie un the use of crypto, especially when the API is flexible or complex. And of course good old exploitable bugs.
No. Never roll your own crypto for production. If you only know one thing about security as a programmer, then it has to be this.
But you are right that crypto algorithms aren't specifically hard to implement compared with, say, computer graphics, image processing, gui programming or whatever.
But! The big difference is that crypto is attacked by other smart people. The bugs and design flaws in your scientific computing library are not hunted down by intelligent agents to purposefully break it. If people attacked your Paint clone to the same extent they attack computer security programs, then it would become just as hard to write them correctly as it is with crypto.
There are so many kinds of ways to get it wrong that beginners don't even know about. It's an "unknown unknowns" situation.
Have you seen the size of the reference implementations for chacha20, blake2b, or poly1305? Those are tiny,
a couple hundred lines for all three! There is very little room for errors to sneak in. Between test vectors, compiler warnings, and Valgrind, we can be pretty sure all bugs have been shaked out. Add in good old code review and the safety of this stuff is pretty much guaranteed.
Of course, I would never make the mistake of rolling my own OpenSSL clone (too big, bad API, some bad primitives), or even my own AES implementation (slow or prone to timing attacks).
Of course, I don't make the mistake of going blind. I read the literature before making any choice. If I don't know why some construction is considered safe, I don't use it. Due diligence first. And I certainly don't implement the first primitive that my search engine gets its hands on. I go for the most vetted alternatives.
Even experts wouldn't do this on their own. It takes teams and a lot of time. A lot of thought can go into a few hundred lines of code. It needs math oriented people, it needs OS-oriented people, people who are experienced with exploiting systems.
As I said, it's good to do it for fun and to learn, but there's no reason to do it for production. It's almost always a case of ignorance and overconfidence.
The problem is that in computer security defense is a lot harder than offense. If you make just one mistake, it can often lead to a compromised system.
Anyway, there's a lot of discussion on this topic around security.stackexchange, quora, reddit etc. The consensus is that it's a terrible idea.
Did you miss the part where I mentioned code review? And the one where I said I would never invent my own primitives?
> A lot of thought can go into a few hundred lines of code.
I can attest to that. I can also confirm that most such though went in the design of the primitive itself. Most implementations in pure C are pretty straightforward, almost naive. Seriously, take a look.
> It needs math oriented people,
Mostly to design the primitives, and perform modulo arithmetic on huge numbers. That's about it.
> it needs OS-oriented people,
Only if you're writing the random number generator of an OS kernel. For the rest, portable C code is enough. You don't even need heap allocation.
> people who are experienced with exploiting systems.
Not quite. You need people who are able to write correct code, thus avoiding undefined behaviour. By the way this applies to everything, not just crypto.
> there's no reason to do it for production.
Dependency management. It's not a very good reason, but if you only use one or two primitives, importing libsodium may not be worth it.
> The problem is that in computer security defense is a lot harder than offense. If you make just one mistake, it can often lead to a compromised system.
It doesn't have to be the crypto library's fault. A single buffer overrun anywhere, and you risk giving your keys to the attacker —if that didn't already served the whole server on a silver platter. You won't be secure just because you use OpenSSL or libsodium as intended. Even if your crypto library is bug free and immune to side-channel attacks.
The greater risk is not crypto code. It's user code. Implementing your own crypto code is not going to increase the risk by much —or at all, if you're sane enough to turn on warnings, use Valgrind/Purify, write tests, and have your code reviewed.
> Anyway, there's a lot of discussion on this topic around security.stackexchange, quora, reddit etc. The consensus is that it's a terrible idea.
With all due respect, fuck consensus. Read the expert's writings. Daniel J. Bernstein in particular has written some very good stuff on why he started to write the NaCl library. Or on how he designed his primitives to be simple, and as such easily analysable by his peers. Or on how easily one can avoid timing attacks (hint: don't use AES).
I'm not saying it's impossible to do. It depends on how "deep" you want to go. "Rolling your own crypto" can have different meanings. It can mean designing a new hash algorithm or encryption algorithm, or opening up a crypto book and reading some pseudocodes and formulas and then implementing them, or it can mean something more high level.
The higher you go the less error-prone it becomes.
Anyway, if you think you can realistically estimate the risks involved and you have the skills to do it, then perhaps you can. You'd definitely be in the top 0.1% of developers or even better. For the vast vast majority (the rest of us), it's not worth doing. It's very likely that we'd do something where we don't even know that we don't even know we should pay attention to. Again, unknown unknowns. When I read about bugs and exploits in security code, I always realize how difficult it is to get it right.
By the way, why do you think that the consensus is the way it is? Is it spread by security and crypto developers so that there's less competition?
Agreed on the depth argument. It seems we misunderstood each other. I shall write an article on "Rolling your Own Crypto" shortly, and lay out the possibilities and the safety implications. I'm no expert, but I believe my opinion is well informed.
> Anyway, if you think you can realistically estimate the risks involved and you have the skills to do it, then perhaps you can. You'd definitely be in the top 0.1% of developers or even better.
I am definitely not in the top 0.1%. However, for the goal of implementing existing crypto primitives, for which test vectors are available, I'm confident I can implement the primitives I'm interested in correctly (zero bug, immune to timing attacks). My only roadblock for now is modulo arithmetic of big numbers. For this, I am currently content with ripping off existing implementations. However, to ensure that my implementations are indeed bug-free, I will need external review.
And again, the current best primitives tend to be simpler than most.
About the unknown unknowns… well, side channel attacks are an open area of research right now, so I won't pretend I can be immune to any of those, except timing attacks. (Those are surprisingly easy to prevent: just don't let any variable-time operation depend on a secret input. for symmetric encryption, this means no branch and no array indexing that depends on either the message or the secret key. Some primitives make this easier than others.)
> By the way, why do you think that the consensus is the way it is?
Because a blanket "never invent/implement/mess-with your own crypto" is easier to spread, and safer than anything more accurate: any subtlety can and will be misinterpreted by some fool, who will then happily implement easily broken crypto. My upcoming article on the subject will indeed start by "don't do it". I'll have to introduce the subtleties very carefully, lest I encourage some idiot to screw up.
Come to think of it, I probably deserve the downvotes I got, even though I stand by what I wrote: with crypto, partial knowledge tends to be dangerously unwieldy. Many missteps are as silent as they are deadly (sometimes literally so: see Tor, or WiFi enabled pacemakers).
The problem is that fools don't know they are fools. Most programmers don't know about timing attacks even. Some don't properly understand buffer attacks etc.
There's a phase in developers' lives when they are overconfident. Having learned a language or two, they can reasonably implement things they need, they can use databases, create guis etc. At this point they might conjure up some "clever" way to store passwords or encrypt data by original schemes invented by them. This is always wrong.
And to reiterate: people often don't know that they don't have the necessary skills.
OMG, memories... at once point in the pleistocene era, I was tasked with writing a manual for what was then Informix 4GL[1], and to get comfortable with the language, I spent a week writing a user forum app, with topics and message threads, messages stored as "blobs" in the SQL DB. Tried to get my co-workers in the pubs group to use it. They thought it was cute but all said hahaha no.
By far the project that teaches me the most was writing an inliner for a toy language we developed as a part of a course at uni.
Most people don't really realise that the more complex an inliner gets, the more similarities it gets to a rabid animal you can just barely restrain on a leash.
My prof said he had laughed at my "how hard can it be?"-attitude . I worked my ass off and ended up being asked whether he could use the inliner in a language often mentioned on HN today :)
The inliner was superseded by a much better one in 2003 written by a much smarter person than I, though.
An emulator for a real chip, like a 68000. I wrote one for a Prime minicomputer (Prime died in the early 90's). Telnet to em.prirun.com on port 8001 to try it out!
The advantage of emulating a real system/chip is that if you can find old software for it, you avoid the step of having to invent your own programs: real, running programs already exist.
I love the author's meta-idea of refusing to accept that unfamiliar things are black boxes full of magic that can't be touched.
A great example of this mindset is the guy who bought a mainframe. [1]
Refuse to be placed in a silo. Work your way up and down the stack and you'll be much better placed to solve problems and learn from the patterns that repeat at all levels.
Everything is made from smaller components. Understand each of those components better and you'll understand the entire system better.
Sometimes, you can use end-errors to tell which component has the issue. For instance, if a web site gives a 502 error, the problem is likely with the load balancer or lower network stack on the web server. 404 would often be a file system level issue on the web server. 500 is frequently a network issue between web server and database server. 400 is a problem with the site presentation code, or maybe database malforming addresses.
> Everything is made from smaller components. Understand each of those components better and you'll understand the entire system better.
This. No matter how specialised you are (or want to be) always strive to have at least a basic understanding of the full stack and everything else that your work touches through a couple of levels of indirection (including the wetware such as, in commercial contexts, having a good understanding of your client's business even if you aren't even close to being client-facing) because it will help you produce much more useful/optimal output and can be a lot more helpful when your colleagues/compatriots/partners/what-ever his a technical problem. Heck, at the logical extreme a little cross discipline understanding could even lead you to discovering a better method of doing X that strips out the need for Y altogether, revolutionising how we do Z.
Of course don't go overboard unless you are truly a genius... Trying to keep up with everything in detail is a sure-fire route to mental burn-out!
Two approaches are severely underused in the software world:
1) Domain-specific languages (DSLs)
2) Virtual machines (or just explicit state machines more generally)
What I mean is, alot of problems could be solved cleanly, elegantly, more safely, and more powerfully by using one (or both) of the above. The problem is that when people think DSL or VM, they think big (Scheme or JVM) instead of thinking small (printf). A DSL or VM doesn't need to be complex; it could be incredibly simple but still be immensely more powerful than coding a solution directly in an existing language using its constructs and APIs.
Case in point: the BSD hexdump(1) utility. POSIX defines the od(1) utility for formatting binary data as text, and it takes a long list of complex command-line arguments. The hexdump utility, by contrast, uses a simple DSL to specify how to format output. hexdump can implement almost every conceivable output format of od and then some using its DSL. The DSL is basically printf format specifiers combined with looping declarations.
I got bored one day and decided to implement hexdump as a library (i.e. "one hexdump to rule them all"), with a thin command-line wrapper that emulates the BSD utility version. Unlike BSD hexdump(1) or POSIX od(1), which implement everything in C in the typical manner, I decided to translate the hexdump DSL into bytecode for a simple virtual machine.
The end result was that my implementation was about the same size as either of those, but 1) could built as a shared library, command-line utility, or Lua module, 2) is more performant (formats almost 30% faster for the common outputs, thanks to a couple of obvious, easy, single-line optimizations the approach opened up) than either of the others, and 3) is arguably easier to read and hack on.
Granted, my little hexdump utility doesn't have much value. I still tend to rewrite a simple dumper in a couple dozen lines of code for different projects (I'm big on avoiding dependencies), and not many other people use it. But I really liked the experience and the end result. I've used simple DSLs, VMs, and especially explicit state machines many times before and after, but this one was one of the largest and most satisfying.
The only more complex VM I've written was for an asynchronous I/O SPF C library, but that one is more difficult to explain and justify, though I will if pressed.
1. Every function is a tiny VM already. Every macro is a layer on top of a "compiler" to let you redesign a language. LISP gives much more power, precisely because every program is its own DSL, and all power of the language within that DSL is available. http://www.paulgraham.com/avg.html
The most powerful constructs I've seen is a combination of these 2: Clojure/async is a macro which transforms any program to state machines. I think that kind of power gives you 20-30x advantage in business type applications. In fact I've seen C++ programmers spending a decade on discovering similar things by themselves. I strongly believe everyone should know at least a little about LISP.
I whole-heartedly agree with your points. The ability to transform programs programmatically is extremely powerful. But, FWIW, I tend to see that as a form of generic programming.
When I think DSL I think regular expressions (especially PCREs) or SQL, where the syntax is tailored-designed for the specific task at hand.
The problem with in-program transformations is that you're still largely bound to the syntax of the host language. That's particularly the case with s-expression. That binding has amazing benefits (e.g. when auto-transforming programs into resumable state machines) but also puts constraints on the language syntax you can employ. That's not a problem when you're dealing with experienced engineers and devising technical solutions, but people tend to stay away from DSLs generally because they fear the burden imposed on others (including their future selves) having to learn a new language, how to use it, and how it integrates within a larger framework. You minimize that burden and make DSLs more cost-effective by maximizing the expressiveness of the DSL in the context of its purpose, and minimize switching costs by making the delineation between the host language and the DSL clear and obvious.
So, for example, in concrete terms you'd generally implement arithmetic expressions using infix notation. You can sorta implement infix notation in LISP, but you still have the s-expression baggage (such as it is; it's obviously not baggage from the expert's standpoint), which among other things makes it difficult to know where the host language ends and the DSL begins.
Lua had a head start on most popular languages with its LPeg PEG parser, which makes it trivial to parse custom DSLs into ASTs. For all the limitations of PEGs[1], they're just amazingly powerful. But to drive home my previous point, while expert LPeg users use the Snowball-inspired pattern where you instantiate PEG terms as Lua objects and build grammars using Lua's arithmetic operators (with operator overloading "a * b" means concatenation instead of multiplication, and "a^1" means repeat 1 or more times), newbies tends to prefer the specialized syntax which uses a notation similar to the original PEG paper and to typical regular expression syntax. (LPeg includes a small module to parse and compile that notation.)
Converting the AST into useable code is another matter, and that's an area where LISP shines for sure. And now that I think about it, the best of both worlds might be a PEG parser in LISP. But I'm completely ashamed to say I haven't use LISP or LISP derivatives.
[1] Yes, implementing left-recursion can be taxing using a PEG. But the syntax is so intuitive and powerful that the alternatives just can't replace it. Regular expressions are limited, too, but they'll never replace PEGs (or anything more sophisticated) because their expressiveness is just too powerful and cost-effective within certain domains. Instead we supplement regular expressions rather than replace them; and if PEGs continue catching on I expect PEGs to be supplemented rather than replaced.
(The Ragel stuff is for parsing the SPF records and is a rather small detail. I always intended to remove it--it's a dependency that turns alot of people off--but never got around to it. But, FWIW, Ragel is awesome when you don't mind the build-time dependency.)
Basically what happened was that when I set out to write an async SPF library I found myself constantly refactoring the internal API and devising ever more clever hacks for implementing continuations in C. The biggest problems were
1) It had to support asynchronous I/O, and specifically non-blocking I/O.
2) I explicitly DID NOT want to use callbacks because one design goal of the library was that it be easy to integrate into other code and frameworks, regardless of event loop and even without an event loop. I had recently implemented a non-blocking DNS library after years of using and hacking on ADNS and c-ares taught me to hate callbacks in C.
Instead, you repeatedly call spf_check() until you get something other than EAGAIN. When you get EAGAIN, you use spf_pollfd(), spf_events(), and spf_timeout() for the polling parameters. (The file descriptor can change dynamically, e.g. if the DNS query had to switch to TCP from UDP. The events can be POLLIN or POLLOUT, though usually POLLIN, and the timeout is necessary because DNS send/recv timeouts can be much smaller than the timeout for the whole SPF check operation.)
3) SPF is recursive--policies can include other policies, which can include other policies. So you're already dealing with a logical stack situation, but you can't rely on the language's runtime stack. Also, for each rule there can be an iteration over DNS records with intermediate lookups, and it's especially ugly in C to pause and resume loops with inner function calls.
4) I had never implemented SPF before. I kept running into areas where my approach turned out to be deficient as I became more familiar with the spec. And because of constraints 1-3, this was very costly in time and effort. Were I to implement an SPF library again I probably wouldn't use a VM; instead I'd go back to using a basic state machine, function jump table, and a simpler state stack. But at the time it made sense because it was a more flexible and conceptually simpler approach given the unknowns. Plus, it was an excuse to try out a concept I had long had, which was implement a VM for non-blocking I/O operations.
So basically what it does is decompose the terms in an SPF policy to logical opcodes. So the "ip4:1.2.3.4" term composes into an opcode which pushes a typed address (1.2.3.4) onto the stack and the opcode (OP_IP4) for matching the address at the top of the stack to the address of the connecting host (which is part of the context of an instantiated SPF processor object).
The term "mx:foo.bar", for example, requires querying MX records for foo.bar, iterating over each host and recursively querying the A or AAAA records, and then matching the IP address like above. The "mx:" term is first emitted like the "ip:" term--push domain string following my OP_MX. But when the OP_MX opcode is executed it dynamically generates more bytecode to do the query. I don't remember why I did it this way--probably because it seemed simpler/easier at the time.
It works well, but objectively is over-engineered. At the very least the VM is _too_ general. Nonetheless, AFAIK it's the most comprehensive async I/O SPF library in C that doesn't use callbacks. It only passes ~80% of the SPF test suite, but that's mostly because of the policy parser--IIRC the ABNF grammar for SPF is broken because the proof-of-concept was implemented using a naive regular expression to chop of the string (the ABNF translation of the pattern is ambiguous); and I never bothered or needed to handle some of the corner cases necessary to match the behavior of, e.g., libspf2. But my implementation handles many millions (if not billions) of queries every day as it's used in several commercial and cloud products.
Thanks for responding. Great explanation! I agree that adhering to the SPF standard has diminishing returns as support is added for the more esoteric macros. Support for every single macro in the standard is almost certainly unnecessary.
I've seen a simple recursive descent parser to process SPF records, but never considered it could be done with a VM or a stack-based machine. Even with the DNS Lookup Limit, it appears you made the right choice to avoid the call stack for a fully async I/O interface.
Your coding style is simple but since I have never studied a VM before, I have trouble telling what's going on. You've inspired me to learn more!
The project that affected my thinking the most was a bytecode interpreter[1].
I've had use for that knowledge, nearly fifteen years later - most of the interesting learnings about building one has been about the inner loop.
The way you build a good interpreter is upside-down in tech - the system which is simpler often works faster than anything more complicated.
Because of working on that, then writing my final paper about the JVM, contributing to Perl6/Parrot and then moving onto working on the PHP bytecode with APC, my career went down a particular funnel (still with the JVM now, but a logical level above it).
Building interpreters makes you an under-techtitect, if that's a word. It creates systems from the inner loop outwards rather than leaving the innards of the system for someone else to build - it produces a sort of double-vision between the details and the actual goals of the user.
Interesting. One thing that I still haven't solved yet is the "break" and "continue" statement inside loops. For a break statement, it seems like it would just be the same a JMP with an address as the operand, but there would need to be some sort of registration of "The VM is in the loop now, and the break address is X", and continue would also be a JMP with an address to the top of the code for the loop.
I haven't implemented those in my system yet, and also have no idea how Python or PHP does it.
Is PHP's VM a stack based one? I do read the Zend/ directory of PHP's source, but it is really hard to follow and there is virtually no documentation on the VM
You can implement those as part of a linking phase during bytecode generation: emit a placeholder value (e.g. 0) for the jump address and when you've finished compiling the block go back and fill-in the placeholder with the correct address/offset.
That's relatively easy when implementing an assembler for your opcodes. Just keep track of symbolic labels and their associated jump points (as a simple array or linked list) and process (i.e. finalize or "link") the jump points when the label address becomes known. My "assemblers" often have constructs like:
L0
...
J1
...
J0
...
L1
where L? registers a symbolic jump destination (i.e. x.label[0].offset = label_offset) and J? emits an unconditional jump and registers a link request (i.e. push(x.label[1].from, jump_opcode_offset)). When a block is finished all the offsets are known; you just process things like
for (i = 0; i < x.nlabel; i++) {
for (j = 0; j < x.label[i].nfrom; j++) {
patch_in_offset(x.label[i].from[j], x.label[i].offset)
}
}
Knowing when to emit symbolic label and jump instructions from the AST is a little more involved, but no more than analyzing the AST for anything else.
Supporting computed gotos would require much more bookkeeping, I'd imagine, and I'm not surprised few languages support that construct. Or maybe not... I haven't really thought it through.
One cool thing about this whole exercise is that it helps to demonstrate why generating some intermediate representation can be easier (conceptually and mechanically) than directly generating runnable code in a single pass. It seems more complex but it really makes things easier.
Computed gotos defer the search for the label from compile time to run-time. So you need to hash the targets, and then jump to it.
It all depends how your interpreter PC (program counter) works. Some have just opcode offsets (like a real PC, as %eip, e.g. lua, ruby, ..) some have opcode addresses (perl, lisp, ...).
With offsets you can encode jumps relative, which is a huge advantage. With addresses you have to remain absolute, which means the data needs a full pointer (cache), and you cannot move the code around for optimizations afterwards.
With offsets you have a cache-friendly op-array, with addresses you have a fullblown tree (AST) or linked list, with its cache-unfriendly pointer chasing overhead.
But with full addresses you can avoid the big switch loop overhead in an interpreter, you just pass around the next pointer instead if incrementing the PC. But then it gets interesting how to keep the state of the stack depth.
I saw the matrix after I first implemented a virtual machine. I recommend everyone does it because it will teach you a lot about how code is executed and transformed from the syntax to the actual assembly/bytecode. A stack based virtual machine is so simple it takes a lot of thinking to understand how they work. (or maybe I'm just not that smart).
It's interesting that he implemented function calls via a jump. In my VM a function is just mapped to a name (variable), so functions are first class. When the VM gets to a CALL instruction, it loads the bytecode from the hash table (via a lookup of the name).
Since this is a procedural language where statements can be executed outside of a function, implementing the functions as a jump would be difficult because there would need to be multiple jumps between the function definition and statements that aren't in a function.
I really wish my CS program had a compilers class, but unfortunately they don't, so I had to learn everything on my own.
I think I agree. I really wish that I'd been walked through The Right Way (or just A Right Way) to write a compiler and bytecode interpreter by a professor. Oh well, it's fun to learn on my own!
Nice post! I really enjoy playing around with things like this. It's amazing how little is needed to make a language/interpreter capable of doing virtually anything, even if not elegantly or safely. As long as you can perform calculations, jump around, and implement some kind of stack your language can do just about anything.
I recently threw something together sort of like this, just for fun (I like your interpreter's name better though): https://github.com/briansteffens/bemu
It's crazy how much these little projects can clarify your understanding of concepts that seem more complicated or magical than they really are.
What's even cooler, is that after building this VM (for the C0 language) as a freshman, you can come back as a junior/senior and write a compiler for that language in 15-411. It's a very cool way of going full circle.
This reminds me a little bit of my computer architecture class. We started at logic gates in a simulator[1], and worked our way up from there to flip-flops and adders, memory chips, a simple ALU, and eventually a whole 8-bit CPU in the simulator. I want to think that we were even writing assembly for it, loading the programs into the simulated memory, and executing it. It was a great way to get a sense of how everything works, and I think it's when C-style pointers really clicked for me.
I've done something similar 21 years ago: a C interpreter targeting a virtual machine. The runtime had a dynamic equivalent of libffi to call into native code and use existing native libraries. I added extensions to run code blocks in threads so that the dinning philosopher problem solution was very elegant. Back in the days, not having libffi meant generating assembly on the fly for Sparc, MIPS, PA-Risc, i386. Fun times. That C interpreter was used to extend a CAD package.
I wrote a VM for the 6502 for fun and it was one of most interesting and satisfying projects I've ever made in my free time.
It is very close to a bytecode interpreter, only that it comes with a specification that is actually the opcode list for the MOS 6502 (and few details you need to take into account when implementing that CPU).
Besides there are cross-compilers that allows you to generate 6502 code from C for your specific VM (see cc65).
I did this in C#. It was a lunch time project at work a couple of years ago. It was fun. I still want to do a V2 and remove all of the shortcuts I put in because I didn't want to write code for the stack and stuff like that. At the end of the day, my solution was spookily similar to this - the 32bit instructions - well, yeah, I was the same! It was just simpler. I did have a few general purpose registers (V1, V2 and V3 I think) and I did have routines to handle bytes, words and such like. So stuff like this (as a random example I pulled from the source):
I'm thinking a lot of the complexity of writing a compiler stems from the usage of inappropriate tools. I.e. I would rather kill myself than write a lexer in C (without yacc / bison), but using parser combinators it's a rather trivial task.
Similarly, annotating, transforming, folding, pattern matching on, CPS transforming etc. the produced AST is pretty trivial in a language that supports these constructs. And again, a nightmare in C.
That leaves codegen, but using the right abstractions it turns into a very manageable task as well.
Here's a compiler written in Haskell for LLVM [0].
I did something similar in the distant past, that is I wrote subset of C compiler (functions, standard types, pointers) to imaginary assembler and then bytecode interpreter. It was awesome fun, but also I got so into it my - then - girlfriend started to question my commitment to the relationship. So be careful, this is really interesting thing to do :)
Yes I wrote a parser/compiler and interpreter for a custom domain specific language and it had a similar effect on my career. Lots of fun!
Okay I guess technically I used a parser generator that I then modified to build an AST and convert it into assembly-like code that fed the interpreter.
Bill gates also started with an interpreter (basic interpreter). Many parts of early windows applications were developed in p-code and visual basic is an important part of Microsoft success.
I like implementing emulators, because the toolchain and architecture specification are all there already. You get to implement what is basically a little embedded CPU.
I'd add reading and implementing a protocol from RFC - it's a great way to start thinking about design, especially if you read the original RFCs and work forward through the revisions and see what was kept vs deprecated.
When you think more about regexes this way, you realize that a regex is just a tiny description of a virtual machine (or emulator) that can process the simplest of instructions (check for 'a', accept '0-9', etc.). Each step in the regex is just a piece of bytecode that can execute, and if you turn a regex on its side you can visualize it as just a simple assembly program.