Since this is SQLite SQL, I copied it into a Gist (to give it CORS headers) and now you can open it directly in Datasette Lite (Datasette in WebAssembly running entirely in the browser):
Inspiring. I would love for the author's plan (pasted below) to get applied widely across key information sources:
Instead, I hatched a plan:
1. Collect sources, and encoded the raw data in a machine-readable form
2. Study those sources, and encode my understanding as assertions,
sanity-checks and validations of that data
3. Synthesise that data according to my understanding, and verify it
against the sources available
Thanks for reposting this here.
I think I work in a similar way to this, but I've not written it down as a "manifesto" before like Simon has.
I'm going to be thinking about this as the Discovery Manifesto.... or maybe the autodidactaliser.
I've found it useful to reexpress foreign knowledge in a familiar setting. The magic is possibly that the learning seems fun because the familiar tool is fairly easy to use and the new information is expressed in the familiar way.
It's also good because it's not cheap dopamine, like looking over Youtube videos.... I watched a video on the 6502... I know how it works now??.. Youtubes do have their place, but not at the expense of doing more in depth thinking.
From the comments, it even seems to account for some instructions needing an extra cycle when crossing page boundaries, nice! This seems pretty comprehensive then.
Great idea, this could be valuable for people who write their own assembler.
It's the first time I see an instruction set as a relational database, which
I would imagine is a very portable way to describe a machine, perhaps it might
be worth collecting other machine specs in that same format and then create a
portable assembler that uses the specficic DBs.
Table-driven assemblers (and disassemblers) have been a thing for a long time, especially for more obscure/embedded architectures. Reverse-engineering/analysis tools likewise have traditionally done the same, but with additional semantic information for each instruction. A quick search for table-driven compilers reveals some mid-century papers.
As far as I know, most information about the 6502 instruction set comes in two forms:
- emulators/simulators/FPGA code
- books, data sheets, OCR'd PDFs of books and data sheets, text files copy/pasted from PDFs or retyped from books and data sheets
Code is likely to be heavily tested, but it makes extracting high-level information about the instruction set very difficult.
Data is easy to analyse and synthesise, but since it's described in prose there's no easy way to test or validate it - if somebody in 1984 made a typo that a particular instruction took 3 cycles instead of 2, and that error was copy/pasted and made its way into half the "6502 instruction set" websites online, how would you know? How would you detect it?
Using SQL to enforce constraints and validation gives me confidence that there aren't a bunch of typos and copy/paste errors in this data. In addition, being able to express special cases like "read-modify-write instructions applying to the accumulator do not pay the three cycle penalty" in code rather than in prose makes it more likely they will be applied correctly. Lastly, since the result is an SQL database, it can be pretty easily formatted to resemble any book or data sheet you like for simplified visual verification against book/data sheet sources.
The ability to ship SQL views that join multiple tables as part of the schema is pretty cool, and something you can't come close to replicating with CSVs.
I like how it looks, and I appreciate the effort that went into it, but other than using it to export a table in just a right format that you will then embed into your assembler directly, I'm not sure how to use it.
And I use opcode references [1] very often (sometimes daily, depending on the project). I even wrote my own disassemblers. But I mostly use opcode references for manual cross checking, so maybe I'm not a target of this project?
Something I realized years ago, the 6502 instruction set is small enough that it can be (almost) entirely implemented as memory look up — no actual logic computation need occur.
This is a very nice presentation, the .sql file contains a lot of notes about the sourcing for the data. I could imagine adding test vectors to the database as well.
In terms of bytes that the original CPU officially recognised as instructions, it was more like ~150 (working from old memories, I may be off by one or few there). Some of the other ~106 did something unofficially, and a number were valid instructions on later versions of the design.
That ~150 were grouped into 56 instructions, many with multiple addressing modes (so "load A immediate", "load A direct", "load A indexed", etc, were different opcodes but considered the same instruction).
Because register use was far from orthogonal (one accumulator, two index registers, and a flags register), instructions for them were considered different (LDA, LDX, & LDY, for load for instance) where in other instruction sets (for chips with multiple general purpose registers) they might be considered the same instruction affecting a different register, though considering them the same instruction didn't reduce the opcode count just the instruction group count.
(Apologies for failing to keep my inner pendant properly inner!)
In a reductionist sense this is "sql is Turing complete" which is long known, A voyage of discovery aside the joy is the execution and efficiency. I'd be delighted if I'd done this.
https://lite.datasette.io/?sql=https://gist.github.com/simon...
Here are the 65 hardcoded opcodes: https://lite.datasette.io/?sql=https://gist.github.com/simon...
And the 64 instructions: https://lite.datasette.io/?sql=https://gist.github.com/simon...