Hacker News new | past | comments | ask | show | jobs | submit login
The 6502 instruction set as a database (gitlab.com/screwtapello)
127 points by orgonon 9 months ago | hide | past | favorite | 30 comments



Since this is SQLite SQL, I copied it into a Gist (to give it CORS headers) and now you can open it directly in Datasette Lite (Datasette in WebAssembly running entirely in the browser):

https://lite.datasette.io/?sql=https://gist.github.com/simon...

Here are the 65 hardcoded opcodes: https://lite.datasette.io/?sql=https://gist.github.com/simon...

And the 64 instructions: https://lite.datasette.io/?sql=https://gist.github.com/simon...


I love your tool. If you want a slightly more structured version of Gists that speaks parquet try my own tool, csvbase :). I pasted this data in as https://csvbase.com/calpaterson/opcodes-6502 and you can get parquet by adding .parquet to that (https://csvbase.com/calpaterson/opcodes-6502.parquet)

Here's datasette lite running off that file:

https://lite.datasette.io/?parquet=https%3A%2F%2Fcsvbase.com...

Really nice when webby stuff works together


Inspiring. I would love for the author's plan (pasted below) to get applied widely across key information sources:

    Instead, I hatched a plan:

    1. Collect sources, and encoded the raw data in a machine-readable form
    2. Study those sources, and encode my understanding as assertions, 
       sanity-checks and validations of that data
    3. Synthesise that data according to my understanding, and verify it
       against the sources available


Thanks for reposting this here. I think I work in a similar way to this, but I've not written it down as a "manifesto" before like Simon has.

I'm going to be thinking about this as the Discovery Manifesto.... or maybe the autodidactaliser.

I've found it useful to reexpress foreign knowledge in a familiar setting. The magic is possibly that the learning seems fun because the familiar tool is fairly easy to use and the new information is expressed in the familiar way.

It's also good because it's not cheap dopamine, like looking over Youtube videos.... I watched a video on the 6502... I know how it works now??.. Youtubes do have their place, but not at the expense of doing more in depth thinking.


The last step should be: 4. Get a real 6502 CPU and verify that the data is correct.


Likely useful: https://github.com/mist64/perfect6502 and http://www.visual6502.org/. These are transistor-level simulations based on die shots. From these you can derive the cycle times of each instruction with 100% confidence.


Up next: 6502 emulator in a single SQL query


:-)


From the comments, it even seems to account for some instructions needing an extra cycle when crossing page boundaries, nice! This seems pretty comprehensive then.


Great idea, this could be valuable for people who write their own assembler.

It's the first time I see an instruction set as a relational database, which I would imagine is a very portable way to describe a machine, perhaps it might be worth collecting other machine specs in that same format and then create a portable assembler that uses the specficic DBs.


Table-driven assemblers (and disassemblers) have been a thing for a long time, especially for more obscure/embedded architectures. Reverse-engineering/analysis tools likewise have traditionally done the same, but with additional semantic information for each instruction. A quick search for table-driven compilers reveals some mid-century papers.


Ghidra uses SLEIGH for this purpose https://fossies.org/linux/ghidra/GhidraDocs/languages/html/s...

> A Language for Rapid Processor Specification

From a SLEIGH description, the assembler, disassembler, and even decompiler can be synthesized.

It's a DSL not a database schema, but fundamentally it's the same idea.

Here's their definition of the 6502: https://github.com/NationalSecurityAgency/ghidra/blob/cae919...


Some other instruction sets in some JSON: https://github.com/asmjit/asmjit/tree/master/db


Not sure what the advantage is, relative to the tables as CSV files.


As far as I know, most information about the 6502 instruction set comes in two forms:

- emulators/simulators/FPGA code

- books, data sheets, OCR'd PDFs of books and data sheets, text files copy/pasted from PDFs or retyped from books and data sheets

Code is likely to be heavily tested, but it makes extracting high-level information about the instruction set very difficult.

Data is easy to analyse and synthesise, but since it's described in prose there's no easy way to test or validate it - if somebody in 1984 made a typo that a particular instruction took 3 cycles instead of 2, and that error was copy/pasted and made its way into half the "6502 instruction set" websites online, how would you know? How would you detect it?

Using SQL to enforce constraints and validation gives me confidence that there aren't a bunch of typos and copy/paste errors in this data. In addition, being able to express special cases like "read-modify-write instructions applying to the accumulator do not pay the three cycle penalty" in code rather than in prose makes it more likely they will be applied correctly. Lastly, since the result is an SQL database, it can be pretty easily formatted to resemble any book or data sheet you like for simplified visual verification against book/data sheet sources.


The ability to ship SQL views that join multiple tables as part of the schema is pretty cool, and something you can't come close to replicating with CSVs.

https://lite.datasette.io/?sql=https://gist.github.com/simon...


I like how it looks, and I appreciate the effort that went into it, but other than using it to export a table in just a right format that you will then embed into your assembler directly, I'm not sure how to use it.

And I use opcode references [1] very often (sometimes daily, depending on the project). I even wrote my own disassemblers. But I mostly use opcode references for manual cross checking, so maybe I'm not a target of this project?

[1] My favorite one for x64 is https://ref.x86asm.net/coder64.html


The ability to export data from a table or query in whatever format you need is one of my favorite things about distributing data using SQLite.


Something I realized years ago, the 6502 instruction set is small enough that it can be (almost) entirely implemented as memory look up — no actual logic computation need occur.


if you allow multiple sequential memory lookups, this is true for any instruction set


This is a very nice presentation, the .sql file contains a lot of notes about the sourcing for the data. I could imagine adding test vectors to the database as well.


It might be related to some extend - intel8080.com.


The 6502 only has 56 opcodes.

The db also includes modern variants of the 6502.


> The 6502 only has 56 opcodes

In terms of bytes that the original CPU officially recognised as instructions, it was more like ~150 (working from old memories, I may be off by one or few there). Some of the other ~106 did something unofficially, and a number were valid instructions on later versions of the design.

That ~150 were grouped into 56 instructions, many with multiple addressing modes (so "load A immediate", "load A direct", "load A indexed", etc, were different opcodes but considered the same instruction).

Because register use was far from orthogonal (one accumulator, two index registers, and a flags register), instructions for them were considered different (LDA, LDX, & LDY, for load for instance) where in other instruction sets (for chips with multiple general purpose registers) they might be considered the same instruction affecting a different register, though considering them the same instruction didn't reduce the opcode count just the instruction group count.

(Apologies for failing to keep my inner pendant properly inner!)


The original 6502 had exactly 151 opcodes, the same as the number of pokémon in the original Pokémon games.


This is a wonderful useless fact. Thanks!


> pendant

I assume you mean pedant.


Auto-carrot strikes again…

I really must stop writing things on the phone. I'm bad enough with a decent keyboard, sometimes I make sense at all via phone input.


I meant instructions/mnemonics.

I was referencing Simonw's post:

> Here are the 65 hardcoded opcodes


In a reductionist sense this is "sql is Turing complete" which is long known, A voyage of discovery aside the joy is the execution and efficiency. I'd be delighted if I'd done this.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: