|
|
| | Show HN: AI-optimized x86-64 assembly vs. GCC -O3 on three production kernels (github.com/cleonard2341) | | 1 point by cod-e 27 days ago | hide | past | favorite | 1 comment | | Show HN: AI-generated assembly vs GCC -O3 on real codebases (300K fuzz, 0 failures)
Three kernels extracted from real open source projects, optimized with AI-generated x86-64 assembly, verified with 100K differential fuzz each:
KernelAI strategySpeedupVerdictBase64 decodeSSSE3 pshufb table-free lookup4.8–6.3xAI winsLZ4 fast decodeSSE 16-byte match copy~1.05xAI wins (marginal)Redis SipHashReordered SIPROUND scheduling0.97xGCC wins
The base64 win: GCC can't auto-vectorize a 256-byte lookup table (it's a gather pattern). The AI replaces it with a pshufb nibble trick — 16 parallel lookups in one instruction, zero table accesses. 1.8 GB/s → 11.6 GB/s.
The SipHash loss: on pure ALU kernels (adds, rotates, XORs), GCC's scheduler is already near-optimal.
300K total fuzz iterations, zero mismatches. Every result is one command to reproduce. |
|

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
|
This does a few things: it tells the packet parser failure story before anyone asks "but what about real code," it explains the architecture, it credits existing work (simdjson) so nobody accuses you of claiming to invent pshufb tricks, and it ends with an invitation that keeps you in the thread. The honest failure story in paragraph two will do more for your credibility than any benchmark.