I built this because I've become paranoid about "safe" refactors in the wake of supply chain attacks like the xz backdoor.
We spend a lot of time reviewing code for syntax, but we lack good tools for verifying that a large refactor (e.g., renaming variables, changing loop styles) preserves the exact business logic. Standard SHA256 hashes break if you change a single whitespace or variable name, which makes them useless for verifying semantic equivalence.
I built Semantic Firewall (sfw) to solve this. It is an open-source tool that fingerprints Go code based on its behavior, not its bytes.
How it works:
1. SSA Conversion: It loads the Go source into Static Single Assignment form using golang.org/x/tools/go/ssa.
2. Canonicalization: It renames registers (v0, v1) deterministically and normalizes control flow graphs. This ensures that `if a { x } else { y }` fingerprints the same even if branches are swapped with inverted conditions.
3. Scalar Evolution (SCEV): This was the hardest part. I implemented an SCEV engine that mathematically solves loop trip counts. This means a `for range` loop and a `for i++` loop that iterate N times produce the exact same fingerprint.
Here is a quick example of what it catches:
// Implementation A
func wipe(k []byte) {
for i := range k { k[i] = 0 }
}
// Implementation B (Refactor?)
func wipe(buf []byte) {
for i := 0; i < len(buf); i++ { buf[i] = 0 }
}
These two produce identical hashes. If you change the logic (e.g. `i < len(buf)-1`), the hash diverges immediately.
It’s written in Go and available as a CLI or GitHub Action. I’d love to hear your thoughts on the approach or edge cases I might have missed in the normalization phase.
I solve that with Scalar Evolution (SCEV) analysis. The tool doesn't just hash the AST; it solves the loop math.
You are right that AST hashing is brittle. That is why I wrote an engine that virtually executes the induction variables to determine that a `range` loop, a C-style `for` loop, and a raw `goto` loop are all mathematically performing the same operation (Iterate 0 to N).
I just pushed a proof to the repo that runs those three exact scenarios. They produce the identical SHA-256 fingerprint.
It also handles Control Flow Normalization, so `if a > b { return 1 }` fingerprints identically to `if b <= a { return 1 }` (inverted condition + swapped branches).
It won't catch O(n) vs O(log n) algorithm changes, but it catches the "syntactic sugar" refactors that make up 90% of code churn.
I built this because I've become paranoid about "safe" refactors in the wake of supply chain attacks like the xz backdoor.
We spend a lot of time reviewing code for syntax, but we lack good tools for verifying that a large refactor (e.g., renaming variables, changing loop styles) preserves the exact business logic. Standard SHA256 hashes break if you change a single whitespace or variable name, which makes them useless for verifying semantic equivalence.
I built Semantic Firewall (sfw) to solve this. It is an open-source tool that fingerprints Go code based on its behavior, not its bytes.
How it works:
1. SSA Conversion: It loads the Go source into Static Single Assignment form using golang.org/x/tools/go/ssa.
2. Canonicalization: It renames registers (v0, v1) deterministically and normalizes control flow graphs. This ensures that `if a { x } else { y }` fingerprints the same even if branches are swapped with inverted conditions.
3. Scalar Evolution (SCEV): This was the hardest part. I implemented an SCEV engine that mathematically solves loop trip counts. This means a `for range` loop and a `for i++` loop that iterate N times produce the exact same fingerprint.
Here is a quick example of what it catches:
These two produce identical hashes. If you change the logic (e.g. `i < len(buf)-1`), the hash diverges immediately.It’s written in Go and available as a CLI or GitHub Action. I’d love to hear your thoughts on the approach or edge cases I might have missed in the normalization phase.
Repo: https://github.com/BlackVectorOps/semantic_firewall
reply