Vectorizing usually refers to compilers (or people) re-writing code taking advantage of vector instructions (like AVX, SSE).
What the author did is more commonly reffered to as embedding by the ML community.
So, a better title would be: Generating embeddings for the Code of Federal Regulations.
Mighty is built in Rust with avx turned on. But I get your point. I kinda dislike the name embeddings. It’s super niche, and not easy to explain to most people I talk to that aren’t working in ML. I always arrive at saying “well, embeddings are a vector…”. So this saves a step.
Many more people know what a vector is, if they have some college math experience.
What are your larger plans here? Do you have an interface in mind for searching or organizing things?
I think an amazing eCFR search experiment would be transformer vectors in a graph, using the hierarchy, citations, and references as edges to (sub)paragraph and section nodes - perhaps even using a modified HNSW somehow. The graph that exists there now isn't leveraged enough.
Per this dataset itself, I already output to Vespa formatted JSON (as noted in https://github.com/maxdotio/ecfr-prepare )...and the resulting vectors from the inference get appended to the original JSON doc as a field.
I have a Vespa schema hat I need to upload (that doesnt include the vector field yet but can be added using the Vespa vector search walkthroughs). It's been a busy day but I'll quickly try to find a place to put it for now :)
--EDIT-- Pushed the schema to the above repo, and some bash. You'll need Docker and to follow the Vespa MSMARCO instructions first at https://docs.vespa.ai/en/tutorials/text-search-semantic.html to get used to the engine.
My experience has been with 14 CFR and 21 CFR. I would love to see any tool you come up with in the future and would be happy to give you feedback.
As but one example: contracts often include the phrase "a reasonable effort".
Here's a definition: "Reasonable Efforts means, with respect to a given goal, the efforts that a reasonable person in the position of the promisor would use so as to achieve that goal as expeditiously as possible"
Try defining that in an algorithm!
But perhaps better tools can be built to make the process more efficient, less biased, and less redundant.
What are your favorite ways to do sentence and paragraph embeddedings and is there a framework you like where you can tune to custom data? Do you find fine tuning your embedding model helpful?
One more situation where raw power of hardware - building the vectors in this case - beats the old "knowledge engineering" approach as previously one would use ontology to get generalized term and siblings for the user's search term.
Curious if anyone has recommendations on good stashes or datasets of already-encoded embeddings? This sounds geeky but to some extent, I dont even "care" about the original text but would love to just get the embedding vectors and play with those.
Do you have a working search demo people can try?