Mighty is built in Rust with avx turned on. But I get your point. I kinda dislike the name embeddings. It’s super niche, and not easy to explain to most people I talk to that aren’t working in ML. I always arrive at saying “well, embeddings are a vector…”. So this saves a step.
Many more people know what a vector is, if they have some college math experience.
Oh yeah I could tinker forever, it's an amazing dataset that I think needs more attention from the ML community. Glad to see the working team at https://www.ecfr.gov/ finally making their search better, as Cornell Law has been the defacto go to forever (for me at least).
I think an amazing eCFR search experiment would be transformer vectors in a graph, using the hierarchy, citations, and references as edges to (sub)paragraph and section nodes - perhaps even using a modified HNSW somehow. The graph that exists there now isn't leveraged enough.
Per this dataset itself, I already output to Vespa formatted JSON (as noted in https://github.com/maxdotio/ecfr-prepare )...and the resulting vectors from the inference get appended to the original JSON doc as a field.
I have a Vespa schema hat I need to upload (that doesnt include the vector field yet but can be added using the Vespa vector search walkthroughs). It's been a busy day but I'll quickly try to find a place to put it for now :)
Yes, I was happy to see the modern changes, I agree Cornell Law had done a better job, although I think a lot of people use Google as the search tool and then link to their prefered site, since they are always the first two.
My experience has been with 14 CFR and 21 CFR. I would love to see any tool you come up with in the future and would be happy to give you feedback.
I think vectorizing and eventually algorithmizing federal regulations is important work and will be important for outcome driven federal policy, and I’m glad that you’re doing it.
Thanks! I hope newcomers see the importance, beauty, and complexity of the dataset and run with it as well. The more interested the better. "Augmented Federal Register" would be of real help for better crafting final rules as well - which is another area I'm looking into - as FR is truly organic.
As but one example: contracts often include the phrase "a reasonable effort".
Here's a definition: "Reasonable Efforts means, with respect to a given goal, the efforts that a reasonable person in the position of the promisor would use so as to achieve that goal as expeditiously as possible"
Awesome work!!!! curious how you handle paragraphs and niche language like federal regulations.
What are your favorite ways to do sentence and paragraph embeddedings and is there a framework you like where you can tune to custom data? Do you find fine tuning your embedding model helpful?
Thanks! The post doesn’t cover fine tuning of the model which would be absolutely necessary (but out of scope for the post). Nils Reimers (the author of SBERT) has been on a speaking circuit covering Generative Pseudo Labelling to handle the vocabulary gap of new domains that a pretrained sbert model hasn’t seen yet.
great project! definitely curious about this for lots of reasons; can add as resource to our legal topic filter list at Breeze search whenever you go live; may also be some collab opportunities
>But if you don't know how to work in this hierarchy, it is difficult to find information. Also, if you don't know the specific keyword "Enrollee", you will spend a long time trying to find what you need. Contemporary search now uses dense vectors that embed language meaning, so recall is much more robust.
One more situation where raw power of hardware - building the vectors in this case - beats the old "knowledge engineering" approach as previously one would use ontology to get generalized term and siblings for the user's search term.
Curious if anyone has recommendations on good stashes or datasets of already-encoded embeddings? This sounds geeky but to some extent, I dont even "care" about the original text but would love to just get the embedding vectors and play with those.
Vectorizing usually refers to compilers (or people) re-writing code taking advantage of vector instructions (like AVX, SSE).
What the author did is more commonly reffered to as embedding by the ML community.
So, a better title would be: Generating embeddings for the Code of Federal Regulations.