Hacker News new | past | comments | ask | show | jobs | submit login
Vectorizing the Code of Federal Regulations (max.io)
66 points by gk1 65 days ago | hide | past | favorite | 19 comments

If anybody was confused by the title:

Vectorizing usually refers to compilers (or people) re-writing code taking advantage of vector instructions (like AVX, SSE).

What the author did is more commonly reffered to as embedding by the ML community.

So, a better title would be: Generating embeddings for the Code of Federal Regulations.

Actually it does both :)

Mighty is built in Rust with avx turned on. But I get your point. I kinda dislike the name embeddings. It’s super niche, and not easy to explain to most people I talk to that aren’t working in ML. I always arrive at saying “well, embeddings are a vector…”. So this saves a step.

Many more people know what a vector is, if they have some college math experience.

Hi all, author here! My submission fell of the new page after 5 minutes sometime this morning and really glad to see it get re-posted.


I have worked with eCFR and have thought about some ideas for processing large sections and how it might be useful.

What are your larger plans here? Do you have an interface in mind for searching or organizing things?

Oh yeah I could tinker forever, it's an amazing dataset that I think needs more attention from the ML community. Glad to see the working team at https://www.ecfr.gov/ finally making their search better, as Cornell Law has been the defacto go to forever (for me at least).

I think an amazing eCFR search experiment would be transformer vectors in a graph, using the hierarchy, citations, and references as edges to (sub)paragraph and section nodes - perhaps even using a modified HNSW somehow. The graph that exists there now isn't leveraged enough.

Per this dataset itself, I already output to Vespa formatted JSON (as noted in https://github.com/maxdotio/ecfr-prepare )...and the resulting vectors from the inference get appended to the original JSON doc as a field.

I have a Vespa schema hat I need to upload (that doesnt include the vector field yet but can be added using the Vespa vector search walkthroughs). It's been a busy day but I'll quickly try to find a place to put it for now :)

--EDIT-- Pushed the schema to the above repo, and some bash. You'll need Docker and to follow the Vespa MSMARCO instructions first at https://docs.vespa.ai/en/tutorials/text-search-semantic.html to get used to the engine.

Yes, I was happy to see the modern changes, I agree Cornell Law had done a better job, although I think a lot of people use Google as the search tool and then link to their prefered site, since they are always the first two.

My experience has been with 14 CFR and 21 CFR. I would love to see any tool you come up with in the future and would be happy to give you feedback.

I think vectorizing and eventually algorithmizing federal regulations is important work and will be important for outcome driven federal policy, and I’m glad that you’re doing it.

Thanks! I hope newcomers see the importance, beauty, and complexity of the dataset and run with it as well. The more interested the better. "Augmented Federal Register" would be of real help for better crafting final rules as well - which is another area I'm looking into - as FR is truly organic.

It's completely impossible without strong AI.

As but one example: contracts often include the phrase "a reasonable effort".

Here's a definition: "Reasonable Efforts means, with respect to a given goal, the efforts that a reasonable person in the position of the promisor would use so as to achieve that goal as expeditiously as possible"

Try defining that in an algorithm!

I don't think full-automation would even be a goal - not when crafting laws for humans!

But perhaps better tools can be built to make the process more efficient, less biased, and less redundant.

Awesome work!!!! curious how you handle paragraphs and niche language like federal regulations.

What are your favorite ways to do sentence and paragraph embeddedings and is there a framework you like where you can tune to custom data? Do you find fine tuning your embedding model helpful?

Thanks! The post doesn’t cover fine tuning of the model which would be absolutely necessary (but out of scope for the post). Nils Reimers (the author of SBERT) has been on a speaking circuit covering Generative Pseudo Labelling to handle the vocabulary gap of new domains that a pretrained sbert model hasn’t seen yet.


great project! definitely curious about this for lots of reasons; can add as resource to our legal topic filter list at Breeze search whenever you go live; may also be some collab opportunities

>But if you don't know how to work in this hierarchy, it is difficult to find information. Also, if you don't know the specific keyword "Enrollee", you will spend a long time trying to find what you need. Contemporary search now uses dense vectors that embed language meaning, so recall is much more robust.

One more situation where raw power of hardware - building the vectors in this case - beats the old "knowledge engineering" approach as previously one would use ontology to get generalized term and siblings for the user's search term.

Thanks for this. We've been doing something similar with the Universal Sentence Encoder en masse (https://www.tensorflow.org/hub/tutorials/semantic_similarity...)

Curious if anyone has recommendations on good stashes or datasets of already-encoded embeddings? This sounds geeky but to some extent, I dont even "care" about the original text but would love to just get the embedding vectors and play with those.

BEIR is what you’re looking for :). There should be stashes of vectors for the datasets floating around.


Have seen many attempts to work with the dataset! It’s a fun challenge. https://ecfr.report and ecfr.io come to mind, but .gov is still the best.

Cool project!

Do you have a working search demo people can try?

working on it ;)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact