The core idea here is to use semantic search & LLMs to make it easier to search the tens of thousands of different demographic indicators available from the US Census API. I'm definitely not the first to try something like this, but I think this solution has some nice properties that I haven't seen in similar tools:
- Barring serious bugs, BlockAtlas won't "lie" to users. It may fail to find something relevant, or misunderstand a query, but the results (map title & data) will faithfully reflect the underlying Census estimates
- BlockAtlas covers a much wider set of Census data than other tools I've seen. Almost every "Detailed Table" from the American Community Survey is available, across the entire range of release years (2005-2022). There are ~29,000 demographic indicators in the search index as it stands, plus some combinations of indicators (e.g. "X and above") for popular tables
Similar LLM+Census things I've seen have used an approach akin to "replicate some data into my DB, have LLM generate SQL over it", which makes it hard to avoid issues with both of these points. I've taken a bit of a different approach - creating a search index over metadata, i.e. searching for API parameters and pulling the data itself directly from the Census. That way, the LLM is limited to "selecting between known-valid options", rather than generating a SQL query and displaying the results under a potentially-misleading name.
This is the second iteration of Blockatlas - the first was a ChatGPT plugin. The LLM would query my API for candidate variables, and generate a link to my site with the variables to display and the map title as query parameters. This made for a cool demo but ultimately was very hard to trust - the LLM could select a map title which was not at all reflected by the variables in question, or could combine variables in a nonsensical way, so it failed to solve the "don't lie to users" problem. The plugin ("GPT" now) is still available, but the standalone search engine is my effort to remedy those issues.
The tech stack: The frontend uses React for the search form and Leafet map. API is written in Typescript and hosted on Cloudflare Workers. The search indexes are in a Postgres DB using pgvector + OpenAI embeddings as well as pg's built-in full-text-search feature, and the OpenAI API is used for query-parsing and result reranking/selection as well (gpt-3.5-turbo).
I think there's a ton of room for improvement here, but wanted to gauge public interest a bit before putting more time into this (I have a newborn and a full time job, so it's been hard to carve out time to work on this lately).
For some reason, when I gave it a very broad query I got the suggested result "[Table B18104?] Sex by Age by Cognitive Difficulty (Civilian noninstitutionalized population 5 years and over): Total".
No idea why it picked that table. Instead of the more general "[Table B01003]: Total Population" or "[Table B01001] Sex by Age". In general I think a query's first result hit should be the least specific match.
And the embeddings/full-text-search mishandle things that have no close match: the query "People who look like Kevin Bacon" returns "Number of People: Population by Ancestry: Basque (2022)"