Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I built an interactive map and search engine for US Census data (blockatlas.com)
3 points by NameError 55 days ago | hide | past | favorite | 6 comments
The core idea here is to use semantic search & LLMs to make it easier to search the tens of thousands of different demographic indicators available from the US Census API. I'm definitely not the first to try something like this, but I think this solution has some nice properties that I haven't seen in similar tools:

- Barring serious bugs, BlockAtlas won't "lie" to users. It may fail to find something relevant, or misunderstand a query, but the results (map title & data) will faithfully reflect the underlying Census estimates

- BlockAtlas covers a much wider set of Census data than other tools I've seen. Almost every "Detailed Table" from the American Community Survey is available, across the entire range of release years (2005-2022). There are ~29,000 demographic indicators in the search index as it stands, plus some combinations of indicators (e.g. "X and above") for popular tables

Similar LLM+Census things I've seen have used an approach akin to "replicate some data into my DB, have LLM generate SQL over it", which makes it hard to avoid issues with both of these points. I've taken a bit of a different approach - creating a search index over metadata, i.e. searching for API parameters and pulling the data itself directly from the Census. That way, the LLM is limited to "selecting between known-valid options", rather than generating a SQL query and displaying the results under a potentially-misleading name.

This is the second iteration of Blockatlas - the first was a ChatGPT plugin. The LLM would query my API for candidate variables, and generate a link to my site with the variables to display and the map title as query parameters. This made for a cool demo but ultimately was very hard to trust - the LLM could select a map title which was not at all reflected by the variables in question, or could combine variables in a nonsensical way, so it failed to solve the "don't lie to users" problem. The plugin ("GPT" now) is still available, but the standalone search engine is my effort to remedy those issues.

The tech stack: The frontend uses React for the search form and Leafet map. API is written in Typescript and hosted on Cloudflare Workers. The search indexes are in a Postgres DB using pgvector + OpenAI embeddings as well as pg's built-in full-text-search feature, and the OpenAI API is used for query-parsing and result reranking/selection as well (gpt-3.5-turbo).

I think there's a ton of room for improvement here, but wanted to gauge public interest a bit before putting more time into this (I have a newborn and a full time job, so it's been hard to carve out time to work on this lately).




Great work.

For some reason, when I gave it a very broad query I got the suggested result "[Table B18104?] Sex by Age by Cognitive Difficulty (Civilian noninstitutionalized population 5 years and over): Total".

No idea why it picked that table. Instead of the more general "[Table B01003]: Total Population" or "[Table B01001] Sex by Age". In general I think a query's first result hit should be the least specific match.

And the embeddings/full-text-search mishandle things that have no close match: the query "People who look like Kevin Bacon" returns "Number of People: Population by Ancestry: Basque (2022)"


Hi - thank you for trying it out! These are both definitely real issues with the current approach. I've tried to reign in the "selecting an overly specific table" issue in the final "LLM-selects-from-search-results" stage but clearly have some work left to do there.

As far as the second issue - when people search for things way outside of the available data - I have not done much to address this, but really should. This happens for more plausible queries too, e.g. "Crime Rate" seems like it could be cataloged by the Census, but is not part of the tables indexed by the site (ACS Detailed Tables). It selects variables somewhat randomly here when it should really say something like "no relevant results found"


The query engine doesn't understand the area of a state/county/zipcode(/census tract), unlke the official Census viewer https://www.census.gov/library/visualizations/2021/geo/demog...

When I query for "population density" I only get tables of total population. Not "people/sq mile".

Also, the default legend breaks are exponential (good) but not rounded to nearest n significant figures. And the color scheme is monochome green (hard to quickly read the map).


Thanks for trying it out - adding land area and supporting queries like "population density" will definitely be doable, and I'd like to make the legend and map color scheme a bit better (and ideally user-configurable) as well.

Really appreciate the feedback!


Uhuh. So when I query for "counties with highest population density", that specifies both the table and the geographical unit. (But currently that gives "B26001_001E: Group Quarters Population (Population in group quarters): Total")


I just made an update which should enable "population density" type queries to work a bit better - there is now an option to divide by 'LAND_AREA' for any variable (though this should probably be limited a bit), and this option will be automatically selected for queries including 'density' or a few related strings e.g. 'per sq. mile'




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: