Last year, I needed to find all software companies in São Paulo for a project. The good news: Brazil publishes all company registrations as open data at dados.gov.br. The bad news: it's 85GB of ISO-8859-1 encoded CSVs with semicolon delimiters, decimal commas, and dates like "00000000" meaning NULL. My laptop crashed after 4 hours trying to import just one file.
So I built a pipeline that handles this mess: https://github.com/cnpj-chat/cnpj-data-pipeline
THE PROBLEM NOBODY TALKS ABOUT
Every Brazilian startup eventually needs this data - for market research, lead generation, or compliance. But everyone wastes weeks:
- Parsing "12.345.678/0001-90" vs "12345678000190" CNPJ formats
- Discovering that "00000000" isn't January 0th, year 0
- Finding out some companies are "founded" in 2027 (yes, the future)
- Dealing with double-encoded UTF-8 wrapped in Latin-1
WHAT YOU CAN NOW DO IN SQL
Find all fintechs founded after 2020 in São Paulo:
SELECT COUNT(*) FROM estabelecimentos e
JOIN empresas emp ON e.cnpj_basico = emp.cnpj_basico
WHERE e.uf = 'SP'
AND e.cnae_fiscal_principal LIKE '64%'
AND e.data_inicio_atividade > '2020-01-01'
AND emp.porte IN ('01', '03');
Result: 8,426 companies (as of Jun 2025)
SURPRISING THINGS I FOUND
1. The 3am Company Club: 4,812 companies were "founded" at exactly 3:00:00 AM. Turns out this is a database migration artifact from the 1990s.
2. Ghost Companies: ~2% of "active" companies have no establishments (no address, no employees, nothing). They exist only on paper.
3. The CNAE 9999999 Mystery: 147 companies have an economic activity code that doesn't exist in any reference table. When I tracked them down, they're all government entities from before the classification system existed.
4. Future Founders: 89 companies have founding dates in 2025-2027. Not errors - they're pre-registered for future government projects.
5. The MEI Boom: Micro-entrepreneurs (MEI) grew 400% during COVID. You can actually see the exact week in March 2020 when registrations spiked.
TECHNICAL BITS
The pipeline:
- Auto-detects your RAM and adapts strategy (streaming for <8GB, parallel for >32GB)
- Uses PostgreSQL COPY instead of INSERT (10x faster)
- Handles incremental updates (monthly data refresh)
- Includes missing reference data from SERPRO that official files omit
Processing 60M companies:
- VPS (4GB RAM): ~8 hours
- Desktop (16GB): ~2 hours
- Server (64GB): ~1 hour
THE CODE
It's MIT licensed: https://github.com/cnpj-chat/cnpj-data-pipeline
One command setup:
docker-compose --profile postgres up --build
Or if you prefer Python:
python setup.py # Interactive configuration
python main.py # Start processing
WHY OPEN SOURCE THIS?
I've watched too many devs waste weeks on this same problem. One founder told me they hired a consultancy for R$30k to deliver... a broken CSV parser. Another spent 2 months building ETL that processes 10% of the data before crashing.
The Brazilian tech ecosystem loses tons of hours reinventing this wheel. That's time that could be spent building actual products.
COMMUNITY RESPONSE
I've shared this with r/dataengineering and r/brdev, and the response has been incredible - over 50k developers have viewed it, and I've already incorporated dozens of improvements from their feedback. The most common reaction? "I wish I had this last month when I spent 2 weeks fighting these files."
QUESTIONS FOR HN
1. What other government datasets are this painful? I'm thinking of tackling more.
2. For those who've worked with government data - what's your worst encoding/format horror story?
3. Is there interest in a hosted API version? The infrastructure would be ~$100/month to serve queries.
The worst part? This data has been "open" since 2012. But open != accessible. Sometimes the best code is the code that deals with reality's mess so others don't have to.