Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How Do You Clean and Structure Data at Work?
5 points by dataflowmapper 48 days ago | hide | past | favorite | 9 comments
Curious to hear from different industries—what’s the most frustrating or repetitive data task you deal with and how are you solving it?

I do software implementations, and we get customer data exports from legacy systems as CSV or XLSX. Cleaning and mapping them for import is always a pain.

Anyone else constantly structuring, formatting, or fixing data? How do you deal with it, any good tools or workarounds?




> we get customer data exports from legacy systems as CSV or XLSX

Convert the xlsx to csv. Every database system out there has blazing fast import of csv files.

As a sql wizard, I prefer to use sql to clean and re-shape data. So my first goal is to get the data into a sql DB as quickly as possible, no cleaning, no re-shaping. Just a raw dump. Now the data is in my house. I clean and re-shape the data with batch update/insert statements. Finally I batch insert to the target tables.

> what’s the most frustrating

Every import job is a custom scenario. I feel special tools don't give you much. You have to understand both the source and destination data to clean and re-shape it. Tools don't have that understanding. AI is less than worthless. At the end of the day you have to roll up your sleeves and start shaping data.


I don't deal with this kind of job too much but sometimes I have to migrate data from one system to another one. The cleaning is hard and specific to the data but most of the time I was able to deal with it using Regex find and replace and some command-line tools. For mapping, I would take 10 lines out of the CSV for test import, use AI or manually map the fields and then do batch import once the mapping is correct. AI can also help you clean, transform the data but the output may not be consistent.


I've played around with regex for some use cases like cleaning up names, but I don't know the syntax well enough to manipulate data well on the fly. How have you been leveraging ai for those use cases you mentioned?


I use (and also develop) RAMADDA's SeeSV package- https://ramadda.org/repository/a/seesv

A Java based package for data wrangling that reads in all sorts of formats, over 300+ commands for transmogrifying the data and a number of output formats. There is an interactive interface provided by RAMADDA and a stand-alone command line release.


Thanks for sharing, this is actually pretty cool. I could see how this could be a really flexible tool for power users. How'd you get into developing it? Were you trying to solve a specific problem you ran into?


I found I was continually trying to manipulate information in Excel and CSV files from my software business. So I built a drag and drop tool for doing it: https://www.easydatatransform.com .


I also have to do this kind of work (with financial data), and actually built a product to make it easy. It's currently in closed beta but please send me an email (listed in my profile) if this is truly a frequent pain for you and you'd like to try it out.


Datasette

Visidata

More tools at: https://github.com/dbohdan/structured-text-tools

I often also use Pycharm community edition with multicursor (Mark next occurrence in keymap) to live edit Text files


Your username suggests that you might already have an answer to this.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: