Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What tools do you use for data munging and merging?
114 points by hellectronic on Jan 13, 2020 | hide | past | favorite | 72 comments

what tools do you use or know about, e.g. something like Talend DataPreparation (https://www.talend.com/products/data-preparation) ?

I found OpenRefine (http://openrefine.org), which is browser based like Talend DataPreparation.

It would be nice to have a "true" desktop application :)

Looks like you're looking for ETL solutions. It's funny reading some of the replies here, you can tell who's coming from a more BI background compared to a software engineering background.

Anyways, I think Alteryx does this job really well. It's great for ETL that the average Joe can pick up and learn very quickly. Sure, the UX could be improved, but it's miles better than AWS' Data Pipeline and other tools. It is a bit costly like another user mentioned, but well worth it IMO. Tableau has introduced some new products the past year or two to compete in this space as well, so if you use Tableau for reporting, look into their new offerings. Dell introduced an ETL tool a few years ago called Boomi. It had some promise a few years ago and might be something to consider. I have no idea what it costs though. Another option is SSIS if the data is eventually going to live in SQL Server or some Microsoft database.

Ultimately, I would consider options based on your target database/environment.

I totally agree with your first sentence. As a former enterprise data warehouse architect, it's interesting seeing some of the responses. Surprised I'm not seeing more Informatica and SSIS. I'll also say SSIS is a good solution for nearly any data storage platform and is included in a SQL Server license so might be a good, cheap enterprise solution if you already have SQL Server.

From my experience SSIS suffers mightily from the Write-Once Read Only problem as well as huge manual click-fests when package metadata changes. That and a lack of development from MSFT with obvious direction towards ADF means SSIS is not really great for modern data architecture.

If you're a MSFT shop then sure and if you're willing to drop the box-line GUI and move to .NET assembly-based packages even more so, but I've spent enough of my life keeping SSIS running and not going back.

+1 for SSIS. Used it extensively in a previous position at a large fortune 500 insurance company

I do not search for an ETL solution, but thanks for your reply because I am using Pentaho Data Integration for ETL and reading about other tools can help :)

A project I'm using, which I'm also one of the authors of, is OctoSQL[1].

It allows you to transform and join data from different data sources (including databases like mysql, postgres, redis, and CSV or JSON files, more to come) using a familiar interface - plain SQL.

Other than that, I like Exploratory[2].

And for nested JSON data I use another project of mine, jql[3] - an easier to use (in my opinion) lispy alternative to jq. (originally inspired by a hackernews comment actually)




Interesting project and I like the name. It seems that the underlying relationship in each data source has to be relational though (even for JSON files) - is that a fundamental limitation of using SQL as the glue language, or are there plans to integrate hierarchical models of data in some form as well?

If you mean hierarchical data querying (like nested json, or json in sql rows), then that's very high on our radar.

OctoSQL sounds interesting thanks.

I try to avoid GUI-based tools as they generally don't lend themselves well to version control, code reuse, etc. There are several orchestration tools that allow you to create arbitrarily complex ETL pipelines using SQL, Python, Spark, etc. Here are links to a few:





dbt is a fantastic tool. We really on it for basically all of our transformations at GitLab.

+1 for dbt.

Great tool with a really active community along with it

Dbt looks awesome. Too bad no oracle support.

Hi there - I run product for dbt - happy to chat more about Oracle support if you're interested. There are a handful of community-supported dbt plugins out in the wild. Oracle could very well be one of the next ones. Check some of these out to see what's involved in building a database plugin:

- https://github.com/fishtown-analytics/dbt-spark

- https://github.com/fishtown-analytics/dbt-presto

- https://github.com/mikaelene/dbt-sqlserver

- https://github.com/jacobm001/dbt-mssql

cool. would be an interested user, but building a plug in ... beyond my team's ability.

QuestDB (https://www.questdb.io) might help. Import your file(s) by drag and drop directly into the web console. Once data is loaded, you can run SQL queries in the console to order, format, concatenate, join (including time-based joins), etc. Once finished, you can export your formatted results back to CSV. Also, it's pretty quick!

Visidata (http://visidata.org/) is my tool of choice these days.

yup lots of love for visidata, a nice quick way to eyeball data and then munge: https://visidata.org/docs/join/ before proper carpentry with pandas.

That looks interesting, thanks.

I really like R / Tidyverse, but that requires that you have more memory than your datasize (especially if you are using narrow and long dataframes, such as recommended by tidyverse) and it also requires you to code, instead of using a UI

Is there really no way of processing dataframes by chunks in the R ecosystem? I'm firmly in the Python camp but I wonder how much I'm missing out on in the R world.

There are many ways to process larger than RAM objects in R.

The dbplyr package makes it possible to use a local db table as if they are in-memory data frames (https://dbplyr.tidyverse.org/).

https://diskframe.com/ is also a pretty great solution that processes data by chunks.

Sorry to butt in, but could you point me to a resource on processing data in chunks using Python?

We use Talend. It does have a gigantic Java-based Windows desktop application. It's pretty powerful and I don't hate it. We looked at MuleSoft as well but it's not ready for Enterprise prime-time like we need it to be.

The data preparation Wikipedia page mentions Talend by name along with Paxata, Trifacta, Alteryx, and Ataccama.

Alteryx is very good, especially if you want to hand off maintenance and operation of data processing flows to non coders.

The python ecosystem is really good here.

I do a lot of exploratory coding in ipython, though the threshold for "switch to a real editor and run git init and poetry new" is pretty low.

Want to munge CSV? stdlib, or pandas(https://pandas.pydata.org/pandas-docs/stable/)

Want to munge JSON? stdlib.

Want to munge a database? pandas, stdlib, anything that speaks sqlalchemy

Want to validate your json/sql/CSV/whatever and have it come in as a bunch of structured classes not std types? jsonschema https://pypi.org/project/jsonschema/, attrs+marshmallow, attrs+cattrs. http://www.attrs.org/en/stable/, https://desert.readthedocs.io/en/latest/ https://marshmallow.readthedocs.io/en/stable/ https://github.com/Tinche/cattrs

Want to transform csv into sqlite? https://github.com/simonw/csvs-to-sqlite

Want to serve sqlite as a REST api? https://datasette.readthedocs.io/en/stable/

Want to stuff simple things into a database really fast? https://dataset.readthedocs.io/en/latest/

Want to flip numerical data around? Numpy.https://docs.scipy.org/doc/numpy/reference/

Want to model it? Scipy https://www.scipy.org/docs.html, pandas

Want to plot it? seaborn https://seaborn.pydata.org/, plotnine https://plotnine.readthedocs.io/en/stable/

Want to futz around with structured data in a sensible way? glom https://glom.readthedocs.io/en/latest/, python-lenses https://github.com/ingolemo/python-lenses.

Want to spit out tabular data? Tabulate https://github.com/astanin/python-tabulate

Want to figure out where not to eat in chicago? built-ins: counter, defaultdict, comprehensions: https://www.youtube.com/watch?v=lyDLAutA88s

There's a LOT you can do pretty fast, and I can more or less hammer out a basic ETL cli script in my sleep at this point.

Sweet. Thanks for the links to seaborn and plotnine. I hadn't seen those before.

I do of automation with csv, yaml, openpyxl, and jinja2. Wrapped in a simple GUI (Qt, Gtk, or Tk, pick your poison) so my non-programmer colleagues can quickly run it and get a PDF report, generated with reportlab.

I think it is not that rich when dealing with binary data.

fsvo 'binary', i think i agree. I've written my fair share of struct.unpack-heavy code, and it's just tedious.

I also think that __str__ and __repr__ behavior on bytes is misleading at best. I do not like getting ascii values for some bytes and not others... and iterating to get base-10 ints!?

Pillow[simd], soundfile, and imageio can do nice things for image and sound data.

Thank you for the list.

I'm working in this area, been doing a lot of "exploratory coding" that has evolved into something I find quite powerful, a modern spreadsheet application with features that I think makes it suited to the current time. (Summary at https://lightsheets.app)

At the moment I'm working on improving performance. I can already load 1 million rows into it without too much trouble, and the next step is to load some Kaggle-size CSV's (e.g. 5.6 GB) and then be able to run data cleaning or other transforms on it without it choking.

If anyone's interested in this kind of stuff feel free to drop me an email (in my profile)!

Usually spark, but it depends on what the source data is. Whether it needs more preprocessing, etc. for json, there are some good command- line tools available for stripping, flattening, if u need to do that. Spark can usually handle most of the data sources.

Sublime text isn't as featured as Talend tools but it does what it does well and quickly and stays out of my way. Great for loading up a text file, scrolling around, and doing quick find/replaces.

Shell scripts using mostly sed and awk go a very, very long way when I want something repeatable or I'm dealing with a lot of data. And, when that starts getting heavy or long and I need more structure, python.

GUIs just get in the way in this space. By necessity, they need to be opinionated because there's only so much screen to get around in. They also tend to create artifacts that aren't quick to understand later and are locked into some particular product or way of doing things.

SQL, Python tools such as Pandas, Rapids.ai (if you have a recent Nvidia GPU) and Talend are all good choices. If it is a small amount of data you can just use Python or Bash to pretty quickly accomplish what you need.

I use SQLite myself, although I would like that those who provide the data should make the data available in a format such as CSV, or TSV, or a SQLite database, or to have some sort of protocol which a SQLite virtual table module can be made to access any data using such a protocol, so that it is possible to do so without having to deal with complicated stuff in a web browser or whatever else it may be, sometimes which is not even compatible with my computer. SQLite is common on all computers, I think, and CSV or TSV will be simple enough to work with many programs, so it should do.

I have recently released a Windows/Mac tool for data munging: https://www.easydatatransform.com

It is a GUI based solution with an emphasis on ease of use, aimed at people who aren't coders or data science professionals. For example a marketer who has several different Excel and CSV files they need to merge, clean, filter and restructure to create a report. Any feedback would be very welcome.

Shell scripts and simple Java applications. Not interested in navigating/learning a GUI which may or may not work for me.

Downloading everything locally isn't desirable for me.

https://www.getdbt.com/ is excellent if your users are SQL-friendly.

At this point, why wouldn't you use Excel?

What use cases do you have that go beyond Excel?

I say this as someone that develops pipelines in notebooks and workflow frameworks.


It's been a few years but I use to use Altova suite specifically MapForce (https://www.altova.com/mapforce) DiffDog, etc. Like I said it's been a few years but it was super powerful, this was back in my XML days when building stylesheets was a big part of my job.

The installation from the docs on the readme is broken!

Just to put it here on HN, the installation command is corrected now. You need to have go modules turned on to use go get for installation.

Structured data rarely lends itself to a GUI paradigm because its structure is arbitrary. I would argue that it would only fit with a GUI if it represents something specific that said GUI explicitly models (e.g., a very specific lump of JSON could be interpreted as a tree, etc.). As such, command line and scripting tools are vastly superior.

The model works absolutely fine with structured data. In these kinds of applications, you usually connect different nodes on a canvas, each node representing a data source, join, transformation etc. and each connection represents the flow of data.

But those nodes are going to be either relatively limited in their ability, highly specific, or just expose scripting tools; the linkage between nodes is trivial. Why waste time connecting all that up visually, when a good shell will do the job?

I agree with this and I also prefer a programming language over a GUI. My point was that there isn't a limitation of GUI ETL tools that's specific to structured data.

> Structured data rarely lends itself to a GUI paradigm because its structure is arbitrary.

I really can't tell what this means.

There is an incredible ecosystem of tools used for "data preparation" that it almost always necessitates defining what the usecase and workflow would be first. There are tools from UI-based to command-line-based.

Tools: Avora, FiveTran, Informatica, Excel, Alteryx, Datameer, Tableau Prep, etc. ... list could go on and on.

If I need to combine a bunch of files in a 'union' or 'concat' type fashion I will usually use UNIX tools or perl. Ditto for dropping or re-arranging columns or filtering out entries. The syntax is usually very terse and for simple tasks isn't too bad in terms of readability/

A thread like this would not be complete witho7na shoutout to Monarch.

It’s exciting to see new tooling coming out, depending on your need, watching a video or two about how Monarch runs could be helpful in assessing what tool is for you.

data.table has both an R and Python version. The python is version getting pretty mature now.

Munging and Merging pretty large datasets using data.table is very efficient and intuitive after a bit of practice.

The vignette in R is very good. Datacamp has a course (I think little dated). Also a search for videos by Matt Dowle and Arun Srinivasan can give a very good quick idea of data.table's capabilities. I think both are involved in some capacity with h2o.ai which seems to be pretty good and on my to-do list.

Desktop coding using Rstudio is a wonderful experience with the R version.

Technical and free: Airflow

Technical and has a fee: dbt, prefect (while prefect is free, prefect cloud is not)

Non-technical and has a fee: datacoral.ai, ascend.io

You can solve your problem with any of these solutions, and many others not mentioned here.

I used KNIME at a previous job. Open source desktop application is free. Server is paid. Competes with other workflow based GUI tools such as Alteryx. I liked it a lot.

Been trying https://www.matillion.com. It’s nice.

cat | cut -d | sort | uniq or when in doubt, just write a few lines of perl.

R+dplyr and trifacta (GCP Dataprep)


to expand on my reply, I use to be a full on python / pandas advocate, but in comparison to alteryx it's slow and the best is that you can get business users to understand alteryx enough to provide value on the data cleansing / merging part

Problems with Alteryx come when you need to share a workflow with other people, or need any functionality outside of the basic tools, or need a loop, or need good documentation, or your internet is down which completely blocks this otherwise-desktop application, or need compatibility with an older version of Alteryx...

I could go on but my point is that beyond the most basic tasks the functionality of Alteryx is limited, the user experience is horrendous and the price isn't justified whatsoever.

The internet being down doesn't block Alteryx at all. I routinely use it without internet. Sharing of workflows hasn't been an issue either.

Understood on the loop part. You have to think on terms of vectorization for that, or use what they call batch macros.

The connection would prevent it from starting up for me because it verifies the license each time you start the program.

Neither vectorisation, nor batch macros (nor iterative macros) allow you to run the same set of commands until a condition is met in Alteryx.

It verifies the license against a local cache. The remote licensing check only occurs upon activating a license. Once activated, there is a local cached copy it uses that I happen to know about because I accidentally deleted it once. I start up Alteryx in an air-gapped environment on a routine basis. I wouldn't use the tool if it had to check a license server every time it started.

You're incorrect on your last statement. I'm a coder, and don't enjoy using Alteryx for more iterative/recursive operations, but I've built several macros that leverage loops and DO run the same set of commands until a condition is met. There are examples of this online.

Typically, when I run into more loop oriented operations, I use the python or R tool to write a snippet of code to perform the operation within the workflow.

> There are examples of this online.

Can you link to any of them? I wasn't able to find anything that actually works when I needed to do this a while ago. (I needed to repeatedly hit a REST API until it returned no results.)



Let me know if that doesn't cover your use case.

BTW, if you're hitting a REST API, you might want to try using the Python tool (it's an embedded Jupyter interface that will default to just running the pure python code when the workflow is executed without a user configuring it), just because I prefer using requests library over using the Alteryx download/curl tool myself.

Yep, I've seen the link. Never managed to get it to run "until the iteration output stream does not receive any more records" though, nor did I see that documented anywhere.

> you might want to try using the Python tool

Yep, I just end up using Python outside of Alteryx altogether.

Yeah, it all comes down to your needs and the use case. If there is a lot of value in allowing non-coders to leverage your work, and more importantly, modify pieces of it, then a tool like Alteryx is worth it. However, if you can just write a Python script, and aren't constantly being hassled to update it for whatever data source or new need, then just go that route. Otherwise, if you don't empower the non-tech users, eventually you'll produce enough stuff that you'll end up spending an increasing percentage of your time maintaining and dealing with your own products. For me, that's been the biggest benefit. Nothing better than telling somebody to just switch the input on a workflow to a db now that the file isn't being sent directly, instead of having to do it for them.

Would you say that the main reason to use a software instead of code is to onboard non-coding business partners?

edit: After looking at the website, I can't help but feel that they are surfing on the Asteryx clothing brand to overprice their products...

> Would you say that the main reason to use a software instead of code is to onboard non-coding business partners

That is usually a key selling point of GUI based ETL solutions. Relatively non-technical users, that know little of coding but do understand their data and what they want out of it, can put together basic (and sometimes not so basic) data flows and debug them.

I'm a user of Alteryx and had never heard of Asteryx. haha.

Alteryx is definitely high-priced. It's enterprise software and priced that way. Which sucks if you're on a budget.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact