
Ask HN: What tools do you use for data munging and merging? - hellectronic
Hello,<p>what tools do you use or know about, e.g. something like Talend DataPreparation (https:&#x2F;&#x2F;www.talend.com&#x2F;products&#x2F;data-preparation) ?<p>I found OpenRefine (http:&#x2F;&#x2F;openrefine.org), which is browser based like Talend DataPreparation.<p>It would be nice to have a &quot;true&quot; desktop application :)
======
hn12345
Looks like you're looking for ETL solutions. It's funny reading some of the
replies here, you can tell who's coming from a more BI background compared to
a software engineering background.

Anyways, I think Alteryx does this job really well. It's great for ETL that
the average Joe can pick up and learn very quickly. Sure, the UX could be
improved, but it's miles better than AWS' Data Pipeline and other tools. It is
a bit costly like another user mentioned, but well worth it IMO. Tableau has
introduced some new products the past year or two to compete in this space as
well, so if you use Tableau for reporting, look into their new offerings. Dell
introduced an ETL tool a few years ago called Boomi. It had some promise a few
years ago and might be something to consider. I have no idea what it costs
though. Another option is SSIS if the data is eventually going to live in SQL
Server or some Microsoft database.

Ultimately, I would consider options based on your target
database/environment.

~~~
mipmap04
I totally agree with your first sentence. As a former enterprise data
warehouse architect, it's interesting seeing some of the responses. Surprised
I'm not seeing more Informatica and SSIS. I'll also say SSIS is a good
solution for nearly any data storage platform and is included in a SQL Server
license so might be a good, cheap enterprise solution if you already have SQL
Server.

~~~
kmerrol
From my experience SSIS suffers mightily from the Write-Once Read Only problem
as well as huge manual click-fests when package metadata changes. That and a
lack of development from MSFT with obvious direction towards ADF means SSIS is
not really great for modern data architecture.

If you're a MSFT shop then sure and if you're willing to drop the box-line GUI
and move to .NET assembly-based packages even more so, but I've spent enough
of my life keeping SSIS running and not going back.

------
cube2222
A project I'm using, which I'm also one of the authors of, is OctoSQL[1].

It allows you to transform and join data from different data sources
(including databases like mysql, postgres, redis, and CSV or JSON files, more
to come) using a familiar interface - plain SQL.

Other than that, I like Exploratory[2].

And for nested JSON data I use another project of mine, jql[3] - an easier to
use (in my opinion) lispy alternative to jq. (originally inspired by a
hackernews comment actually)

[1]:[https://github.com/cube2222/octosql](https://github.com/cube2222/octosql)

[2]:[https://exploratory.io](https://exploratory.io)

[3]:[https://github.com/cube2222/jql](https://github.com/cube2222/jql)

~~~
hatmatrix
Interesting project and I like the name. It seems that the underlying
relationship in each data source has to be relational though (even for JSON
files) - is that a fundamental limitation of using SQL as the glue language,
or are there plans to integrate hierarchical models of data in some form as
well?

~~~
cube2222
If you mean hierarchical data querying (like nested json, or json in sql
rows), then that's very high on our radar.

------
westonsankey
I try to avoid GUI-based tools as they generally don't lend themselves well to
version control, code reuse, etc. There are several orchestration tools that
allow you to create arbitrarily complex ETL pipelines using SQL, Python,
Spark, etc. Here are links to a few:

[http://airflow.apache.org/](http://airflow.apache.org/)

[https://www.prefect.io/](https://www.prefect.io/)

[https://dagster.readthedocs.io/en/0.6.7/](https://dagster.readthedocs.io/en/0.6.7/)

[https://www.getdbt.com/](https://www.getdbt.com/)

~~~
eyeball
Dbt looks awesome. Too bad no oracle support.

~~~
drewbanin
Hi there - I run product for dbt - happy to chat more about Oracle support if
you're interested. There are a handful of community-supported dbt plugins out
in the wild. Oracle could very well be one of the next ones. Check some of
these out to see what's involved in building a database plugin:

\- [https://github.com/fishtown-analytics/dbt-
spark](https://github.com/fishtown-analytics/dbt-spark)

\- [https://github.com/fishtown-analytics/dbt-
presto](https://github.com/fishtown-analytics/dbt-presto)

\- [https://github.com/mikaelene/dbt-
sqlserver](https://github.com/mikaelene/dbt-sqlserver)

\- [https://github.com/jacobm001/dbt-mssql](https://github.com/jacobm001/dbt-
mssql)

~~~
eyeball
cool. would be an interested user, but building a plug in ... beyond my team's
ability.

------
TheTank
QuestDB ([https://www.questdb.io](https://www.questdb.io)) might help. Import
your file(s) by drag and drop directly into the web console. Once data is
loaded, you can run SQL queries in the console to order, format, concatenate,
join (including time-based joins), etc. Once finished, you can export your
formatted results back to CSV. Also, it's pretty quick!

------
xn
Visidata ([http://visidata.org/](http://visidata.org/)) is my tool of choice
these days.

~~~
ysr23
yup lots of love for visidata, a nice quick way to eyeball data and then
munge: [https://visidata.org/docs/join/](https://visidata.org/docs/join/)
before proper carpentry with pandas.

------
wodenokoto
I really like R / Tidyverse, but that requires that you have more memory than
your datasize (especially if you are using narrow and long dataframes, such as
recommended by tidyverse) and it also requires you to code, instead of using a
UI

~~~
stuxnet79
Is there really no way of processing dataframes by chunks in the R ecosystem?
I'm firmly in the Python camp but I wonder how much I'm missing out on in the
R world.

~~~
deadcaribou
There are many ways to process larger than RAM objects in R.

The dbplyr package makes it possible to use a local db table as if they are
in-memory data frames
([https://dbplyr.tidyverse.org/](https://dbplyr.tidyverse.org/)).

[https://diskframe.com/](https://diskframe.com/) is also a pretty great
solution that processes data by chunks.

------
2data222
We use Talend. It does have a gigantic Java-based Windows desktop application.
It's pretty powerful and I don't hate it. We looked at MuleSoft as well but
it's not ready for Enterprise prime-time like we need it to be.

The data preparation Wikipedia page mentions Talend by name along with Paxata,
Trifacta, Alteryx, and Ataccama.

~~~
JPKab
Alteryx is very good, especially if you want to hand off maintenance and
operation of data processing flows to non coders.

------
hprotagonist
The python ecosystem is really good here.

I do a lot of exploratory coding in ipython, though the threshold for "switch
to a real editor and run git init and poetry new" is pretty low.

Want to munge CSV? stdlib, or pandas([https://pandas.pydata.org/pandas-
docs/stable/](https://pandas.pydata.org/pandas-docs/stable/))

Want to munge JSON? stdlib.

Want to munge a database? pandas, stdlib, anything that speaks sqlalchemy

Want to validate your json/sql/CSV/whatever and have it come in as a bunch of
structured classes not std types? jsonschema
[https://pypi.org/project/jsonschema/](https://pypi.org/project/jsonschema/),
attrs+marshmallow, attrs+cattrs.
[http://www.attrs.org/en/stable/](http://www.attrs.org/en/stable/),
[https://desert.readthedocs.io/en/latest/](https://desert.readthedocs.io/en/latest/)
[https://marshmallow.readthedocs.io/en/stable/](https://marshmallow.readthedocs.io/en/stable/)
[https://github.com/Tinche/cattrs](https://github.com/Tinche/cattrs)

Want to transform csv into sqlite? [https://github.com/simonw/csvs-to-
sqlite](https://github.com/simonw/csvs-to-sqlite)

Want to serve sqlite as a REST api?
[https://datasette.readthedocs.io/en/stable/](https://datasette.readthedocs.io/en/stable/)

Want to stuff simple things into a database really fast?
[https://dataset.readthedocs.io/en/latest/](https://dataset.readthedocs.io/en/latest/)

Want to flip numerical data around?
Numpy.[https://docs.scipy.org/doc/numpy/reference/](https://docs.scipy.org/doc/numpy/reference/)

Want to model it? Scipy
[https://www.scipy.org/docs.html](https://www.scipy.org/docs.html), pandas

Want to plot it? seaborn
[https://seaborn.pydata.org/](https://seaborn.pydata.org/), plotnine
[https://plotnine.readthedocs.io/en/stable/](https://plotnine.readthedocs.io/en/stable/)

Want to futz around with structured data in a sensible way? glom
[https://glom.readthedocs.io/en/latest/](https://glom.readthedocs.io/en/latest/),
python-lenses [https://github.com/ingolemo/python-
lenses](https://github.com/ingolemo/python-lenses).

Want to spit out tabular data? Tabulate [https://github.com/astanin/python-
tabulate](https://github.com/astanin/python-tabulate)

Want to figure out where not to eat in chicago? built-ins: counter,
defaultdict, comprehensions:
[https://www.youtube.com/watch?v=lyDLAutA88s](https://www.youtube.com/watch?v=lyDLAutA88s)

There's a LOT you can do pretty fast, and I can more or less hammer out a
basic ETL cli script in my sleep at this point.

~~~
billfruit
I think it is not that rich when dealing with binary data.

~~~
j88439h84
Sure it is.
[https://construct.readthedocs.io/](https://construct.readthedocs.io/)

------
davedx
I'm working in this area, been doing a lot of "exploratory coding" that has
evolved into something I find quite powerful, a modern spreadsheet application
with features that I think makes it suited to the current time. (Summary at
[https://lightsheets.app](https://lightsheets.app))

At the moment I'm working on improving performance. I can already load 1
million rows into it without too much trouble, and the next step is to load
some Kaggle-size CSV's (e.g. 5.6 GB) and then be able to run data cleaning or
other transforms on it without it choking.

If anyone's interested in this kind of stuff feel free to drop me an email (in
my profile)!

------
cmollis
Usually spark, but it depends on what the source data is. Whether it needs
more preprocessing, etc. for json, there are some good command- line tools
available for stripping, flattening, if u need to do that. Spark can usually
handle most of the data sources.

------
clavalle
Sublime text isn't as featured as Talend tools but it does what it does well
and quickly and stays out of my way. Great for loading up a text file,
scrolling around, and doing quick find/replaces.

Shell scripts using mostly sed and awk go a very, very long way when I want
something repeatable or I'm dealing with a lot of data. And, when that starts
getting heavy or long and I need more structure, python.

GUIs just get in the way in this space. By necessity, they need to be
opinionated because there's only so much screen to get around in. They also
tend to create artifacts that aren't quick to understand later and are locked
into some particular product or way of doing things.

------
xs83
SQL, Python tools such as Pandas, Rapids.ai (if you have a recent Nvidia GPU)
and Talend are all good choices. If it is a small amount of data you can just
use Python or Bash to pretty quickly accomplish what you need.

------
zzo38computer
I use SQLite myself, although I would like that those who provide the data
should make the data available in a format such as CSV, or TSV, or a SQLite
database, or to have some sort of protocol which a SQLite virtual table module
can be made to access any data using such a protocol, so that it is possible
to do so without having to deal with complicated stuff in a web browser or
whatever else it may be, sometimes which is not even compatible with my
computer. SQLite is common on all computers, I think, and CSV or TSV will be
simple enough to work with many programs, so it should do.

------
hermitcrab
I have recently released a Windows/Mac tool for data munging:
[https://www.easydatatransform.com](https://www.easydatatransform.com)

It is a GUI based solution with an emphasis on ease of use, aimed at people
who aren't coders or data science professionals. For example a marketer who
has several different Excel and CSV files they need to merge, clean, filter
and restructure to create a report. Any feedback would be very welcome.

------
aww_dang
Shell scripts and simple Java applications. Not interested in
navigating/learning a GUI which may or may not work for me.

Downloading everything locally isn't desirable for me.

------
truculent
[https://www.getdbt.com/](https://www.getdbt.com/) is excellent if your users
are SQL-friendly.

------
RocketSyntax
At this point, why wouldn't you use Excel?

What use cases do you have that go beyond Excel?

I say this as someone that develops pipelines in notebooks and workflow
frameworks.

Snowflake.

------
hkchad
It's been a few years but I use to use Altova suite specifically MapForce
([https://www.altova.com/mapforce](https://www.altova.com/mapforce)) DiffDog,
etc. Like I said it's been a few years but it was super powerful, this was
back in my XML days when building stylesheets was a big part of my job.

------
jhoechtl
Octosql

[https://github.com/cube2222/octosql](https://github.com/cube2222/octosql)

~~~
rhombocombus
The installation from the docs on the readme is broken!

~~~
cube2222
Just to put it here on HN, the installation command is corrected now. You need
to have go modules turned on to use go get for installation.

------
Xophmeister
Structured data rarely lends itself to a GUI paradigm because its structure is
arbitrary. I would argue that it would only fit with a GUI if it represents
something specific that said GUI explicitly models (e.g., a very specific lump
of JSON could be interpreted as a tree, etc.). As such, command line and
scripting tools are vastly superior.

~~~
psv1
The model works absolutely fine with structured data. In these kinds of
applications, you usually connect different nodes on a canvas, each node
representing a data source, join, transformation etc. and each connection
represents the flow of data.

~~~
Xophmeister
But those nodes are going to be either relatively limited in their ability,
highly specific, or just expose scripting tools; the linkage between nodes is
trivial. Why waste time connecting all that up visually, when a good shell
will do the job?

~~~
psv1
I agree with this and I also prefer a programming language over a GUI. My
point was that there isn't a limitation of GUI ETL tools that's specific to
structured data.

------
tixocloud
There is an incredible ecosystem of tools used for "data preparation" that it
almost always necessitates defining what the usecase and workflow would be
first. There are tools from UI-based to command-line-based.

Tools: Avora, FiveTran, Informatica, Excel, Alteryx, Datameer, Tableau Prep,
etc. ... list could go on and on.

------
Zhyl
If I need to combine a bunch of files in a 'union' or 'concat' type fashion I
will usually use UNIX tools or perl. Ditto for dropping or re-arranging
columns or filtering out entries. The syntax is usually very terse and for
simple tasks isn't too bad in terms of readability/

------
j45
A thread like this would not be complete witho7na shoutout to Monarch.

It’s exciting to see new tooling coming out, depending on your need, watching
a video or two about how Monarch runs could be helpful in assessing what tool
is for you.

------
santa_boy
data.table has both an R and Python version. The python is version getting
pretty mature now.

Munging and Merging pretty large datasets using data.table is very efficient
and intuitive after a bit of practice.

The vignette in R is very good. Datacamp has a course (I think little dated).
Also a search for videos by Matt Dowle and Arun Srinivasan can give a very
good quick idea of data.table's capabilities. I think both are involved in
some capacity with h2o.ai which seems to be pretty good and on my to-do list.

Desktop coding using Rstudio is a wonderful experience with the R version.

------
iblaine
Technical and free: Airflow

Technical and has a fee: dbt, prefect (while prefect is free, prefect cloud is
not)

Non-technical and has a fee: datacoral.ai, ascend.io

You can solve your problem with any of these solutions, and many others not
mentioned here.

------
apohn
I used KNIME at a previous job. Open source desktop application is free.
Server is paid. Competes with other workflow based GUI tools such as Alteryx.
I liked it a lot.

------
eyeball
Been trying [https://www.matillion.com](https://www.matillion.com). It’s nice.

------
anon176
cat | cut -d | sort | uniq or when in doubt, just write a few lines of perl.

------
mikecb
R+dplyr and trifacta (GCP Dataprep)

------
mgaitan09
alteryx

~~~
mgaitan09
to expand on my reply, I use to be a full on python / pandas advocate, but in
comparison to alteryx it's slow and the best is that you can get business
users to understand alteryx enough to provide value on the data cleansing /
merging part

~~~
131012
Would you say that the main reason to use a software instead of code is to
onboard non-coding business partners?

edit: After looking at the website, I can't help but feel that they are
surfing on the Asteryx clothing brand to overprice their products...

~~~
dspillett
_> Would you say that the main reason to use a software instead of code is to
onboard non-coding business partners_

That is usually a key selling point of GUI based ETL solutions. Relatively
non-technical users, that know little of coding but do understand their data
and what they want out of it, can put together basic (and sometimes not so
basic) data flows and debug them.

