what tools do you use or know about, e.g. something like Talend DataPreparation (https://www.talend.com/products/data-preparation) ?
I found OpenRefine (http://openrefine.org), which is browser based like Talend DataPreparation.
It would be nice to have a "true" desktop application :)
Anyways, I think Alteryx does this job really well. It's great for ETL that the average Joe can pick up and learn very quickly. Sure, the UX could be improved, but it's miles better than AWS' Data Pipeline and other tools. It is a bit costly like another user mentioned, but well worth it IMO.
Tableau has introduced some new products the past year or two to compete in this space as well, so if you use Tableau for reporting, look into their new offerings.
Dell introduced an ETL tool a few years ago called Boomi. It had some promise a few years ago and might be something to consider. I have no idea what it costs though. Another option is SSIS if the data is eventually going to live in SQL Server or some Microsoft database.
Ultimately, I would consider options based on your target database/environment.
If you're a MSFT shop then sure and if you're willing to drop the box-line GUI and move to .NET assembly-based packages even more so, but I've spent enough of my life keeping SSIS running and not going back.
It allows you to transform and join data from different data sources (including databases like mysql, postgres, redis, and CSV or JSON files, more to come) using a familiar interface - plain SQL.
Other than that, I like Exploratory.
And for nested JSON data I use another project of mine, jql - an easier to use (in my opinion) lispy alternative to jq. (originally inspired by a hackernews comment actually)
Great tool with a really active community along with it
It is a GUI based solution with an emphasis on ease of use, aimed at people who aren't coders or data science professionals. For example a marketer who has several different Excel and CSV files they need to merge, clean, filter and restructure to create a report. Any feedback would be very welcome.
The dbplyr package makes it possible to use a local db table as if they are in-memory data frames (https://dbplyr.tidyverse.org/).
https://diskframe.com/ is also a pretty great solution that processes data by chunks.
The data preparation Wikipedia page mentions Talend by name along with Paxata, Trifacta, Alteryx, and Ataccama.
I do a lot of exploratory coding in ipython, though the threshold for "switch to a real editor and run git init and poetry new" is pretty low.
Want to munge CSV? stdlib, or pandas(https://pandas.pydata.org/pandas-docs/stable/)
Want to munge JSON? stdlib.
Want to munge a database? pandas, stdlib, anything that speaks sqlalchemy
Want to validate your json/sql/CSV/whatever and have it come in as a bunch of structured classes not std types? jsonschema https://pypi.org/project/jsonschema/, attrs+marshmallow, attrs+cattrs. http://www.attrs.org/en/stable/, https://desert.readthedocs.io/en/latest/ https://marshmallow.readthedocs.io/en/stable/ https://github.com/Tinche/cattrs
Want to transform csv into sqlite? https://github.com/simonw/csvs-to-sqlite
Want to serve sqlite as a REST api? https://datasette.readthedocs.io/en/stable/
Want to stuff simple things into a database really fast? https://dataset.readthedocs.io/en/latest/
Want to flip numerical data around? Numpy.https://docs.scipy.org/doc/numpy/reference/
Want to model it? Scipy https://www.scipy.org/docs.html, pandas
Want to plot it? seaborn https://seaborn.pydata.org/, plotnine https://plotnine.readthedocs.io/en/stable/
Want to futz around with structured data in a sensible way? glom https://glom.readthedocs.io/en/latest/, python-lenses https://github.com/ingolemo/python-lenses.
Want to spit out tabular data? Tabulate https://github.com/astanin/python-tabulate
Want to figure out where not to eat in chicago? built-ins: counter, defaultdict, comprehensions: https://www.youtube.com/watch?v=lyDLAutA88s
There's a LOT you can do pretty fast, and I can more or less hammer out a basic ETL cli script in my sleep at this point.
I do of automation with csv, yaml, openpyxl, and jinja2. Wrapped in a simple GUI (Qt, Gtk, or Tk, pick your poison) so my non-programmer colleagues can quickly run it and get a PDF report, generated with reportlab.
I also think that __str__ and __repr__ behavior on bytes is misleading at best. I do not like getting ascii values for some bytes and not others... and iterating to get base-10 ints!?
Pillow[simd], soundfile, and imageio can do nice things for image and sound data.
At the moment I'm working on improving performance. I can already load 1 million rows into it without too much trouble, and the next step is to load some Kaggle-size CSV's (e.g. 5.6 GB) and then be able to run data cleaning or other transforms on it without it choking.
If anyone's interested in this kind of stuff feel free to drop me an email (in my profile)!
Shell scripts using mostly sed and awk go a very, very long way when I want something repeatable or I'm dealing with a lot of data. And, when that starts getting heavy or long and I need more structure, python.
GUIs just get in the way in this space. By necessity, they need to be opinionated because there's only so much screen to get around in. They also tend to create artifacts that aren't quick to understand later and are locked into some particular product or way of doing things.
Downloading everything locally isn't desirable for me.
What use cases do you have that go beyond Excel?
I say this as someone that develops pipelines in notebooks and workflow frameworks.
I really can't tell what this means.
Tools: Avora, FiveTran, Informatica, Excel, Alteryx, Datameer, Tableau Prep, etc. ... list could go on and on.
Munging and Merging pretty large datasets using data.table is very efficient and intuitive after a bit of practice.
The vignette in R is very good. Datacamp has a course (I think little dated). Also a search for videos by Matt Dowle and Arun Srinivasan can give a very good quick idea of data.table's capabilities. I think both are involved in some capacity with h2o.ai which seems to be pretty good and on my to-do list.
Desktop coding using Rstudio is a wonderful experience with the R version.
It’s exciting to see new tooling coming out, depending on your need, watching a video or two about how Monarch runs could be helpful in assessing what tool is for you.
Technical and has a fee: dbt, prefect (while prefect is free, prefect cloud is not)
Non-technical and has a fee: datacoral.ai, ascend.io
You can solve your problem with any of these solutions, and many others not mentioned here.
I could go on but my point is that beyond the most basic tasks the functionality of Alteryx is limited, the user experience is horrendous and the price isn't justified whatsoever.
Understood on the loop part. You have to think on terms of vectorization for that, or use what they call batch macros.
Neither vectorisation, nor batch macros (nor iterative macros) allow you to run the same set of commands until a condition is met in Alteryx.
You're incorrect on your last statement. I'm a coder, and don't enjoy using Alteryx for more iterative/recursive operations, but I've built several macros that leverage loops and DO run the same set of commands until a condition is met. There are examples of this online.
Typically, when I run into more loop oriented operations, I use the python or R tool to write a snippet of code to perform the operation within the workflow.
Can you link to any of them? I wasn't able to find anything that actually works when I needed to do this a while ago. (I needed to repeatedly hit a REST API until it returned no results.)
Let me know if that doesn't cover your use case.
BTW, if you're hitting a REST API, you might want to try using the Python tool (it's an embedded Jupyter interface that will default to just running the pure python code when the workflow is executed without a user configuring it), just because I prefer using requests library over using the Alteryx download/curl tool myself.
> you might want to try using the Python tool
Yep, I just end up using Python outside of Alteryx altogether.
edit: After looking at the website, I can't help but feel that they are surfing on the Asteryx clothing brand to overprice their products...
That is usually a key selling point of GUI based ETL solutions. Relatively non-technical users, that know little of coding but do understand their data and what they want out of it, can put together basic (and sometimes not so basic) data flows and debug them.
Alteryx is definitely high-priced. It's enterprise software and priced that way. Which sucks if you're on a budget.