
OpenRefine – free, open source, powerful tool for working with messy data - joubert
https://openrefine.org
======
vcdimension
You can do similar stuff using the visidata command line tool:
[https://www.visidata.org/](https://www.visidata.org/)

You can use python code for more advanced data manipulations and creating
plugins.

~~~
rasmusei
Wow, that is a nice little tool. Just installed it and tested on some random
files in my current data analysis project.

By the way, I installed it using pipx
[https://github.com/pipxproject/pipx](https://github.com/pipxproject/pipx) by
running `pipx install visidata`. To also read HDF and Excel files, I added the
necessary packages by running `pipx inject visidata h5py openpyxl`.

~~~
thadguidry
Thanks @rasmusei! If you are a data scientist you might also be interested in
how to work along with Jupyter. Our community has some documentation on our
Wiki about that here:
[https://github.com/OpenRefine/OpenRefine/wiki/Jupyter](https://github.com/OpenRefine/OpenRefine/wiki/Jupyter)

------
cstuder
Formerely known as Google Refine.

The history of the rename can be found in the blog:
[https://openrefine.org/blog/2013/10/12/openrefine-
history.ht...](https://openrefine.org/blog/2013/10/12/openrefine-history.html)

~~~
riedel
Formerly known as Freebase Gridworks as also mentioned in the article you
linked

------
Chris2048
Anyone used this before?

My own experience: I had a _lot_ of data to process, which I thought was the
use case for a tool like this: but it took a _long_ time, and seemed to have
to process the data in order to ingest it properly.

What underlying storage/tech is used? Is it all just web-stack?

~~~
thadguidry
Our current architecture is here:
[https://github.com/OpenRefine/OpenRefine/wiki/Architecture](https://github.com/OpenRefine/OpenRefine/wiki/Architecture)

YES! We want users to process much larger data sets also! We have started
experiments with using Apache Spark on the backend where the hope is that we
can help users with much larger datasets. This work is being funded by CZI and
you can read the grant proposal here:
[http://openrefine.org/blog/2019/11/14/czi-
eoss.html](http://openrefine.org/blog/2019/11/14/czi-eoss.html)

~~~
Chris2048
before you go down the spark route, consider perl/unix-tools may do this kind
of thing quite well:
[https://livefreeordichotomize.com/2019/06/04/using_awk_and_r...](https://livefreeordichotomize.com/2019/06/04/using_awk_and_r_to_parse_25tb/)

~~~
thadguidry
That author did not have Spark tuned well for the use case. This is a common
issue with Spark. Since OpenRefine commonly is used with Strings, we plan to
optimize in many areas for that such a few mentioned here:
[https://databricks.com/glossary/spark-
tuning](https://databricks.com/glossary/spark-tuning) But in general, there
are always tradeoffs when trying to provide immediate feedback for
interactions. Since OpenRefine has many interactive features, some will need
to support batching and advise the user in the interface that things will take
longer...do you want to send to batch? Some of the tradeoffs and ways we plan
to address these are mentioned in our general OpenRefine on Spark issue here:
[https://github.com/OpenRefine/OpenRefine/issues/1433](https://github.com/OpenRefine/OpenRefine/issues/1433)

------
easygenes
I have used and loved this since it was a project from MIT CSAIL SIMILE (circa
2006).

~~~
easygenes
Follow up: Looking over the old SIMILE site I couldn't find the original
project. Also David Huynh didn't mention it in his own website, but some
searching yielded the original project, "Parallax"

[https://books.google.com/books?id=Y_FZPtpgntwC&pg=PA36](https://books.google.com/books?id=Y_FZPtpgntwC&pg=PA36)

More from the era: [https://blog.jonudell.net/2008/08/25/motivating-people-to-
wr...](https://blog.jonudell.net/2008/08/25/motivating-people-to-write-the-
semantic-web-a-conversation-with-david-huynh-about-parallax/)

~~~
thadguidry
SIMILE library is used in OpenRefine for certain Faceting like timeline,
clustering, etc. Parallax was originated by David to show how time series data
visualizations could be enhanced. David was one of our original designers of
OpenRefine and I worked closely with him and Stefano in testing it.

------
canada_dry
Reminded me a bit of the 'data wrangler' tool from Stanford. It was __* a
fantastic tool for dealing with messy data.

[http://vis.stanford.edu/wrangler/](http://vis.stanford.edu/wrangler/)

 __*it 's now a commercial product maintained by Trifacta
([https://www.trifacta.com/start-wrangling/](https://www.trifacta.com/start-
wrangling/))

------
chrisweekly
See also [https://lnav.org](https://lnav.org)

~~~
aembleton
That is really interesting but I don't see what it has to do with OpenRefine.

Thanks for the link though.

~~~
chrisweekly
Was "really interesting" sarcastic? If you found lnav to be interesting, I
don't understand how you'd fail to see how it's relevant.

lnav is a mini ETL tool, which, like OpenRefine, aids in transforming data
from various formats to make it more useful. They're in the same space.

