

How Akka Streams can be used to process the Wikidata dump in parallel - ArturSoler
http://engineering.intenthq.com/2015/06/wikidata-akka-streams/

======
mtrn
On a related note: When I indexed the whole English wikipedia last year, I was
surprised, that it was possible to have a JSON version of it indexed[1] and
searchable within half an hour on my laptop.

[1] Using parallel bulk indexer for ES:
[https://github.com/miku/esbulk](https://github.com/miku/esbulk)

~~~
mhuffman
How about ~17 minutes (including wikipedia data download and extraction time)!
Using json-wikipedia and lbzip2

[1] [https://github.com/diegoceccarelli/json-
wikipedia](https://github.com/diegoceccarelli/json-wikipedia)

~~~
mtrn
Thanks, JSON exports make wikipedia data much more approachable.

------
frik
> Process the whole Wikidata in 7 minutes with your laptop

Wikidata is several magnitudes smaller than Freebase (closed by Google in May)
and it won't fit in your RAM (laptop).

~~~
thibaut_barrere
What are your favorite large, publicly available datasets?

~~~
Smerity
Biased reply (I'm a data scientist there): Common Crawl[1]. We build and
maintain an open repository of web crawl data that can be accessed and
analyzed by anyone completely free.

[1]: [http://commoncrawl.org/](http://commoncrawl.org/)

------
cristianpascu
From their video: The presenter: "Why would you (the assistent lady) be
interested in cars?" The assistent: "I'm the perfect chick to be into
Masserati."

It's a bit disturbing to see an employee presenting her personal life, kids,
interests, and what not. Good job, IntentHQ!

The video: [https://www.intenthq.com/resources/interest-
fingerprint/](https://www.intenthq.com/resources/interest-fingerprint/)

~~~
laumars
I wouldn't say it was disturbing, but it was definitely cringe worthy. A lot
of their blog feels that way. They've gone for an informal corporate approach
like using puns[1] and memes[2] as headings. Even that video felt badly
scripted; like it was meant to sound like an informal pub conversation but
instead it came off awkward and unprofessional.

I'm sure their products are of the highest quality, but their blog isn't a
great advert in my opinion.

[1] [http://engineering.intenthq.com/2015/06/for-those-about-
to-c...](http://engineering.intenthq.com/2015/06/for-those-about-to-code-we-
salute-you/)

[2] [http://engineering.intenthq.com/2015/06/wikidata-akka-
stream...](http://engineering.intenthq.com/2015/06/wikidata-akka-
streams/#paralellise-all-the-things)

~~~
cristianpascu
I have to say I have mixed feelings about the video. On one hand I understand
there's a whole world of people out there, and I don't mind openness and
honesty. Big thumbs to her for being honest and cool. On the other hand, it
goes the other way when you're bragging about your awesome product that
analyzes people's life and sells that info to corporations.

~~~
laumars
You're assuming those details aren't made up ;)

The privacy thing didn't really bother me because it's either fake data or
she's consented to publishing real data about her - either way it's a
considered decision. My issue was just how awkward the presentation was
delivered. Maybe that could have been resolved if they used a fictional
character like Homer Simpson? But then that would have it's own issues.

------
jimbokun
Could this example have been accomplished with awk and xargs just as fast,
with same or less memory usage, in fewer lines of code?

Seems so to me after skimming the article, but maybe I missed an important
advantage of using Akka Streams for this task?

~~~
thelastnode
Yes, the initial parts of the example could be accomplished with awk and
xargs, but as the article goes on to demonstrate, even doing something like
printing every nth element would be difficult.

I think the intent was for this to be more of a demonstrative example, and
with a more complex, evolving, real-world processing pipeline, Akka streams
could be really useful.

------
MrDosu
Are streaming json parsers that rare?

