
Show HN: Hadoop in Excel - karamazov
https://datanitro.com/hadoop_in_excel.html
======
karamazov
Hi, I'm one of the developers. I'd be happy to answer any questions you have
on this.

If you're in New York, I'd love to meet you in person at our Big Data in Excel
meetup this Monday:
[http://www.meetup.com/DataNitro/events/149402612/](http://www.meetup.com/DataNitro/events/149402612/)

And, as the page says, we're looking for beta users! If you're interested in
this, know someone who might be, or just have an opinion, I'd love to talk to
you. You can comment here or reach me at ben at datanitro.com.

~~~
cs702
Looks like a _killer product_ , because there are a lot of business people who
readily know how to write what is essentially functional code in Excel but for
whatever reason cannot or will not write even simple map and reduce functions
in a conventional programming language to extract information from a large
Hadoop data set.

Are formulas or spreadsheet browsing limited in any way?

~~~
karamazov
Formulas should be used as columnar operations: you can apply one formula to
every element of a column (map) or aggregate an entire column (reduce). (This
isn't that much of a restriction - you can use If-Statements to make complex
expressions here.)

Spreadsheet browsing is limited to a sample of the data (head + tail); you can
set the size of the sample before pulling.

------
pvnick
Did you write your mappers and reducers in java using the hadoop api or does
this translate into hiveql or some other higher-level language? Great job btw,
this looks super helpful for the business types to get useful reports on their
own rather than interrupt the workflow of someone with more formal training
(huge issue typically).

~~~
karamazov
Thanks! We're working directly in Java right now, but might explore
alternatives later. We're also planning to add support for Impala/Presto/etc.

------
monstrado
What are you using on the back-end to perform the queries? Are you using
MapReduce? What is the average latency expectations when using the
application?

~~~
karamazov
We are using MapReduce. Latency will depend on your cluster and the query;
it's just a regular MapReduce operation from Hadoop's point of view.

------
staunch
Funny as this sounds it may be in fact exactly perfect for a large subset of
Hadoop use-cases. If it works well.

------
prawks
Being pretty naive to the space, I'm assuming the killer differentiator from
Microsoft's own Power Query (which looks like it can pull from Hadoop) is that
this pulls a subset of data as an initial workspace, while Power Query pulls
all of the data? Any other key differences?

Really cool tool! Wish I had some large real-world Hadoop cluster to try it
out on...

~~~
karamazov
The major difference is the ability to run queries on Hadoop in addition to
the being able to pull data.

------
eigenvalue
I think this would really benefit from a dead simple tool that would allow
users to import from csv files into a local Hadoop instance, without having to
do anything besides install Hadoop. But this seems like something that could
really democratize data analysis on large data sets considering the number of
people who are pretty good with Excel.

------
RobGoretsky
I've seen demos of a tool called Datameer which seems to offer very similar
functionality (an Excel-like interface for configuring a job on a small set of
data, followed by submission of that job to a Hadoop cluster as a MapReduce
job). How does DataNitro compare to that?

------
jackmaney
Ummmm...doesn't Excel have a row limit of somewhere around 1 million?

~~~
karamazov
Yes, it does. This doesn't involve pulling all of your data into a
spreadsheet.

------
wbsun
Can Excel open a 1-billion-row data file?

~~~
karamazov
No, it can't - the limit is just over one million rows. This doesn't involve
pulling anywhere near that many rows into your spreadsheet.

~~~
wbsun
Then why MapReduce?

~~~
karamazov
We let people with Hadoop Clusters pull a small sample of data into Excel,
analyze it with Excel formulas, and then run the analysis on the full data
set. The last part happens outside of Excel.

------
Fomite
While impressive in terms of a technical achievement, Excel is a pretty
appalling analysis tool generally. I fear for what it will turn into when you
throw this much at it. Big Data doesn't let you power through being wrong.

~~~
karamazov
This is aimed at people doing simple analyses on massive sets of data, which
can work extremely well. [1] We're not advocating that people without a data
science background start doing ML or something.

[1] See "The Unreasonable Effectiveness of Data", by Peter Norvig, Alon
Halevy, and Fernando Pereira at Google.

~~~
Fomite
Which is why I'm not besmirching your technical achievement as much as...Excel
is widely abused by the ignorant, Big Data is widely abused by the
ignorant...Hadoop in Excel...

