Ask HN: Building out analytics department, what platforms should we use? - uptownfunk
======
canterburry
I'd very much recommend you first think about the nature of your problem,
data, environment, transformation and storage needs and then start evaluating
platforms. Don't just take the latest buzz words and start hacking away.

Do you need batch analysis or stream processing? How much data do you have?
How will you need to combine stored data with streaming data? What data needs
to be in memory vs disk? How far back in time does you analysis need to go?
Can you pull that off in a streaming fashion or do you need a Lambda
architecture? Are all your queries known ahead of time or ad-hoc? How will you
access historical events? Do you really have data that requires 'big data'
tools or can you get away with R on a powerful workstation?

You don't give many details here so I can only through out questions which
first need answers.

I.e. in our environment we have the whole Cloudera stack, we use map reduce
for some things, spark for others and storm for others. We discovered we don't
need Kafka but can get away with just a message broker. We have HDFS, Hive,
Couchbase, Oracle and a few other stores all with their own sweet spot.

------
happyslobro
Same here :) Right now, we're just starting to automate a bunch of work that
our one overworked analyst does manually with GA, FB marketing insights, and a
bunch of third party services. She is asking dev to extract data sets that
crash Excel when she imports them :/ We need to integrate all of this data
into one somewhat consistent stream. Then whenever she asks for a data set, we
(I) need to ask her what she's going to do with it, and whip up a little
stream processing service that can answer that question continuously.
Preferably in chat, not with a big fancy dashboard, because really, all those
graphs are just a means to an end. Her output isn't a graph, it's a single
paragraph recommendation or a boolean answer to a vague question.

I've just started looking into Apache Kafka and Storm. The whole stream
oriented paradigm looks really promising. If we could archive a _huge_ torrent
of incoming events, and then build little stream processing services over
time, as specific questions come in from analytics, and replay old events
through the system, I think that would put us in the position we want to be
in. I've been playing around with Storm on my laptop in clojure (Clojure is
_awesomesauce_ ), and contemplating how we should deploy this.

As far as deployment goes, what I would like to do, is set up a big fancy
Mesos cluster to run all these little stream processors, and heck, might as
well run all of our product services in it too. What I will probably have to
settle for, is starting off on a single EC2 instance that occasionally has
little accidents. My ops skill are... a work in progress.

Anyway, if you want to get in touch, shoot me an email or something.
daniel.ross at kasra.co. It would be really cool to watch several of these
systems built up in parallel, so that we could all see what is working and
what problems need to be solved by a bigger group than our own. It isn't easy
convincing the CEO to let me open source stuff like this, but it also isn't
easy to hire developers who can learn a new sub-field and build something cool
in it, at least not for a relatively small company.

~~~
happyslobro
Just realised how silly it is for the CEO to think that we gain more over the
competition by working in isolation with a small team, than we do by
collaborating with people who are highly unlikely to actually be interested in
our niche. I'm going to have to thoroughly explain how this open source thing
works, over a beer or three.

~~~
uptownfunk
The thing with open source is that when it breaks, you have no one to yell at.
We've been using R/Python for sometime now and have not had any problems there
at all, but it is probably their biggest concern. And I'm sure this very
subject has come up so often you should be able to find some very convincing
arguments around it.

~~~
happyslobro
Oh, the business peeps are totally fine with using OS. It's the idea of
sharing our own work that scares them; they worry that our competitors will
leapfrog us. I think that if our stuff is really that good, then we don't
really need to worry about competition from the tech angle. Especially since
we are dominating an obscure niche at the moment.

------
fenier
Setting aside specific platforms...

Platforms typically run one of a few way

Generation: When someone falls into a segment. Impression: The actual act of
serving. Session: A block of continuous time that defines a user session.

You'll find it much easier on yourself if you are dealing with the same kinds
of data when doing comparisons. Otherwise you'll have to do conversions of
that data to get it to be roughly in sync.

An example of this is trying to tolerate a session based system, with a
generation based one. They may only generate during the first session, and get
recorded once, where they could actually have multiple sessions recorded over
a given time box.

Once you've identified the above - does the data need to be real time? If so
what are the write / read requirements?

Will you be getting bulk feeds for import? What are the import times, etc, Can
certain reports only be ran after those processes execute?

Do you need a dashboard? Do you need Data Viz tools? Do you need very granular
data as far back as you can go? Is summarizing that data OK after a span of
time?

Do you feel you need to run the data collection internally, are 3rd party
storage areas on the table?

is the data transaction based? What happens when you roll back the transaction
- does the correct data get deleted?

Do you need to filter data based on internal vs external sources?

Answering a lot of those questions would help you identify your specific use
cases, and let you make a buy/build decision.

------
austinhulak
We use Segment > Lambda Function > Kinesis Firehose > Redshift + S3 > Tableau

If you don't mind writing some boilerplate JS to capture things like Device,
Location, etc, then you could really just all of the events straight to the
lambda function via Amazon's API Gateway. This would save anywhere from
$100/mo to $2k a month.

The only downside to this approach is that we had to define our analytics
schema early on in the process in order to have it match up with the redshift
database. If you went with elastic search instead of redshift you would
probably gain some flexibility in the schema (but at the cost of a normalized
dataset to wire up in Tableau or some other BI tool).

I would also be very interested to hear if anyone's used Amazon's BI service
(Quicksight?) or other tools like Looker or Periscope.

------
pamelanorwood
Please lead with the location of the position and include the keywords REMOTE,
INTERNS and/or VISA when the corresponding sort of candidate is welcome. When
remote work is not an option, please include ONSITE. Submitters: please only
post if you personally are part of the hiring company—no recruiting firms or
job boards.

Readers: please only email submitters if you personally are interested in the
job—no recruiters or sales calls.

You can also use kristopolous' nifty console script to search the thread:
[https://news.ycombinator.com/item?id=10313519](https://news.ycombinator.com/item?id=10313519).

~~~
uptownfunk
This was not a job solicitation. This was a question as to the tech stack of
an analytics team. Thanks.

