
Data Wrangling at Slack - dianamp
https://slack.engineering/data-wrangling-at-slack-f2e0ff633b69
======
ransom1538
For what it is worth, every company I have worked for - and almost every
company I know -builds their own bizarre stats system. Each presentation I
attend (last one being uber) the ideas for storing columnar data gets even
nuttier. Frankly I gave up. Now I just installed new relic insights and I can
run queries, have dashboards, and infinite scale. I understand that slack has
scale - but why on earth hook together 30 random technologies and become an
analytics company too.

~~~
rb808
I'm surprised too. I work at companies that have their own data center so
can't use new relic, datadog etc. I'm really surprised there aren't more free
open source analytics platforms for small projects. I'm going to start one
when I "get some spare time". lol.

Anyone know of anything out there?

~~~
user5994461
Using paid tools has no relation with having or not having your own
datacenter.

If you have a small project (what is "small"?) you just deal with Google
analytics or direct SQL requests to the single database you have. Don't need
fancy tools.

The two free stuff I can think of are piwik and snowplowanalytics. They
clearly suffer from "free open source" when compared to the paid tools out
there.

------
coldcode
Sometimes looking at people's stacks I wonder if we've made computing so
complicated most of the time is spent dealing with stuff that is broken, and
little time is left to do anything useful. Data science seems even more into
this that programming in general; and sometimes you wonder if the result is
actually worth all the pain.

~~~
joaodlf
I feel like this happens because Data Science can only work when two
professional areas clash and mix: Programming and Maths. The two are very well
connected, of course, but the concepts behind the maths of Data Science are
much deeper than what the typical programmer is used to. Programmers need
Mathematicians as much as Mathematicians need Programmers. This is where it
gets hard: Programmers find it hard to implement these concepts. On the other
hand, Mathematicians don't understand what good software is.

Good data analytics software can only come when these two areas learn to teach
each other. Programmers need to learn maths to the point where they are
comfortable enough to implement a valid solution, Mathematicians need to learn
about building software that others can use.

~~~
user5994461
It is not my impression that Data Science mixes programming and maths. Unless
in a limited field of finance where all data and analysis are maths heavy.

~~~
joaodlf
I felt the same when our stats were based on simple arithmetics, "sum those
revenue figures", "divide that by the total amount of users", "percentage of
returning members"...

It can easily spiral into, "Pearson's Correlation" or "Give me the Linear
regression of the bastard".

~~~
user5994461
Still not hard maths. If all you have to do is apply a simple standard well
documented algorithm, there is really no obstacle to your success =)

That being said. I guess that having had maths classes in my engineering
degree skews my point of view, combined with working with Quants at times, who
do analysis way more advanced than that.

~~~
joaodlf
If you are familiar with those concepts, I would count that as a big step over
what I typically see in "data science". Surely a big step over what a lot of
people think data analytics to be.

Like yourself, I had quite a bit of contact with maths during my engineering
degree - whether I took most of it in is a different question :) (Financial
Calculus nearly destroyed me).

Developers aren't typically aware of concepts outside basic statistics, and
even though a lot of algorithms are readily available for everyone to
implement and benefit from, how can you use what you don't know conceptually?

I guess everyone has a different experience, depends where you're working,
really. I do know of quite a few shops where the push for analytics came from
the tech people, mostly because companies don't employ people with the math
knowledge to identify these business gains.

~~~
user5994461
They are really just maths algorithm, seen in maths courses or found with a
quick google search.

The typical reddit developer who got a job without a degree is unaware of many
many things.

The typical developer who got a job with a hardcore interview at random
financial company and is surrounded by other master's and PhD. Not so much.

The typical tech company doesn't need much advanced analysis. If they could
figure out how many recurring users and revenues they have, that would be a
good start :D

------
bhntr3
Seems like a pretty typical set of problems. Dependency conflicts hard. Schema
evolution hard. Upgrades hard.

The big data space still feels like an overengineered, fractured, buggy mess
to me. I was hoping spark would simplify the user experience but it's as much
of a clusterf*ck as anything else.

How hard can fast, reliable distributed computation and storage for petabytes
of data be? He said ironically.

~~~
zaptheimpaler
IMO one major problem is integration between different projects. Like you
said, its a hard problem, and any solution typically depends on many many
different open source projects because of the scope of challenges. All of
those projects go forward without much coordination between the teams because
they're open source. Then we end up in this fun, fun clusterfuck.

~~~
joaodlf
There is some sort of hope, Apache Arrow is (in my opinion) a step in the
right direction - A common In-Memory data layer for storage and data analysis
systems? Yes please. It's important to start thinking about how all these big
data storage/analytics tools can bridge the gap between themselves, hopefully
projects like Apache Arrow will help... As long as there is adoption.

~~~
dgudkov
A common in-memory _columnar_ data layer would make a lot of sense because a)
columnar is generally better for analytics, and b) converting from one
columnar format to another can theoretically be done without decompression
because columnar data is typically compressed using standard algorithms
(vocabulary compression, LRE, etc). Here I wrote a few suggestions for such
open-source data layer: [http://bi-review.blogspot.ca/2015/06/the-world-needs-
open-so...](http://bi-review.blogspot.ca/2015/06/the-world-needs-open-source-
columnar.html)

~~~
infinite8s
Have you seen the Apache Arrow project?
[https://arrow.apache.org/](https://arrow.apache.org/)

------
buremba
We actually have pretty similar architecture and use Presto for ad-hoc
analysis, Avro is used for hot data and ORC is used as columnar storage at
[https://rakam.io](https://rakam.io). Similar to Slack, we have append-only
schema (stored on Mysql instead of Hive), since Avro has field ordering the
parser uses the latest schema and if it gets EOF in the middle of the buffer,
fills the unread columns as null. We modified the Presto engine and built a
real-time data warehouse, Avro is used when pushing data to Kafka, the
consumers fetch the data in micro-batches, process and convert it to ORC
format and save it to the both local SSD + AWS S3.

~~~
BrandonBradley
Are you using Avro because of your own choices or Confluent's toolset (which
uses Avro on Kafka)?

~~~
buremba
We tried Avro, Thrift and Protobuf and Avro was our choice. The schema of
collections in Rakam is dynamic and with both Thrift and Protobuf schema
evolution is not that easy at runtime. Avro is easier to use in Java and
doesn't enforce code generation, the dynamic classes are optimized for
performance so it's a better option for us.

------
zaptheimpaler
I had very similar experience with Parquet and cross system pains. Pretty much
the whole big data space is a giant cluster fuck of poorly documented and ever
so slightly incompatible technologies.. with hidden config flags you need to
find to get it to work the way you want, classpath issues, tiny
incompatibilities between data storage formats and SQL dialects and so on..

Hoping someone on this thread could answer a related question - how do you
store data in Parquet when the schema is not known ahead of time? Currently we
create an RDD and use Spark to save as Parquet (which I believe has an
encoder/decoder for Rows) but this is a problem because we can't stream each
record as it comes and use a lot of memory to buffer before writing to disk.

------
mastratton3
We're actually having a debate now as we're starting to process larger
datasets as to whether or not we should keep everything on S3 or start using
HDFS w/ Hive. I'm curious if you guys considered HDFS and why you decided to
go strictly with S3, and additionally, are there any issues you encounter with
S3.

~~~
dianamp
We've considered HDFS, but we really liked the idea of having compute only
clusters and have our data kept completely separate. Clusters failure happen
and having data on S3 makes us worry less if a cluster goes down. Just spin up
a new one and you're good to go.

There is a bit of more latency when using S3 compared to HDFS, but it's not
bad and the benefits overcame that. We do have a couple of jobs that store
some intermediate results in HDFS, but in the end everything lands in S3.

We encountered a few issues with S3 at the beginning mostly around the
eventual consistency, but nothing that could not be fixed.

~~~
mastratton3
Oh great, thanks for the reply. I think thats about where I think we'll
land... keep S3 as the primary source, but have HDFS be used for intermediate
jobs.

~~~
dianamp
Good luck and have fun! :D

------
vikiomega9
I'm curious about how much time is spent moving data back and forth from S3.
It sounds like they don't currently have an ETL per say.

~~~
user5994461
Pick one solution among:

\- alooma.io (SaaS queing and transformation pipeline that saves to S3)

\- segment.io (Saas analytics platform that can save to S3)

\- snowplowanalytics (clusterfuck open source self hosted analytics pipeline)

------
Plough_Jogger
We are implementing a very similar architecture, and have decided to use Avro
for schema validation / serialization, rather than Parquet.

Does anyone have experience with both that can talk to their strengths /
weaknesses?

~~~
maxnevermind
Parquet may consume less space because it uses encoding enhancements like
delta encoding, run-length encoding, dictionary encoding. Also large number of
tools that support Parquet as a format when Avro is Java and Hadoop centric.

~~~
andrioni
The other way around: Avro is supported by pretty much any language out there,
while you can't even write a Parquet file on Python, and even reading it is
pretty hard.

------
eng_monkey
Data Engineering is about developing technology for data management. Data
management/analysis is about using this technology to produce results.

So this is not about data engineering, but data management/analysis.

------
dangoldin
We (adtech) use a very similar approach. We're consuming a ton of data through
Kafka and then using Secor to store it on S3 as Parquet files. We then use
Spark for both aggregations as well as ad-hoc analyses.

One thing that sounds very interesting and worked surprisingly well when I
played around with it was Amazon's Athena
([https://aws.amazon.com/athena/](https://aws.amazon.com/athena/)) which lets
you query Parquet data directly without relying on Spark which can get
expensive quickly. I wouldn't trust production use cases just yet and it ties
you more and more into the AWS ecosystem but might be worth exploring as a
simple way to do basic queries on top of Parquet data. I suspect it's simply a
managed service on top of Apache Drill
([https://drill.apache.org/](https://drill.apache.org/)).

~~~
idunno246
not drill, its on top of presto. presto is quite good, but the open source s3
support is definitely second class because fb doesnt use it, hopefully aws is
contributing their connector back. likewise, fb use orc, and parquet is more
externally supported.

Since s3 listing is so awful, and the huge number of partitions we needed, we
had to write a custom connector that was aware of the file structure on s3,
instead of the hive metastore which has lots of limitations, so im a little
wary of athena. create table as select is amazing too, write sql to generate
temporary parquet/orc files back to s3 to query later, i hope will support
this if it doesn't already.

------
v0g0n
With Qubole you can offload data engineering to their platform. Cluster
management is super simple. Hand rolled solutions in my experience are a pain
and elastic cloud features take up time to build. Qubole's offering provides
out of the box experience for most big data engines out there. Presto/ Spark/
Hive/ Pig - what have you - all work with your data living in S3 (or any other
object storage). I believe they have offerings in other clouds too.

Some amount of S3 listing optimisation is done by Qubole's engineering team
for: [https://www.qubole.com/blog/product/optimizing-s3-bulk-
listi...](https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-
performant-hive-queries/)

They also have features that allow you to auto-provision for additional
capacity in your compute clusters as your query processing times increase.

~~~
ktamura
When Amazon Athena actually matures, wouldn't it solve at least the
interactive query needs, probably at a much lower/elastic price point than
Qubole?

~~~
v0g0n
True, I've tried Athena and it's great at cost, performance and ease of use.
However, most Data Engineering teams have lots of custom tweaks they need and
certain level of control to add jars, applications, UDFs to their queries. I
don't see this available through Athena today.

------
poorman
Apparently the concept on sampling has been lost in time.

~~~
disgruntledphd2
I think that many people don't trust sampling.

I like sampling for figuring out how something works, it allows me to iterate
much, much quicker.

However, if you need individual level predictions, sampling probably isn't
going to help.

------
henrygrew
Isn't moving data back and forth from s3 rather expensive?

~~~
gashad
AWS doesn't charge to put data in to s3. It's free to pull data out from its
region to any AWS service within the same region. It can get expensive to pull
data out across regions or out of AWS infrastructure (ie. to your private data
center).

~~~
meritt
AWS does indeed. They charge $0.005 per 1000 PUT requests (which is 12.5x more
expensive than GET requests) and then you're immediately paying for storage
space as well.

~~~
vacri
Wow, I hadn't noticed that before. It's less than a third of the cost to store
data in S3 than to pull it across the wire (2.3c/G store, 9c/G wire, in us-
east-1)

------
OskarS
This is off-topic, but I can't help myself:

Slack, Hive, Presto, Spark, Sqooper, Kafka, Secor, Thrift, Parquet.

I sometimes can't tell the difference between real Silicon Valley product
names and parodies. I'm starting to miss the days when it was all just letters
and numbers.

~~~
ivm
There's a game "Pokemon or Big Data?"

[https://pixelastic.github.io/pokemonorbigdata/](https://pixelastic.github.io/pokemonorbigdata/)

~~~
reuven
This is _AMAZING_. Thank you.

------
vs2370
Well for its worth my experience interviewing for the data team there was
terrible. A long coding exercise that when submitted resulted in a 7 day wait
and a 2 liner email. Wouldn't recommend.

~~~
guessmyname
What surprises me the most about the Slack's job page is that most — if not
all — the positions are on-site. It surprises me because most of the companies
that I know are remote-friendly use Slack as their main communication method,
so I would expect Slack itself to have some remote positions just for the
Dogfooding [1]. I have applied 3 times for a regular SDE position there and
two times I was rejected because I was not (permanently) living in the US, the
3rd time I got no response while staying in NYC.

[1]
[https://en.wikipedia.org/wiki/Eating_your_own_dog_food](https://en.wikipedia.org/wiki/Eating_your_own_dog_food)

~~~
tyingq
This article talks about that specifically:
[http://readwrite.com/2014/11/06/slack-office-
communication-p...](http://readwrite.com/2014/11/06/slack-office-
communication-productivity/)

An excerpt...

 _Which raises the question: With such a good tool for team communication, why
does Slack need an office? Why not do all your work virtually?

Slack CEO Stewart Butterfield gives product manager Mat Mullen advice, and a
ukulele serenade. “There are some conversations that are much easier in
person,” says Brady Archambo, Slack’s head of iOS engineering._

~~~
draw_down
Revealing.

