
Ask HN: All of you working with Big Data, what is your Data? - sysk
Big data is a trending topic these days and I&#x27;d like to get my hands dirty both out of curiosity and to make myself more relevant on the marketplace. That being said, I&#x27;m not sure which data sets are both interesting to play with and easily accessible. My question is:<p>For those of you already working with big data, what kind of data do you work with?
======
alanctgardner3
If you want to work in a "big data"-type role as a developer, I wouldn't worry
about finding huge data sets. There's a dearth of candidates, especially ones
who actually have hands-on experience, and having deep knowledge of (and a
little experience with) a broad range of tools will make you a pretty good
candidate:

Fire up a VM with a single-node install on it [1] and just grab any old CSVs.
Load them into HDFS, query them with Hive, query them with Impala (Drill,
SparkQL, etc.). Rinse and repeat for any size of syslog data, then JSON data.
Write a MapReduce job to transform the files in some way. Move on to some
Spark exercises [2]. Read up on Kafka, understand how it works and think about
ways to get exactly-once message delivery. Hook Kafka up to HDFS, or HBase, or
a complex event processing pipeline. You'll probably need to know about
serialization formats too, so study up on Avro, protobuf and Parquet (or
ORCfile, as long as you understand columnar storage).

If you can talk intelligently about the whole grab bag of stuff these teams
use, that'll get you in the door. Understanding RDBMSes, data warehousing
concepts, and ETL is a big plus for people doing infrastructure work. If
you're focused on analytics you can get away with less of the above, but
knowing some of it, plus stats and BI tools (or D3 if you want to roll your
own visualization) is a plus.

[1]
[http://www.cloudera.com/content/cloudera/en/downloads/quicks...](http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.html)
[2] [http://ampcamp.berkeley.edu/5/](http://ampcamp.berkeley.edu/5/)

~~~
vijayr
If someone with a typical web dev background (knows how to handle databases
like oracle or MySQL, but nothing about tools like Hadoop etc), could you
recommend a course/book to start big data with? Also, with so much to learn,
how does one go about deciding what field within big data to specialize in?

~~~
IndianAstronaut
>how does one go about deciding what field within big data to specialize in?

In my experience your job generally dictates what you specialize in. I ended
up being more data engineer than scientist since my job had a lot of tricky
data warehousing problems.

------
me_bx
The twitter social graph (follow connections between people) is my data
source, I extract it from the API and cache it in a database.

The mariadb table storing this information currently takes a bit more than
500GB, it has about 4 billion rows (based on the statistics, I don't run
SELECT count(*) on it anymore).

I usually don't use the term "big data" because the buzzword is so popular
that it doesn't mean anything anymore.

~~~
billyhoffman
Won't SELECT count(*) be super slow? Isn't SELECT Count(some_primary_key) a
better idea?

~~~
icebraining
It should be the same. count(*) only needs to return the number of columns
(regardless of its value), so it can use only indexes, while count(column)
must only count non-null values. But since the primary is non-null, it should
end up taking the same time.

~~~
billyhoffman
Wow, just tested this against a table with some pretty wide columns and ~ 7M
rows. They take exactly the same time! I would have thought COUNT( _) would be
like SELECT_ , but I guess the query planner is smart enough to know what to
do.

Thanks!

------
jawns
Day job: Web site user sessions and offline retail sales data.

Side project: Poll responses on
[http://www.correlated.org](http://www.correlated.org)

------
malux85
Here's some stuff I have done in the past year, I work for a small company,
but run a personal computing cluster of 167 servers that I pay for out of my
own pocket. I really enjoy loading "big" datasets into them and working on
improving algorithms or gaining insight into the data.

I (try and) network around London and offer my services for free to people who
have interesting problems.

\- Very high resolution FMRI data. A single scan can be 10-20GB

\- Infringing URLs for a piracy company, 4 billion rows

\- DNA sequences and Protein Data, lots of variation in sizes, from a few
hundred MB's of string data, to hundreds of GBs

\- RAW radio data for a military skunkworks project (10's of GB / min)

I would really like to find an investor who could take me off my full time
job, I have 3 quite large projects I would like to build, one I have almost
finished.

~~~
sysk
Do you actually have 167 servers running in your home or do you rent them on
e.g. EC2 when you need them? If it's the former case, is it because it makes
financial sense (I'd be surprised) or is it for the experience/fun?

~~~
malux85
I have 3 blade servers in my house, these act as command and control machines,
the 167 are rented on EC2

------
ScottBurson
If you're looking for data sets to play with, check out Kaggle [0]. Companies
post data sets there along with questions they want answered, and people
compete to find the best way to answer them.

[0] www.kaggle.com

~~~
billsossoon
Interesting idea. I'm going to start posting my company's workload online in
the form of competitions, letting people work for the _possibility_ of being
compensated at sub-market rates.

------
Maro
I work at Prezi. We have about a petabyte of data. It's usage data coming from
the product and the website. Clicks in the editor and such. Then we have a
data warehouse with cleaned and accurate datasets, that's much less. We are on
AWS, we use S3, EMR for Hadoop, Pig, Redshift for SQL, chartio, etc. We have
our own hourly ETL written in Go which we will opensource this year.

I recently talked at Strata, here's the Prezi:

[https://prezi.com/d1889jmlziks/strata-2014/](https://prezi.com/d1889jmlziks/strata-2014/)

------
nevinera
Retail transaction/loyalty, network traffic, financial, and health data.

To be clear 'big data' is poorly defined, and I mostly do not work with
terabyte+ data sets, but rather with highly dimensional data in moderate
volume. Data is only 'big' relative to the algorithms you try to use on it.

------
serhanbaker
You can actually think an interesting application and generate your own data.
For example, we were developing a product for processing network events in
real-time. There were 6-10K events per second, and we were creating alerts for
several different scenarios. For testing purposes, we actually wrote a program
to simulate those events, with 20K events per second. It was generating fake
(but realistic) data with the right format.

Application idea from top of my head: Generate turnstile data for different
subway stations (enter/exit, time) and wrote an application to show the
density of those stations with times. You can create a scenario where a
certain station is more dense than others, and this could be your test. And
this application could be your proof of concept

------
crzrcn
Treasure Data, 400k records per second. For us it's less about the data we
manage, but how easy we make it for customers to store and query it.

Data consists of IoT devices ranging from wearables to cars to frickin'
windmills, analytics from various websites and mobile games.

------
laughfactory
I am the data modeler for an organization which lends to small businesses. In
my experience "big data" is all in the eye of the beholder, and it's not all
about how many gigabytes of data you work with, how wide, or how long it is.
The challenges are the same: how to use the data in relevant ways to forward
organizational goals. In my case the days isn't particularly long in terms of
number of rows, but it is exceptionally wide in terms of potential variables.
It's enough data that I have to spend a reasonable amount of time thinking
about the most efficient way to model (statistically) and data mine. The
issues are similar to other data oriented jobs I've had: how to determine
which variables are relevant, clean and transform the data... And ultimately
how to turn a big pile of data into a model which effectively predicts
likelihood of charge off if the loan were to be approved. Scintillating stuff,
but obscenely difficult. Of course, it's harder too because I'm the only
modeler and am fairly inexperienced. My last experience building predictive
models was a couple classes in college... Which was also my last experience
using R (which I prefer to SAS.

To answer your implied question, I'd recommend picking up ANY size real world
data and playing with it. Build statistical models (predictive or otherwise),
apply supervised and unsupervised machine learning methods to it, but above
all develop a foundation of experience working with real world data. In class
in college we used "canned" data sets which were already cleaned, validated,
organized, and so forth. This made it unrealistically easy to model. In the
real world just working with the data effectively is a hard won skill. So from
the get go you need to learn how to explore data, visualize it, interpret
plots and statistics, clean/transform/normalize it, formulate a question your
data can answer, and apply the relevant methods in pursuit of the answers you
seek. Once you have the fundamentals down the size of the data is immaterial--
only requiring you to put additional thought into what you can computationally
achieve (for instance, how to determine which of 150 candidate variables are
statistically relevant).

------
mgkimsal
OT but I've never liked the term 'big data' precisely because it's so ill-
defined. Most people I speak with on this think they have 'big data'. Anything
they can't comprehend is "big data". Anything that makes their Excel 97 crash
is "big data". It's pervasive a term enough that people have heard it, and use
it wrongly.

Colleague of mine is at a company that's advertising for someone with "big
data" experience. Collectively, for more than 10 years in business, they have
maybe 100g of data. They just do not know how to organize the data sanely in a
relational database, and actively refuse to consider normal data structures.

~~~
sixdimensional
There is a definition for "very large database" (VLDB) on Wikipedia- the
precursor to the term "big data".. although it is somewhat outdated.

------
kfor
Governmental health records and survey data. A lot of the really big stuff we
use requires academic licenses, but there's still a lot of publicly accessible
data.

For the U.S. try

\- CDC's National Center for Health Statistics:
[http://www.cdc.gov/nchs/](http://www.cdc.gov/nchs/)

\- CDC WONDER: [http://wonder.cdc.gov/](http://wonder.cdc.gov/)

\- NIH's Unified Medical Language System:
[http://www.nlm.nih.gov/research/umls/](http://www.nlm.nih.gov/research/umls/)

And for global try the Global Health Data Exchange:
[http://ghdx.healthdata.org](http://ghdx.healthdata.org)

------
kenrick95
There are organizations that collect big data at various locations and shared
among them. Have a look: [http://webscience.org/web-
observatory/](http://webscience.org/web-observatory/)

------
jayshahtx
I think the most interesting datasets are within reach but require curation
yourself. For example there are extremely powerful scraping libraries in just
about every popular language today, not to mention APIs such as Twitter's.

If you're looking for a cool dataset to play with, I think it is more
productive to ask yourself what questions you want to answer and then
find/curate the data VS find a dataset and then ask "what questions can I
answer?". The former approach will also keep motivations high if you're driven
by curiosity.

~~~
graycat
> I think it is more productive to ask yourself what questions you want to
> answer

I second that. An old remark is, "We often find that a good question is more
important than a good answer.", as I recall, due to Richard Bellman, say, the
leading proponent of _dynamic programming_ , i.e., usually a case of _optimal
control_ , either for the deterministic or stochastic (the system gets _random
exogenous_ inputs while we are trying to control it). Bellman was into a lot
in pure and applied mathematics, engineering, medicine, etc. Bright guy. As I
recall, his Ph.D. was in stability of solutions of initial value problems for
ordinary differential equations, from Princeton.

------
alexatkeplar
Human and machine-generated structured event streams, via Snowplow
([https://github.com/snowplow/snowplow](https://github.com/snowplow/snowplow)).

The largest open-access event stream archive I know about is from GitHub, I
think it's about 100Gb:
[https://www.githubarchive.org/](https://www.githubarchive.org/)

------
matt_s
Data collected from devices and it is large, but not big. Around 40-60TB and
very repetitive data. Find some open set of data that interests you and just
do something to get familiar with the tools.

I think most data sets could be handled via RDBMS and Big Data is just another
choice. The more interesting thing to me is what you accomplish and if a new
tech can get you there faster or cheaper, etc.

------
jongos
Job: Data Science Consultant, Governments and NGOs

For me it's primarily population data. It's not exactly 'big' data in the raw
form, but what makes it bigger are the variations analyzing, applications of
predictive models, and new metadata values extracted from it.

The data grows faster than we're collecting it exponentially because of all
the analysis.

------
bsmartt
Working with information about attacks all the way down the killchain.
Everything from IDS sigs, english descriptions, attribution, ip/host
reputation.

AlienVault is hiring security researchers.

edit: we have some limited data sets that we make public, incase you're
interested, hence the name 'open threat exchange'.

------
dmichulke
Financial data (tick to EOD), network traffic data (TCP packet level sends /
receives) and farm data (sensor + farm ERP data)

All of them are basically time series with some master data, none of them is
more than a few dozen GB

So in any case, I think time series data is worth a look.

~~~
noname123
I run backtesting for options trading as a hobby and storing EOD tick data,
querying it and extracting it is a pain. I dl my source data currently from a
retail historical data provider, then store it in MongoDB in AWS. What would
you recommend tech stackwise to do backtesting on time seriea data?

~~~
dmichulke
I also first used MongoDB but I switched to cassandra and I find it much
better.

MongoDB was bad for this case because it devours memory and it lacks a primary
key (so saving a tick that's already creates a second tick).

cassandra is better (it has a PK, although it's a bit weird because of its
distributed-first attitude) but in the long run I think postgres would be even
better (because it's space efficient).

Apart from that I use some Java libraries and clojure/incanter and program the
rest around it myself.

------
danmaz74
I work on the relationships between hashtags and between hashtags and
influencers: [http://hashtagify.me](http://hashtagify.me)

For this analysis, we collect the data from Twitter's public API

------
alexvay
Working on reducing Big Data to help network security engineers investigate
threats faster and respond more accurately.

The data, currently, is mainly from various Network Security Monitoring
appliances & SIEMs.

------
calinet6
I work at Localytics. We have analytics data from billions of mobile and web
users, including specific user actions, usage in general, and user profiles.
It really is a fascinating dataset.

~~~
spacefight
Are you really supposed to look at your customers data?

~~~
calinet6
I never said I looked at it—we work with it, meaning we develop products which
allow customers to operate on their own data.

------
Arkid
Have worked on a few Big Data projects: \- Sensor Data from haul trucks to
predict their failures and optimize their routes in the mines \- Telematics
data for insurance companies

------
valevk
Mostly logfiles, and other machine generated data (of which 99% can be thrown
away, but that's what "big data" does for me, filter out what's important).

------
gesman
Banking and brokerage portal access data.

Utilizing Splunk as analytics and alerting platform to correlate real time
financial activity events with multiple threat intelligence feeds.

------
lolwhat
IMDB and Boxoffice mojo data. Thinking of moving to mongodb.

------
robinho364
In search engine companies, we sort out cookies every day.

------
hijinks
I'm in devops but support the Hadoop cluste. We are a adtech company that has
close to a 2 petabyte cluster that is around 76% full.

------
bobosha
Perhaps the biggest of big data problems - imagery (photos & video) We build
algorithms to extract value in imagery.

------
byoung2
Currently sentiment analysis on business reviews (Yelp, Google, Citysearch,
Facebook, OpenTable, TripAdvisor).

------
iskander
Genetic sequence data, mostly from cancer.

(Tools are terrible, data sizes up to hundreds of gigabytes per patient)

------
quentindemetz
Hotel reservations, prices, and numerous market indicators. for thousands of
hotels.

PriceMatch is hiring in Paris!

------
sjwhitworth
GPS and transport data.

~~~
3apo
Same. >1TB/hour of vehicles and sensor data. Including video, gps and radar

------
daemonk
I work with genomics data. The data is more complex than big.

------
ronreiter
Log data of browsing history. 500k requests per sec.

~~~
bizzleDawg
How big is each event? How much indexing? How much infrastructure?

------
Demiurge
working with ~1PB of remote sensing data and station derived data. never used
the word 'Big Data' in any work context.

------
vishalzone2002
clickstream logs to build recommendation systems.

------
bkruse
Genomics!

------
gaius
I think I would consider anything 100Tb and up to be big data. There is no big
data that is "easily accessible"; that's why it's "big" because it requires
extremely powerful hardware and advanced techniques to work on. Otherwise its
just "data".

NOTE: there are people in the world who would laugh at my definition and say
that big data starts at 1Pb.

~~~
benihana
> _NOTE: there are people in the world who would laugh at my definition and
> say that big data starts at 1Pb._

I commend them for having a larger penis ^H^H^H^H^H^H data stack than you.

I thought big data was less about the actual size of the data store and more
about where it comes from (typically passive collection from user activity)
and how it's accessed (through some kind of large map-reduce style framework)
and used (to inform product decisions or learn more about human behavior)?

~~~
gaius
So some other definition of "big" than umm "big" then?

~~~
calinet6
Yes, absolutely. When people talk about big data, more often than not it's a
measure of _complexity_ and _difficulty,_ not size.

~~~
saalweachter
Yeah, for everyone but physicists it's really "big enough" data: it's a big
enough data set that you've started recording things you didn't even try to
record.

An excellent example was on HN the other day, using the NYC taxi data to
determine which drivers are observant Muslims. It's not something anyone set
out to record, but the data set has gotten so large that if you turn it
sideways and shake, random facts like that fall out.

