
The New Data Engineering Ecosystem: Trends and Rising Stars - ShaunaLAnderson
http://insightdataengineering.com/blog/new-ecosystem/
======
demian
It seems to me we are in the same place Front-End Web was a couple of years
ago: There is an explosion of new tech and frameworks. BUT there seems to be a
bump for the adoption of new tech for data engineering, the HIGH cost of
migrating terabytes and terabytes of the company's data to a new system.

Migrating and integrating data is incredibly expensive. When you implement a
new system you either have to migrate the data from the older system (=
expensive) OR having both systems working in parallel one for "older" data and
one for newer (= really expensive)

Currently the established risk-averse corporations, the owners of most of the
non- social media and internet data, seem to be still testing enterprise
Hadoop distributions like Cloudera and Hortonworks for daily enterprise data
operations, and maybie have some projects on R&D for harnessing new "types" ok
data (like sensor data from a factory floor or high granularity transport data
for supply chain).

Still, I hope the best of the new tech can get a place inside the modern
corporations that work on important problems like Energy and Healthcare.

------
jszymborski
One file format I would also recommend, although definitely neither a trend or
a rising star, would be HDF5[1], which has done a stellar job for me. It also
a ton of wrappers, and has been widely used for ages (it's been around in some
form or another since 1987 and has been adopted by NASA to store their Earth
Observing System data).

[1]
[https://en.wikipedia.org/wiki/Hierarchical_Data_Format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)

~~~
ap22213
HDF5 is great, but has anyone developed a native Java version? The official
version has a Java binding with a nonintuitive interface. And, it requires a
lot of prerequisite installations which makes it a bit difficult to deploy.

But, I may be completely missing the best practices though.

~~~
jszymborski
JHDF5[1] seems to be a high-level abstraction that has a nice API from what I
can tell. Can't speak to deployment or prereqs, but they seem to have a JAR[2]
you can just throw in.

[1] [https://wiki-
bsse.ethz.ch/display/JHDF5/JHDF5+%28HDF5+for+Ja...](https://wiki-
bsse.ethz.ch/display/JHDF5/JHDF5+%28HDF5+for+Java%29) [2]
[http://www.hdfgroup.org/products/java/hdf-
object/##DOWNLOAD](http://www.hdfgroup.org/products/java/hdf-
object/##DOWNLOAD)

------
ameyamk
Its actually very simple with some clear winners emerging now:

Log / Stream Processing - Kafka \n Scalable Storage - HDFS \n Data Processing
- Spark, Map reduce (in that order) \n Historical Analytics - Hive/ Spark SQL
\n Real Time Processing - Spark Streaming, Storm \n NoSQL - Cassandra/ HBase
\n NoSQL (In memory) - Redis \n Search - Elastic Search \n

Some more honorable mentions: kibana on elastic search - for analytics
visualization \n druid - for analytics \n

Above are the basics - if you add them you will have 90% of the standard stack
for big data.

------
iblaine
Ah, job security for Data Engineers. Nice.

