
Ask HN: Simple, beginner friendly ETL / Data Engineering project ideas? - zabana
Hi HN,<p>I&#x27;m a seasoned Python software developer. Recently I have found a new obsession with data processing, management, engineering etc ... I&#x27;d like to (eventually) branch off into that field but I find the lack of beginner friendly resources is slowing me down. All I can find is spark, hadoop related articles (I know these are prominent in the field, but I want to learn to walk before I run). So If any of you have pointers, websites, project ideas I can start to get a good grasp of all the fundamental concepts, I&#x27;d really appreciate it.<p>Thanks a lot in advance
======
3pt14159
Spark, etc, are great, but honestly if you're just getting started I would
forget all about existing tooling that is geared towards people working at 300
person companies and I would read The Data Warehouse ETL Toolkit by Kimball:

[https://www.amazon.com/gp/product/1118530802/](https://www.amazon.com/gp/product/1118530802/)

I learned from the second edition, but I've heard even better things about the
third. As you're working through it, create a project with real data and from-
scratch re-implement a data warehouse as you go. It doesn't really matter what
you tackle, but I personally like ETLing either data gather from web crawling
a single site[0] or push in a weekly gathered wikipedia dump. You'll learn
many of the foundational reasons for all the tools the industry uses, which
will make it very easy for you to get up to speed on them and to make the
right choices about when to introduce them. I personally tend to favour tools
that have an API or CLI so I can coordinate tasks without needing to click
around, but many others like a giant GUI so they can see data flows
graphically. Most good tools have at least some measure of both.

[0] Use something like Scrapy for python (or Mechanize for ruby) with CSS
selectors and use the extension Inspector Gadget to quickly generate CSS
selectors.

~~~
Kagerjay
I own all 3 Kimball books, they are fantastic

~~~
mipmap04
Probably the best live training I've ever attended for data warehousing.

------
drej
1) Learn to do as much in plain Python as possible, focus on lazy evaluation
(itertools, yielding, ...), you'll be able to process gigabytes with a tiny
memory footprint, deployment will be a breeze etc.

2) Get to know some of the basic Python data processing/science packages like
pandas, numpy, scipy etc.

3) Get used to writing short shell scripts - they probably won't be a part of
your pipeline, but data engineering, especially development, involves a lot of
data prep that coreutils will help you with. Throw in a bit of jq and you'll
handle a lot of prep work.

4) Only once you've gotten used to the above, look at Dask, PySpark, Airflow
etc. It really depends on your use cases, but chances are you won't have to
touch these technologies at all.

Bottom line - wait with the heavyweight tools, they might be needlessly
powerful. Also, work closely with DevOps, because the deployment side of
things will help you understand consequences of your actions.

~~~
bdhess
I like this answer. I'd also add:

0) Spend time finding data sets that actually interest you.

If you're working with data that you're actually excited about, the questions
will flow naturally. And when the questions flow naturally, there's an obvious
path to picking up new techniques as you need them.

~~~
lmiller1990
This was the thing that helped me most. I just can’t make myself excited about
the price of a house or the size of a tumor. Stats and predictions for my
favorite sport, on the other hand, gave me context that naturally made me ask
questions regarding the accuracy of my results and how I can improve them.

------
dbatten
I second the Luigi recommendation. Republic Wireless uses it for all of our
data warehouse ETL, and it's been fantastic to work with.

I also second the other comment that recommends starting with basic data
extraction rather than diving into Hadoop or Spark immediately. Sure, at some
point, you might need to process 100 billion lines of data. But in your
average business, you're far more likely to be working with thousands or
millions of records on customers, sales, orders, invoices, sales leads, etc.
That stuff doesn't need Hadoop/Spark, it needs a Postgres database and a DBA
with a good head on their shoulders keeping everything organized.

In my experience, government data sets (particularly demographics and other
geographically-related data sets) are a fantastic way to get your feet wet
with data processing. They're published by a bunch of different agencies, so
they're not necessarily conveniently available in one place. However, they
usually use standardized identifiers for geographies, which makes it easy to
join the data sets together in new and interesting ways.

For instance, here at Republic, we recently used Form 477 data on wireless
broadband availability from the FCC, data from Summary File 1 of the US
Census, and a couple of Census geographic crosswalk files to be able to
calculate the percentage of population in given zip codes and cities covered
by various wireless carriers. That required reading the docs for several
different data sources, automating some downloads, building database tables to
hold all of the information, and then carefully crafting some SQL to pull it
all together.

Of course, government data sets generally won't require a whole lot of
automation (they're updated yearly or less than yearly, not daily). To build
your skills on that front, I'd recommend learning to extract data from various
APIs, structure it in a meaningful way, and make it available in a database.
For example, if you have a website, set up a free Google Analytics account for
it, then build a daily ETL that extracts some meaningful information from the
Google Analytics API and stuffs it in a Postgres DB. Then see if you can build
some charts or something that sit on top of that database and report on the
information.

~~~
mtrn
Luigi is a great Python library for building multi-stage pipelines with
parameterized tasks. It easily extend to new storage types and targets.

We currently use it to build a moderately complex product from dozens of data
streams (files, APIs and things in-between) and millions of records. At its
core is a DAG and topological sort, which capture the essence of pipelines and
execution.

A DAG is easy to visualize, even its development over time[1].

[1] [https://imgur.com/a/RuC0B5Y](https://imgur.com/a/RuC0B5Y) (generated from
the projects' code, in this case
[https://github.com/miku/siskin](https://github.com/miku/siskin))

------
ebullientocelot
Hi, data engineer here. Other comments have a lot of good suggestions. I
especially agree with ideas like avoiding high powered frameworks in the
beginning, and learning to write effective transform wrappers for different
kinds of weird input data. One thing you'll find is that (for enterprise
situations at least) your source data is going to come from extremely odd, old
fashioned, or very poorly documented sources. Be prepared to find twenty and
thirty year old manuals on formats you've never heard of at times.

That aside, the other side of the coin that is very important is to get very
familiar with how folks talk about their data problems. Managers, analysts,
etc. will often request a specific solution that they've heard of or seems
popular--sometimes it is what they need, often times it isn't. To get a solid
footing in this space (not that it isn't for all types of SE) it is critical
to have very strong understanding of business requirements, understanding the
work of many other roles in your org, and being able to communicate with
business folks in ways that allow you to develop a rational plan for a
solution while getting them to realize what their needs really are.

Best of luck!

~~~
zabana
Thanks for your insight !

------
veritas3241
Another commenter mentioned it as well, but "Designing Data-Intensive
Applications" by Martin Kleppmann
[https://dataintensive.net/](https://dataintensive.net/) is a _fantastic_
overview of the field and, I think, more approachable and enjoyable to read
than Kimball's book. But Kimball is a classic, especially for how to do
warehouse design.

I'll also make a plug for the Meltano[0] project that my colleagues are
working on. The idea is to have a simple tool for extraction, loading,
transformation, and analysis from common business operations sources
(Salesforce, Zendesk, Netsuite, etc.). It's all open source and we're tackling
many of the problems you're interested in. Definitely poke around the codebase
and feel free to ping me or make an issue / ask questions.

[0] [https://gitlab.com/meltano/meltano/](https://gitlab.com/meltano/meltano/)

------
RickJWagner
I worked in that field a number of years. My recommendation to you is to start
with some form of data that you are passionate about. Baseball statistics,
business metrics, investment figures, whatever.

Once you have the data, then figure what you're going to do with it. (Don't
agonize over it, this should all take just a day or so.)

Then go after the toolkit. You'll find many interesting questions if you start
with the end goal in mind.

Good luck, and have fun!

~~~
KasianFranks
This is the best advice I've seen here. The purpose of data engineering is to
support a much larger effort related to scientific discovery or increasing
bottom line revenue. Don't start with the language, tooling, libs and
architecture, start with a purpose and goal. Most goals relate to mimicking a
portion of human cognition in terms of pattern matching.

If you're building a house you don't start by picking out hammers and saws.

This takes being passionate about your data and its pre and postprocessing as
a data engineer.

------
jakestein
[http://Singer.io](http://Singer.io) is an open source ETL project written in
Python. The components are small, composable programs that you can run
independently, so you should be able to walk before your run.

A good beginner project is to build an adapter to a new data source (known as
a "tap" in Singer). Most taps pull data out of business tools like Salesforce
or Marketo, but people also build them to pull characters from the Marvel API
([https://www.stitchdata.com/blog/tapping-marvel-
api/](https://www.stitchdata.com/blog/tapping-marvel-api/))

Check out the getting started guide ([https://github.com/singer-io/getting-
started](https://github.com/singer-io/getting-started)), or jump into the
Singer Slack and ask for help (linked from the guide)

~~~
vladsanchez
Fantastic reference! Thanks for sharing. As an ETL specialist I've looked for
a composable solution but only found Luigi. I wanted something simpler, more
integrated, like IFTTT. This has the feel of Yahoo Pipes. I think Singer is
the answer. ;-)

~~~
dttos
Composableanalytics.com might be the tool you are looking for then....

------
danthelion
I would recommend looking into Python-based workflow managers: Luigi[0] then
Airflow[1], to get a hang of scheduling, DAGs, etc. Both are fairly simple to
get started with (especially for a seasoned Python developer) and are used in
production environments as well.

[0] [https://github.com/spotify/luigi](https://github.com/spotify/luigi)

[1] [https://github.com/apache/incubator-
airflow](https://github.com/apache/incubator-airflow)

~~~
tomnipotent
Only just look - don't actually try to use one of these on a side project
unless you want to waste a LOT of time. They're amazing for large projects
moving 100GB+ a day and running dozens or hundreds of different jobs to
orchestrate a moving system, but are a total time sink for anything small.

~~~
ljw1001
100 GB per day is pretty small, but your point holds

------
diogofranco
I wrote a simple blog post recently with links to data engineering resources I
found useful to get into the field, hopefully it is helpful:
[https://diogoalexandrefranco.github.io/data-engineering-
reso...](https://diogoalexandrefranco.github.io/data-engineering-resources/)

~~~
zabana
Brilliant post, currently looking at the SQL section of leetcode, thanks a lot
my friend. Very much appreciated.

------
djeebus
I've always found enjoyment in finding statistics in things that interested
me, so here's a few thoughts on project ideas. The main thing is to find a
data set that interests you, and use that interest for fun or for profit:

\- pull down your facebook backup, run it through sentiment analysis, throw it
in a db, explore yourself through facebook's eyes! \- a while back I was
looking to buy a car and used scrapy to process cars.com, ran their vins
through a look up tool to find cars that were _actually_ manuals (and not just
manumatics, seems few can tell the difference these days). Found reasonable
national prices, average miles, etc. \- interested in politics? pull down the
data sets for voting records, explore your local politicians' voting records,
compare them to national averages or historical information, etc. \-
interested in movies? find movie datasets (or scrape imdb.com, themoviedb.org,
etc) and find which genres pay the least per actor, have the smallest cast,
etc.

Lots of datasets are available online, if not in machine readable format, then
in a format that can be easily scraped. Have fun!

------
collinf
If you want to make it as a career choice, I think you should start with
learning Java and Scala. For better or worse, this field is tied to the JVM
and learning these languages will make picking up Spark and Hadoop (which tbh
is a prereq for any Data Engineering position to have on their resume) a lot
easier.

Also if you are looking to stay in the Python world, PySpark is pretty
intuitive for any Python developer and tons of companies are using it.

------
pacuna
Build a web scrapper and save some raw data in somewhere like S3. Then run
some job on top of that data to get aggregated measures and save them
somewhere. I built a project[1] like this and learned a lot in the process. I
used airflow to run the scrapping tasks, save the data in S3, use AWS Athena
to run queries and load data into Redshift. I did all of this just to learn
more about Airflow and some AWS tools.

[1] [https://skills.technology/data-engineer](https://skills.technology/data-
engineer)

------
mmaia
I find ETL best practices with Airflow a good start. Even if you don't go the
Airflow route, you can benefit from the example implementations of Kimball and
Data Vault ETLs.

[https://gtoonstra.github.io/etl-with-
airflow/index.html](https://gtoonstra.github.io/etl-with-airflow/index.html)

[https://github.com/gtoonstra/etl-with-
airflow](https://github.com/gtoonstra/etl-with-airflow)

------
kfk
I have put together a (very short) post to build a dashboard using shopify
data and pulling ETL with Stich. You could try to implement the stich part in
Python and would have a complete solution. Like looking at Stich/Blendo would
give you some ideas of simple ETL workflows. Keep in mind that ETL changes
depending on what you want to do. In theory you can do all ETL just with
Python code. If you have an SQL compliant database that can hold your data ETL
processes could simply be a matter of running SQL queries to transform data
within the database. Then you basically load your data from any data source
into your db and you run a bunch of SQL to do the transform part.

[https://assemblinganalytics.com/post/airbnbapache-
superset-a...](https://assemblinganalytics.com/post/airbnbapache-superset-and-
shopify/)

------
lixtra
Software wise you could also have a look at dask. Which is more lightweight
than Hadoop and spark.

But since you are asking for a project, why not do something local. I.e. you
could scrape some data for some time (cinemas, crime) and structure it nicely.
After a few months you can start analysis. Bonus points if you make you data
available.

------
indogooner
Like other I would say start with a problem which interests you.

To mimic the real world try to have extraction from variety of data sources
(rdbms, NoSQL, files (different formats csv/json), APIs (salesforce)). Once
you have different sources, extract the data to your datalake built on
S3/GCS/HDFS. Once the data is present you need to integrate with tools which
can extract value. You can use Vendor specific tools like Athena/BigQuery or
open source like Presto/Impala/Hive. You can do analytics where you require
filtering, cleansing, joining various datasets. You can also look at storing
the results in different formats so that other tools like Tableau can use
them. To orchestrate all of this you can use Azkaban or Airflow.

My suggestion is slightly biased towards Hadoop ecosystem but the good thing
is that most tools here have Open source alternatives.

------
cl0vn
I think you can easily learn to walk also with Spark. There are a lot of
beginners Spark tutorial online. You can see, for instance,
[https://community.cloud.databricks.com](https://community.cloud.databricks.com).
They give you a Spark cluster where you can start and there are several
tutorials in a notebook-style. Check it out. Of course, you could start your
ETL / Data Engineering in a more "traditional" way trying to learn about
relational databases and the likes. But I would suggest you to start directly
with Spark. You can use it with lots of other big data tools (such as
Hadoop/Hive and also S3) and you could also find some interesting Machine
Learning use case.

~~~
zabana
Bookmarked, thanks !

------
d__k
This list has numerous pointers to resources related to data engineering:
[https://github.com/igorbarinov/awesome-data-
engineering](https://github.com/igorbarinov/awesome-data-engineering)

------
Zaheer
It may not be exactly what you're looking for but perhaps to get started
without having to set up all the infrastructure you could use something like
AWS Glue. There's some tutorials / examples in the official AWS docs:
[https://docs.aws.amazon.com/glue/latest/dg/aws-glue-
programm...](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-
python-samples-legislators.html)

Disclosure: I work on AWS Glue. Note: We're Hiring! Email me at: zaheeram <at>
amazon dot com if you're interested in this space!

------
account2
Do Data Engineering jobs require you to have knowledge of statistics, machine
learning, etc? Or can you get data engineering jobs where you are focused on
connecting pipelines and data flow? I am interested in a data engineering job,
but it appears a lot of data engineering jobs require machine learning
expertise in job listings. I am interested in just a data engineering focused
role.

~~~
gnulinux
Look for functional programming data engineering jobs e.g. Haskell, they might
be what you're looking for, since most people don't run their algorithms in
Haskell (some do!) and use it as a duct tape/pipe.

------
hhs19832
Many posts here are focused on classic ETL. I'm working on a small project for
handling data just after ETL.

It’s for dealing with annoyingly large data, bigger than RAM but sitting on a
personal PC. It basically performs sampling and munging for this data. There’s
no good solution for this right now (I know because I've been looking for more
than a year).

What might be interesting to you is that there's little abstraction in the
project, but it's non-trivial to execute. To me, this makes it fun. Despite
its simplicity, it has high utility and could be used by others. This would be
a great outcome for an initial project.

I've got a working version of it, but it would benefit from the eye of a
seasoned python dev.

Maybe it would be interesting to you to get in touch? My email is:
mccain.alex@yandex.com

Cheers,

------
debarshri
I think basic concepts of datawarehousing like creating data marts, building
star or snowflake schemas, dimensional modelling, slowly changing dimensions
are quite important before you jump into why hadoop, hive, hbase or spark is
relevant.

------
idiotclock
Spark is not too tricky to dive into, even though you can't really take
advantage unless you have a big cluster to use :)

if you want to practice data-manipulation, and a lot of the map reduce type
stuff you can do with spark, I find Pandas useful for small datasets (And a
lot of overlap in functionality as far as Dataframes are concerned)

For pipeline stuff, definitely take a look at Luigi, but again without a
cluster it'll be less fun. Still, if you can try automating tasks with a mini
luigi scheduler on your localhost, it would be good practice

------
meterplech
Google Cloud Composer, built on top of Airflow (mentioned by a number of
people here) has a great Getting Started Guide to check out:
[https://cloud.google.com/composer/docs/quickstart](https://cloud.google.com/composer/docs/quickstart)

You can dive right in to Airflow here:
[https://airflow.apache.org/tutorial.html](https://airflow.apache.org/tutorial.html)

------
textmode
What are the industries that have the highest costs for ETL/Data Engineering?

What companies in these industries are interested in reducing their costs for
this work?

"costs" as used above includes time expenditures as well as spending money

[http://web.archive.org/web/19991023120316/http://www.dbmsmag...](http://web.archive.org/web/19991023120316/http://www.dbmsmag.com:80/9509d05.html)

------
eleijonmarck
Hey Zabana!

I noticed you were in the Europe area. We are looking for developers with
interest in going into the field.

We produce a data pipeling/analytics platform for complex data to our
customers.

If you would like to know more about our company! Position is in Stockholm.
[https://gist.github.com/eleijonmarck/1b384480aaa3d22ab7e6ea0...](https://gist.github.com/eleijonmarck/1b384480aaa3d22ab7e6ea036d297d10)

------
chrisweekly
You might want to take a look at [lnav]([https://lnav.org](https://lnav.org)),
a sort of mini-ETL CLI powertool. Performance is fine up to a few million
rows.

Edit: of course this comment (like many others below) pertains to tooling vs a
particular project per se. FWIW I agree w/ others' sentiment about doing
things "by hand" and working with data that holds your interest.

------
jakecodes
While still alpha right now, Meltano is a project from GitLab. We are looking
for contributors who have a passion for bringing the excellent ideas of
software development to the data science world.
[https://gitlab.com/meltano/meltano/](https://gitlab.com/meltano/meltano/).
Feel free to post some issues and give it a try.

------
twocats
an excellent survey of the field:
[https://dataintensive.net/](https://dataintensive.net/)

~~~
veritas3241
I cannot recommend this book enough! I thoroughly enjoyed it and really
appreciated the clear definitions and strong opinions.

~~~
eddiepeter
Agreed. This book will give a fantastic way to think about all these.

------
stadeschuldt
You could check out [https://github.com/mara/mara-example-
project](https://github.com/mara/mara-example-project)

The project is just a demonstrator for the Mara framework but it gives a good
overview and you could take it as a starter for something you want to build.

------
currymj
if you have any interest in baseball,
[https://www.retrosheet.org/](https://www.retrosheet.org/) might be of
interest. They have play-by-play accounts for every Major League baseball game
going back to 1914.

this strikes me as a good testbed for personal projects, because:

\- it's enough data to be inconvenient, but can still definitely fit on a
single machine

\- it comes in a weird format that is sort of a pain to process, which is good
practice

\- just loading each event into a database won't be enough, you'll have to
transform and reorganize it further to support answering interesting questions

\- last I checked, there are undocumented but public APIs for more recent MLB
games (on some MLB website), so automatically scraping those and incorporating
them into the retrosheet data is another interesting challenge.

------
zabana
OP Here, Thanks for all the great links, suggestions and ideas. One question I
forgot to ask is the following: What exactly does Transform mean ? It seems to
me that it's a very volatile concept and can change depending on the project.
I'd love to know your thoughts on this. Cheers.

~~~
dscpls
I think it's as vague as it sounds. But the stereotype is probably something
like translating rows to columns; or normalizing a column; or creating a
composite column

------
danbrooks
Web scraping is a good project. LinuxAcademy has some good content on AWS and
Hadoop ($50/month).

------
Swinx43
Hi I have been working in this field (and some of the previous incarnations of
it) for quite a while. I would be happy to have a discussion to give some
pointers and relay some experience if you want.

My email and keybase.io is in my profile so feel free to get in touch.

------
dfsegoat
Luigi. We literally use this for anything requiring a series of more than a
few steps:

[https://github.com/spotify/luigi](https://github.com/spotify/luigi)

------
sbussard
SQL. Most of my ETL jobs consist of setting up the overhead pieces in pyspark
then using SQL for the most important logic. It's way more portable that way.

~~~
sitkack
What I like about using SQL is I can prototype with SQLite, Postgres, etc and
then adapt and move to Hive, Impala, BigQuery, etc. SQL is literally
everywhere, and I don't have to learn to paradigms and frameworks.

------
boredmgr
Convert .mat files to json,csv,xls formats

