Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Simple, beginner friendly ETL / Data Engineering project ideas?
233 points by zabana 8 months ago | hide | past | web | favorite | 72 comments
Hi HN,

I'm a seasoned Python software developer. Recently I have found a new obsession with data processing, management, engineering etc ... I'd like to (eventually) branch off into that field but I find the lack of beginner friendly resources is slowing me down. All I can find is spark, hadoop related articles (I know these are prominent in the field, but I want to learn to walk before I run). So If any of you have pointers, websites, project ideas I can start to get a good grasp of all the fundamental concepts, I'd really appreciate it.

Thanks a lot in advance




Spark, etc, are great, but honestly if you're just getting started I would forget all about existing tooling that is geared towards people working at 300 person companies and I would read The Data Warehouse ETL Toolkit by Kimball:

https://www.amazon.com/gp/product/1118530802/

I learned from the second edition, but I've heard even better things about the third. As you're working through it, create a project with real data and from-scratch re-implement a data warehouse as you go. It doesn't really matter what you tackle, but I personally like ETLing either data gather from web crawling a single site[0] or push in a weekly gathered wikipedia dump. You'll learn many of the foundational reasons for all the tools the industry uses, which will make it very easy for you to get up to speed on them and to make the right choices about when to introduce them. I personally tend to favour tools that have an API or CLI so I can coordinate tasks without needing to click around, but many others like a giant GUI so they can see data flows graphically. Most good tools have at least some measure of both.

[0] Use something like Scrapy for python (or Mechanize for ruby) with CSS selectors and use the extension Inspector Gadget to quickly generate CSS selectors.


Agreed. This book will give you a fantastic way to think about ETL strategy rather than simply pointing you to the latest library.

Some of the recent popular toolkits / services aren't "real" ETL -- they simply move data from one place to another. This is obviously a crucial part of ETL, but it's not the hard part. And without an understanding of data warehousing such as from this book, it will not be easy to discern the difference.

(This is based on many conversations with people on both sides of the table.)


I interned for a data warehousing team when I was in college (a random assignment) and this is the book everybody there lived by and recommended.


second this and also The Data Warehouse Lifecycle Toolkit


I own all 3 Kimball books, they are fantastic


Probably the best live training I've ever attended for data warehousing.


your link points to the data warehouse toolkit, not the ETL one.


Thanks, I linked to the right book, but I wrote the wrong title because I was originally going to recommend that one but changed my mind when I remembered what content was in what book.


1) Learn to do as much in plain Python as possible, focus on lazy evaluation (itertools, yielding, ...), you'll be able to process gigabytes with a tiny memory footprint, deployment will be a breeze etc.

2) Get to know some of the basic Python data processing/science packages like pandas, numpy, scipy etc.

3) Get used to writing short shell scripts - they probably won't be a part of your pipeline, but data engineering, especially development, involves a lot of data prep that coreutils will help you with. Throw in a bit of jq and you'll handle a lot of prep work.

4) Only once you've gotten used to the above, look at Dask, PySpark, Airflow etc. It really depends on your use cases, but chances are you won't have to touch these technologies at all.

Bottom line - wait with the heavyweight tools, they might be needlessly powerful. Also, work closely with DevOps, because the deployment side of things will help you understand consequences of your actions.


I like this answer. I'd also add:

0) Spend time finding data sets that actually interest you.

If you're working with data that you're actually excited about, the questions will flow naturally. And when the questions flow naturally, there's an obvious path to picking up new techniques as you need them.


This was the thing that helped me most. I just can’t make myself excited about the price of a house or the size of a tumor. Stats and predictions for my favorite sport, on the other hand, gave me context that naturally made me ask questions regarding the accuracy of my results and how I can improve them.


A million times yes. Whenever I teach people anything, I force them to use the tools on problems they actually have. You can solve a lot for Alice and Bob, but I'd rather you solved something for yourself.


Calling out the deployment side is huge - You may have tested your ETL software itself, but once real data starts flowing through the pipes all bets are off. You might be interested in this read about the concept of pipeline debt: https://medium.com/@expectgreatdata/down-with-pipeline-debt-...

In full disclosure: I'm co-founder of Superconductive Health, our team co-created great-expectations and co-authored the above blog post.


Do you have recommendations on resources for learning 1)?

I'd add a possible 5) of Jupyter notebooks (or some dashboard framework) + some visualization library for learning to juice actionable analyses from whatever data


Thanks for your suggestions, I want to deal with the least amount of abstraction possible so I can truly understand the benefits of using industry renowned frameworks (like the ones you've listed in 4.)


What do yo mean by "least amount of abstraction"? Unless I am misunderstanding what you mean by that, suggestions by user drej entail less abstraction than Spark / Hadoop etc. Also there's a difference between familiarity with "industry-renowned" frameworks and strong knowledge of the core small-scale libraries. FWIW, I'd certainly prefer to hire a developer/data scientist with the latter experience (if you feel like you have already mastered this, then congrats!).


I second the Luigi recommendation. Republic Wireless uses it for all of our data warehouse ETL, and it's been fantastic to work with.

I also second the other comment that recommends starting with basic data extraction rather than diving into Hadoop or Spark immediately. Sure, at some point, you might need to process 100 billion lines of data. But in your average business, you're far more likely to be working with thousands or millions of records on customers, sales, orders, invoices, sales leads, etc. That stuff doesn't need Hadoop/Spark, it needs a Postgres database and a DBA with a good head on their shoulders keeping everything organized.

In my experience, government data sets (particularly demographics and other geographically-related data sets) are a fantastic way to get your feet wet with data processing. They're published by a bunch of different agencies, so they're not necessarily conveniently available in one place. However, they usually use standardized identifiers for geographies, which makes it easy to join the data sets together in new and interesting ways.

For instance, here at Republic, we recently used Form 477 data on wireless broadband availability from the FCC, data from Summary File 1 of the US Census, and a couple of Census geographic crosswalk files to be able to calculate the percentage of population in given zip codes and cities covered by various wireless carriers. That required reading the docs for several different data sources, automating some downloads, building database tables to hold all of the information, and then carefully crafting some SQL to pull it all together.

Of course, government data sets generally won't require a whole lot of automation (they're updated yearly or less than yearly, not daily). To build your skills on that front, I'd recommend learning to extract data from various APIs, structure it in a meaningful way, and make it available in a database. For example, if you have a website, set up a free Google Analytics account for it, then build a daily ETL that extracts some meaningful information from the Google Analytics API and stuffs it in a Postgres DB. Then see if you can build some charts or something that sit on top of that database and report on the information.


Luigi is a great Python library for building multi-stage pipelines with parameterized tasks. It easily extend to new storage types and targets.

We currently use it to build a moderately complex product from dozens of data streams (files, APIs and things in-between) and millions of records. At its core is a DAG and topological sort, which capture the essence of pipelines and execution.

A DAG is easy to visualize, even its development over time[1].

[1] https://imgur.com/a/RuC0B5Y (generated from the projects' code, in this case https://github.com/miku/siskin)


There's a lot of NTSB data in this repository that would be good to practice ETL with:

https://github.com/rtidatascience/data-scientist-exercise02


Hi, data engineer here. Other comments have a lot of good suggestions. I especially agree with ideas like avoiding high powered frameworks in the beginning, and learning to write effective transform wrappers for different kinds of weird input data. One thing you'll find is that (for enterprise situations at least) your source data is going to come from extremely odd, old fashioned, or very poorly documented sources. Be prepared to find twenty and thirty year old manuals on formats you've never heard of at times.

That aside, the other side of the coin that is very important is to get very familiar with how folks talk about their data problems. Managers, analysts, etc. will often request a specific solution that they've heard of or seems popular--sometimes it is what they need, often times it isn't. To get a solid footing in this space (not that it isn't for all types of SE) it is critical to have very strong understanding of business requirements, understanding the work of many other roles in your org, and being able to communicate with business folks in ways that allow you to develop a rational plan for a solution while getting them to realize what their needs really are.

Best of luck!


Thanks for your insight !


I worked in that field a number of years. My recommendation to you is to start with some form of data that you are passionate about. Baseball statistics, business metrics, investment figures, whatever.

Once you have the data, then figure what you're going to do with it. (Don't agonize over it, this should all take just a day or so.)

Then go after the toolkit. You'll find many interesting questions if you start with the end goal in mind.

Good luck, and have fun!


This is the best advice I've seen here. The purpose of data engineering is to support a much larger effort related to scientific discovery or increasing bottom line revenue. Don't start with the language, tooling, libs and architecture, start with a purpose and goal. Most goals relate to mimicking a portion of human cognition in terms of pattern matching.

If you're building a house you don't start by picking out hammers and saws.

This takes being passionate about your data and its pre and postprocessing as a data engineer.


quick question: is there a clearing house of this data? I've been wanting to get at football, hockey or soccer data but finding a data source has been daunting.

I could scrape pages, but don't really feel that is the best way to go about it.


Thanks for the advice !


I would recommend looking into Python-based workflow managers: Luigi[0] then Airflow[1], to get a hang of scheduling, DAGs, etc. Both are fairly simple to get started with (especially for a seasoned Python developer) and are used in production environments as well.

[0] https://github.com/spotify/luigi

[1] https://github.com/apache/incubator-airflow


Only just look - don't actually try to use one of these on a side project unless you want to waste a LOT of time. They're amazing for large projects moving 100GB+ a day and running dozens or hundreds of different jobs to orchestrate a moving system, but are a total time sink for anything small.


100 GB per day is pretty small, but your point holds


http://Singer.io is an open source ETL project written in Python. The components are small, composable programs that you can run independently, so you should be able to walk before your run.

A good beginner project is to build an adapter to a new data source (known as a "tap" in Singer). Most taps pull data out of business tools like Salesforce or Marketo, but people also build them to pull characters from the Marvel API (https://www.stitchdata.com/blog/tapping-marvel-api/)

Check out the getting started guide (https://github.com/singer-io/getting-started), or jump into the Singer Slack and ask for help (linked from the guide)


Fantastic reference! Thanks for sharing. As an ETL specialist I've looked for a composable solution but only found Luigi. I wanted something simpler, more integrated, like IFTTT. This has the feel of Yahoo Pipes. I think Singer is the answer. ;-)


Composableanalytics.com might be the tool you are looking for then....


And after you use Singer to ingest data, add in dbt [0] to transform the raw data into something more useful - I cannot speak highly enough of this library and how much work it will save you.

Also keep an eye on Gitlab's new Meltano [1] project which IMHO has huge potential.

[0] https://www.getdbt.com/

[1] https://gitlab.com/meltano/meltano/tree/master


Thanks for the shout-out on Meltano! And we love dbt! Great project and great people behind the code. Their next release is going to be awesome - the new docs feature is beautiful.


An Apache Solr (Search Engine) Target could be quite nice here.


I will look into this, thanks for sharing !


I wrote a simple blog post recently with links to data engineering resources I found useful to get into the field, hopefully it is helpful: https://diogoalexandrefranco.github.io/data-engineering-reso...


Brilliant post, currently looking at the SQL section of leetcode, thanks a lot my friend. Very much appreciated.


If you want to make it as a career choice, I think you should start with learning Java and Scala. For better or worse, this field is tied to the JVM and learning these languages will make picking up Spark and Hadoop (which tbh is a prereq for any Data Engineering position to have on their resume) a lot easier.

Also if you are looking to stay in the Python world, PySpark is pretty intuitive for any Python developer and tons of companies are using it.


Another commenter mentioned it as well, but "Designing Data-Intensive Applications" by Martin Kleppmann https://dataintensive.net/ is a _fantastic_ overview of the field and, I think, more approachable and enjoyable to read than Kimball's book. But Kimball is a classic, especially for how to do warehouse design.

I'll also make a plug for the Meltano[0] project that my colleagues are working on. The idea is to have a simple tool for extraction, loading, transformation, and analysis from common business operations sources (Salesforce, Zendesk, Netsuite, etc.). It's all open source and we're tackling many of the problems you're interested in. Definitely poke around the codebase and feel free to ping me or make an issue / ask questions.

[0] https://gitlab.com/meltano/meltano/


Software wise you could also have a look at dask. Which is more lightweight than Hadoop and spark.

But since you are asking for a project, why not do something local. I.e. you could scrape some data for some time (cinemas, crime) and structure it nicely. After a few months you can start analysis. Bonus points if you make you data available.


This list has numerous pointers to resources related to data engineering: https://github.com/igorbarinov/awesome-data-engineering


I've always found enjoyment in finding statistics in things that interested me, so here's a few thoughts on project ideas. The main thing is to find a data set that interests you, and use that interest for fun or for profit:

- pull down your facebook backup, run it through sentiment analysis, throw it in a db, explore yourself through facebook's eyes! - a while back I was looking to buy a car and used scrapy to process cars.com, ran their vins through a look up tool to find cars that were actually manuals (and not just manumatics, seems few can tell the difference these days). Found reasonable national prices, average miles, etc. - interested in politics? pull down the data sets for voting records, explore your local politicians' voting records, compare them to national averages or historical information, etc. - interested in movies? find movie datasets (or scrape imdb.com, themoviedb.org, etc) and find which genres pay the least per actor, have the smallest cast, etc.

Lots of datasets are available online, if not in machine readable format, then in a format that can be easily scraped. Have fun!


Do Data Engineering jobs require you to have knowledge of statistics, machine learning, etc? Or can you get data engineering jobs where you are focused on connecting pipelines and data flow? I am interested in a data engineering job, but it appears a lot of data engineering jobs require machine learning expertise in job listings. I am interested in just a data engineering focused role.


Look for functional programming data engineering jobs e.g. Haskell, they might be what you're looking for, since most people don't run their algorithms in Haskell (some do!) and use it as a duct tape/pipe.


I have put together a (very short) post to build a dashboard using shopify data and pulling ETL with Stich. You could try to implement the stich part in Python and would have a complete solution. Like looking at Stich/Blendo would give you some ideas of simple ETL workflows. Keep in mind that ETL changes depending on what you want to do. In theory you can do all ETL just with Python code. If you have an SQL compliant database that can hold your data ETL processes could simply be a matter of running SQL queries to transform data within the database. Then you basically load your data from any data source into your db and you run a bunch of SQL to do the transform part.

https://assemblinganalytics.com/post/airbnbapache-superset-a...


Like other I would say start with a problem which interests you.

To mimic the real world try to have extraction from variety of data sources (rdbms, NoSQL, files (different formats csv/json), APIs (salesforce)). Once you have different sources, extract the data to your datalake built on S3/GCS/HDFS. Once the data is present you need to integrate with tools which can extract value. You can use Vendor specific tools like Athena/BigQuery or open source like Presto/Impala/Hive. You can do analytics where you require filtering, cleansing, joining various datasets. You can also look at storing the results in different formats so that other tools like Tableau can use them. To orchestrate all of this you can use Azkaban or Airflow.

My suggestion is slightly biased towards Hadoop ecosystem but the good thing is that most tools here have Open source alternatives.


Many posts here are focused on classic ETL. I'm working on a small project for handling data just after ETL.

It’s for dealing with annoyingly large data, bigger than RAM but sitting on a personal PC. It basically performs sampling and munging for this data. There’s no good solution for this right now (I know because I've been looking for more than a year).

What might be interesting to you is that there's little abstraction in the project, but it's non-trivial to execute. To me, this makes it fun. Despite its simplicity, it has high utility and could be used by others. This would be a great outcome for an initial project.

I've got a working version of it, but it would benefit from the eye of a seasoned python dev.

Maybe it would be interesting to you to get in touch? My email is: mccain.alex@yandex.com

Cheers,


I think you can easily learn to walk also with Spark. There are a lot of beginners Spark tutorial online. You can see, for instance, https://community.cloud.databricks.com. They give you a Spark cluster where you can start and there are several tutorials in a notebook-style. Check it out. Of course, you could start your ETL / Data Engineering in a more "traditional" way trying to learn about relational databases and the likes. But I would suggest you to start directly with Spark. You can use it with lots of other big data tools (such as Hadoop/Hive and also S3) and you could also find some interesting Machine Learning use case.


Bookmarked, thanks !


I think basic concepts of datawarehousing like creating data marts, building star or snowflake schemas, dimensional modelling, slowly changing dimensions are quite important before you jump into why hadoop, hive, hbase or spark is relevant.


It may not be exactly what you're looking for but perhaps to get started without having to set up all the infrastructure you could use something like AWS Glue. There's some tutorials / examples in the official AWS docs: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programm...

Disclosure: I work on AWS Glue. Note: We're Hiring! Email me at: zaheeram <at> amazon dot com if you're interested in this space!


Build a web scrapper and save some raw data in somewhere like S3. Then run some job on top of that data to get aggregated measures and save them somewhere. I built a project[1] like this and learned a lot in the process. I used airflow to run the scrapping tasks, save the data in S3, use AWS Athena to run queries and load data into Redshift. I did all of this just to learn more about Airflow and some AWS tools.

[1] https://skills.technology/data-engineer


Spark is not too tricky to dive into, even though you can't really take advantage unless you have a big cluster to use :)

if you want to practice data-manipulation, and a lot of the map reduce type stuff you can do with spark, I find Pandas useful for small datasets (And a lot of overlap in functionality as far as Dataframes are concerned)

For pipeline stuff, definitely take a look at Luigi, but again without a cluster it'll be less fun. Still, if you can try automating tasks with a mini luigi scheduler on your localhost, it would be good practice


I find ETL best practices with Airflow a good start. Even if you don't go the Airflow route, you can benefit from the example implementations of Kimball and Data Vault ETLs.

https://gtoonstra.github.io/etl-with-airflow/index.html

https://github.com/gtoonstra/etl-with-airflow


Google Cloud Composer, built on top of Airflow (mentioned by a number of people here) has a great Getting Started Guide to check out: https://cloud.google.com/composer/docs/quickstart

You can dive right in to Airflow here: https://airflow.apache.org/tutorial.html


What are the industries that have the highest costs for ETL/Data Engineering?

What companies in these industries are interested in reducing their costs for this work?

"costs" as used above includes time expenditures as well as spending money

http://web.archive.org/web/19991023120316/http://www.dbmsmag...


Hey Zabana!

I noticed you were in the Europe area. We are looking for developers with interest in going into the field.

We produce a data pipeling/analytics platform for complex data to our customers.

If you would like to know more about our company! Position is in Stockholm. https://gist.github.com/eleijonmarck/1b384480aaa3d22ab7e6ea0...


You might want to take a look at [lnav](https://lnav.org), a sort of mini-ETL CLI powertool. Performance is fine up to a few million rows.

Edit: of course this comment (like many others below) pertains to tooling vs a particular project per se. FWIW I agree w/ others' sentiment about doing things "by hand" and working with data that holds your interest.


While still alpha right now, Meltano is a project from GitLab. We are looking for contributors who have a passion for bringing the excellent ideas of software development to the data science world. https://gitlab.com/meltano/meltano/. Feel free to post some issues and give it a try.


an excellent survey of the field: https://dataintensive.net/


I cannot recommend this book enough! I thoroughly enjoyed it and really appreciated the clear definitions and strong opinions.


Agreed. This book will give a fantastic way to think about all these.


You could check out https://github.com/mara/mara-example-project

The project is just a demonstrator for the Mara framework but it gives a good overview and you could take it as a starter for something you want to build.


if you have any interest in baseball, https://www.retrosheet.org/ might be of interest. They have play-by-play accounts for every Major League baseball game going back to 1914.

this strikes me as a good testbed for personal projects, because:

- it's enough data to be inconvenient, but can still definitely fit on a single machine

- it comes in a weird format that is sort of a pain to process, which is good practice

- just loading each event into a database won't be enough, you'll have to transform and reorganize it further to support answering interesting questions

- last I checked, there are undocumented but public APIs for more recent MLB games (on some MLB website), so automatically scraping those and incorporating them into the retrosheet data is another interesting challenge.


OP Here, Thanks for all the great links, suggestions and ideas. One question I forgot to ask is the following: What exactly does Transform mean ? It seems to me that it's a very volatile concept and can change depending on the project. I'd love to know your thoughts on this. Cheers.


I think it's as vague as it sounds. But the stereotype is probably something like translating rows to columns; or normalizing a column; or creating a composite column


Hi I have been working in this field (and some of the previous incarnations of it) for quite a while. I would be happy to have a discussion to give some pointers and relay some experience if you want.

My email and keybase.io is in my profile so feel free to get in touch.


Luigi. We literally use this for anything requiring a series of more than a few steps:

https://github.com/spotify/luigi


SQL. Most of my ETL jobs consist of setting up the overhead pieces in pyspark then using SQL for the most important logic. It's way more portable that way.


What I like about using SQL is I can prototype with SQLite, Postgres, etc and then adapt and move to Hive, Impala, BigQuery, etc. SQL is literally everywhere, and I don't have to learn to paradigms and frameworks.


Web scraping is a good project. LinuxAcademy has some good content on AWS and Hadoop ($50/month).


Convert .mat files to json,csv,xls formats




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: