
How we run Spark and Sqoop in production - natekupp
https://www.thumbtack.com/engineering/building-thumbtacks-data-infrastructure-part-ll/
======
ktamura
Any good alternatives for Sqoop? I feel that an ETL tool _just_ for HDFS is
too limiting and leads to further fragmentation on the data pipeline.

~~~
rathboma
Using Sqoop from something like Luigi as the ETL manager is a pretty great
workflow -
[https://github.com/spotify/luigi](https://github.com/spotify/luigi)

You can define dependencies between jobs based on output file which allows you
to re-run only part of your pipeline

~~~
machbio
Thats a great idea - but could you elaborate on the scheduling of jobs on
Luigi - it does not have a scheduler like AirFlow - how do you schedule Luigi
tasks ?

~~~
rathboma
Check out this Foursquare talk that goes through how we used to do scheduling
-- basically you make jobs dependent on a date -
[http://www.slideshare.net/OpenAnayticsMeetup/luigi-
presentat...](http://www.slideshare.net/OpenAnayticsMeetup/luigi-
presentation-17-23199897)

------
sciurus
Their earlier post about rebuilding their data infrastructure is more
interesting imho:
[https://news.ycombinator.com/item?id=11474284](https://news.ycombinator.com/item?id=11474284)

------
natekupp
hey all, feel free to reach out to me either on this thread, or directly at
nate[at]thumbtack.com if I can answer any questions!

~~~
rahij
I had to connect to a US VPN to access the jobs page. Is that intentional?

~~~
natekupp
thanks for flagging, I'll look into it!

------
lazywizard
Thanks for sharing with all scripts. Great help.

------
lobster_johnson
Speaking of Spark, has anyone used it with Go? Is such a thing even possible?

------
mcrad
Thumbtack is great for lazy consumers. But word on the street is they
contribute to too much price pressure on the market, therefore drive the
overall quality of services down. Good work Thumbtack! You have figured out a
convenient way to sacrifice long term value in favor of short term profit.

