Our company (EverythingMe) had a very similar journey to the one Samson describes (Splunk, MySQL, logs, ... -> Redshift). At the time Persicope wasn't there (or we didn't find it), so we built re:dash instead.
We've been using the python celery task queue library to coordinate python ETL-esque jobs, but as our setup gets more heterogenous language and job wise we're looking for something built more explicitly for the purpose. Luigi and Airflow are both contenders.
Data engineering is moving in a direction where pipelines are generated dynamically. "Analysis automation", "analytics as a service", "ETL frameworks" require dynamic pipeline generation. Providing services around aggregation, A/B testing and experimentation, anomaly detection and cohort analysis require metadata-driven pipeline generation.
Airflow is also more state aware where the job dependencies are kept of for every run independently and stored in the Airflow metadata database. The fact that this metadata is handled by Airflow makes it much easier to say- rerun a task and every downstream tasks from it for a date range. You can perform very precise surgery on false positive/ false negative and rerun sub sections of workflow in time easily.
It's a bit of a relief to see this after wondering why I've been unable to find an analytics solution that felt complete.
Snowplow is shown on your pricing page. Doesn't Snowplow already connect directly with Redshift? Or was Fivetran built before Snowplow finished their new setup?
1. Run your own Snowplow collector that writes to your Redshift.
2. Use Snowplow Inc's hosted collector that writes into your Redshift.
3. Use Fivetran's Snowplow collector that writes to your Redshift.
We have customers running each of these configurations. In the case of #1 & 2, they're just using us to sync other data into the same Redshift cluster as their snowplow collector.
I didn't have this to read several months ago so I did end up writing my own ETL solution. It has been fun though and now we're into the analytics phase and using Chart.io - did you compare that to Periscope? It's a tad pricey but the charts are pretty easy to generate using their query builder if you don't want to get dirty with SQL (or have team members who don't have SQL skills yet).
I do think that rolling your own ETL can be rewarding though - especially if you are wrapping each attempt at an ETL process in a class and storing long term data about the monthly/daily/hourly/irregular processes for internal analysis, forecasting, bug-reporting, and providing fodder for visuals to sell the rest of the team on what you're doing.
Even for technical people, having access to analytics is helpful. Being informed about product metrics helps developers see the ‘why’ behind their work — a key ingredient to high performance
You mean that the technical people who actually implement the software are so alienated from their users that they have to wait for a royal decree to spur them to work?
I wonder how few of their ideas flow back up to the top.
I like the article; it's nice to see one use case of ETL. Too often people directly hit the production database to get reports instead of ETLing into a read-only db that won't affect site performance.
I'm using Django, but I imagine this kind of system would be mostly platform-agnostic.
I know a true ETL solution like this will be more powerful and flexible, but how do you decide when it's worth the investment? Are there key use-cases or reports that only this type of solution can provide? Is the main issue with external analytics (like Mixpanel) that you need to know what data to collect before you can report on it?
The 2nd thing is that MixPanel is write-once, read-forever. aka if you misspell something or change definitions you end up in a bad place. These ETL processes are much more capable of adapting and normalizing a schema over time.
Did you ever looked into Splunk DB connect app?
It allows to use structured DB (such as MySQL and others) as a data source for Splunk. I haven't used it yet but interested in feedback.
Great, detailed article though.
Back when Redshift was on its initial beta release (2012), there almost wasn't any ETL / charts tool available for us, so we ended up building most of the tools and libraries ourselves. A few of them were open source too:
- https://github.com/zalora/redsift A web-interface SQL tool for Redshift, letting the user to query and also export them to S3 (and send an email alert to user once done.)
- https://github.com/zalora/postgresql-user-manager for managing user privileges
- https://github.com/lenguyenthedat/aws-redshift-to-rds for copying Tables from Redshift to RDS (postgresql)
- https://github.com/zalora/kraken a bit similar to luigi or airflow.
Things has changed quite a lot since then, there are a lot of great solutions to our problems that are either free or very cheap and production-ready:
- redash.io (https://github.com/EverythingMe/redash) - web-interface SQL tool with visualization - FREE
- redshift_console (https://github.com/EverythingMe/redshift_console) - redshift ops tool - FREE
- flydata (https://www.flydata.com/resources/flydata-sync/sync-rds-mysq...) sync live data from MySQL to Redshift - Subscription base
- dreamfactory (https://aws.amazon.com/marketplace/pp/B00GXYDK18?sr=0-3&qid=...) for providing REST API interface to your database (supports redshift and other databases) - FREE
We didn't use Tableau in Zalora either (due to pricing and the number of users that we have in-house), and ended up building our own customized data dashboards with d3js and a few other different frameworks.
However, as long as you are ok with the price, Tableau is pretty good. It's being use widely in my current company (http://commercialize.tv) :
- You can minimize processing from Tableau server by just create another data mart layer from Redshift with your ETL tools / scripts, having Tableau connecting directly to it.
- Visualizations / Charts creation process is pretty much straightforward. The end-result will look exceptionally comparing to other solutions that we have tried.
- They also have a really good and active community.
I'm not sure how well they're doing business-wise. I did notice that they recently pivoted one aspect of their business. They used to facilitate users selling prints of their works; now they've dropped that service and are building a stock photo service instead.
My relative posts a photo almost everyday and is always trying to figure out how to get a higher score; different photo subjects, posting at different times of the day, etc.
Looker (http://looker.com) is an alternative to Periscope that a ton of venture-backed tech companies are using (mostly in conjunction with Redshift). Here are two posts from Buffer and SeatGeek on their stacks using Redshift, Luigi and Looker. Buffer: https://overflow.bufferapp.com/2014/10/31/buffers-new-data-a.... SeatGeek: http://chairnerd.seatgeek.com/building-out-the-seatgeek-data...
Looker does not necessitate each member of the team to know SQL to explore the data or create reports. It also has a text-based modeling language that is a thin abstraction layer of SQL. It makes SQL modular and reusable making it far more efficient to support a wide range of analysis. It's more expensive than Periscope, but it's way more powerful.
I run marketing for Periscope. Like Chartio, you guys have a great product. We routinely send people your way when they have more business user needs and aren’t a good fit for Periscope.
For those evaluating tools like Periscope/Looker/Chartio:
Whereas Looker focuses on business users, we’ve built Periscope around serving SQL analysts better than anyone else.
If you don't have someone who knows SQL on your team, Looker (or Chartio) is probably a better fit. (though I should mention that some customers have learned SQL through our customer support :)
We’re focused on the SQL analysts who are building dashboards and running reports, day in, day out. With that said, every one of our customers has employees who don’t know SQL, but need to dive into their data. So we’ve built a ton of features to make this easy as well.
If you want to find out more, visit https://www.periscope.io. If Periscope looks like a good fit, sign up and we'll get you set up with a free trial.
As I understand it, Periscope is cloud based, so could you elaborate on how it might plug into large locally hosted databases, or is the caveat that my databases must also be in the cloud?
There needs to be a route from the internet to your DB, yes. Sometimes this works because the DB is in the cloud, or sometimes a port or SSH tunnel is opened into a local network.