Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Prophecy.io – Cloud Native Data Engineering (medium.com)
64 points by ibains 58 days ago | hide | past | web | favorite | 22 comments

Hi Everyone! Super excited for our first release!!

I was product manager for Apache Hive at Hortonworks and earlier an early engineer at CUDA at NVIDIA - focusing on compiler optimizations.

As product manager, I saw so many customers struggle with Hive/Hadoop. So, I decided to build a product first company - that just works to solve real (and sometimes unsexy) pain points of customers.

We want to support entire Enterprise journey to Open Source (Apache Spark) and then to cloud (Kubernetes).

What I'm personally excited about is that with some compiler magic - I can make code and visual interfaces both work together - making all developers happy!

I'd love to hear what you think, what you wish we'd build, and I'll be here to answer any questions!

This looks very compelling. I'd love more to read about it. Having suffered in various visual ETL trenches I've pretty much walked away as much as possible. When can we see the source?

I'd love to understand what your preferred interface looks like, and get your input in building the interface, please send me a mail at raj.bains@prophecy.io

Very cool that you can go between drag-n-drop and code for development. How is Prophecy different from datacoral.com or Astronomer.io? Will it be open sourced (ex, dbt, airflow, Dagster)?

Foremost, we're focused on Transforms - helping users create and manage complex ETL transforms - this is your business logic.

One is the category of schedulers - airflow, astronomer - these are focused on scheduling these Transforms, so these are very different (we integrate well with airflow).

Then there is the NoCode category of transform services on cloud - these focus on simple movement of data - datacoral, segment, fivetran and all might fit here.

We're quite different from both!

Out of personal (and painful) experience, using Spark for general-purpose ETL processes is a bad idea. Spark is meant to be used in highly-distributed systems and with tables that have like trillions of rows. RDDs use an optimized distributed programming model that takes a lot of practice and getting used to. Some operations are virtually impossible due to the fact that executors run on separate contexts. Caveat emptor.

What issues did you encounter? We've been using Spark since 1.0, and the dataframes API and Catalyst query engine have abstracted away RDDs nicely, we rarely have to use them. We recently converted a bunch of legacy ETL jobs that ran against the same dataset in Vertica to Spark in EMR by storing the raw data as Parquet, caching it in memory, and running the 30 or so different SQL queries against it using Spark SQL.

Only issue we encountered was one query used three unions which was rather inefficient, but once that was replaced with a query using grouping properly, problem solved.

First, like I mentioned, Spark uses separate contexts for all executors (by design). This means that running nested RDD mappings is a nightmare. Just Google "nested for each loop Spark" or something along those lines[1]. This is an extremely common paradigm, especially when dealing with more complex transforms (in my specific case, we needed to hit an API, and then hit another API based on results of the previous API).

Second, Spark is not re-entrant. This means that to really get usefulness out of Spark's massive parallelization, you need to use clever and non-obvious tricks to do something that "seems" simple[2]. In some extreme cases, the RDD needs to be fully serialized (danger zone if it's a big one).

Finally, and related to the two above, Spark is just simply strange. You really need to have a "parallel" mindset when writing code on a Spark cluster. I constantly had to look stuff up and documentation was spotty at best. Once you throw in the rest of the ensemble (Pandas, Numpy, etc.) you end up with a very domain-specific codebase. In reality, most companies will have datasets that could be ETL'd on my iPhone and a Spark cluster is just overkill.

[1] https://stackoverflow.com/questions/34339300/nesting-paralle...

[2] https://stackoverflow.com/questions/32619570/can-i-use-spark...

The variation of complexity of transforms across Enterprises and teams within Enterprises is quite large.

Spark and Hive (that I was product manager for, and used by hundreds of Enterprises) are used quite heavily in data engineering. Many transforms fit well into relational/set based model.

In your particular case, perhaps due to the nature of your transforms, it might not have been the best fit.

Also, if your data is small, it's not always a great fit. However, in many large companies, if they have 500 transforms and 30 are small - there is simplicity in using a single technology.

Just to be clear, I'm not discounting the work you're doing with Prophecy. We were using Databricks, and I can definitely see how a platform like Prophecy is a step up. I have a personal rule that I will always prop up any budding startup or entrepreneur (the golden rule and all that).

So huge congrats on your release!

> In reality, most companies will have datasets that could be ETL'd on my iPhone and a Spark cluster is just overkill.

I like that you said this and I've seen a similar sentiment before. I was first exposed to it watching an ElixirConf keynote from José Valim[1] about the Flow framework they built in Elixir where he summarized it as "For between 40-80% of the jobs submitted to MapReduce systems, you'd be better off running them on a single machine." which referenced the paper Musketeer: all for one, one for all in data processing systems [2].

While I'm no Data Engineer myself, I do often wonder if distributing the workload is always better? The anecdote above indicates that a powerful single multicore machine may be right solution for many.

Now that isn't to discount what Prophecy is trying to do, my company just went through a huge re-platforming moving from on on premises to the cloud and it is not easy; any company trying to tackle that space is on the right track. But I just wonder if its overkill for most use cases?

[1] - https://www.youtube.com/watch?v=srtMWzyqdp8 [2] - http://www.cs.utexas.edu/users/ncrooks/2015-eurosys-musketee...

This looks very similar to products like Apache Nifi, Apache CDAP and Google Data Fusion. When would I pick prophecy.io over the others?

We're focused on complex transforms for high volume & performance workloads - something you'd find in Enterprise core production ETL workflows.

We're a complete replacement for ETL products - you'd use us to replace an Ab Initio, Informatica, IBM Datastage - we'd move your workflows through automated conversion.

You can program using visual drag-and-drop or write pure Spark Scala/Java/Python code, they are equivalent in our system. You get configuration management, lineage, metadata management. We're not tied to any private/public cloud - so you can use us with any Apache Spark - every public/private cloud has it.

This is quite different from these products! Different focus and level of abstraction.

How does this compare to cask hydrator/cdap? What does the migration from R or python ETLs to this resemble?

We're not adding a layer on top of all big data platforms, instead we work on Spark directly. This has advantages, disadvantages.

When you use our product, your code looks very similar to writing Spark code in IntelliJ. You get visual workflows along with it.

Our users are not having to learn or use another layer of abstractions that locks them into our API, and we avoid performance

Apart from interface, we automatically "convert" your legacy code into Spark & "migrate" these Spark workflows from one cluster to another. You can move workflows from private cloud to public cloud, or to another datacenter.

What about existing spark/scala workflows? Can they be imported?

I read all this technical jargon but have no idea what any of it actually does.

Well, simplest way to put it is - you get an Data Engineering (ETL) product - that'll move your 10s of thousands of ETL workflows from legacy ETL product to Spark, give you a full stack ETL product on Spark, and move you from on-premise to cloud.

Plus it has improvements along multiple dimensions such as interfaces - that developers care a lot about!

what is "legacy ETL" exactly? you have a transpiler but what code does it transpile? what if all my ETL is written in Haskell and SQL (it's not - I don't have any ETL processes - I'm just trying to figure out what you're transpiling)

This is the workflows written in legacy formats - Ab Initio, Informatica, IBM DataStage - this is a $10B+ industry - so lots of usage out there. They typically have graph formats and Domain Specific Languages to specify transformations. We reverse engineer them. Enterprises have 10s of thousands of these workflows and getting out of legacy lock-in is hard!


That is awesome. I could use your service. Will mail

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact