
Show HN: Prophecy.io – Cloud Native Data Engineering - ibains
https://medium.com/prophecy-io/introducing-prophecy-io-cloud-native-data-engineering-1b9247596030
======
ibains
Hi Everyone! Super excited for our first release!!

I was product manager for Apache Hive at Hortonworks and earlier an early
engineer at CUDA at NVIDIA - focusing on compiler optimizations.

As product manager, I saw so many customers struggle with Hive/Hadoop. So, I
decided to build a product first company - that just works to solve real (and
sometimes unsexy) pain points of customers.

We want to support entire Enterprise journey to Open Source (Apache Spark) and
then to cloud (Kubernetes).

What I'm personally excited about is that with some compiler magic - I can
make code and visual interfaces both work together - making all developers
happy!

I'd love to hear what you think, what you wish we'd build, and I'll be here to
answer any questions!

~~~
reilly3000
This looks very compelling. I'd love more to read about it. Having suffered in
various visual ETL trenches I've pretty much walked away as much as possible.
When can we see the source?

~~~
ibains
I'd love to understand what your preferred interface looks like, and get your
input in building the interface, please send me a mail at
raj.bains@prophecy.io

------
iblaine
Very cool that you can go between drag-n-drop and code for development. How is
Prophecy different from datacoral.com or Astronomer.io? Will it be open
sourced (ex, dbt, airflow, Dagster)?

~~~
ibains
Foremost, we're focused on Transforms - helping users create and manage
complex ETL transforms - this is your business logic.

One is the category of schedulers - airflow, astronomer - these are focused on
scheduling these Transforms, so these are very different (we integrate well
with airflow).

Then there is the NoCode category of transform services on cloud - these focus
on simple movement of data - datacoral, segment, fivetran and all might fit
here.

We're quite different from both!

------
dvt
Out of personal (and painful) experience, using Spark for general-purpose ETL
processes is a bad idea. Spark is meant to be used in highly-distributed
systems and with tables that have like trillions of rows. RDDs use an
optimized distributed programming model that takes a lot of practice and
getting used to. Some operations are virtually impossible due to the fact that
executors run on separate contexts. Caveat emptor.

~~~
EdwardDiego
What issues did you encounter? We've been using Spark since 1.0, and the
dataframes API and Catalyst query engine have abstracted away RDDs nicely, we
rarely have to use them. We recently converted a bunch of legacy ETL jobs that
ran against the same dataset in Vertica to Spark in EMR by storing the raw
data as Parquet, caching it in memory, and running the 30 or so different SQL
queries against it using Spark SQL.

Only issue we encountered was one query used three unions which was rather
inefficient, but once that was replaced with a query using grouping properly,
problem solved.

~~~
dvt
First, like I mentioned, Spark uses separate contexts for all executors (by
design). This means that running nested RDD mappings is a nightmare. Just
Google "nested for each loop Spark" or something along those lines[1]. This is
an _extremely_ common paradigm, especially when dealing with more complex
transforms (in my specific case, we needed to hit an API, and then hit another
API based on results of the previous API).

Second, Spark is not re-entrant. This means that to really get usefulness out
of Spark's massive parallelization, you need to use clever and non-obvious
tricks to do something that "seems" simple[2]. In some extreme cases, the RDD
needs to be fully serialized (danger zone if it's a big one).

Finally, and related to the two above, Spark is just simply strange. You
really need to have a "parallel" mindset when writing code on a Spark cluster.
I constantly had to look stuff up and documentation was spotty at best. Once
you throw in the rest of the ensemble (Pandas, Numpy, etc.) you end up with a
very domain-specific codebase. In reality, most companies will have datasets
that could be ETL'd on my iPhone and a Spark cluster is just overkill.

[1] [https://stackoverflow.com/questions/34339300/nesting-
paralle...](https://stackoverflow.com/questions/34339300/nesting-
parallelizations-in-spark-whats-the-right-approach)

[2] [https://stackoverflow.com/questions/32619570/can-i-use-
spark...](https://stackoverflow.com/questions/32619570/can-i-use-spark-
dataframe-inside-regular-spark-map-operation)

~~~
ibains
The variation of complexity of transforms across Enterprises and teams within
Enterprises is quite large.

Spark and Hive (that I was product manager for, and used by hundreds of
Enterprises) are used quite heavily in data engineering. Many transforms fit
well into relational/set based model.

In your particular case, perhaps due to the nature of your transforms, it
might not have been the best fit.

Also, if your data is small, it's not always a great fit. However, in many
large companies, if they have 500 transforms and 30 are small - there is
simplicity in using a single technology.

~~~
dvt
Just to be clear, I'm not discounting the work you're doing with Prophecy. We
were using Databricks, and I can definitely see how a platform like Prophecy
is a step up. I have a personal rule that I will always prop up any budding
startup or entrepreneur (the golden rule and all that).

So huge congrats on your release!

------
quirmian
This looks very similar to products like Apache Nifi, Apache CDAP and Google
Data Fusion. When would I pick prophecy.io over the others?

~~~
ibains
We're focused on complex transforms for high volume & performance workloads -
something you'd find in Enterprise core production ETL workflows.

We're a complete replacement for ETL products - you'd use us to replace an Ab
Initio, Informatica, IBM Datastage - we'd move your workflows through
automated conversion.

You can program using visual drag-and-drop or write pure Spark
Scala/Java/Python code, they are equivalent in our system. You get
configuration management, lineage, metadata management. We're not tied to any
private/public cloud - so you can use us with any Apache Spark - every
public/private cloud has it.

This is quite different from these products! Different focus and level of
abstraction.

------
tgtweak
How does this compare to cask hydrator/cdap? What does the migration from R or
python ETLs to this resemble?

~~~
ibains
We're not adding a layer on top of all big data platforms, instead we work on
Spark directly. This has advantages, disadvantages.

When you use our product, your code looks very similar to writing Spark code
in IntelliJ. You get visual workflows along with it.

Our users are not having to learn or use another layer of abstractions that
locks them into our API, and we avoid performance

Apart from interface, we automatically "convert" your legacy code into Spark &
"migrate" these Spark workflows from one cluster to another. You can move
workflows from private cloud to public cloud, or to another datacenter.

~~~
tgtweak
What about existing spark/scala workflows? Can they be imported?

------
transfire
I read all this technical jargon but have no idea what any of it actually
does.

~~~
ibains
Well, simplest way to put it is - you get an Data Engineering (ETL) product -
that'll move your 10s of thousands of ETL workflows from legacy ETL product to
Spark, give you a full stack ETL product on Spark, and move you from on-
premise to cloud.

Plus it has improvements along multiple dimensions such as interfaces - that
developers care a lot about!

~~~
mlevental
what is "legacy ETL" exactly? you have a transpiler but what code does it
transpile? what if all my ETL is written in Haskell and SQL (it's not - I
don't have any ETL processes - I'm just trying to figure out what you're
transpiling)

~~~
ibains
This is the workflows written in legacy formats - Ab Initio, Informatica, IBM
DataStage - this is a $10B+ industry - so lots of usage out there. They
typically have graph formats and Domain Specific Languages to specify
transformations. We reverse engineer them. Enterprises have 10s of thousands
of these workflows and getting out of legacy lock-in is hard!

------
nexuslab
Awesome!

------
rpooranprasad
That is awesome. I could use your service. Will mail

