I was product manager for Apache Hive at Hortonworks and earlier an early engineer at CUDA at NVIDIA - focusing on compiler optimizations.
As product manager, I saw so many customers struggle with Hive/Hadoop. So, I decided to build a product first company - that just works to solve real (and sometimes unsexy) pain points of customers.
We want to support entire Enterprise journey to Open Source (Apache Spark) and then to cloud (Kubernetes).
What I'm personally excited about is that with some compiler magic - I can make code and visual interfaces both work together - making all developers happy!
I'd love to hear what you think, what you wish we'd build, and I'll be here to answer any questions!
One is the category of schedulers - airflow, astronomer - these are focused on scheduling these Transforms, so these are very different (we integrate well with airflow).
Then there is the NoCode category of transform services on cloud - these focus on simple movement of data - datacoral, segment, fivetran and all might fit here.
We're quite different from both!
Only issue we encountered was one query used three unions which was rather inefficient, but once that was replaced with a query using grouping properly, problem solved.
Second, Spark is not re-entrant. This means that to really get usefulness out of Spark's massive parallelization, you need to use clever and non-obvious tricks to do something that "seems" simple. In some extreme cases, the RDD needs to be fully serialized (danger zone if it's a big one).
Finally, and related to the two above, Spark is just simply strange. You really need to have a "parallel" mindset when writing code on a Spark cluster. I constantly had to look stuff up and documentation was spotty at best. Once you throw in the rest of the ensemble (Pandas, Numpy, etc.) you end up with a very domain-specific codebase. In reality, most companies will have datasets that could be ETL'd on my iPhone and a Spark cluster is just overkill.
Spark and Hive (that I was product manager for, and used by hundreds of Enterprises) are used quite heavily in data engineering. Many transforms fit well into relational/set based model.
In your particular case, perhaps due to the nature of your transforms, it might not have been the best fit.
Also, if your data is small, it's not always a great fit. However, in many large companies, if they have 500 transforms and 30 are small - there is simplicity in using a single technology.
So huge congrats on your release!
I like that you said this and I've seen a similar sentiment before. I was first exposed to it watching an ElixirConf keynote from José Valim about the Flow framework they built in Elixir where he summarized it as "For between 40-80% of the jobs submitted to MapReduce systems, you'd be better off running them on a single machine." which referenced the paper Musketeer: all for one, one for all in data processing systems .
While I'm no Data Engineer myself, I do often wonder if distributing the workload is always better? The anecdote above indicates that a powerful single multicore machine may be right solution for many.
Now that isn't to discount what Prophecy is trying to do, my company just went through a huge re-platforming moving from on on premises to the cloud and it is not easy; any company trying to tackle that space is on the right track. But I just wonder if its overkill for most use cases?
 - https://www.youtube.com/watch?v=srtMWzyqdp8
 - http://www.cs.utexas.edu/users/ncrooks/2015-eurosys-musketee...
We're a complete replacement for ETL products - you'd use us to replace an Ab Initio, Informatica, IBM Datastage - we'd move your workflows through automated conversion.
You can program using visual drag-and-drop or write pure Spark Scala/Java/Python code, they are equivalent in our system. You get configuration management, lineage, metadata management. We're not tied to any private/public cloud - so you can use us with any Apache Spark - every public/private cloud has it.
This is quite different from these products! Different focus and level of abstraction.
When you use our product, your code looks very similar to writing Spark code in IntelliJ. You get visual workflows along with it.
Our users are not having to learn or use another layer of abstractions that locks them into our API, and we avoid performance
Apart from interface, we automatically "convert" your legacy code into Spark & "migrate" these Spark workflows from one cluster to another. You can move workflows from private cloud to public cloud, or to another datacenter.
Plus it has improvements along multiple dimensions such as interfaces - that developers care a lot about!