
Ask HN: What's the Heroku for Data? - shay_ker
Heroku made it so easy to one-click deploy a web application. What&#x27;s the equivalent for building a data pipeline, data warehouse, or data lake?
======
alpb
I work at Google so I'll give examples from Google land

* Google BigQuery (data warehouse) [https://cloud.google.com/bigquery/](https://cloud.google.com/bigquery/)

* Google Dataflow [https://cloud.google.com/dataflow/](https://cloud.google.com/dataflow/)

* Looker [https://cloud.google.com/looker](https://cloud.google.com/looker)

More on data analytics:
[https://cloud.google.com/products#section-6](https://cloud.google.com/products#section-6)
databases:
[https://cloud.google.com/products#databases](https://cloud.google.com/products#databases)

Most of these solutions are 'serverless' similar to Heroku model (you don't
manage the infrastructure, it scales based on demand, you pay per usage).

~~~
shay_ker
Is there anything that actually builds the pipeline for you?

------
streetcat1
For data warehouse try clickhouse.

------
Jugurtha
We're working on an internal platform. Object storage, structured or
unstructred data, plug-in architecture and easy to add applications. Notebooks
with GPU and a choice of images with lots of pre-installed libraries with one
click. i.e: our people will not have to deal with CUDA and won't have to get a
powerful machine.

One click publishing of an automatically parametrized notebook as an
application, an AppBook, without the widgets code or tagging cells. Automatic
tracking of parameters, metrics and saving the models in object storage with
no boilerplate code behind the scenes, so people don't _forget_ to log. You
can send post requests to your instrumented training notebooks and
programmatically tweak training parameters. You can also have a runtime
notebook with the params you want and Publish an AppBook so a domain expert
can either try to validate it entering values, or train models themselves
tweaking parameters.

Also one click deployment that gives you an endpoint to serve your model. You
can invoke the model with a post request using an authentication token you can
generate from the application.

We're adding functionality for projects and better user management, and
supporting more data sources (for now S3 only, but Google Cloud Storage and
Azure in the pipe). We don't have async training or invocation yet.

We're opening it for about 30 students of one of our colleagues so they could
prepare their masters on an instance we deployed on Google Cloud Platform and
pretty big datasets for students (they get to use a Tesla K80, which they
couldn't before).

If you're interested drop me a line. We're also supporting the fast.ai course
so people could start learning hassle free.

We have worked as a consulting company in ML projects with large organizations
and we've suffered though the usual problems of config, hardware, libraries,
dependencies, notebook sharing, model deployment, building applications from
problem and model to the javascript on the front. Not trivial to find or keep
people who can go through that, and it's tricky to need such people, and as a
small consulting company doing contract work, the spending of having full time
specialists if you can find them is draining. We also had the issues of key
people not being part of the projects anymore and no-one knowing what the
project was about. Now we're adding features for collaboration and
institutionalizing knowledge of the projects, as we do in our software
engineering day-to-day where we're all aligned and everyone knows why
something was done -rationale, assumptions, hypotheses, etc.-.

So we built this tool to give us leverage, allow our data scientists to deploy
models and share interactive notebooks without having to write widget code, or
ask people who can deploy something to take care of it. Must be one click.

