Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is your ML stack like?
387 points by imagiko 24 days ago | hide | past | web | favorite | 131 comments
How did your team build out AI/ML pipelines and integrated it with your existing codebase? For example, how did your backend team(using Java?) work in sync with data teams (using R or python?) to have minimal rewriting/glue code as possible to deploy models in production. What were your architectural decisions that worked, or didn't?

I'm currently working to make an ML model written in R work on our backend system written in Java. After the dust settles I'll be looking for ways to streamline this process.




What didn't work:

Shipping pickled models to other teams.

Deploying Sagemaker endpoints (too costly).

Requiring editing of config files to deploy endpoints.

What did work:

Shipping http endpoints.

Deriving api documentation from model docstrings.

Deploying lambdas (less costly than Sagemaker endpoints).

Writing a ~150 line python script to pickle the model, save a requirements.txt, some api metadata, and test input/output data.

Continuous deployment (after model is saved no manual intervention if model response matches output data).


hi Aaron, We did exactly what works for you into a open source python library, github.com/bentoml/bentoml.

It packages your model for you into a standardized format, that you can use it in multiply serving scenarios online serving with api endpoint, offline serving with spark udf, CLI access or import it as python module. It also helps you deploy to different platform such as lambda, sagemaker and others.

Our value is from model in notebook to production service in 5 mins. Love to hear your feedback on this. You can try out our quick start on Google colab (https://colab.research.google.com/github/bentoml/BentoML/blo...)


It's been great seeing this space fill out with solutions in the last year. MLFlow[1] is another open source solution I have my eyes on.

BentoML looks more cohesive than our homegrown solution because it targets a more general case. One of the things I would miss switching to BentoML would be automatic requirements generation. We use pipreqs[2] to generate a requirements.txt given a model instance. Any thoughts on the difficulty as a user in extending BentoML as to integrate pipreqs?

Again another difficulty question: we have a few statsmodels[3] predictors and it isn't clear how much work would be involved extending BentoML to accept those too.

Thanks for pointing out BentoML. I'll keep an eye on it as a migration target as this space develops.

[1] https://mlflow.org/docs/latest/index.html

[2] https://github.com/bndr/pipreqs

[3] https://www.statsmodels.org/stable/index.html


hi Aaron, I'm one of the BentoML aurthors - great suggestion on pipreqs, will look into incorparating that into BentoML!

It should be very straightforward adding support for saving/loading Statsmodels in BentoML. In fact you should also be able to just use the existing "PickleArtifact" in BentoML for statsmodel predictors too. We will add an example notebook for working with Statsmodels library soon!


Hi Aaron! We in Kubeflow[1] would love if you took a look at us as well. We're always open to feedback!

[1] https://kubeflow.org


Hey Aaron, I work on Cortex which is a tool for continuously deploying models as HTTP endpoints on AWS. Under the hood we use Kubernetes instead of Lambda to avoid cold starts, enable more flexibility with customizing compute and memory usage (e.g. running inference on GPUs), and support spot instances. Could you clarify your comment regarding editing of config files? Is it still a problem if the configuration is declarative and tracked in git? I'd love to hear your feedback! (GitHub: https://github.com/cortexlabs/cortex | website: https://cortex.dev/)


Sure, I'm thinking about the development lifecycle in terms of what actions data scientists have to take to get a model deployed. Anytime the process has a branch (ie: you need to change this file whenever something elsewhere changes) then I know I'm going to forget to do that.

If we were to use Cortex, we would likely wrap the creation of cortex.yml in a function and call it when we're saving our models. We do something similar right now and store the meta in json files for later deployment. I love tracking config in git too.


That makes sense. Programmatically updating cortex.yaml is a common use case especially when you're thinking about continuous deployment. We also have a Python client which can replace the cortex.yaml file (https://www.cortex.dev/deployments/python-client).


We had a similar problem with SageMaker re: cost. We tried a few different things out, but ultimately wound up sticking with Cortex https://github.com/cortexlabs/cortex/


Could you possibly define "pickling" in this context for us ML noobs?


Pickling is a protocol to serialize Python objects. In scikit-learn that would be serializing an Estimator.

https://docs.python.org/3/library/pickle.html https://scikit-learn.org/stable/modules/model_persistence.ht...


We save the state of an object (an instance of a class with a predict() method) to disk once we have a model that we are happy with. During deployment we copy this file to a server which loads the file from disk and restores the state of the object on the remote machine.

We use dill[0], but there are other similar libraries.

[0] https://pypi.org/project/dill/


There's a serialization module called pickle that can be used to store models:

https://docs.python.org/3/library/pickle.html


Pickling isn't ML-specific. Pickle is an object serialization library in Python.


If by ML noob you mean to say that you're like me and have zero formal CS training (as in, I don't know what a data structure is), pickling lets you write your Python workspace to a file just like Matlab's .mat file loading. It's excellent for writing scripts defining different parts of a data pipeline, or just for debugging/trying new things without waiting 20 minutes for something to filter.


Here is a simple explanation:

1. The universe is composed of things.

2. We can use the computer to store information about those things (this is the data).

3. In order to gain useful insights about those things, we want to do operations on the data. I.e. to compute. computation is done by algorithms.

4. Data structures are the bridge between the data about things and the algorithm. They hold the data such that the algorithm will have an easier time computing.


> as in, I don't know what a data structure is

Basically everything you work with in programming is a value (the number one-hundred-seventy-five, for example: "175") or the address—location in computer memory, say—of a value. You might record the address of that value above as the count of characters from the beginning of this post, for example, were this post the layout of data in some RAM, just as numbering houses on a street. Add the concept of data "width"—how long the number is, in terms of how many characters represent it (3, in this case) and you've got basically all there is in terms of primitive stuff that computer programs operate on.

Observe that the value stored at a location in memory, say, might itself be an address—location—of some other thing stored in memory.

The width+value concept can get you pretty far, in that you can store a bunch of stuff and find it again, given the address of the beginning and some convention that the first so-many bits of the value describe the width of the rest of the value, or some other means of knowing the size (width) of the value, as long as all its parts are stored right next to each other in memory, and in the correct order. That's called an array, in fact, which is a data structure! One problem with arrays is that if you want to make them longer, you might not have more memory available at the end of them—something else may be using that location already, and if you just overwrite it that'll likely break something. So you'll have to copy your entire array to a larger piece of empty memory to add on to the end of it.

EXAMPLE!

Say we have some RAM large enough to store nine things, and we know that we have something stored at position 4—programmers like to order items starting with zero, but I'll refrain because it's not really important here and makes it more confusing. The RAM contains the stuff we're looking for, plus some other stuff that we don't care about right now:

[429317501]

We look at position 4, and know (by convention, or whatever) that the "3" we find tells us how many more locations to read past that to get our entire value, and extract "175" as our value, by proceeding to do just that. That's an array! One of the most basic data structures. Of course all this is binary under the hood, and those binary numbers can also represent letters or color values of a pixel in an image or whatever, but I'm using simple base-10 numbers to keep things easier to follow.

You can use these basic pieces to build up more complex data structures than that, of course. You can have a series of places in memory, not necessarily next to each other, each containing a value and then the address of another value+address pair. Given the address of the first of these, one could write a program to read each in order, following the addresses to hop from one to the next until it finds one without an address provided. Ta-da, it's (one kind of) a linked list! That's another kind of data structure. Now if you want to add to the end of your value, you just pick an empty spot in memory, add its address to the last piece of the existing list (first "walking" the list to find it by starting at the beginning and reading each piece in turn, if you don't already know where the last part is located), then fill it in with the value you want. This takes up more space than an array, though, and may take a little longer to "read" (get the value[s] from). It's very common for many types of data structure to be able to represent the same thing, just with different trade-offs in terms of space used, or time to locate a given part of the structure, or how much space or time it takes to modify it (recall, having to copy an entire array in order to add on to it) and so on.

EXAMPLE!

[950818972]

If we know (somehow) that our linked list starts at position (address) 5, and know that we can expect a value then an address at that location, next to one another, we see "18", so our value starts with 1 and we should look at address 8 next, where we find "72"—value is 7, and now look at address 2, finding "50". Zero in our little make-believe addressing system here conventionally means there are no further addresses to look up (there is no address 0) so, without knowing how long the list would be when we started, we now know we're done, that there were three values stored in the list, and that they are, in order, "1", "7", and "5".

If you've followed this far, you may be able to see how one could make "trees" (pairs of addresses instead of just one, suggesting a "left" and "right" path) and other things from these fundamental parts, and may be able to think of reasons why one might do this. A linked-list where all the "values" are addresses to parts of some other structure (with the "address" portion of the linked list item used as normal)? That's a sort of index, right? A list where the "value" at each position is the address of the beginning of another list? We call that a multidimensional list (or multidimensional array, if it's laid out as an array) and it's one way a person might represent a grid of, for example, colors in an image (this is basically what a bitmap is). And so on.

Disk storage uses the same fundamental building blocks. FAT, as in FAT32 or FAT16, old DOS and Windows file systems? File Address Table is what FAT stands for. There's a table (kinda like conjoined lists, much as above), starting at a conventional spot (address) on a FAT-formatted disk partition, that describes where all the files are on the rest of the disk, along with some other info about the filesystem. It's basically exactly what it sounds like, and uses precisely the same concepts as above, just applied to locations of files on a disk rather than locations of values in RAM.

One last insight: program code—the instructions for the CPU—is also stored in memory, and that works basically the same way as the stuff above. This unified, undifferentiated storage system is called Von Neumann architecture, and it's what pretty much all computers you're likely to encounter use. Point being, those addresses pointing to things stored in memory? They can also point to places where code is stored, which, once located, one might direct the CPU to execute. A little thought on that, combining it with the above notions, should suggest some cool things this would enable.

And that's about it. That's data structures, and indeed much of programming. It's all values, addresses, widths, and more than a little bit of convention.


I have never read such a crystal clear explanation of what a computer / a Von Neumann architecture is. Simpler is always better. Thanks a lot for this.

Any other clear descriptions of CS concepts - especially for Data Science - to share ? Links to them ?


Haha, thanks. Nah, I wrote that just now and don't have, like, a source I go to for this stuff that's not pretty well-known already, or a blog or anything. Maybe I should start one. Not exactly ready to explain anything else like that off-the-cuff in an HN post at the moment :-)


I've read a lot about data structures and nothing's ever stuck, but this "did it" for me. Thanks so much, you're a really excellent writer.


To add to the other responses, I would recommend Joblib (over pickle or cpickle) Reasons here: https://stackoverflow.com/a/12617603/1868436


"Pickling" is just the pythonic term for serialization. In this context, it most likely means persisting the model to disk as some sort of file.


Not really - "pickling" an object in python is applying a very specific serialization protocol. That protocol happens to be built into the python language itself, but there are alternatives.


What about the A/B testing? What do you use for A/B strategy. How many predictions are being served by the model per second?



Please do not use `pipenv`, use `poetry` or plain old `pip` instead.

1. https://news.ycombinator.com/item?id=18612590


Hmmm

Point about "Official tool" is valid

Others seems strange to me

I'd add to my comment something like "Try it. If you like it - use it"

Thank you for link


This is oddly similar to what we're doing. Except my team is heading towards sagemaker against my wishes.


Did you have the two systems talking to each other through HTTP endpoints? I mean the ML system receiving data from a source API and sending back a result? Is this where AWS lambdas jumps in? Are there any formal tools that facilitate making these endpoints?


Yes. We use aws sam cli [1] to facilitate testing and deployment to AWS's api-gateway + lambdas. It works and even thought the configuration is automatically generated using model metadata. I'm still not too thrilled about this choice. TBD on if this was a good or bad choice.

[1] https://github.com/awslabs/aws-sam-cli


algorithmia.com ? HTTP endpoints, serverless, versioned, logging, auth, etc.


Yes to pickling models!


Would love to hear your thoughts on this? cortex.dev


We use Cortex and I'd say I'm pleased with it. It doesn't offer the end-to-end solution that something like SageMaker does, but it's the best tool we've used for deploying models. Also, and this is less of a technical feature and more of a nice to have, but the team has been really responsive when we've had problems and they seem to be shipping new features at a steady clip.


Custom Unreal Engine simulator, simulating agents with NVidia Physx and publishing sensors through GStreamer. GStreamer has sinks and sources for ROS, and tensorflow elements for inferencing. We package this all into NVidia Docker for scalable simulations. Setup is similar for training and inference. The core framework is a streaming engine with stream combinators that enable reasoning about spatio-temporal data streams. Where each datum is related to a point in space and time. The goal is for tensors to be the streaming primitives, but the pipeline is still fragile, it’s a challenge to keep all this working with so many core technologies changing constantly (UE4 + Physx + Cuda + CuDNN + Tensorflow + …). We train robots in simulation.

Most of our tools are built in Rust. Several of those are for creating(cleaning) data streams out of datasets. They are converted into tensors or ROS messages.


How well does the training in the simulator transfer over to real world environments? Do the models require fine tuning on real life data, or do they immediately work? I would imagine they would get confused by things like wheel/tracks slipping a bit on the ground versus in the Unreal simulation.


Traditional simulations don’t translate well. Usually one has to choose a fidelity level; controls, navigation, exploration, mission, etc… We are having more success with models that can tolerate variable dynamics. Simulations don't need to be realistic, only consistent. We then reinforce models with an infinite number of simulation universes and rules, such that reality is just another sample. We don’t have the resources for end-to-end training, and our initial tests did not yield good results. But chaining smaller specific-purpose networks within traditional control and planning systems is looking really good.


This sounds pretty advanced! What lab do you work in?


Out of Central America. https://vertexstudio.co/


This is really cool, what are you training?


Quadcopters, railway vehicles and perching devices.


I want to be where you are. How can I get in contact with you?


I'm building these labs/teams in Latin America, with a few remote people from around the world. Ping me at: alex.rozgo @ vertexstudio.co


Our organization looks similar: models are almost all written in R but the business operates in C#. We just use simple HTTP APIs to intermediate. Specifically, we do the following:

1. Use R packages to bundle up the models with a consistent interface.

2. Create thin Plumber APIs that wrap these packages/models.

3. Build Docker images from these APIs.

4. Deploy API containers to a Docker Swarm (but you could use any orchestration).

5. Stick Nginx in front of them to get pretty, human-readable routes.

6. Call the models via HTTP from C#.

This stack works pretty well for us. Response times are generally fast and throughput is acceptable. And the speed at which we can get models into production is massively better than when we used to semi-manually translate model code to C#...

Probably the biggest issue was getting everyone on board with a standard process and API design, after a few iterations. And putting in place all the automation/process/culture to help data teams write robust, production-ready software.


We train models as kubernetes cronjobs defined by a minimal properties file per model defining number of cps/gpus/mem. They will start with a given image (ex pytorch or tf) based on where in the repository these files are placed, and will then run a user specified bash file to start the job.

Data scientists have a similar docker image running in kubernetes which includes all of these images as conda environments for experimenting in prod-like environments. Spark is used to fetch data for the most part.

Models report a finished state over Kafka after getting persisted to buckets in Google cloud, then gets mirrored over to a ceph cluster connected to our serving kubernetes cluster.

We have an in house Golang server binding to c++ for serving pytorch neural nets persisted with the torch.jit API (I can really recommend this for hassle-free model serving). We also have some Java apps for serving normal ALS or Annoy based models.

Our traffic is not as wild as many here, but we're serving around 10M user requests a day.

We also do a merging of results from several models' results, and join them together with a separate "meta-model" that estimates which model the user has had a preference for recently, to weight those up.

There's probably a lot of details left out here, especially about the serving part, since we have various services in front of the models enriching data and presenting it to the user, but it's the gist of it.


What has worked fairly well so far:

Models:

- Models are structured as python packages, each model inherits a base class

- base class has define how to train, and how to predict (as well as a few other more specific things)

- ML engineer can override model serialization methods, default is just pickle

Infra:

- Code is checked in to github, Docker container built each merge into master

- Use Sagemaker BYO container to train models, each job gets a job_id that represents the state produced by that job (code + data used)

Inference / deployment:

- Deploy model as http endpoints (SageMaker or internal) using job_id

- Have a service that centralizes all score requests, finds correct current endpoint for a model, emits score events to kinesis, track health of model endpoints

- A/B test either in scoring service or in product depending on requirements

- deploy prediction jobs using a job_id and a data source (usually sql) that can be configured to output data to S3 or our data warehouse

So far this has been pretty solid for us. The tradeoff has been theres a step between notebook and production for ML engineers which can slow them down, but it forces code review and increases the number of tests checked in.


> The tradeoff has been theres a step between notebook and production for ML engineers which can slow them down, but it forces code review and increases the number of tests checked in.

This was a game-changer for us. What does your testing story look like?


We're still ironing out a few things but unit tests for various functions then we have smol statistically representative datasets for each model. In CI we train a model on the small dataset (aim for <5 mins e2e) then have a a suite of model metrics we care about, tests confirm values are within acceptable bounds and the test values are pulled into the PR.


We are framework agnostic for model development, models get converted to ONNX[1] and served with the ONNX runtime[2]. They are deployed as microservices with docker.

We are currently looking at MLflow[3] for the tracking server, it has some major pain points though. We use Tune[4] for hyperparameter search, and MLflow provides no way to delete artifacts from the parallel runs which will lead to massive amounts of wasted storage or dangerous external cleanup scripts. They have also been resisting requests for the feature in numerous issues. Not a good open source solution in the space.

Note that this is for an embedded deployment environment.

[1] https://github.com/onnx/onnx

[2] https://github.com/Microsoft/onnxruntime

[3] https://mlflow.org/

[4] https://ray.readthedocs.io/en/latest/tune.html


Any issues with the relatively new ONNX format? How do you handle model monitoring, verifying accuracy of model over time?


The systems I've seen basically break things into different services. Tied together with gRPC or Thrift which have code generators for most languages. So the Java backend simply makes RPC requests to a server running R.

Although in one case we had very tight latency requirements (ie: 10ms) so the ML results were pre-computed and loaded from a cache on the backend servers.


I imagine I will be breaking down into different services as well. An ML "blackbox" that makes a call to the back-end for data and returns a result/prediction. This could happen through an API. What kind of API to choose is still open.

I'm not very sure what you mean when you say the ML results were pre-computed?


If you're pulling data from somewhere automatically then you should make sure to define the data contracts well and that they're not changed. Also, A/B test things if possible. I've had issues in the past where the data pipeline view of the data and the API view of the data weren't the same. Or where a bug was fixed that resulted in the values of certain fields to change.

>I'm not very sure what you mean when you say the ML results were pre-computed?

We were scoring ads per page, and possible values for both were known ahead of time. So for each ad-page combination we generated the scores and then pushed them into a giant cache.


Almost entirely in Go. I use Gorgonia [0] and Gonum [1]. Granted I wrote Gorgonia. All solutions fit into the company's CI/CD infra with almost no additional overhead.

Sometimes the end result is gRPC services, sometimes its some sort of serialized model (weights). Sometimes the model is specified in protobuf. Very rarely it's a HTTP API. I don't fancy those.

Ironically I haven't done much distributed models. Or if it's distributed, it's not some Kafka-esque monstrosity.

I rarely use Python for anything other than exploratory analyses now.

Being able to type `go build .` and have it run anywhere is pretty awesome

[0] https://gorgonia.org [1] https://gonum.org


Check out the CRAN task view on this topic: https://cran.r-project.org/web/views/ModelDeployment.html

One dead simple way to do this (R model —> Java production) that I’ve done in the past is to use PMML (via pmml package), which converts models to an XML representation. ONNX is a similar/newer framework along these lines. You can also look at dbplyr for performing (dplyr-like) data preprocessing in-database.


I didn't see you his comment when I posted but I would strongly recommend PMML-based approach.


What we do at https://quillbot.com

Training:

Currently we just use a bunch of beefy desktop workstations for training (using Pytorch).

Deployment:

This is the vast majority of our cost, each time a paraphrase comes in we add it to a queue through google cloud Pubsub. We have a cluster of GPU (T4) servers pulling from the queue, generating paraphrases and then sending the responses back through Redis pub/sub. I think ideally we would have a system that makes it easier to batch sentences of similar length together, but this seems to be the most cost effective way for models that are too computationally expensive for the CPU that is relatively simple to put together.


I'm a PhD student at Caltech, working on the theoretical foundations of ML. I personally don't do a lot of coding, but basically everyone in my department uses Python (especially Pytorch) for deep learning/ML. This all runs on Nvidia GPUs (never seen an AMD GPU in the office). Occasionally people code in Matlab, especially if they work in optimization or control. Tmux and git are the only command line tools I see commonly used. Occasionally people ssh into an Amazon box if they need more compute.


You didn't really describe a stack. Which is fine, because academic research usually doesn't really reuse code ;)

A proper ML stack is something like:

- Data format in X schema

- Model trained on Y library/platform

- Evaluated and tested using Z

- Serialized in A format

- Stored on cloud B

- Deployed using C

- Versioned using D

- Real-time monitoring using E


I'd recommend exporting R model as PMML file, and getting your Java team to interact with Openscoring server.

PMML is language agnostic model specification (XML like). Python and R machine learning ecosystem can easily generate these (caveat, only tried for gbdt and linear models and not sure this works well for neural nets).

Openscoring is Java library that creates rest API for scoring models. It's lightweight, battle-tested, nice API, good model versioning and in my experience 10x faster than Python flask. You don't need to write any Java code, just download and run the .jar and post valid PMML to the right endpoint.

Another feasible approach is Sagemaker deploy - code from Jupyter notebook can deploy API in one line. I think this can be less economical and have higher latency if you will have high usage but a datascientist can do model updates from within a notebook.

Please NEVER hardcode regression model coefficients within Java. This is a nightmare to maintain, prevents increasing model complexity and is no simpler than PMML + openscoring. I think you can wrap the Java PMML library in another Java web framework like spring if you need something more bespoke.

https://www.rdocumentation.org/packages/pmml/versions/2.1.0/...

https://github.com/openscoring/openscoring

https://aws.amazon.com/blogs/machine-learning/using-r-with-a...


Looks very interesting. Will definitely explore in this direction.

> Please NEVER hardcode regression model coefficients within Java.

Amen to that.


If you aren't wedded to R then pickling sklearn Pipeline and loading in Flask app can be nice. Advantage of this is that data pre-processing can also be included in a sklearn Pipeline.

https://scikit-learn.org/stable/modules/generated/sklearn.pi...

The bit I'm not sure about how to do well is model monitoring.


We have a bit of a problem like what you mention. Our backend/app is in Java but the DS/ML team generally works in python. The ML team basically doesn't ship production code.

Here are the artifacts we produce:

1. For new models we often build a demo endpoints/glue code written in python/flask that can be compared against the prod output in dev/psup.

2. Deep learning models (much of what I do personally): saved in TF saved model format. If it is an update to an existing model often it is just a drop-in replacement. If it is a brand new model i will often include a flask demo (the python code does proper data transformation before calling on tf). On production side, after testing/regression these model are deployed via tensorflow-serving containers and used as gRPC endpoint. For production, whatever data pre-processing needs to be done is written by the backend team, who compare preprocessing output with our demo.

3. Logistic regression/tree models: again, for new models we provide the demo but what goes into production are either csv (logistic regression) or json (tree) of the weights/decision boundaries which are used as resources by the backend team's Java code.

The overall flow is:

ETL (via apache airflow/custom code) => model training/feature engineering => (saved model file + flask demo endpoint/documentation on feature transformations) => dev incorporate model/test into java backend => comparison of demo vs java backend => regression of java backend (if they had previous versions of model) => psup (small amount of prod data duplicated and ran in parallel with prod) => prod (model deployed + monitored)

There is a caveat that we also do some batch processing/not really live analysis that is just done in python and then results are pushed wherever they need to be pushed. In this case we don't involve the backend/java team.


BentoML(https://github.com/bentoml/BentoML) may help you with the process of building endpoints with both Deep learning models and logistic regression/tree models, and it automatically helps you to containerize the API server into docker image that's ready for production deployment.

It also provides OpenAPI spec for your API endpoint, which allows you to generate API client in Java, for your backend/app teams.


FYI. One way to automatically generate API clients is to use OpenAPI Generator (https://github.com/OpenAPITools/openapi-generator), which is free, open-source and supports 30+ programming languages.

Disclosure: I'm the top contributor to the project.


So we've been using swagger, is there some comparison between OpenAPITools and swagger code gen?



Currently:

Models and feature engineering done in python, trained locally, weights uploaded to S3. Dockerfile with a tiny little web server gets deployed through or CI/CD pipeline for serving.

Soon: Argo workflows + Polyaxon for data collection, feature engineering, training etc. Push best model tobS3, same CICD process with docker container deploys little web server onto our Kubernetes environment.

Deep learning stuff will probably use a similar setup, but with PyTorch instead of Sklearn. Would like to look at serving with ONNX exporting.

When the Julia packages evolve a little more, will be looking forward to using that in production.


Glad to see that you are interested by using Polyaxon[0] for your MLOps. Although I was going to write a blog post about the upcoming v1.0 release of Polyaxon, I just wanted to point out that there will be a native support for different type of workflows, currently it supports parallelism and distributed learning, and in the next release there will be native support for DAGs as well. Here's a test fixture[1] of what a dag workflow will look like in Polyaxon.

Happy to answer any question or provide more information.

[0]: https://github.com/polyaxon/polyaxon

[1]: https://github.com/polyaxon/polyaxon/blob/master/cli/tests/f...


This looks increadibly exciting. What's the story around what happens after training/validation? I can't see anything on your site specific to this - do you currently (or plan to) offer anything to help track or version model "releases"?


Since it's possible to create custom components, one can for instance create a component for packaging models, for example the component can extract the model from a path (mounted volume or blob storage), create a python package with a requirements file, and a microservice based on flask, everytime a user wants to promote a model to production, she can run this component, or she can also add it to a workflow to be triggered after a training or hyperparams operation.

From Polyaxon's side, we will be providing a set of reusable components, some of these components will be targeting deployment and packaging, for instance aws lambda, azure ml, sagemaker, open source projects for deployment (some of them are mentioned in this HN thread) Users can also contribute components as well or create them to be used inside there organization.

For versioning, all component have versions, and all runs have full overview of their dependencies and provenance (inputs/outputs). This gives all information about when and how a model was created.

The platform knows how to create and manage services, e.g. tensorboards, notebooks, dash, simulators for RL agents ..., so one can also deploy the model as an internal Polyaxon service to be used as an internal tool or for testing purposes, the only thing to keep in mind is that the API endpoint will be subject to the same access rights as other components create by a given user.


Any inputs on Argo workflows vs Kubeflow vs MLFlow? Which is better suited?


I will try to be as objective as possible answering this question, since I am working on a project in a competing space.

Argo workflow is a pipeline engine that is cloud and kubernetes native. It tries to solve graph and multi-steps workflows using containers on Kubernetes, It can be leveraged for ML pipelines as well as other use-cases.

Kubeflow is a large project that has several components: training operators, serving (based on Istio and Knative), metadata (used by tensorflow TFX), pipelines, ... and integrates with other projects. Kubeflow pipelines is using Argo workflow as a workflow engine, although I think there are efforts to support other projects such as Tekton which is also a google project, and possibly TFX as a DSL for authoring pipelines in python.

The main focus for MLFlow, I think, is tracking ML models and providing an intuitive interface to model deployment and governance. The main strength of MLFlow is that it's easy to install and use.

Polyaxon has been used mainly for fast developement and experimentation, it has a tracking interface and several integrations for dashboarding, notebooks, and distributed learning. Polyaxon also has native support for some Kubeflow components, e.g. TFJob, Pytorch job, MPIJob for distributed learning.

The upcoming Polyaxon release will be providing a larger set of intergrations for dashboards, in addition to tensorboards, notebooks and jupyter labs, users will be able to start and share zeppelin notebooks, voila, plotly dash, shiny, and any custom stateless service that can consume the outputs of another operation.

The new workflow interface focuses mainly on an easy declrative way to handle DataOps and MLOps, the main idea is to provide a very simple interface for the user to go from a data transformation to training models. Since the component abstraction is based on containers, it can be used to do other operations, e.g. packaging models and preparing them to be served on other open source projects, cloud providers, or lambda functions. Also support for some frameworks such as dask, spark and flink operators could be used as a step in a workflow, ...

For hyperparams tuning, Currently, the platform has grid search, random search, hyperband, and bayesian optimization, one of the major changes in the next release is a new interface for people to create their own algorithms and a mapping interface to traverse a space search provided by the user or based on the output of another operation.


Kubeflow, unless I’m missing some things, is for Tensorflow pipelines, if you’re not using TF, or it’s not the only thing you use, it’s not ideal.

I thought MlFlow was a spark thing, and were trying to migrate off of spark/DataBricks due to the resources inefficiencies of Spark (at our scale) and maintenance nightmare that python notebooks are causing us.

Argo is just a container workflow tool, not ML specific. We’re planning on using Argo for the data engineering parts, and polyaxon for the ML training parts because of the convenient monitoring and hyper parameter search tools.


Hi! Co-founder of Kubeflow here - definitely not TensorFlow only! You can see ([1]) many many different repos and operators. The nice part about Argo for us is it let us build an ML specific DSL that was also Kubernetes native.

[1] https://github.com/kubeflow/


What about the A/B testing? What do you use for A/B strategy. How many predictions are being served by the model per second?


For most of the stuff we’ve deployed, we’re not yet operating at a scale/level of interest where A/B rearing is worth it. Additionally, the purposes we’re using most of these models for don’t really necessitate A/B testing.

When we do need A/B testing, we’ll probably use something like Seldon. As for predictions/second, not very much at the moment: 1 per 30 seconds maybe? It’s not deployed into a Kubernetes cluster because of scaling requirements, it’s because that’s where all our other services greet deployed till, and it’s more beneficial (ops and cost wise) to also deploy into there than it is to bother with having a separate workflow for deploying to lambda’s or SageMaker.


So how do you know if a new version of a model is better than the existing serving version?


As currently the only person doing data science things for the team, I’ll test to make sure changes I make to model/feature engineering/etc result in a better model. We’re not constantly, constantly retraining our models, because our incoming data and behaves the same. We’ve had the same model in prod for 4 months now; we don’t have any pressing issues with its predictions, and looking through the logs of what the input was the the output, it’s still performing as expected, so we’ll probably leave it longer.


I see, so how do you measure the difference between the incoming data and your training data?

Also, it looks like you have a very low volume of predictions?


Background:

We have a team of 6 DS working on parallel projects. We tried the Java service approach. It's great for a one-time model, but very painful to iterate on.

We develop on top of Sagemaker, and since we're a funded company, can somewhat get away with the 40% price increase of an "ML Instances".

We have a mix of R/Python models. For each, we keep a separate repo with a Dockerfile, build file, and src code.

Training:

If the jobs are small, we train them locally, package assets into the container, and deploy. If it's a bigger job, we leverage Sagemaker training jobs and S3 for model storage.

Serving:

We have boilerplate web service layer with an entrypoint that DS fills in with their own code. Yes, this allows almost arbitrary code to be written, but we do force code reviews and enforce standards. Convention over configuration.

We do the feature engineering using Python/R, which when parallelized, has good enough performance (sub 200ms latency on Sagemaker prod). If we need latencies in the 1-10ms range, we'd consider refactoring the feature engineering into a separate layer written in a more performant language. It's always FE that takes the most time.

One learning from tuning Python services is: for max performance, try to push the feature engineering work onto consumers. Have them fully specify the shape of the data in a format that your models expect so you do as little feature engineering in the serving step.


Data ingest: AWS Lambda (JavaScript) using the Serverless Framework to Kinesis Firehose.

Continuous Training: Lambdas triggering SageMaker BYO training jobs.

Continuous Model Deployment: Lambdas polling SageMaker training jobs to update SageMaker BYO endpoints.

Inference endpoint: Lambda proxying to a SageMaker endpoint.

Our BYO endpoints allow us to do batch inference without a bunch of round trips between Lambda and SageMaker.

If endpoint costs get too high, we’ll implement some caching at the lambda layer.


For our current projects we use our open source Apache Spark framework Arc (https://arc.tripl.ai/) for feature prep then depending on the type of model we will either:

- use builtin Spark ML models

- call a model running as a service

- write files for a model to ingest (for a legacy project)

- develop a custom plugin or UDF (for calling via SQL)

We have built in stages for running Spark ML models in the framework as well as HTTP and Tensorflow Serving stages to call services. We recently ran a series of models for NLP that were in Python and Ocaml via the HTTP stage sending payload either in JSON or other formats that the services needed. The text extraction via OCR (tesseract) had been done as a prior Spark stage. This design allows us to call these more custom ML models but keep them part of a larger Spark job and use SQL and other features when needed. The services where deployed in AWS Fargate to allow for scaling. For other jobs we are deploying our Arc jobs using Argo for orchestration. We spin up compute on demand vs running inside a persistent cluster.

For training we use Jupyter Notebooks where possible. We have a plugin that generates Arc jobs from these notebooks.

For special cases we can add custom plugins or UDF functions to extend the framework. I have done similar plugins to run XGBoost models in Spark for example.

Whilst we try to be prescriptive around the ML stack for Data Scientists this approach has allow flexibility where needed and for different teams to own their part of the job. This is particularly useful in larger teams where development is more federated.


We use self deployable configuration to allow Data Scientist to control the model's destiny.

Models are written in Python (mix of pytorch/NLP/tensorF). The Models are serving about 35 predictions/second on avg. The API server written in the Python. API server container feeds or write the requests in the distributed queue cluster. The models picks up the samples from the queue in batching. It allows to experiment the models (different flavor) based on the routing being set during the deployment time and which in turns being set in the cache. We use AWS managed cache, queuing and container orchestration platform. Next: 1)Current pipeline for the training and production is two separate pipeline which we want to combined, possibly use MLFlow, Airflow or KubeFlow. Deployment to the production is done through Jenkins. 2)Active retraining and auto deployment to production. 3)Tie the version of model in production to model being trained. There is no way for us to tie back the version.


We developed an OSS ML Server called Harness. It does all ingest, prepare, Algorithm management, workflow bits for pugable ML "Engines". These are Algorithms + Datasets + Models and are flexible enough to do most anything. We use the build-in Universal Recommender Engine, and have built our own for other uses.

Harness exposes a framework for adding Engines and does all the routing for Engine Instance workflow and lifecycle management. It also provides a toolbox of abstractions for using the Spark ecosystem with Mongo and Elasticsearch.

It comes in a docker-compose system for vertical scaling and Kubernetes for ultimate in scaling an automation. Quite a nice general system with out of the box usefulness.

https://github.com/actionml/harness


We are currently experimenting with workflow engines for orchestrating different components, e.g. data ingestion, data preprocessing, feature engineering, scoring, automated decision making / escalation. Namely, Argo for offline processing, and bulk processing; Zeebe, and Cadence (trying both out) for online processing, and business logic / application services.

We don't yet have a polyglot architecture, but we do have the requirement of running distributed services (partly because there are certain components of the pipeline that needs to be run on-premise), and we have found that workflow engines / orchestrators definitely makes it a lot easier to reason about the wider architecture / have a bird's eye view. It works for us. No need to handle callbacks, events, queues, etc. We also do have the potential to run a polyglot architecture.

We tried out Celery Workflows, and struggled to get it "production ready", so I'd advise against this for complex workflows. We also found the visibility lacking.

We have yet to fully try out Kubeflow, and MLflow. What is not quite working at the moment is creating, and deploying portable models. And I don't mean simply pickling, and storing an artifact.

Leveraging containers (Docker), and slapping simple anti-corruption layers (e.g. simple web APIs) has also helped. We have a more consistent way of deploying, and isolating code without having to rewrite much.

We want to look into using Nuclio, and/or knative to ease the process of deployment, and to empower the data scientists to deliver without much engineering expertise.

Others have mentioned using base classes or standard interfaces for their models. We tried this too, but it didn't work. The generalisation early on was met with conflicting requirements, and broke the interface segregation principle (not that it matters too much, but it can be confusing to not know precisely what is being used or not used). We figured it's much easier to procrastinate any abstractions. Let the data, and it's flow do the talking.


Full stack Julia for RL (and trajectory optimization)


This sounds pretty awesome. Do you have any write-ups that explain more?


After wrapping the Mujoco physics engine in Julia, I did a few reinforcement learning / robotics control projects, combining it with the Flux.jl library for neural nets.

While I never made my research code public, https://openreview.net/forum?id=SyxytxBFDr documents a similar setup to what I use.


Plotly's Dash to prototype front ends that ingest/use the model output (Our team is Python only, but others use R, so this works great, because it supports both.) https://dash.plot.ly/

FastAPI for quickly creating new API endpoints. It has automatic _interactive_ docs and super simple data validation via Python typehints, so that we don't waste compute time with malformed data. https://fastapi.tiangolo.com/

We deploy on prem most of the time, but have started using GCP on occasion.


> or example, how did your backend team(using Java?) work in sync with data teams

I guess we are a bit of an outlier, but we deploy the ML using Java / JVM. Not really in the same league as others here so the models are simple enough that the various Java ML frameworks are fine for it (DL4J, Smile, etc). We even do a lot of the interactive exploratory / training type work on the JVM (though via Groovy and Scala with BeakerX notebooks [1] - sometimes combined with Python and R).

I think as the field matures a lot more could move to this model.

[1] http://beakerx.com/


We once had a team member re-write their R code in Java, using weka just to avoid too much hassle for the back-end team. So I guess you're not alone!


crystal / shainet (https://github.com/NeuraLegion/shainet)

I contract for some clients in fintech and some defense-related stuff.


What kind of targets is crystal used on? Strictly x86_64 linux?

EDIT: Took a little digging but I found it [1] (and yes it's primarily x86_64 linux+macOS)

[1] https://github.com/crystal-lang/crystal/wiki/Platform-Suppor...


Windows will probably be done in a year, but it's hard to say. Hardest part is finding people who want to work on windows support xD


No kidding. (I wasn't complaining btw nor was I even focused on OS support, I was curious about which arches people use it on).

What are the popular use cases - either ones driving current development or ones where it's best suited?


Right now people mainly use it for web apps, sometimes data processing. With windows support, I think there would be a rush to make cross platform GUI apps.


I'm going to have a hard time convincing my team to adopt a completely new language :)


Is there a specific reason why you chose shainet/crystal?


I love crystal and I have a lot of autonomy so I get to use what I want. Shainet is also really good in general for coming up with new network topologies, which I find is a lot harder to do in a lot of the python-based mega-frameworks, Cafe, etc. In general, fibers are really good for concurrent/parallel data processing, and now that crystal has true parallelism (enabled with a flag), for my purposes there is no reason not to use it.

Occasionally I will use rust or C/C++ for some of these tasks, but I try to keep things in crystal whenever I can.


Development of models in our data environment: notebooks, pyspark EMR clusters for analytical workloads and offline models, tensorflow/EC2 P2s for online models.

Jobs are scheduled (Azkaban) for reruns/re-training and pushed from data env to the feature/model-store in live env (Cassandra). Online models are exported to SaveModel format and can be loaded on any TF platform, eg java backends.

Online inference using TF Serving. Clients query models via grpc.

A lot of our models are NN embedding lookups, we use Annoy for indexing those.


I'm pretty impressed with the level of automation I'm seeing in general. Looks like many are using docker/k8s or containers in some way or another. Inspiring.


It's pretty horrifying to me. Why are we adding layers and layers of abstraction that does not really make life easier?


We're currently running a single NVIDIA RTX2080 with Tensorflow 2.0 on a Windows 10 station. We'll soon be switching to a standard multi-GPU rig running an air gapped Linux distro. Linux seems overall much better for ML because of better Docker integration and tensor core support on the newer GPUs. Also, we'll probably be switching from Tensorflow to Pytorch for model development. Pytorch requires a little bit more code, but debugging is 10X easier.


Why airgapped? Is it a business/security requirement? If you have to share the machine, does everybody have to thumb drive over their files to run with the big GPU?


Yes. The air-gapping isn’t ideal, but trying to get our IT org to accommodate a Linux workstation on the network just isn’t worth the hassle.


And how do you get data on the fly [prediction phase]. Do you have an API call you make to get data that your ML algorithm can munch on?


Hi Aaron, I was having similar issues. My main problem was integration of different programming languages and tools under same roof. So I started with my own ML platform. Currently C# is supported, but there are other ones in the roadmap (Python, R, Nodejs...) You can check it here : https://github.com/Zenodys/ZenDevTool


How do you all monitor concept drift or monitoring to detect when its time to deploy a fresh model?


Wondering the same thing!


On a meta note, if you're interested in viewing & sharing stacks, that's the primary feature of the startup I'm working on: Vetd (app.vetd.com). The communities we host (often VC portfolio companies) share their stacks and leverage for discounts.

- CTO (chris at vetd.com)


Python (the Conda distribution, not small but quite easy and batteries included), Keras, Tensorflow, NVidia hardware (GTX1080ti, not sure what the current sweet spot for price/performance in GPU land is but that was the best I could get at the time).


I'm also curious—and maybe someone here can chime in—about how you get organizational buy in for introducing ML. There are a couple of problem areas at my company that I think would be great for ML, but I don't know how to get others onboard.


Someone here wrote a POC. They happened to be fairly high up, not C-level but well respected. Basically, I can save x% cost using this. You can bootstrap that if you're not high enough to sit at the table, just need to convince someone who does.

I appreciate that this approach won't work for everyone.


For my pet projects, do training and testing locally on my machine either using Notebooks or on an IDE. Test and validate it further on my local machine before deploying it on a server as a micro service. This is for my pet projects only.


We used TensorFlow Serving running in a docker container that is a rest endpoint for predictions. Then the backend can be agnostic to Serving. We personally used Node to query the container for whatever we needed.


our team built a Postgres backend on a single physical server (with a lot of good disk in RAID and others), python ran against the database, calling sklearn libs .. (skipping problem specific libs involved, but they added to either PG or the python side) Worked really, really well, easy to work on .. good architectural separation of stages in the process. Completed and shipped to an impressed customer. No GPUs


scikit-learn + Optuna for hyperparameter search + Python + Docker + AWS Batch/AWS Fargate. We are going to incorporate XGBoost or LightBGM at some point. The AWS services have some slowness in starting up, but the pipelines are not for online services so it works.


why not write the ML model in Java?


As someone slightly involved in the Deeplearning4J[1] project it always surprises me that more people don't consider this option.

[1]: https://github.com/eclipse/deeplearning4j


One way is to obviously go all out Java - definitely makes things streamlined. But not all team members are familiar with Java. Especially not ones formally trained on data science - who tend to work with R/python etc. Atleast that has been my experience.


Because all of the large tools are in Python. Tensorflow, PyTorch, OpenCV, Numpy, Pandas, etc.


The OP mentions that they are currently working in R.


The better question, is why Java? I don’t think I’ve ever encountered any company or person to use Java for ML. Scala yes. Clojure, surprisingly, yes. Java, no. Not to say they don’t exist, but it’s not a good idea. The ecosystem isn’t there, and the language (I want to say sucks), isn’t there either.


Where did you see clojure being used in production for ML? I am curious, because I am a clojure dev


We have used clj-ml[0], which wraps a bunch of Weka stuff, as well as the XGBoost JVM bindings. I've used Bayadera[1] and Dragan's other tools for linear algebra, although not in production. I was always sad that Incanter didn't really go anywhere, but at this point I wouldn't be surprised if Clojure became a respectable platform for data science, especially given Clojurists Together's funding focus this quarter[2].

0: https://github.com/shark8me/clj-ml

1: https://github.com/uncomplicate/bayadera

2: https://www.clojuriststogether.org/news/q4-2019-funding-anno...



We use our own product https://PI.EXCHANGE

We deploy the models as HTTP endpoints and consume them in R/Python/Excel.

We also have advanced functionality available to enterprise clients that are exposed as APIs with a customised JSON format to trigger various agents.


My company is considering licensing DataRobot.

It looks to be a time saver.


OpenCV + dlib + CUDA or caffe2 as alternative


Idk i would love good answers here




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: