Hacker News new | comments | show | ask | jobs | submit login
Ask HN: What does your production machine learning pipeline look like?
339 points by bayonetz 201 days ago | hide | past | web | 103 comments | favorite
Doing some design for an upcoming project and taking a survey.

I'll go first. Model training happened nightly on a Spark cluster. This output a PMML-based SVM model. The model was instantiated on a cluster of compute servers running Openscoring. A thin Node web service wrapper used the Openscoring cluster to serve realtime client prediction requests. Dataset size in the hundred millions of examples with hundreds of features. Handled thousands of requests per second, no problem.

Separating the training technology from the execution technology was nice but the PMML format is limiting in the kinds of models you can use that both you trainer and executor will support. What are people doing who use same tech for both? For something like Tensorflow, I assume you must have to save the model as binary from the train step and then send it off to the prediction cluster to be instantiated again for execution?

(ML engineer from Dropbox here)

For deep learning oriented projects, we train on EC2 GPU instances, generally p2.xlarge instances these days to get the Nvidia K80 GPUs. We can spin up many of these in parallel if we are doing model architecture searches or hyperparameter exploration.

We have an in-house data turking setup where we can efficiently roll new UIs to get ground-truth data for given problems, generally in the thousands to tens of thousands of real data. We also use data augmentation where possible, synthetically generating millions of example data points, combining it with the real turked data for fine tuning. Note that we never look at or train with real user data, unless explicitly donated, so data efficiency is important to us.

We've standardized on TensorFlow these days, doing inference on CPUs currently on Dropbox's compute infrastructure. We have a jail setup based off of LXC and Provost that allows us to safely execute these trained models in a contained way, responding to user requests as they come in. We use standard distributed systems plumbing for RPCs, queues, etc. to respond to user requests and route them to trained models. We version our trained models, and have an in-house experiments framework that we use to deploy new models and test them against subsets of user traffic to see how they are doing as we ramp up a new model.

Most of our day-to-day work is in Python, with occasional use of C++; other parts of Dropbox sometimes use Go and Rust, though we haven't had need for that on the ML team. Note that Dropbox is one of the largest users of Python in the world (Guido van Rossum actually works here).

BTW, the Machine Learning team at Dropbox is hiring. Come join us! Details: https://www.dropbox.com/jobs/listing/533100

I'd also be really interested to learn how you generate synthetic data. Is it a 'simple' case of manually generating as much as possible and then using the statistics of this to bulk it out? Or are you using something more complex to augment it?

The general class of model that you can use to create synthetic data is called a generative model [1]. The style of learning that pits one of these generative models against a discriminative one (the class you can't generate data from, including traditional neural networks) is called adversarial learning [2]. It's worth noting that normally you can use generative models for classification on their own as well.

[1] https://en.wikipedia.org/wiki/Generative_model

[2] https://en.wikipedia.org/wiki/Adversarial_machine_learning

One example of synthetic data generation was for our OCR project. We took a corpi of word choices (Project Gutenberg, modern books, the UPC database for receipts, etc.), took several thousand fonts, and combined it with geometric transformations that mimic distortions like shadows, creases, etc. to bootstrap millions of fake OCR like scannable documents.

We aren't using GANs yet, but are definitely keeping an eye on them. Work like InfoGANs which has the GAN learn a ground-truth like label are very promising, but GANs don't yet work at the image sizes necessary to really make this promising. I do think in the next year or two we will see these problems solved and GANs will become an integral part of synthetic data generation.

Ooh, neat! I've used GANs to generate synthetic temporal sequence data for training electrophysiologic (i.e., EEG, EMG) signal decoders. In fact, I wrote up the section of my dissertation on this topic today! In my experience it worked quite a bit better than other generative techniques (I've used convolutional variational autoencoders in the past for this and had so-so results). Looking forward to seeing what you guys do with this!

I am surprised that you are using AWS. It's generally cheaper and faster when you are at massive scale to build your own. Why do you do it this way?

in-house data turking setup

Can you clarify: Are you are turking with customers inputs or with internal team members?

synthetically generating millions of example data points

Are you implementing this via GAN or hand tuned?

You're right about scale, but I think forgetting that the model training is on very different hardware than what production web and storage servers use.

Unless the have the need for dedicated GPU hardware at scale paying by compute unit still makes the best business sense.

Disagree. That's the whole point of me asking. It's cheaper and will be better tailored to build your own GPU stack if you need it at scale than it is to "rent one" from the cloud providers.

I think it comes down to comfort and how critical it is to business processes.

I am studying ML this semester at university. I would like to get some ideas on where does a company like DropBox use ML. Can you state some problems that the team is working to solve or use cases?

Two examples of where we use machine learning at Dropbox is corner and edge detection in our mobile docscanner (https://blogs.dropbox.com/tech/2016/08/fast-and-accurate-doc...) and computer vision for OCRing.

Does Dropbox do OCR-ing? I couldn't find how to do that in the mobile app. I love the scan upload feature.

I really hope you are employing world class security folks and not gonna have another password leak. I have found after the scan feature, I'm uploading more of my paper life.

When using the Docscanner, we do OCRing of the images for Dropbox for Business users only right now. The OCRed Docscanned PDF gets copy/pasteable text, and its search tokens go into the search index for later searching. We only offer full text search for Dropbox for Business customers as well right now, not Pro.

I'm curious, how (if at all) do you verify how well your models are performing on the user's data?

What the heck is data turking?

Using something like Mechanical Turk to manually add ground truth data to something, though we don't use Mechanical Turk but rather other providers.

What are other providers that you use? Since you are using it as input to your training set, it should have fairly high accuracy (even more accurate than manual errors in tagging)

Ditto, are there any providers you are allowed to mention that you have had good experience with?

I think they mean manually labeling data en-masse. Maybe tagging photos with a description of what's in it, as a common example.

"Turking" is probably a reference to Amazon's Mechanical Turk service, where you can to get lots of low-effort work done by people.

I guess crowd sourcing ground truthing/generating more validated data with mechanical turk?

The crowd sourcing aspect is accurate, but I think the turk half is used more generically past the actual Amazon browser GUI experience tool to whatever annotation tool one has available or has invested development in. Stuff like https://www.crowdflower.com/

One other thing worth pointing out is why you can't always lean on the Amazon Turk per se because some of the tasks you need to train from require specialist knowledge that takes years to hone. It is not a task for the common layperson who can spot a cat in an image or anything else that exploits "common knowledge". In the case of the company I'm at we had to pay contract work for these specialists to come in house and operate our annotation tools to build up our ground truth data.

Yep, it's just a historical reference to Mechanical Turk since that was one of the first of such services. You are probably not going to be able to get minimum wage online workers in 3rd world countries to annotate medical images.

What was your data set like?

It was basically a bunch of microscope images of a certain kind of tissue.

Ditto on not being able to rely on some mechanical turk services - although the quality of answers/labels you get back inevitably depends on how well you have laid out the task at hand, we have additionally witnessed poor quality output that was outright rogue - tasks being completed in sub seconds or same answers given by a user no matter the question - both pointing to tasks being completed by bots not humans.

In fact we had resolved to building a turker bot detector and started rejecting tasks completed in suspicious ways. Only once we have built ourselves a trusted turkers population did we start to get quality data back. I suspect most people don't bother to go back and reject poor quality answers and that is why the bots survive.

Wow, I believe it. We didn't have so much a rogue situation, but you really do have to constrain their actions to just what you wanted. I tried to find the source but there was some YouTube video I watched where the guy made this good comment about creating GUIs where you have to put a real emphasis on preventing users from doing things you do not want them to do. You can't always focus on features but also constraints. I really took that message to heart after experiencing some of the human unpredictability found in building up training data. It made for some interesting payment debates to ask them to redo some work that was incomplete. Fun stuff.

> we never look at or train with real user data, unless explicitly donated

What is the rationale behind this? Why not train?

I can answer this. I'm a paying Dropbox customer and I didn't donate my data. The information is mine, and we can discuss using it, but my expectation of privacy includes Dropbox not using it for any purpose that isn't storing and serving it to me unless I've expressly stated otherwise.

While I'm not that interesting of a person, I do have an interest in things like my tax returns, financials, and other PII to not be in the hands of Dropbox employees.

Short answer: Dropbox is doing the right thing, and I really appreciate that.

Couldn't user data make it into the model, and be reverse-engineered from it, if you used it to train the model?

I mean, machine learning models aren't designed as one-way cryptographic hash functions, so you can't be sure that some of the training data can be inferred through the model's behavior.

Yes! there is an interesting paper on this subject: https://arxiv.org/abs/1609.02943

Preserving users' privacy first and foremost is absolutely the right thing to do here. I'm really glad Dropbox is willing to trade away a bit of model accuracy for better user privacy.

Because the ML models might start reproducing snippets of user data.

Perhaps they don't have the metadata for the data?

What does Dropbox use machine learning for?

See my reply to dnt404 for two examples. We have nearly an exabyte of data at Dropbox ( https://blogs.dropbox.com/tech/2016/03/magic-pocket-infrastr... ); there's lots of interesting ML things we can do with that, whether its searching, suggestions, etc.

You have an exabyte of other peoples private data that you're mining and essentially exporting to other files?

Am i misunderstanding or are you admitting to severely abusing your users trust?

See my original parent comment about not touching user data for privacy reasons :)

He says they don't use user data without explicit permission and they generate the training data using "turking," which means paying people to classify things manually.

Are there any public posts discussing how you augment data?

Also, what do you use you spin up multiple jobs in parallel using EC2?

No public posts yet unfortunately. For spinning up jobs we have an internal tool that can spin up machines and provision them, but in my own experiments I just usually end up spinning up machines and rsyncing experiments over.

Ah, fair enough. Too bad, I'd be really interested in hearing how you do that. I work for a consulting firm as a data scientist, and we have a lot of issues with data size. I'm aware of the tricks for image data (rotating & flipping, playing with different gamma levels), but I haven't seen much for standard, tabular style data.

Fraud detection at GoCardless (YC 11). We use the same tech for training and production classification: logistic regression (sklearn).



- train on an ad-hoc basis, every few months right now moving to more frequently and regularly as we streamline the process

- training done locally, in memory (we're "medium data" so no need for distributed process), using a version-controlled ipython notebook

- we extract model and preprocessing parameters from the model and preprocessors that were fit in the retraining process, dump to a json config file in the production classification repo


Production classification

- we classify activity in our system on a nightly cron*

- as part of the cron we instantiate the model using config dumped from the retraining process. This means the config is fully readable in the git history (amazing for debugging by wider eng. team if something goes wrong)

- classifications and P(fraud) gets sent to the GoCardless core payments service which then decides whether to escalate cases for manual review


* We're a payments company, processing Direct Debit bank-to-bank transfers. Inter-bank Direct Debit payments happen slowly (typically >2 days) so we don't need a live service for classifications.

Quite simple as production ML services go, but it's currently only 2 people working on this (we're hiring!).

When using sklearn, I've seen a lot of folks just pickle the model and use that as the interchange format. I like the human-readable interchange format you are using better. I assume you just rolled your own. Why not something like PMML?

Yep, we made our own. I haven't heard of PMML before - quite cool! What we've made is a bit more readable for what we're using it for though, IMO. Looks like this:

        "intercept": 1.0,
        "features": {
            "feature_1": {
                "coefficient": 1.0,
                "range": [0.1, 10.0],
                "mean_feature_score": 1.0,
                "imputation_value": 1.0

Is this open source? We were looking for something like this.

Sadly not. I'd be totally up for open sourcing if there's clear demand. If you can find it, send me an email at angus@{company_I_work_at}.com

Note that it's very tied down to our use case right now: only compatible with Logistic Regression, and currently it assumes fixed hyperparameters (will change this in future though), assumes a production pipeline of min-max scaling, imputation, then classification.

PMML is fairly verbose and limited to a particular set of models. It's often easier to pickle the models and then keep tagged versions. I think a human readable format could be created, but since most models are just a pile of numbers it's unclear what is gained.

For Logistic Regression we find human readable config makes a lot of sense. It's pretty intuitive if there aren't too many features - if the model starts behaving weirdly, we can sometimes track it down to a change in a single feature using this (especially when viewing recent git diffs).

Sure. I tend to keep my postprocessing of a model under version control. In particular, what features were most helpful for predictions.

Can't really talk about features on here :(

What kind of features do you look at? Obviously I don't expect you to be able to talk specifics, but I'm curious about the generalities.

Also, how did you settle on logistic regression? Have you tried any other models?

Can't really talk about features on here. Any smart fraudster should be watching every single thing I say :)

We're using logistic regression not because it performs the best, but because it's the most understandable. When cases get flagged for manual review people need to know exactly what seems dodgy about the account, and with Logistic Regression you can read the exact contribution from each feature to the final fraud probability. Seen as the features mean something real and tangible (unlike in neural nets), this means a manual reviewer immediately knows which aspects of someone's behaviour are out of the ordinary when they get presented with a new case (we have a really nice internal UI for presenting this). This saves several minutes per case which really adds up.

Performance-wise Logistic Regression is good, but it can't automatically learn non-linearities in a feature value and its propensity for fraud, and it can't learn about two features that together should indicate a probability of fraud greater than the sum of its parts* . If this becomes a problem for us we'll start looking into nonlinear models where the inner workings are somewhat communicable to the manual review team.

* You can alter feature definitions manually to capture nonlinearities (e.g. a feature which is "user_has_done_x_and_has_done_y_too", but this is very very manual, and needs to be potentially rewritten/manually re-optimised on every retrain. We don't do this.

Just a note on human readability of models: for sure glm gives you a human readeable representation for "free" but there are many ways to get the same kind of readability for neural Networks. Great article, though, cheers!

Ah interesting! Blind spot in my knowledge right there, thanks for pointing it out

I like this approach - first medium data real world solution I've read

My case is probably not quite what you're looking for, but I'll describe it anyway. I develop models for automated EEG analysis. The models are all in a standalone program that neurologists use. Their computers may or may not be connected to the internet, so all computation has to be local.

I have a set of training records (usually on the order of ~100). I do three folds, and train models on 2/3 of the data, validate on the last 1/3. I train using Tensorflow, usually experimenting with a few hundred different architectures. Then I combine the best models we have trained on the three folds into an ensemble. The ensemble can be too big to be practical, so sometimes the ensemble is distilled to a smaller model by training to the ensemble outputs. (The models have to run in real-time, and there are a lot, so it should take any particular model no more than ~10ms to process 1 second of data.) The final model is then tested on a completely independent dataset to determine our performance.

As these models are training I develop different experiments to try to probe the behavior of the models to make sure that they're working as expected. This usually motivates new architectures or features to experiment with.

Once the model has been trained, I export all the model weights to a bunch of big static C++ arrays. I've written my own feed-forward NN layers (fully connected, convolutional, and LSTM) to use these arrays of weights in our C++ code. (Tensorflow Serving didn't seem to make much sense for our use case and it was just easier to write the basic NN layers.)

When our resident neurologist feels like it's working well enough, we apply to the FDA for approval, and, who knows, maybe in a few years it'll be allowed in the product!

I assume these models, and your end-product goal, is to make some kind of diagnosis, or flag anomalies? This is very interesting, and any more info on what your use-case or end-product would be great.

The end goal is to make routine EEG interpretation easier for neurologists. The usual way to look at an EEG is to page through the record 10 seconds at a time, and look for different (sometime subtle) features to make a diagnosis. For a 24 hour record, that's 8600 pages the neurologist has to look through, the vast majority of which contain nothing interesting.

One of the main features of our software is epileptogenic spike detection. Epileptogenic spikes are little blips in the record and are indicative that the patient has epilepsy. Our models are now good enough at finding these spikes as human neurologists. There may be only a handful of these spikes in the record, so finding them automatically can be a huge time savings for the neurologist.

Our software also finds and eliminates various EEG artifacts. Blinking, for example, causes a huge artifact in the records. (It turns out that your eye is a dipole, and rolls upward every time you blink, which produces a huge signal.) The next major goal is seizure detection. It turns out there are a lot of things that can look a lot like a seizure in an EEG record. Even disconnecting the electrodes can look like a seizure.

The holy grail would be to give the program an EEG record and have it spit out the neurologist's report. We're not there yet, but we're working towards it. (By the way, we're hiring another ML engineer and software engineer. Email jobs@persyst.com if you're interested!)

That sounds like a very rewarding application that will be a great net benefit for humanity.

I'm looking at using machine learning on time series data in a similar way, replacing EEG with 3D motion data. However, ML is not my forte and I am just cutting my teeth. I would really appreciate if you be able to share some of the 'buzzwords' or ML topics that I might look into as I explore this work?

I have found recurrent NNs and their variants (LSTMs and GRUs) to be very useful for the time series I've been working with. It can also be useful to use 1-d convolutional NNs as well. Depending on your data, sometimes working in frequency space can be easier (i.e., train on the FFT of your data). I'm still a neophyte in this area (my background is in astronomy), so I'm learning, too!

Many thanks.

Thanks for the detailed response! It's remarkable to see yet another example of the impact of innovations like this.

Having only hundreds of training examples and fitting hundreds of models seems like a good recipe for overfitting. Have you noticed any issues with overfitting? Or maybe there is something that I am missing.

Oh, that's definitely something we worry about a lot! One thing to note is that each record contains a lot of data in it, so it's not as though there's just 100 data points we have to go on. A 24 hour record has ~20 million data points, it's just that they are highly correlated with each other.

Cross validation helps to some extent. If you're fitting enough models, though, you can still trick yourself into thinking that some models are not overfitting, when they just happen to do well on your validation set by chance. But one of the things I look for is a smooth transition from underfitting to overfitting as the model capacity increases.

Then the other thing we do is probe the models so that we can understand what they're doing as much we can. I like to generate synthetic data and look at the transfer function of the model as I vary a particular parameter in my synthetic dataset, for example.

We wrote a blog post detailing a lot of how we do our streaming ML system at Distil (https://resources.distilnetworks.com/all-blog-posts/building...) We classify web traffic as humans or bots in realtime.

Scoring: Essentially raw logs go to kafka. Java processes then read raw logs and aggregate the per-user state in real-time. State is stored in Redis. When a user's state is updated it is sent into another kafka topic that is the "features" topic. Features are simultaneously saved to HDFS for future training data and then consumed by a Storm cluster for doing prediction. Storm is running pickled scikit-learn models in bolts that read in the features and output scores. Scores are sent into a "score" kafka topic. Downstream systems reading the scores can read from this kafka topic.

Time from receiving a log to producing a score is ~x seconds or so.


Training data is stored into HDFS from the real-time system, so our training and production data is identical. We have IPython notebooks that pull in the training data, build a model and sckikit-learn feature transformation pipeline. This is pickled and saved to HDFS for versioning. The storm cluster when it starts a topology loads the appropriate model from HDFS for classifying data.

I never thought that ipython notebooks could be used to build production models.

Can you talk about this? How big a machine do you need to run notebooks effectively ? Why a notebook...Why not a Python script,etc

> How big a machine do you need to run notebooks effectively ?

- not too big. I've run tests with performance vs. sampling of data and for the models we use we can sample a significant amount (i.e. use a fraction of the full training data). - If we did need a bigger machine, we're using ensemble models that parallel train nicely.

> Why a notebook...Why not a Python script

I was originally going to use a Python script, but I found it useful to have the notebook output inline performance charts and metrics. It's easier to contain them in the notebook than output a bunch of image artifacts that have to be added to VC. This way I can pull open the notebook and scroll through to check all my visual metrics.

I'm not opposed to ditching the notebook for training entirely, but for now it works just fine.

so you build a notebook and play around with it... and then run this notebook in an automated way ? so u can open the notebook anytime and work with it ?

I really love this quick iterative way of working (atleast in the early days). Could you talk about your production setup of training ? I'm just concerned about performance, etc - is it OK to train manually each time (by opening the notebook, etc)

So far our models remain fairly stable in the ~weeks timeframe. If we needed to train daily or similar I would invest time in something other than a notebook. But, right now, it's easier to have the training steps documented in the notebook that does the actual training than to build a separate system and document it.

Not claiming it's a good way of doing this, but just how it is right now.

Adding on to this, how do you guys version control your notebooks?

We still just check them in directly. Not great, but it works. You can at least have github render the notebook at any point.

Conversational ML Pipeline (Myra Labs).

We have a few interesting challenges. Our system is set up to create models for a variety of customers, and thus needs to be able to build, update and serve a couple of thousand models at any given time. Since we need GPUs, this has required some custom work rather than being able to piggy back on existing frameworks. We leverage Tensorflow for most things.

- We have a distributed job processor cluster running on GPU nodes. All models are trained on this. We wrote a custom framework in Python for this. It can train any kind of model (keras - tf/theano, sklearn, et al).

- We store our models on S3. Versioning of models becomes important, and we looked around to see how others do it before we rolled our own.

- We have a serving system that is a heavily modified/forked version of Tensorflow Serving, written in C++. Again on GPU machines. Why C++? Because we can leverage binary model formats and reduce memory usage/make things more efficient. The reason we forked TF Serving almost a year ago was that we had a large number of features we needed that TF's rapid versioning cycle/breaking API updates just didn't let us do. This system is able to load and unload models dynamically from the S3 source, as well as distribute them across nodes and balance queries into them. It can load all TF graphs, but also supports running other kinds of models through callouts to other "engines" (eg, if we want to use sklearn or CNTK in the future).

>Versioning of models becomes important, and we looked around to see how others do it before we rolled our own.

could you talk about model versioning ? not necessarily your own - but interesting and relevant examples you found elsewhere.

Not sure if this was one of the examples the parent looked at in their search, but Pachyderm is a tool that's built specifically for training and versioning ML models (and any other data pipeline). You can express your analysis operations as Docker images so it should support all of the ML tools listed in this thread.

Full disclosure: I'm one of the creators of the project.

I work at Spotify on ML. I spent quite sometime working on our production vector model pipelines.

We have three components:

1. Training data: Scalding pipelines (could be any large scale production framework) to preprocess training data and reduce it to a few hundred gigs.

2. Run latent vector models (Matrix factorization, Embedding models etc) to generate item vectors. This stage is designed to be non specific to framework. We use typed avro objects to communicate with the previous stage.

3. Production deployment of models. Essentially glue code that ensures that models are versioned, atomic (user, music vectors are all from the same batch) and deployed into music recommendation and realtime frameworks.

Read here[1] more.

[1] https://labs.spotify.com/2016/08/07/commodity-music-ml-servi...

I find a lot of these frameworks to be overkill.

A rating event comes into our web server (python), then sent to SQS, then the event is pulled by our custom artificial neural network (deep learning) script, --written in Go. A model is trained and serialized. Next, the serialized model is uploaded to Postgres where it is fetched by the web service (also written in Go) to serve predictions.

We update our models within 15 seconds of a user rating. Every month with millions of ratings, we re-train millions of models and serve billions of predictions.

Interesting. Do you mind going into more detail?

I'm thinking of implementing a similar system – though not using Go.

What sort of details are you looking for?

Mostly persistence and scaling.

I assume you're using a jsonb column, and going horizontal?

Honestly, we haven't needed to scale horizontally. The Postgres column is a jsonb, although probably not necessary since we don't need to query the serialized model. If our reads ever start to slow down, the plan is to just add a follower to read from.

Woops. I was referring to the prediction tier.

What frameworks are you thinking of, specifically?

At Custora (YC W11), analyzing consumer behavior and presenting the insights through a web-based interface.

We have an in-house system for modeling DAGs of statistics, perhaps like Airflow (I've only read about Airflow, never used it, so can't comment more there). Computed values at different nodes in the DAG can be cached by their arguments. So for example, a fitted model would be a very logical node to cache, as you'll have many other potential statistical requests that would depend on the model.

Refitting the model would entail either clearing that node; if you changed the inputs to the fit, a separate stat would be cached and other dependent stats would know the difference and point to the right stat depending on the arguments you passed.

A lot of our models are Bayesian in nature, so generating predictions is typically a two-part process: training parameters, which is slow, can happen infrequently and which need not critically include all of the latest data, and applying the posterior update, which is faster and which needs to be redone on every data update. (We import data in batch.) Retraining is somewhat ad-hoc right now, although we've got an active project on the docket to systematize this and produce streamlined before-and-after comparisons.

Computation work is dispatched to EC2 instances by a Redis-based job manager (qless, developed by Moz, formerly SEOmoz). We do the stats work in R, in-memory. For things like order and transactional data this is feasible even for relatively large retailers, but we're gradually looking at involving Spark more so that we have the capacity for larger analyses. (We already do use Spark for some non-ML tasks, like customer data import.)

Thanks everyone for the useful answers.

I'll go too: We have 20M customers with a history 1-1000 activities per customer, updated once daily. We do nightly runs of a collaborative filtering algo to produce per-customer suggested activities, and push this to a sql database. Business users can load recommendations for each customer from this db...

We use scikit-learn to train the models every few weeks when we get more labeled data. Once a model is trained we use joblib to save the entire pipeline (normalization, feature processing etc). In production we have a thin Rest wrapper that loads the model pipeline to memory and serves prediction requests. We scale the number of these servers based on the load.

Same here.

Works well as long as you're disciplined w/ your code and have regression tests, continuous integration.

The hard part is ETLing data for training in a consistent way.

Deepdetect for both dev and prod, with the minimum code in front of it. This setup is definitely not able to accommodate all modern ML needs, but the fast and secure model update from dev to prod is the easiest for us. Disclaimer: DD author so the bias is very high, apologies for this, maybe my comment will remain useful to some.

Data Engineer at Thread (YC '12) here.

We're a relatively small team so it's nothing terribly sophisticated - Spark jobs prep log data for a nightly model training, done with sklearn - the model is just a python pickle that get loaded into our ranking system, Thimble. All done on EC2.

We also have a real-time pipeline (Bazalgette) that pulls events off a Kinesis stream and turns them into features, saved to Redis.

We're hiring a software engineer (in London) at the moment so if this sounds like the kind of tech you'd be interested in working alongside, drop me an email (in my profile) or look at https://thread.com/jobs

These replies are all very valuable, so appreciate everyone's comments. I have been involved in implementing a risk model for a financial services company, and the system is written completely in R and split as follows:

live model: R script listening to redis list, data is json, scores customer applications as they come in; if needed can trivially scale more workers.

batch update: also R script, generates model files that live system uses and runs nightly on cron, past models are backed up so rollbacks are possible.

The major pain points have been ensuring the live and batch systems match exactly, as we have to rewrite the code to do the same thing on both of them. Unfortunately, since we're doing quite a bit of processing to engineer features from multiple data sources this couldn't easily be expressed as a pipeline in scikit learn.

Redis has been invaluable in managing the queues and glueing the different systems together.

(product person at Appuri in Seattle)

We work with customer data, so lots of time-series. We use SQL for exploration and feature engineering. Model-building happens inside a Docker container that runs on ECS. Scores and predictions are inserted back into the data warehouse (Redshift). It's a simple, practical system that works well without needing a ton of infrastructure.

Data Engineer at Shopify here.

We use ML in both streaming and batch use cases to do things like fraud detection, app/theme recommendations and to give merchants insights into their customers and business.

Our ML architecture is very similar to the OP, we use PySpark to train models and fill some of the missing gaps using scikit-learn. The batch jobs consume and produce data to HDFS and our streaming pipeline read off a Kafka Stream.

We've been experimenting with exporting models to PMML and using the JPMML Evaluator to serve them. However in some use cases we've had to rewrite models in ruby so that they can be used in synchronous paths with our rails app, these are usually very simple things like logistic regression.

Our ML pipeline is in its infancy, theres still a lot of work to be done, if this is the sort of thing you'd like to build shoot me a message at


Is Shopify open to remote work?

Some of our positions are open to remote work. But unfortunately there are no remote positions in the ML and Data teams. We do have offices in several Canadian cities and do work on getting a visa for candidates.

heres a link to our positions https://jobs.lever.co/shopify?lever-via=XBuWsYM_Q2

Can anyone comment on the relative computational requirements of training vs using the classifier? antognini described a system that trained in tensorflow and then is used by a simple C++ program (not on a GPU?).

Do you use the same resources/hardware for both training and executing?

At Dropbox we use different machines for training and inference, as their computational needs are so different. We train on Amazon GPU machines, while we deploy on our own Dropbox compute infrastructure on CPU machines.

When I joined Airbnb, we did something similar with PMML on the risk team. The issue I had with PMML is that sometimes the scores would differ in weird ways due to the differences in the model implementation. Also, it limited what models we could use.

I moved to our ML Infra team and built a service to allow inference on the pickled sklearn model or serialized TensorFlow model. This also worked with our ml framework: Aerosolve. (If you're using TensorFlow only then you can use TensorFlow serving.) One of the primary goals I had was to make everything config driven so data scientists could launch models without engineering help. This has worked pretty well. We've put about 20 models into production on this service in the last year.

For batch scoring, the solution is similar. We use Airflow for scheduling and can just do inference on the model in the airflow worker. It's easy enough when it's python or spark because Airflow plays nicely with those.

Some complications occur in:

1) Dependency management. We let users package up their own virtualenvs and use them in both the real time scoring and batch scoring. But it's a bit clunky in certain cases and we're moving to docker.

2) Heterogenous resource requirements. If you need a lot of memory or gpu access then it can be a bit tricky to do that with Airflow. We're working on integrating it with docker and kubernetes to solve this at the moment. Typically we use yarn for problems like this but our ML use cases require some things that are just hard to do on our hadoop cluster. I'm not a big fan of yarn for ML. No one has needed GPUs for real time prediction yet so I'll cross that bridge when we get to it.

3) Online / offline consistency. Sometimes we'd end up re-implementing transform logic in Java so we could run it in production. This led to lots of sadness when things didn't quite match up. Now, we try to make it easy to deploy the whole sklearn pipeline including custom python transformations. That way we have some confidence we're running the exact same code at training and inference time. Similarly, we built a framework to compute offline aggregations in hive and upload the results to a KV store for use in online scoring. We have the ability to augment this with streaming data to reduce the data delay on the aggregation. This introduced some lambda architecture requirements that were kind of interesting. We actually ran into quite a few scaling issues here so we've gone through a couple iterations as new projects come in. I'm currently working on the third generation of this framework.

These are some of the issues we tackled early on with the ML Infra team. We're now focusing on improving the quality of our offline training tools. Particularly we're trying to make:

1) A seamless process for going from a notebook to production that integrates all these tools.

2) A good notebook environment that includes simple provisioning of resources for large training jobs.

3) Sample notebooks for common ML tasks that encode best practices and make it easier for people to get their initial model quickly.

4) A system for making it easier to do complex training data backfills.

Outside the ML Infra team we have a few ML focused product teams that own their own infrastructure and optimize it for very specific needs they have.

I'm interested to see if https://computes.io could be used for ML.

Training/retraining: Done arbitrarily, mostly locally, sometimes distribute. Done with either TensorFlow or Torch. We have a custom backend for it. Often times linked with Keras.

Inference: For our custom platform, we have our own framework (in progress). For other products where our hardware is unavailable, we will use either MXNet mobile or a custom framework on mobile frameworks. For deployments where we have the luxury of a cloud link, we will either use TensorFlow serving (with a custom backend once it's done) or Flask linked to TF/Caffe/Keras (also with a custom backend once it's done).

For you feedback I'm offering karma or bug bounties (if you can identify me on github)!

Forestry High Resolution Inventory - Predicting Forest Attributes at Landscape Scale ----------------------------------

This a really promising business line. Our pipeline is not very automated and it has no application level data management - we're a ways away from that (e.g. the code isn't even under version control).

For our current business and data volumes, the system works.

We get target attributes (forest characteristics, e.g. species composition) by doing field surveys or other methods (e.g. expert photo interpretation).

We acquire Lidar, Color Infrared data, and Climatic models to develop landscape-scale features. For the lidar derived features we use LAStools. We use Safe software FME for generating some features from the color infrared data (e.g. vegetation indices). We use regional climate models suitable for the area of interest for the climatic indices.

The reference data and the target attributes are spatially joined. We end up with a lot of features and use the subselect library's "improve" function in R in an iterative fashion to reduce the number of features; leave-one-out LDA is used to assess the performance of the subset of features. If the target variable is not categorical, we have to classify it in order to run the LDA. The procedure produces a lot of candidate feature sets; our process to select a particular feature-set is human driven. We do not have a formalized or explicit rule. The subset of feature chosen are used in KNN routine. Some components are in python and state is shared using the filesystem.

We do a lot of tuning on a project by project basis.

At the end of the prediction, there are several transformations to derive other forest characteristics and munge the data into regional standards.

The whole modeling process is done on a somewhat high powered desktop. There is one directory that holds all the code and configs; scripts are invoked which pull settings from a corresponding config. Some of the code is stateful (i.e. stores configuration) and the configs are global (their location is not abstracted) so in order to run this process concurrently, it has to be deployed in separate machines.

Municipal Sewer Backup ----------------------

We ported some of the components described above to predict sewer backup risk. Key components were broken into R and python libraries and dockerized. The libraries are here pylearn - https://github.com/tesera/pylearn and rlearn - https://github.com/tesera/rlearn.

A python library (learn-cli, which we haven't open sourced) uses r2py to coordinate/share state between the two libraries. The process for training still requires a user to select a model from all of the candidates that the variable selection routine produces. This selection is made a prediction time; all the candidate models are stored in one file and an id is specified for which one to use. learn-cli is dockerized and we have it deployed on ECS. It scales pretty well.

This solved many of the challenges in the forestry pipeline, but we haven't been able to bring everything from the forestry model into deployment this way due to a gap in data science and development. I've been looking into Azure Machine Learning as a possible solution for this. I have benchmarked some builtin models there and gotten identical performance as with our highly customized process.


Would love to hear your advice for formalizing & automating, or alternatives to, our process for the forestry model pipeline.

Also, if you have a highly automated machine learning pipeline - what are your data scientist responsibilities? It's not clear to me how our data scientist jobs would evolve if they didn't have to manually run several scripts for fit and select a model and generating the predictions.

wow, that sure looks like it could be automated, starting with the indexing passes (script them in Python maybe?) and slowly factoring out bits of it into sklearn modules or whatever. Alternatively, as a batch job spawned from within R, but that would get annoying over time. Still, the idea is great and it sounds like you guys end up with time to go kayaking etc.

I'd think that Microsoft would be bending over backwards to get you to run the R stuff on Azure, since they bought the Revolution operation to show off such things. In the long term I imagine that you'll move most of this to Python.

You might consider poaching some spatial statistics people from Uber. Like, say, some of the women. At least if the market is big and robust enough to bring in some heavy hitters. It seems like a lot of the steps might be automated with convolutions and validated against your expert pool, but I can't say for sure.

Neat projects. Best of luck.

I'd love to hear more about what you're doing. I'm looking a bit into this space and could share some of what I've seen in the market - my email is in my profile.

FYI: HN email field isn't public; you have to put it in the about field too.

Glad to chat if you throw your email into your profile

(ML engineer from Twizoo here)

We use a multitude of models - linear and non-linear, supervised and unsupervised - to surface the best user-generated content from social for our clients.

Our ML stack is written almost entirely in Python - because of the readily available excellent libraries that give you many of the tools you'll need in your ML toolbox (matplotlib, numpy, pandas, scikit-image, scikit-learn, scipy to mention just a few). We have some more sophisticated image processing in C++. We find ourselves attacking quite a number of machine learning problems so to make our workflow more efficient we have built a layer of functionality above these libraries to manage the flow of data and speed up model prototyping and assessment. Processing input data and model fitting is usually done locally, with any hyper-parameter grid search or other computationally intensive tasks being run on a remote, cloud hosted cluster spun up on-demand.

Our tech is built around lambda architecture principles - we have a live path that processes new social content to the extent required to make content immediately available to relevant clients, and a batch path that processes complete datasets daily. Our ML models get used on both paths.

With a neat set of Python libraries for data manipulation and model definitions and our own model prototyping and assessment rig the biggest challenge is always building big, quality datasets. This continues to be our biggest learning and we are always keen to explore how others tackle this. We rely on a combination of in-house turking (which we consider a vital model development stage), external turking services (for volume) and have additionally looked at some knowledge transfer techniques.

See https://www.periscope.tv/w/1RDGlYXeDpOJL for a quick talk where I expanded on why and how Twizoo uses ML.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact