
Ask HN: What does your production machine learning pipeline look like? - bayonetz
Doing some design for an upcoming project and taking a survey.<p>I&#x27;ll go first. Model training happened nightly on a Spark cluster. This output a PMML-based SVM model. The model was instantiated on a cluster of compute servers running Openscoring. A thin Node web service wrapper used the Openscoring cluster to serve realtime client prediction requests. Dataset size in the hundred millions of examples with hundreds of features. Handled thousands of requests per second, no problem.<p>Separating the training technology from the execution technology was nice but the PMML format is limiting in the kinds of models you can use that both you trainer and executor will support. What are people doing who use same tech for both? For something like Tensorflow, I assume you must have to save the model as binary from the train step and then send it off to the prediction cluster to be instantiated again for execution?
======
bradneuberg
(ML engineer from Dropbox here)

For deep learning oriented projects, we train on EC2 GPU instances, generally
p2.xlarge instances these days to get the Nvidia K80 GPUs. We can spin up many
of these in parallel if we are doing model architecture searches or
hyperparameter exploration.

We have an in-house data turking setup where we can efficiently roll new UIs
to get ground-truth data for given problems, generally in the thousands to
tens of thousands of real data. We also use data augmentation where possible,
synthetically generating millions of example data points, combining it with
the real turked data for fine tuning. Note that we never look at or train with
real user data, unless explicitly donated, so data efficiency is important to
us.

We've standardized on TensorFlow these days, doing inference on CPUs currently
on Dropbox's compute infrastructure. We have a jail setup based off of LXC and
Provost that allows us to safely execute these trained models in a contained
way, responding to user requests as they come in. We use standard distributed
systems plumbing for RPCs, queues, etc. to respond to user requests and route
them to trained models. We version our trained models, and have an in-house
experiments framework that we use to deploy new models and test them against
subsets of user traffic to see how they are doing as we ramp up a new model.

Most of our day-to-day work is in Python, with occasional use of C++; other
parts of Dropbox sometimes use Go and Rust, though we haven't had need for
that on the ML team. Note that Dropbox is one of the largest users of Python
in the world (Guido van Rossum actually works here).

BTW, the Machine Learning team at Dropbox is hiring. Come join us! Details:
[https://www.dropbox.com/jobs/listing/533100](https://www.dropbox.com/jobs/listing/533100)

~~~
soheil
> we never look at or _train_ with real user data, unless explicitly donated

What is the rationale behind this? Why not train?

~~~
perlgeek
Couldn't user data make it into the model, and be reverse-engineered from it,
if you used it to train the model?

I mean, machine learning models aren't designed as one-way cryptographic hash
functions, so you can't be sure that some of the training data can be inferred
through the model's behavior.

~~~
zek
Yes! there is an interesting paper on this subject:
[https://arxiv.org/abs/1609.02943](https://arxiv.org/abs/1609.02943)

~~~
j_s
Counterpoint:
[https://arxiv.org/abs/1701.04082](https://arxiv.org/abs/1701.04082)

Source:
[https://news.ycombinator.com/item?id=13824047](https://news.ycombinator.com/item?id=13824047)

------
angusb
Fraud detection at GoCardless (YC 11). We use the same tech for training and
production classification: logistic regression (sklearn).

\-------

 __ _Training /retraining_ __

\- train on an ad-hoc basis, every few months right now moving to more
frequently and regularly as we streamline the process

\- training done locally, in memory (we're "medium data" so no need for
distributed process), using a version-controlled ipython notebook

\- we extract model and preprocessing parameters from the model and
preprocessors that were fit in the retraining process, dump to a json config
file in the production classification repo

\-------

 __ _Production classification_ __

\- we classify activity in our system on a nightly cron*

\- as part of the cron we instantiate the model using config dumped from the
retraining process. This means the config is fully readable in the git history
(amazing for debugging by wider eng. team if something goes wrong)

\- classifications and P(fraud) gets sent to the GoCardless core payments
service which then decides whether to escalate cases for manual review

\-------

* We're a payments company, processing Direct Debit bank-to-bank transfers. Inter-bank Direct Debit payments happen slowly (typically >2 days) so we don't need a live service for classifications.

Quite simple as production ML services go, but it's currently only 2 people
working on this (we're hiring!).

~~~
bayonetz
When using sklearn, I've seen a lot of folks just pickle the model and use
that as the interchange format. I like the human-readable interchange format
you are using better. I assume you just rolled your own. Why not something
like PMML?

~~~
angusb
Yep, we made our own. I haven't heard of PMML before - quite cool! What we've
made is a bit more readable for what we're using it for though, IMO. Looks
like this:

    
    
        {
            "intercept": 1.0,
        
            "features": {
                "feature_1": {
                    "coefficient": 1.0,
                    "range": [0.1, 10.0],
                    "mean_feature_score": 1.0,
                    "imputation_value": 1.0
                },
                {
                    ....
                }
            }
        }

~~~
sandGorgon
Is this open source? We were looking for something like this.

~~~
angusb
Sadly not. I'd be totally up for open sourcing if there's clear demand. If you
can find it, send me an email at angus@{company_I_work_at}.com

Note that it's very tied down to our use case right now: only compatible with
Logistic Regression, and currently it assumes fixed hyperparameters (will
change this in future though), assumes a production pipeline of min-max
scaling, imputation, then classification.

------
antognini
My case is probably not quite what you're looking for, but I'll describe it
anyway. I develop models for automated EEG analysis. The models are all in a
standalone program that neurologists use. Their computers may or may not be
connected to the internet, so all computation has to be local.

I have a set of training records (usually on the order of ~100). I do three
folds, and train models on 2/3 of the data, validate on the last 1/3\. I train
using Tensorflow, usually experimenting with a few hundred different
architectures. Then I combine the best models we have trained on the three
folds into an ensemble. The ensemble can be too big to be practical, so
sometimes the ensemble is distilled to a smaller model by training to the
ensemble outputs. (The models have to run in real-time, and there are a lot,
so it should take any particular model no more than ~10ms to process 1 second
of data.) The final model is then tested on a completely independent dataset
to determine our performance.

As these models are training I develop different experiments to try to probe
the behavior of the models to make sure that they're working as expected. This
usually motivates new architectures or features to experiment with.

Once the model has been trained, I export all the model weights to a bunch of
big static C++ arrays. I've written my own feed-forward NN layers (fully
connected, convolutional, and LSTM) to use these arrays of weights in our C++
code. (Tensorflow Serving didn't seem to make much sense for our use case and
it was just easier to write the basic NN layers.)

When our resident neurologist feels like it's working well enough, we apply to
the FDA for approval, and, who knows, maybe in a few years it'll be allowed in
the product!

~~~
gricardo99
I assume these models, and your end-product goal, is to make some kind of
diagnosis, or flag anomalies? This is very interesting, and any more info on
what your use-case or end-product would be great.

~~~
antognini
The end goal is to make routine EEG interpretation easier for neurologists.
The usual way to look at an EEG is to page through the record 10 seconds at a
time, and look for different (sometime subtle) features to make a diagnosis.
For a 24 hour record, that's 8600 pages the neurologist has to look through,
the vast majority of which contain nothing interesting.

One of the main features of our software is epileptogenic spike detection.
Epileptogenic spikes are little blips in the record and are indicative that
the patient has epilepsy. Our models are now good enough at finding these
spikes as human neurologists. There may be only a handful of these spikes in
the record, so finding them automatically can be a huge time savings for the
neurologist.

Our software also finds and eliminates various EEG artifacts. Blinking, for
example, causes a huge artifact in the records. (It turns out that your eye is
a dipole, and rolls upward every time you blink, which produces a huge
signal.) The next major goal is seizure detection. It turns out there are a
lot of things that can look a lot like a seizure in an EEG record. Even
disconnecting the electrodes can look like a seizure.

The holy grail would be to give the program an EEG record and have it spit out
the neurologist's report. We're not there yet, but we're working towards it.
(By the way, we're hiring another ML engineer and software engineer. Email
jobs@persyst.com if you're interested!)

~~~
kejaed
That sounds like a very rewarding application that will be a great net benefit
for humanity.

I'm looking at using machine learning on time series data in a similar way,
replacing EEG with 3D motion data. However, ML is not my forte and I am just
cutting my teeth. I would really appreciate if you be able to share some of
the 'buzzwords' or ML topics that I might look into as I explore this work?

~~~
antognini
I have found recurrent NNs and their variants (LSTMs and GRUs) to be very
useful for the time series I've been working with. It can also be useful to
use 1-d convolutional NNs as well. Depending on your data, sometimes working
in frequency space can be easier (i.e., train on the FFT of your data). I'm
still a neophyte in this area (my background is in astronomy), so I'm
learning, too!

~~~
kejaed
Many thanks.

------
gallamine
We wrote a blog post detailing a lot of how we do our streaming ML system at
Distil ([https://resources.distilnetworks.com/all-blog-
posts/building...](https://resources.distilnetworks.com/all-blog-
posts/building-a-streaming-log-processing-and-machine-learning-system-at-
distil)) We classify web traffic as humans or bots in realtime.

Scoring: Essentially raw logs go to kafka. Java processes then read raw logs
and aggregate the per-user state in real-time. State is stored in Redis. When
a user's state is updated it is sent into another kafka topic that is the
"features" topic. Features are simultaneously saved to HDFS for future
training data and then consumed by a Storm cluster for doing prediction. Storm
is running pickled scikit-learn models in bolts that read in the features and
output scores. Scores are sent into a "score" kafka topic. Downstream systems
reading the scores can read from this kafka topic.

Time from receiving a log to producing a score is ~x seconds or so.

Training:

Training data is stored into HDFS from the real-time system, so our training
and production data is identical. We have IPython notebooks that pull in the
training data, build a model and sckikit-learn feature transformation
pipeline. This is pickled and saved to HDFS for versioning. The storm cluster
when it starts a topology loads the appropriate model from HDFS for
classifying data.

~~~
sandGorgon
I never thought that ipython notebooks could be used to build production
models.

Can you talk about this? How big a machine do you need to run notebooks
effectively ? Why a notebook...Why not a Python script,etc

~~~
gallamine
> How big a machine do you need to run notebooks effectively ?

\- not too big. I've run tests with performance vs. sampling of data and for
the models we use we can sample a significant amount (i.e. use a fraction of
the full training data). \- If we did need a bigger machine, we're using
ensemble models that parallel train nicely.

> Why a notebook...Why not a Python script

I was originally going to use a Python script, but I found it useful to have
the notebook output inline performance charts and metrics. It's easier to
contain them in the notebook than output a bunch of image artifacts that have
to be added to VC. This way I can pull open the notebook and scroll through to
check all my visual metrics.

I'm not opposed to ditching the notebook for training entirely, but for now it
works just fine.

~~~
sandGorgon
so you build a notebook and play around with it... and then run this notebook
in an automated way ? so u can open the notebook anytime and work with it ?

I really love this quick iterative way of working (atleast in the early days).
Could you talk about your production setup of training ? I'm just concerned
about performance, etc - is it OK to train manually each time (by opening the
notebook, etc)

~~~
gallamine
So far our models remain fairly stable in the ~weeks timeframe. If we needed
to train daily or similar I would invest time in something other than a
notebook. But, right now, it's easier to have the training steps documented in
the notebook that does the actual training than to build a separate system and
document it.

Not claiming it's a good way of doing this, but just how it is right now.

------
viksit
Conversational ML Pipeline (Myra Labs).

We have a few interesting challenges. Our system is set up to create models
for a variety of customers, and thus needs to be able to build, update and
serve a couple of thousand models at any given time. Since we need GPUs, this
has required some custom work rather than being able to piggy back on existing
frameworks. We leverage Tensorflow for most things.

\- We have a distributed job processor cluster running on GPU nodes. All
models are trained on this. We wrote a custom framework in Python for this. It
can train any kind of model (keras - tf/theano, sklearn, et al).

\- We store our models on S3. Versioning of models becomes important, and we
looked around to see how others do it before we rolled our own.

\- We have a serving system that is a heavily modified/forked version of
Tensorflow Serving, written in C++. Again on GPU machines. Why C++? Because we
can leverage binary model formats and reduce memory usage/make things more
efficient. The reason we forked TF Serving almost a year ago was that we had a
large number of features we needed that TF's rapid versioning cycle/breaking
API updates just didn't let us do. This system is able to load and unload
models dynamically from the S3 source, as well as distribute them across nodes
and balance queries into them. It can load all TF graphs, but also supports
running other kinds of models through callouts to other "engines" (eg, if we
want to use sklearn or CNTK in the future).

~~~
sandGorgon
_> Versioning of models becomes important, and we looked around to see how
others do it before we rolled our own._

could you talk about model versioning ? not necessarily your own - but
interesting and relevant examples you found elsewhere.

~~~
jdoliner
Not sure if this was one of the examples the parent looked at in their search,
but Pachyderm is a tool that's built specifically for training and versioning
ML models (and any other data pipeline). You can express your analysis
operations as Docker images so it should support all of the ML tools listed in
this thread.

Full disclosure: I'm one of the creators of the project.

------
eshvk
I work at Spotify on ML. I spent quite sometime working on our production
vector model pipelines.

We have three components:

1\. Training data: Scalding pipelines (could be any large scale production
framework) to preprocess training data and reduce it to a few hundred gigs.

2\. Run latent vector models (Matrix factorization, Embedding models etc) to
generate item vectors. This stage is designed to be non specific to framework.
We use typed avro objects to communicate with the previous stage.

3\. Production deployment of models. Essentially glue code that ensures that
models are versioned, atomic (user, music vectors are all from the same batch)
and deployed into music recommendation and realtime frameworks.

Read here[1] more.

[1] [https://labs.spotify.com/2016/08/07/commodity-music-ml-
servi...](https://labs.spotify.com/2016/08/07/commodity-music-ml-services/)

------
poorman
I find a lot of these frameworks to be overkill.

A rating event comes into our web server (python), then sent to SQS, then the
event is pulled by our custom artificial neural network (deep learning)
script, --written in Go. A model is trained and serialized. Next, the
serialized model is uploaded to Postgres where it is fetched by the web
service (also written in Go) to serve predictions.

We update our models within 15 seconds of a user rating. Every month with
millions of ratings, we re-train millions of models and serve billions of
predictions.

~~~
devbug
Interesting. Do you mind going into more detail?

I'm thinking of implementing a similar system – though not using Go.

~~~
poorman
What sort of details are you looking for?

~~~
devbug
Mostly persistence and scaling.

I assume you're using a jsonb column, and going horizontal?

~~~
poorman
Honestly, we haven't needed to scale horizontally. The Postgres column is a
jsonb, although probably not necessary since we don't need to query the
serialized model. If our reads ever start to slow down, the plan is to just
add a follower to read from.

~~~
devbug
Woops. I was referring to the prediction tier.

------
andy_wrote
At Custora (YC W11), analyzing consumer behavior and presenting the insights
through a web-based interface.

We have an in-house system for modeling DAGs of statistics, perhaps like
Airflow (I've only read about Airflow, never used it, so can't comment more
there). Computed values at different nodes in the DAG can be cached by their
arguments. So for example, a fitted model would be a very logical node to
cache, as you'll have many other potential statistical requests that would
depend on the model.

Refitting the model would entail either clearing that node; if you changed the
inputs to the fit, a separate stat would be cached and other dependent stats
would know the difference and point to the right stat depending on the
arguments you passed.

A lot of our models are Bayesian in nature, so generating predictions is
typically a two-part process: training parameters, which is slow, can happen
infrequently and which need not critically include all of the latest data, and
applying the posterior update, which is faster and which needs to be redone on
every data update. (We import data in batch.) Retraining is somewhat ad-hoc
right now, although we've got an active project on the docket to systematize
this and produce streamlined before-and-after comparisons.

Computation work is dispatched to EC2 instances by a Redis-based job manager
(qless, developed by Moz, formerly SEOmoz). We do the stats work in R, in-
memory. For things like order and transactional data this is feasible even for
relatively large retailers, but we're gradually looking at involving Spark
more so that we have the capacity for larger analyses. (We already do use
Spark for some non-ML tasks, like customer data import.)

------
isoprophlex
Thanks everyone for the useful answers.

I'll go too: We have 20M customers with a history 1-1000 activities per
customer, updated once daily. We do nightly runs of a collaborative filtering
algo to produce per-customer suggested activities, and push this to a sql
database. Business users can load recommendations for each customer from this
db...

------
gidim
We use scikit-learn to train the models every few weeks when we get more
labeled data. Once a model is trained we use joblib to save the entire
pipeline (normalization, feature processing etc). In production we have a thin
Rest wrapper that loads the model pipeline to memory and serves prediction
requests. We scale the number of these servers based on the load.

~~~
hcarvalhoalves
Same here.

Works well as long as you're disciplined w/ your code and have regression
tests, continuous integration.

The hard part is ETLing data for training in a consistent way.

------
rtkaleta
(ML engineer from Twizoo here)

We use a multitude of models - linear and non-linear, supervised and
unsupervised - to surface the best user-generated content from social for our
clients.

Our ML stack is written almost entirely in Python - because of the readily
available excellent libraries that give you many of the tools you'll need in
your ML toolbox (matplotlib, numpy, pandas, scikit-image, scikit-learn, scipy
to mention just a few). We have some more sophisticated image processing in
C++. We find ourselves attacking quite a number of machine learning problems
so to make our workflow more efficient we have built a layer of functionality
above these libraries to manage the flow of data and speed up model
prototyping and assessment. Processing input data and model fitting is usually
done locally, with any hyper-parameter grid search or other computationally
intensive tasks being run on a remote, cloud hosted cluster spun up on-demand.

Our tech is built around lambda architecture principles - we have a live path
that processes new social content to the extent required to make content
immediately available to relevant clients, and a batch path that processes
complete datasets daily. Our ML models get used on both paths.

With a neat set of Python libraries for data manipulation and model
definitions and our own model prototyping and assessment rig the biggest
challenge is always building big, quality datasets. This continues to be our
biggest learning and we are always keen to explore how others tackle this. We
rely on a combination of in-house turking (which we consider a vital model
development stage), external turking services (for volume) and have
additionally looked at some knowledge transfer techniques.

See
[https://www.periscope.tv/w/1RDGlYXeDpOJL](https://www.periscope.tv/w/1RDGlYXeDpOJL)
for a quick talk where I expanded on why and how Twizoo uses ML.

------
pilooch
Deepdetect for both dev and prod, with the minimum code in front of it. This
setup is definitely not able to accommodate all modern ML needs, but the fast
and secure model update from dev to prod is the easiest for us. Disclaimer: DD
author so the bias is very high, apologies for this, maybe my comment will
remain useful to some.

------
hythloday
Data Engineer at Thread (YC '12) here.

We're a relatively small team so it's nothing terribly sophisticated - Spark
jobs prep log data for a nightly model training, done with sklearn - the model
is just a python pickle that get loaded into our ranking system, Thimble. All
done on EC2.

We also have a real-time pipeline (Bazalgette) that pulls events off a Kinesis
stream and turns them into features, saved to Redis.

We're hiring a software engineer (in London) at the moment so if this sounds
like the kind of tech you'd be interested in working alongside, drop me an
email (in my profile) or look at
[https://thread.com/jobs](https://thread.com/jobs)

------
maslam
(product person at Appuri in Seattle)

We work with customer data, so lots of time-series. We use SQL for exploration
and feature engineering. Model-building happens inside a Docker container that
runs on ECS. Scores and predictions are inserted back into the data warehouse
(Redshift). It's a simple, practical system that works well without needing a
ton of infrastructure.

------
psandersen
These replies are all very valuable, so appreciate everyone's comments. I have
been involved in implementing a risk model for a financial services company,
and the system is written completely in R and split as follows:

live model: R script listening to redis list, data is json, scores customer
applications as they come in; if needed can trivially scale more workers.

batch update: also R script, generates model files that live system uses and
runs nightly on cron, past models are backed up so rollbacks are possible.

The major pain points have been ensuring the live and batch systems match
exactly, as we have to rewrite the code to do the same thing on both of them.
Unfortunately, since we're doing quite a bit of processing to engineer
features from multiple data sources this couldn't easily be expressed as a
pipeline in scikit learn.

Redis has been invaluable in managing the queues and glueing the different
systems together.

------
quadrature
Data Engineer at Shopify here.

We use ML in both streaming and batch use cases to do things like fraud
detection, app/theme recommendations and to give merchants insights into their
customers and business.

Our ML architecture is very similar to the OP, we use PySpark to train models
and fill some of the missing gaps using scikit-learn. The batch jobs consume
and produce data to HDFS and our streaming pipeline read off a Kafka Stream.

We've been experimenting with exporting models to PMML and using the JPMML
Evaluator to serve them. However in some use cases we've had to rewrite models
in ruby so that they can be used in synchronous paths with our rails app,
these are usually very simple things like logistic regression.

Our ML pipeline is in its infancy, theres still a lot of work to be done, if
this is the sort of thing you'd like to build shoot me a message at

franklyn.dsouza@shopify.com

~~~
gallamine
Is Shopify open to remote work?

~~~
quadrature
Some of our positions are open to remote work. But unfortunately there are no
remote positions in the ML and Data teams. We do have offices in several
Canadian cities and do work on getting a visa for candidates.

heres a link to our positions [https://jobs.lever.co/shopify?lever-
via=XBuWsYM_Q2](https://jobs.lever.co/shopify?lever-via=XBuWsYM_Q2)

------
bhntr3
When I joined Airbnb, we did something similar with PMML on the risk team. The
issue I had with PMML is that sometimes the scores would differ in weird ways
due to the differences in the model implementation. Also, it limited what
models we could use.

I moved to our ML Infra team and built a service to allow inference on the
pickled sklearn model or serialized TensorFlow model. This also worked with
our ml framework: Aerosolve. (If you're using TensorFlow only then you can use
TensorFlow serving.) One of the primary goals I had was to make everything
config driven so data scientists could launch models without engineering help.
This has worked pretty well. We've put about 20 models into production on this
service in the last year.

For batch scoring, the solution is similar. We use Airflow for scheduling and
can just do inference on the model in the airflow worker. It's easy enough
when it's python or spark because Airflow plays nicely with those.

Some complications occur in:

1) Dependency management. We let users package up their own virtualenvs and
use them in both the real time scoring and batch scoring. But it's a bit
clunky in certain cases and we're moving to docker.

2) Heterogenous resource requirements. If you need a lot of memory or gpu
access then it can be a bit tricky to do that with Airflow. We're working on
integrating it with docker and kubernetes to solve this at the moment.
Typically we use yarn for problems like this but our ML use cases require some
things that are just hard to do on our hadoop cluster. I'm not a big fan of
yarn for ML. No one has needed GPUs for real time prediction yet so I'll cross
that bridge when we get to it.

3) Online / offline consistency. Sometimes we'd end up re-implementing
transform logic in Java so we could run it in production. This led to lots of
sadness when things didn't quite match up. Now, we try to make it easy to
deploy the whole sklearn pipeline including custom python transformations.
That way we have some confidence we're running the exact same code at training
and inference time. Similarly, we built a framework to compute offline
aggregations in hive and upload the results to a KV store for use in online
scoring. We have the ability to augment this with streaming data to reduce the
data delay on the aggregation. This introduced some lambda architecture
requirements that were kind of interesting. We actually ran into quite a few
scaling issues here so we've gone through a couple iterations as new projects
come in. I'm currently working on the third generation of this framework.

These are some of the issues we tackled early on with the ML Infra team. We're
now focusing on improving the quality of our offline training tools.
Particularly we're trying to make:

1) A seamless process for going from a notebook to production that integrates
all these tools.

2) A good notebook environment that includes simple provisioning of resources
for large training jobs.

3) Sample notebooks for common ML tasks that encode best practices and make it
easier for people to get their initial model quickly.

4) A system for making it easier to do complex training data backfills.

Outside the ML Infra team we have a few ML focused product teams that own
their own infrastructure and optimize it for very specific needs they have.

------
gumby
Can anyone comment on the relative computational requirements of training vs
using the classifier? antognini described a system that trained in tensorflow
and then is used by a simple C++ program (not on a GPU?).

Do you use the same resources/hardware for both training and executing?

~~~
bradneuberg
At Dropbox we use different machines for training and inference, as their
computational needs are so different. We train on Amazon GPU machines, while
we deploy on our own Dropbox compute infrastructure on CPU machines.

------
deepnotderp
Training/retraining: Done arbitrarily, mostly locally, sometimes distribute.
Done with either TensorFlow or Torch. We have a custom backend for it. Often
times linked with Keras.

Inference: For our custom platform, we have our own framework (in progress).
For other products where our hardware is unavailable, we will use either MXNet
mobile or a custom framework on mobile frameworks. For deployments where we
have the luxury of a cloud link, we will either use TensorFlow serving (with a
custom backend once it's done) or Flask linked to TF/Caffe/Keras (also with a
custom backend once it's done).

------
cmatthieu
I'm interested to see if [https://computes.io](https://computes.io) could be
used for ML.

------
d4rth
For you feedback I'm offering karma or bug bounties (if you can identify me on
github)!

Forestry High Resolution Inventory - Predicting Forest Attributes at Landscape
Scale \----------------------------------

This a really promising business line. Our pipeline is not very automated and
it has no application level data management - we're a ways away from that
(e.g. the code isn't even under version control).

For our current business and data volumes, the system works.

We get target attributes (forest characteristics, e.g. species composition) by
doing field surveys or other methods (e.g. expert photo interpretation).

We acquire Lidar, Color Infrared data, and Climatic models to develop
landscape-scale features. For the lidar derived features we use LAStools. We
use Safe software FME for generating some features from the color infrared
data (e.g. vegetation indices). We use regional climate models suitable for
the area of interest for the climatic indices.

The reference data and the target attributes are spatially joined. We end up
with a lot of features and use the subselect library's "improve" function in R
in an iterative fashion to reduce the number of features; leave-one-out LDA is
used to assess the performance of the subset of features. If the target
variable is not categorical, we have to classify it in order to run the LDA.
The procedure produces a lot of candidate feature sets; our process to select
a particular feature-set is human driven. We do not have a formalized or
explicit rule. The subset of feature chosen are used in KNN routine. Some
components are in python and state is shared using the filesystem.

We do a lot of tuning on a project by project basis.

At the end of the prediction, there are several transformations to derive
other forest characteristics and munge the data into regional standards.

The whole modeling process is done on a somewhat high powered desktop. There
is one directory that holds all the code and configs; scripts are invoked
which pull settings from a corresponding config. Some of the code is stateful
(i.e. stores configuration) and the configs are global (their location is not
abstracted) so in order to run this process concurrently, it has to be
deployed in separate machines.

Municipal Sewer Backup \----------------------

We ported some of the components described above to predict sewer backup risk.
Key components were broken into R and python libraries and dockerized. The
libraries are here pylearn -
[https://github.com/tesera/pylearn](https://github.com/tesera/pylearn) and
rlearn - [https://github.com/tesera/rlearn](https://github.com/tesera/rlearn).

A python library (learn-cli, which we haven't open sourced) uses r2py to
coordinate/share state between the two libraries. The process for training
still requires a user to select a model from all of the candidates that the
variable selection routine produces. This selection is made a prediction time;
all the candidate models are stored in one file and an id is specified for
which one to use. learn-cli is dockerized and we have it deployed on ECS. It
scales pretty well.

This solved many of the challenges in the forestry pipeline, but we haven't
been able to bring everything from the forestry model into deployment this way
due to a gap in data science and development. I've been looking into Azure
Machine Learning as a possible solution for this. I have benchmarked some
builtin models there and gotten identical performance as with our highly
customized process.

\-----------------------------------

Would love to hear your advice for formalizing & automating, or alternatives
to, our process for the forestry model pipeline.

Also, if you have a highly automated machine learning pipeline - what are your
data scientist responsibilities? It's not clear to me how our data scientist
jobs would evolve if they didn't have to manually run several scripts for fit
and select a model and generating the predictions.

~~~
abakker
I'd love to hear more about what you're doing. I'm looking a bit into this
space and could share some of what I've seen in the market - my email is in my
profile.

~~~
j_s
FYI: HN email field isn't public; you have to put it in the about field too.

