
How to build your own feature store for ML - LexSiga
https://www.logicalclocks.com/blog/how-to-build-your-own-feature-store
======
StonyRhetoric
This is a good idea - every ML operation should have something like this, to
store, organize, version data, check for drift, do time-travel,
backups/replication et cetera.

But to borrow from Steve Jobs, I think this is a feature, not a product. If
you've already done the hard work of setting up a data lake or data warehouse
in a cloud provider, the cloud provider can give you backups and replication,
and even some time-travel. Using something like Delta Lake or even just the
standard Kimball DW audit columns will get point-in-time queries. Feature
versioning is just query versioning in source control, and if you have schema,
you can schema version with views if you need to. If you don't have a data
lake, data warehouse ... well, you'll still need to gather and clean all your
data before you put it into a feature store, and that's where 90% of the work
is.

I'd love to learn more, I'm sure I'm missing something, but it seems that
they're re-solving the solved part - data storage and versioning. Checking for
drift and data integrity is a nice bonus, but again, lots of libraries for
that. I guess I could see it being beneficial for ML shops that don't have
modern development practices, but if you don't have that, you have bigger
problems anyways.

~~~
jamesblonde
All your points are valid points. However, operational models (models used by
online applications, for example) typically need access to lots of historical
features that are not available in the application. In that case, you need to
go to a low-latency database/store to get your feature values (build your
feature vectors). If you want to reuse those features in different models, you
will need join support for building the feature vectors, so a key-value DB
won't help there. Now, your features are duplicated between this
online/serving layer and the data warehouse. How do you sync them up? The
other thing you're missing is time-travel queries (temporal logic for SQL in
data warehouse speak). Yes, Delta Lake gives you this, but you will need to
wrap that data in APIs so that your data scientists will be able to use it.
For data drift, a library alone won't cut it. You need to compare descriptive
statistics/distributions of the data used to train the model and the live data
coming in. Where do you get those statistics from - the feature store, in our
case (with the help of versioning+metadata). Then, there is end-to-end
governance of ML models - what training dataset was used to train this model,
can i reproduce that training dataset if it hasn't been archived? You need
metadata to manage all that. So, yes you can do it - but you have to build
something (as the article describes) or buy it.

------
LexSiga
As this topic will inevitably become more trendy find some some additional
interesting resources on the subject as well:

\- [https://www.quora.com/What-are-the-implementation-
challenges...](https://www.quora.com/What-are-the-implementation-challenges-
of-a-machine-learning-feature-store)

\- [http://featurestore.org/](http://featurestore.org/) (a list of -some of-
the available feature stores)

------
tristanz
A great collection of real-world case studies and various implementations can
be found here: [http://featurestore.org/](http://featurestore.org/)

------
encyclopedia
Similar to generating feature vectors for dataset augmentation here
[https://vectorspace.ai/covid19.html](https://vectorspace.ai/covid19.html)

------
jamesblonde
I'm the author. Let me know if you have any questions.

~~~
strgcmc
Just to add a counter-anecdote, as I see lots of (good/valid) questions about
"why do I need this?", here's an anecdote about "yes we definitely benefited
from this":

\- Years and years ago, we already had a data warehouse (DWH)

\- In the data warehouse, you would store data like, each and every order that
all customers have made (i.e. up to and including full-fledged facts and
dimensions about each)

\- Now, let's say axiomatically/hypothetically, a very useful and highly
predictive feature for ML is "# of orders made in past 7 days" for each
customer

\- Can this be computed from the data already in the DWH? Yes, absolutely, but
it's a new computation and not an existing column/attribute in the dimensional
model.

\- What if you need to recompute this feature daily, for millions of customers
and orders? Well, we could always just add it to the dimensional model,
compute it once, and let people just use it/share it... but why? Most internal
users of the DWH probably don't care about something like "# of orders past 7
days" as something to be added to a customer dimension or per-customer grain
(too specific or whatever), and moreover, the DS/ML folks want the same
feature but for "every 1/3/7/30/90/180/365/730/etc." days breakdown, as well
as a bunch of variations about orders and things other than orders (e.g.
"average time between new orders, over past 7/30/90 days" or "average $ spent
over past 7/30/90 days", as features that serves as a proxy for frequency of
activity and level of engagement)

\- Hence, it makes sense to keep the "golden copy" of data in a canonical form
in the classic/standard DWH as a baseline, and to separately/independently
compute features out of that data and to store them in a different system
(which can also be optimized for the different query/access patterns that
DS/ML have, vs traditional BI). Over time, it also made sense in certain cases
to go upstream of the DWH to source data from and process it more directly
(for performance/efficiency reasons), though generally deriving features out
of the transformed dimensional models was still very useful.

\- It took our teams ~1-2 years to really go through this evolution and reach
a mature-ish state, but for the past 2-3 years, we've benefited tremendously
from having an independent feature repository/store, that is separate from the
classic DWH. Benefits came in all the obvious and some non-obvious ways, i.e.
in faster iteration/cycle time, in better quality/repeatability, and in being
able to automatically discover interesting relationships that no human could
have anticipated - simply by having a very broad/large repository of features
and running automatic feature selection over it.

~~~
ivalm
In your use case is a feature store roughly a collection of incrementally
updated feature tables + time/version meta?

We've been storing our features in RDBMS/normalized with each feature table
having a run_id column (run_id is unique to pipeline version + timestamp of
execution if batch processed; if streaming there is run_id for pipeline
version but the date comes from interaction date, which is part of the raw
data already]) and I'm curious what we're potentially missing..

In this sense you can query features for given users generated by particular
version and/or date but it does involve potentially lots of joins (to get a
collection of features).

~~~
strgcmc
Conceptually I think that's about right (meaning, similar concepts as us, but
who's to say whether this is objectively the best approach...). Practically,
we're heavy on AWS, so we've found that for our size/scope/breadth, it was
more performant/efficient to store the data as parquet files in S3 and
cataloged in AWS Glue, and yes there are a lotttttt of joins needed, so it's
worth taking time to invest in some deep thinking about, how best to partition
your data and how best to optimize the type/variety/complexity/number of joins
you'll have to do.

I can see RDBMS being reasonable for small-to-medium size of data, but beyond
a certain threshold, I think it starts to breakdown (at the few 10s/100s of TB
level, maybe?).

~~~
ivalm
Yeah, we are in the 1-10 TB range. We are bound to on-prem Oracle Exadata and
so far it's ok.

