
What the SATs Taught Us about Finding the Perfect Fit - elsherbini
http://multithreaded.stitchfix.com/blog/2017/12/13/latentsize/
======
lsiebert
One of the great joys of my bachelor's degree in psychology was being invited
to take a graduate level course on Item Response Theory (with Professor Jack
Vevea, now at UC Merced). I wouldn't have fallen in love with programming and
become a software developer if I hadn't taken it. 1

The Rasch Model is a specifically simplified case of item response theory, but
I'd argue that it may not be the best one for stitch fix. That's not to say
that it can't be useful, but rather that the simplifications and assumptions
of the Rasch model may lead to information that does not reflect the
customer's measurements as well as a more sophisticated model could. Of course
it very well may be good enough, but it serves as a somewhat useful
exploration of the

The Rasch model is an attempt to differentiate two associated sets of
information, the latent trait of the test taker/question answerer (in this
case, their measurements) and the difficulty of the question (in this case,
whether the item is too big, too small or just right). Basically the Rasch
model treats the level of a latent trait of an individual as a function of the
difficulty of a test question and what they answered.

But the model purposely ignores the question of the discrimination of the
question, that is, how good is the question at differentiating between those
who's latent trait differs, and just assumes that the discrimination (the
slope of the line reflecting the model of the question's difficulty) is not
relevant. Other models see this as relevant.

For example, if StitchFix offers a belt with a number of different holes, some
people may feel the belt is too small if they are forced to use the last hole,
some the second last hole. A question about such a belt that just asked if it
was too large, too small, or just right might have low discrimination in terms
of identifying an individuals underlying size. Likewise someone who has bigger
thighs but a relatively slim torso might have different answers about a pair
of slim fit pants of size x which are too small for their thighs, and a belt
of size x. Thus questions about pants may have a higher discrimination then
questions about a belt.

Item Response Theory outside of the Rasch also has a third factor to consider
on a per question and per individual basis, which is basically the propensity
to guess. Basically, how likely is someone to think carefully about the
question as opposed to just putting down a random answer , and likewise are
some questions more likely to have people answer blithely instead of
earnestly.

The other thing to consider is that in most IRT tests, the latent trait is
assessed at a single time for multiple questions. But weight/fit/measurements
are here being assessed item by item, as they are tried, and the underlying
fit may be changing if a person is gaining weight or bulk, retaining water, or
recovering from thanksgiving dinner. While it's unlikely that someone's weight
or size would change radically in a brief period, a model that weighed items
that were tested more recently might better reflect the individual's
measurements.

Of course it's been years and years since I took the class, so any screwup in
this comment should reflect on me, and not my professor.

1 I was writing a function in R to speed up an IRT model fitting a curve in a
way that let me do it in seconds instead of hours (It's been a while but I
think it was identifying the point of the curve where the slope is maximized),
in any case it was a time consuming computation if you check every possibility
linearly to 6 decimals for hundreds of test takers, but I figured that there
weren't local maximums and optimized with something like a binary search (but
by decimal place), before I had ever heard of binary searches, and getting
that sort of efficiency jump was deeply satisfying.

~~~
ouid
I didn't actually read the article, so maybe you're making a point specific to
the article, but in response to

>For example, if StitchFix offers a belt with a number of different holes...

surely the solution is just to model belts shirts and pants separately. Just
because there's a correlation between those things doesn't mean you have to
use it, and it fucks with your convergence.

------
wenc
Truly good sizing is a much more complex problem than can be solved through
recommendation systems. You can get incrementally better at it, but nothing
beats actually trying something on at the store. Problems that cannot be
solved through recommendations alone include:

1) Body types that aren't average (all of our bodies are irregular, but some
of us are more irregular than others). Clothes sizing are based on a average
model of the human body.

2) Fabric interaction with body. Softer fabrics drape in different ways from
more rigid fabrics. Also how loose or muscular your flesh is can affect fit if
you're in the market for really slim fits (European style).

3) Non-representative snapshots: your measurements in the morning will differ
from other times of day, and it will differ over the course of weeks and
months, even if your diet is stable.

4) Shrinkage/expansion: depending on material, there is some shrinkage or
expansion after the first wash/wear. Although this is mostly a known quantity
and good clothiers account for this.

Really good bespoke tailors understand these principles, and make allowances
for them as they build your clothing. Also they know to put elastic fabrics in
the right places so the suit will still fit even after a large meal.

I think the one way we can get closer to a good fit while being remote is to
have pop-up stations/kiosks where we can get multiple 3D scans of our body on
separate occasions (kinda like how most bespoke tailors require at least 3
fittings). That still doesn't account for fabric-body interactions, but it
gets us a lot closer than recommendation systems.

p.s. the other problem is cultural. Most Americans don't know or care quite as
much about fit (because it's not as prized in the culture) as their European
counterparts, so their data is going to be skewed slightly towards the left
end of the competence curve.

~~~
Jun8
I'd say the situation is even more complex than that, i.e. even trying it on
at the store may be insufficient: I hypothesize that a non-small percentage of
people, like me, have no good subjective function to assess apparel fit when
they out it on. That's why (I'm a bit embarrassed to admit) I have to bring in
in my wife with me for all non-trivial shopping to get her expert opinion.

~~~
wenc
> I have to bring in in my wife

Excellent point. :) What is the point of finding the perfect fit, after all?

It's so that the person (or persons) whose opinions you care about think you
dress well and look well put together. That is an important piece of data.

------
snegu
I find it surprising they do all this math since the model they seem to use is
"Send everybody tent-like shirts that would fit a horse." After repeated
requests for more fitted styles, I kept getting tents (although eventually
smaller size tents).

I understand why they do this, because this style is likely to fit more
people. But if the technique they describe in this post actually works,
perhaps they could be more adventurous?

------
pjc50
Related: [http://sizes.darkgreener.com/](http://sizes.darkgreener.com/)
recording the discrepancy between label size and actual fit for a lot of UK
high street shopping.

The variance in label sizes is bad enough - and worse for women's clothing
than mens - but when you get into "large/medium/small" it's just a lottery.
Especially if you're a westerner ordering direct from China.

~~~
Bartweiss
The variance in label sizes is definitely more extreme in women's sizes, but
I'm particularly struck by how it happens with supposedly-objective men's
clothing. As men's pants get larger, the waistband size becomes increasingly
dishonest - despite supposedly being measured in "inches". I'll look for the
study, but it can get up to an impressive 10%+ discrepancy.

------
jatsign
Stitchfix has done a good job learning my size, but I wish I actually could
"shop" at stichfix. When you get a box, there's a foldout paper that shows you
examples of what you could wear these new clothes with...but I don't own any
of those thing.

I wish they had some sort of follow up experience that would let me "complete"
the outfit.

They talk a bit about why they don't have any sort of online shopping here:

[http://multithreaded.stitchfix.com/blog/2015/07/07/personali...](http://multithreaded.stitchfix.com/blog/2015/07/07/personalizing-
beyond-the-point-of-no-return/)

------
robterrin
They used Stan! [http://mc-stan.org/](http://mc-stan.org/)

Gelman and the rest of the crew at Columbia are doing great work. Check out
[https://www.generable.com/](https://www.generable.com/) too.

------
not_that_noob
Nice! Using IRT for clothes sizing is indeed innovative.

One factor that is difficult to account for is users not telling you the
truth. Users may feel embarrassed to say something is too small, as that may
have negative connotations for some. It’s difficult of course to control for
that, but I wonder how honest people might be in their feedback.

------
perseusprime11
The simple model didn't work for me. They sent me shirts and pants that did
not fit.

~~~
dgritsko
This is the "cold start" problem, which they mention in the post ("What do we
do about clients who recently signed up and have no past history on the
service?") along with how they attempt to mitigate it. It's not surprising
that their approach doesn't work for all users (as it would appear that it
didn't in your case). I'm curious though, how much feedback does their system
need in order to converge? E.g., how many ratings would you need to provide
before receiving well-fitting clothes?

~~~
perseusprime11
Interesting. Thanks for pointing that out. Most of the clothes they sent are
from their own brands which makes me wonder what is new about this model.
Anybody retailer can ship their clothes if they are willing to process the
returns for free.

~~~
gumby
I'm not in B2C or really any consumer play, and haven't looked at SF closely.
But: I assume it's an execution play.

Anybody can sell their books and electronics online (well, they could before
Amazon started, yet amazon exceeded). Anybody can make clothes, yet Inditex is
a monster. And why is there more than one restaurant. SF's claim is that their
tech is a differentiator, and they are a tech company (unlike, say, Target or
Blue Nile).

There are a couple of thoughts about "in store brand": their business plan may
have required tighter control and visibility into the dimensions and materials
of their product (dimension for the reasons discussed in the article and its
footnotes; material because stuff that is tried on not in store and returned
may need to be more durable, I don't know). It makes the plan more complex,
but perhaps improves execution enough that it's worth the expense.

In the long run, if they really have a tech advantage they have the option of
expanding beyond just in-house product or simply splitting into sub brands, or
both.

