

Don't Get Burned By Heatmaps - bkrausz
http://www.gazehawk.com/blog/dont-get-burned-by-heatmaps/

======
petercooper
_After reviewing the individual heatmaps, however, we get the impression that
the site might have a bit too much going on_

Normal A/B split tests have a similar issue. Let's say you have a sales page
and you're testing colors. Red buy button, a blue button, a green button. And
let's say blue wins overall with a high level of confidence.

Without drilling down (and most systems I've worked with are bad at this or
can't do it at all) you can't tell if certain _types_ of user actually
preferred other scenarios. For example, visitors from east Asia might convert
WAY better for the red button (red being a lucky color in China). And visitors
coming from _certain sites_ might always convert better on the green.

Problem is, your split test results show the overall results and so you
optimize with the blue.. when your A/B tool could have analyzed the data and
suggested segments for you. Does this tool exist (without getting users to
guess segments up front)? If not, there's a ton of money being left on the
table.

~~~
mwexler
Actually, this is a problem with most entry level tools. Professional level
tools like Omniture Test and Target or Optimost with Autonomy Segments handle
MVT testing per segment, segment defined by external variables (appended data
or data from Dataxu et al), business variables (return customer, prospect,
etc.), or just online variables (IP address Geo, repeat vs. new visitor, etc.)

Professional level, btw, is defined as "costs a lot" and "are painful to use"
and "have poor interfaces"; more recent tools give up some of the power and
cost for ease of use.

However, as you point out, these easier MVT tools (or Split-Test tools) focus
on showing the "winner" or "winning variables", winner defined as "most
impactful on the total traffic or sample of this traffic during this time".
You are right that this is not the final answer... but it's a good starting
place; if you solve for the majority of users, it's usually the biggest step.
Then solving for unique smaller groups can often result in lower volume but
higher value conversions... if you have time.

That being said, including segment variables in the analysis is a wonderful
thing to do, and I encourage it. I look forward to when Visual Website
Optimizer and other "entry level" tools include these in their analyses by
default... and we can look to the end of Optimost, Omniture TnT, and the other
"enterprise" tools.

------
lsb
Very cool stuff, Brian! It'd be interesting to try to cluster users based on
what bits of the page they look at (women 65+ look at navigation breadcrumbs
significantly more than average, say).

~~~
lbarrow
GazeHawk intern here -- I wrote the post. Our post next week is going to be
focused more on using different clustering metrics as a way of making the data
easier to understand.

We definitely have plans for running some big studies where we can do
demographic comparisons, since that sort of information helps us bring a lot
of value to our customers, but it might not be for a few weeks.

------
moultano
Seems like you need a heatmap for the variance as well as the mean.

~~~
lbarrow
That's an interesting idea... I'll see if anything comes of it.

~~~
pacaro
This was my first thought also (although 7 hours later!), it may be
interesting to look at skew as well as variance (kurtosis is probably going
too far), the difference between diffuse normally distribution and split
peaks...

It may even be interesting to do some kind of k-means, would be awesome to see
that correlate with demographic...

------
roryokane
To ensure a heatmap does not “eliminate the element of time”, the heatmap
could show later-looked-at spots with more desaturated colors.

Of course, this has the same problem as the main point – that you can’t tell
whether a spot is medium-saturated because everyone looks at it in the middle
of browsing or because some people see it at the beginning and others see it
only at the end. Still, this problem, like the article’s, could be fixed by
moultano’s suggestion of making a heatmap for the variance as well as the
value.

Or you could sacrifice the display of time, and display _variance_ as
saturation – less-variant, more-sure spots would have more saturated colors.
This would be easier to read than two separate heatmaps for value and
variance.

------
terzza
During my final year project for my BSc, I tried to address the problem of
losing the temporal data from eye tracking in heatmaps with an accelerated
replay, heatmap animation. Here's an example:

[http://www.youtube.com/watch?v=L319pLmzHVc&feature=chann...](http://www.youtube.com/watch?v=L319pLmzHVc&feature=channel_video_title)

This video shows a selection of sittings from radiologists reporting on chest
xrays.

If I recall correctly it is sped up approximately 5 x realtime.

~~~
bkrausz
Nice visualization! We have something similar that we generate:

<https://s3.amazonaws.com/gazehawk-public/example_track.mp4>

Definitely a step up from heatmaps, though I think there's a lot to be said
for providing data as a static image rather than a heatmap, especially when
they are being added to a powerpoint or PDF.

------
rgbrgb
Though I'm pretty optimistic about the possibility for algorithmic user
interface design (or at least quantitative evaluation), I've yet to see any
conclusive studies on the matter. Anybody got any hard info on this?

~~~
lbarrow
I think it's really difficult to rigorously evaluate that sort of thing.
Different websites serve different purposes and need different designs.

~~~
rgbrgb
Yes, I suppose different sites will need slightly different fitness functions
but I'm confident in the future of automated design.

------
georgieporgie
Using the example of GazeHawk, it looks like they claim to work using your
system's built-in webcam. Being interested in eye tracking for non-advertising
purposes, I did a few calculations, and I fail to see how this is possible at
typical monitor distance and typical webcam resolutions.

Does anyone have any insight as to how this is done? Currently, I'm highly
skeptical of that these heat maps are the least bit accurate.

~~~
jgershen
What calculations did you do?

As cofounder of GazeHawk, I've written on different aspects of this topic
previously [1, 2]. Is that information helpful / can you elaborate on your
skepticism?

[1] <http://www.gazehawk.com/blog/on-accuracy/>

[2] [http://www.quora.com/GazeHawk/What-broad-computer-vision-
tec...](http://www.quora.com/GazeHawk/What-broad-computer-vision-techniques-
does-GazeHawk-use-for-gaze-tracking)

~~~
georgieporgie
Thanks for the links, they were very interesting.

I did basic trigonometry, and came up with an estimate of about 50 pixels
accuracy using a high-res, 3rd party webcam. That's why I doubt claims about
built-in webcams, since they're typically pretty low resolution.

Your first link mentions an accuracy of around 70 pixels on a MacBook Pro,
which is impressive but doesn't strike me as impossible (assuming FaceTime HD
camera, which is 1280x720, I believe).

~~~
jgershen
While the resolution of the webcam is (obviously) important when discerning
accuracy, I feel like you may be conflating two terms here. Specifically,
going from the resolution of the webcam to an estimate of so many pixels of
accuracy using basic trigonometry will necessarily depend on the method you're
using to convert the webcam input into eye-tracking data.

The <70 pixel figure for GazeHawk's accuracy is based on testing against real,
labeled training data. That is the distance, on the screen, by which our
calculated gazepoint differs from the true location at which the user's gaze
was directed. It is only loosely correlated with webcam resolution, in that a
higher webcam resolution corresponds to a larger pipeline - more input pixels
being dumped into the eye tracking algorithm. I could be wrong, but it sounds
like you're discussing the size of the eyes in the input image.

Also, at this point a discussion of accuracy vs. precision becomes germane.
The use of higher resolution video as an input can often impact one but not
the other.

~~~
georgieporgie
_I feel like you may be conflating two terms here._

Probably. :-)

 _I could be wrong, but it sounds like you're discussing the size of the eyes
in the input image._

I believe I am. I assume that increased pixel count in the eye region
corresponds directly with increased accuracy. This could be accomplished by
either moving the camera closer to the eye, or by using a higher resolution
camera.

------
kjames
Pardon my ignorance, but I find your efforts quite insignificant. If you were
to ask me where most people looked on that example, I could accurately tell
you what got the most looks first, second, third etc, without ever seeing your
heatmap. Do I really need a third party to tell me people like boobs more than
Ron Paul?

I'd like to add that the length of eye contact is irrelevant especially to a
metric that is more valuable, interpretation. Let's say that we can determine
(or even narrow down) users interpretation of content and heatmap the
relevance to their visit. If we could take a pro-active stance we could
predict future visits and adjust the content accordingly, not be reactionary
and simply say (after the user is gone) that people looked at boobs more than
Ron Paul.

Just my two cents, sorry if I was a dick.

~~~
potatolicious
Did you even read the article?

The _aggregate_ confirms what you're saying - that people will look down the
middle at all the half-naked people.

But look at the individual heatmaps, it seems like a not-insignificant number
of people followed more text than images, some drifted all over the place, etc
etc.

That's the entire _point_ of the post I think - pointing out that aggregates
can be deceiving when the distribution is not even close to uniform.

