
What tool would *you* use to solve this? - ColinWright
http://nrich.maths.org/7731
======
VonLipwig
I would use my right hand to slap Alison and Charlie and get them to do their
work again... this time forcing them to label their samples.

I would then take them into my office and hold an inquest on how two
completely different data-sets got mixed up in the first place.

At which point I would probably find out Alison and Charlie have been knocking
boots on company time.. Alison and Charlie would then be fired as this is
against company policy.

Due to the difficulty in finding new jobs Alison and Charlie's fledgling
romance would end... leaving Alison unexpectedly pregnant and Charlie with an
18 year bill for child support.

The morale of this story? Label your f __king work.

 __ __Note: I was unable to solve this problem as my math skills are poor.

~~~
ovi256
You're on the management fast track aren't you ?

~~~
VonLipwig
I probably would be if I spent less time writing fictional stories to
accompany math problems. ^^

------
antonb2011
PEOPLE!!! Are you hackers or not!?!? Download the Excel spreadsheet and look
at the raw numbers. The numbers in columns A,D and E have 2 significant digits
after the decimal point, whereas the numbers in columns B,C and F have 13
significant digits after decimal point! No calculations necessary!

~~~
bluemanshoe
Nice.

------
bluemanshoe
I would use the Kolmogorov-Smirnov test:
<http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test>

It's in scipy.stats.ks_2samp

Results:

Sets| D |p-value

\-----------------

A,C |0.275|0.080|

B,C |0.175|0.531|

\-----------------

A,D |0.125|0.893|

B,D |0.275|0.080|

\-----------------

A,E |0.100|0.983|

B,E |0.300|0.043|

\-----------------

A,F |0.300|0.043|

B,F |0.100|0.983|

\-----------------

As far as the test goes, if D is small and p is high, you cannot reject the
hypothesis that the two datasets came from the same distribution. The p-value
is roughly how often, randomly you would get similar looking data assuming the
null hypothesis (in this case that they are drawn from the same dataset)

In light of this evidence, if they are not lying to us, and really each of
these sets came from and A-like or B-like distribution, I'd say fairly
confidently that:

F is B-like

E is A-like

D is A-like and

C is B-like (though with lower confidence)

The box-plot: <http://i.imgur.com/epPw7.png> seems to confirm.

~~~
WeWin
Based on the final selections:

there is a 98.3% chance that f is b-like there is a 98.3% chance that e is
a-like there is a 89.3% chance that d is a-like there is a 53.1% chance that c
is b-like

If these are multiplied together, it appears that there is only a 45.8% chance
that they are all classified correctly?

~~~
mturmon
You're asking a good question -- but you know a lot more than what you write
above.

The main thing is, you know that C, D, E, and F came from _either_ A or B. The
p-values above don't account for that; they just say what's the chance, due to
random fluctuation, that a sample could have come from from the same source as
A.

That's reflected in the fact that the pairs of p-values don't add to one!
(Like (A,C) and (B,C) in the table above.)

You also implicitly know that at least one of {C,D,E,F} is A-like and one is
B-like (otherwise there would not be a problem). So even if you know P(X and Y
have same source) for all (X,Y), which you don't, you couldn't multiply them.

Finally, the p-value returned by the KS test will underestimate the true
probability of discrepancy. This is because it's only looking at one thing,
the max value of a CDF difference. The significant differences between the
distributions may lie elsewhere, like in the tails, and the KS test is known
to be relatively insensitive to tail behavior. (Although at n=40 you won't be
able to see far into the tails.)

There are a host of other tests that use the same idea (empirical CDF
difference) but weight differently. Some can be more effective than the KS
test if you're looking for certain types of difference. Here's an OK overview,
albeit for the goal of assessing normality:

<http://www.instatmy.org.my/downloads/e-jurnal%202/3.pdf>

In a real problem, it's always a good idea to use more empirical-cdf tests
than just the KS test, to compare variances and other moments as some people
in the thread have done, and to make histogram or CDF plots -- especially if
you're in just 1 dimension and the plots are easy to interpret.

------
revorad
I used my app <https://app.prettygraph.com> to quickly look at the histograms.
B and F look almost identical with a very normal distribution. The others look
more like twin-peaks.

Edit: If you want to try, copy the data and click the "Paste data from Excel
link" on the left under the Data Tab, paste and load. Then choose histogram
under the Graph tab. It plots the X var (ignores Y). Sorry it's a bit shabby,
I haven't worked on it in a while.

~~~
ovi256
Nice app! Computing the histogram hung on the Apple stock data for me.

~~~
revorad
Thanks! It gets hung sometimes, but it should work if you select any field
other than the Date as the X variable. Refresh and try again if you like.

------
praptak
An interesting problem. My first guess is that teenagers' weights distribution
is two-peaked (boys & girls) but this assumes that the teenagers are around
the same age and that the peaks are far enough to be detectable in the
samples. I can't be arsed to check this hypothesis though :-)

Anyway, a serious approach to this problem would require comparing against
solid real-life data. Off the bat assumptions about distributions of real life
data are often very wrong.

------
anthonyb
I can just scan the numbers and see the patterns. Column A has much more
extreme ranges, lots of 40s and 70s, whereas B is much narrower.

So D and E are temperature, and C and F are weights.

------
ColinWright
Several people have said they would, or have, ploted histograms - but how?

* By hand?

* With Excel?

* With R?

* With processing?

I'm a little sad to see that this item has been flagged heavily, but I guess
there are people who think this is sufficiently off-topic that it should be in
the same category as spam.

~~~
TheColonel
use R

#copy data into a txt file without the first two lines from the xls

par(mfrow = c(3,2))

dat <\- read.table("data.txt", sep="\t", header = TRUE)

for(let in c("A","B","C","D","E","F"))

plot(density(dat[,let]), main = let)

easy...(why doesnt HN recognise newlines?)

~~~
Dove
_why doesnt HN recognise newlines?_

It treats single newlines as the same paragraph and double newlines as a new
one. Alternatively, if you want to post code, you can indent it by four
spaces. Like this:

    
    
        par(mfrow = c(3,2))
        dat <- read.table("data.txt", sep="\t", header = TRUE)
        for(let in c("A","B","C","D","E","F"))
        plot(density(dat[,let]), main = let)

~~~
TheColonel
Thanks Dove. I probably should read the posting FAQ or something.

~~~
Dove
Not your fault, actually. That particular quirk may be written down somewhere,
but I've not found it in my three years here.

I offer you knowledge obtained by sheer trial and error. :)

------
Cieplak
Bash

for i in {1..6}; do COL='$'$i; awk -F, "{delta = $COL - avg; avg += delta /
NR; mean2 += delta * ($COL - avg); } END { print sqrt(mean2 / NR); }"
list.csv; done

[troll]

~~~
haldean
Troll... or hero?

~~~
Cieplak
haha thanks, but definitely trolling. This is perhaps the most inelegant
solution in the world. But it does the job in this case. Output is 10.6014,
4.44403, 5.71353, 9.46718, 10.6471, 5.12933, for columns A to F, respectively

------
nichol4s
I would use one of these two tools:

Orange <http://orange.biolab.si/> "Open source data visualization and analysis
for novice and experts. Data mining through visual programming or Python
scripting. Components for machine learning. Extensions for bioinformatics and
text mining. Packed with features for data analytics."

Weka <http://www.cs.waikato.ac.nz/ml/weka/> "Weka is a collection of machine
learning algorithms for data mining tasks. The algorithms can either be
applied directly to a dataset or called from your own Java code. Weka contains
tools for data pre-processing, classification, regression, clustering,
association rules, and visualization. It is also well-suited for developing
new machine learning schemes."

------
intended
I calculated the mode in excel, the weight columns have one while the
farenheit columns dont.

EDIT: I realize that this was a somewhat curious result.

------
bmunro
I used a spreadsheet to calculate the standard deviations. A,D and E have
twice the deviation of B,C and F. (10 vs 5)

The means are all the same.

------
demallien
We can not say with 100% certainty that a particular set of data fits in
either group. This is more of an AI problem than a statistics problem,
although in this particular case the differences between the two types of data
sets is big enough that a relatively crude statistical analysis can generate a
cleat answer to the question.

In a more general sense, I would tend to use a Self Organising Map
(<http://en.wikipedia.org/wiki/Self-organizing_map>) to identify the two
groups of data, and then discrimante between them. The dimensions of the SOM
would be different statistical analyses of the data sets - what is the mean,
the absolute spread, the standard deviation, the median, etc.

The nice thing about using a SOM is that it will show you whether or not you
have succeeded in finding a measure that can successfully discriminate between
different data sets - whilst at the same time actually doing the
discrimination for you.

------
mshron
Permutation tests! Easy to use and easy to explain.

The two examples have the same mean and median, but differ substantially in
their min/max/standard deviation/median absolute deviation. Pick your spread,
they differ. I got this by poking around in iPython.

Find the standard deviation of A. Reshuffle A and B together, taking half of
the resulting list, calculate the standard deviation, record it and repeat
that a thousand times. See that you basically never get a standard deviation
as large as you did for A alone, so they meaningfully differ.

Repeat on each of the data sets, and you see that D and E are plausibly
similar to A whereas C and F are not. Repeat for B to see if we're being
fucked with, and lo and behold it all seems to work out, though D is a little
iffy.

I used numpy and ten lines of glue code. Happy to post if there is interest.

------
LachlanArthur
Since each list is independent, sort them individually and graph them.

Here's the result in Excel: <http://ii.snag.gy/vSt4H.jpg>

Alison's lists are A, D and E

Charlie's lists are B, C and F

~~~
rufibarbatus
Holy cow, I did the exact same thing as you!

For the sake of completeness, my first steps before sorting and plotting were
to:

    
    
      1. take the mean, median and mode of each set,
      2. try and fail to use more sophisticated statistical analysis [1],
      3. have a look at the minimum and maximum value for each set.
    

By #3 I already had a good enough guess, but I felt I had to _see_ it.

[1] In fact, could anyone point me to a refresher? I'm talking about
distribution curves, regression, that kind of thing. I seem to have lost hang
of it — couldn't pass my own "5 minutes to implement the simple thing"
criterion.

------
mberning
I'm sure you can easily identify which set is which using statistical methods.
It's probably as simple as finding out which type of distribution each set
resembles. Then again, my undergrad statistics is very rusty.

------
bobby07
Standard dev seamed to be enough. I sorted and graphed the data to to be sure.

~~~
ianterrell
Temps have standard deviation around 10ish, weights around 5ish.

------
Dn_Ab
I would look at the mean and variance in each column and cluster by that. My
intuition is that the variance in Temperature measured in Fahrenheit will be
greater then that of weight in Kg.

Checking it seems that this is borne out. So is the data real? What is being
assessed here?

Interesting to see that people prefer std dev to variance for this problem.

------
sepent
Drawing several charts with LibreOffice Calc:

[http://www.hostmypic.net/pictures/93969c88a3ecaac84ba81c8d38...](http://www.hostmypic.net/pictures/93969c88a3ecaac84ba81c8d38f776df.png)

And it is almost obvious that we have two data sets: (A, D, E) and (B, C, F)

------
stephen789
Here is the scatter plot. Sorted. The difference isn't hard to see.
[https://docs.google.com/spreadsheet/oimg?key=0Ar0JB_woDdlodD...](https://docs.google.com/spreadsheet/oimg?key=0Ar0JB_woDdlodDZEbFNQVUJSVTg5eWt0UHlwd2oyTkE&oid=3&zx=giv4rpctce72)

------
singular
Excel, compare standard deviation + means in a slightly hand-wavey way. The
mean is around the same for each interestingly, however the standard deviation
varies in such a way as to imply which belongs to which.

I am quite rusty on this stuff I must say!

------
aw3c2
Weird, just looking at the raw numbers I would have expected A to be the
weights and B to be the temperature averages because A has a bigger range.

I guess one would correlate each row to the A and B one and look which
correlate the most?

------
bjcubsfan
I used python with pylab/matplotlib to sort and plot the columns. It was easy
to spot the two sets. I suppose a histogram would have worked nicely. This
also would have been easy with my selected tools.

------
ideaoverload
Draw distributions in Excel. B,C and F show similar, normal shape distribution

<http://i.imgur.com/Dg6OT.png>

------
andrewcooke
as others have said, first thing to do is plot the data. but if you want a
statistical test (with significances, so that you have some idea of how
confident you can be in your decision) then you could use the two-sample ks
test. it's likely more useful that a simple t-test here because it's sensitive
to shape.

but to be honest the first thing i would do is google for suitable
approaches...

------
WeWin
I have only two numbers 50 and 60 - one is a weight and another is a
temperature. Now I have another number, 80. What kind of number is that? No
way to know. The fact that there are multiple values make no difference - the
data is the data!

------
ars
I would plot some histograms and do it by eye.

------
dustinupdyke
Based on recent events, I'm a little disappointed that nobody proposed a
node.js solution.

------
z01d
SPSS -> Cluster Analysis

------
fmota
A computer.

------
evertonfuller
SPSS.

------
hackermom
Does the thread starter wonder about what software tool (i.e. language) people
would use, or does he wonder about what method people would use? As far as
language goes, the problem requires mathematics on a level so simple that
practically any programming language is more than enough sufficient - the
histographical method I'd use for solving this problem would be doable in
BASIC on a VIC-20 or a TI-81 pocket calculator.

------
georgieporgie
This is silly, but I find it to be more on-topic and fun than a _lot_ of other
items that I see on HN. It's really educational to see how people look at the
problem.

------
wedtm
Math

------
Shamiq
Excel.

------
bherms
Regression.

