

Statistical Reference Datasets - jcr
http://www.itl.nist.gov/div898/strd/

======
pash
Does anyone know of any other, similar resources with datasets and known
results for further techniques in statistics, machine-learning, numerical
optimization, etc.?

While working on a library for deriving discrete maximum-entropy distributions
recently, I had trouble coming up with good test cases. For that sort of
thing, you either have to go flip through textbooks, search for other
implementations (and hope they have a decent test suite), or think up problems
to solve on pen and paper (which misses the sorts of problems you need
numerical techniques to solve).

Even a collection of standard problems and datasets without canonical
solutions would be valuable. Then everybody could start collecting results
from different pieces of software to see which ones are anomalous. Does a
website like that exist?

~~~
jcr
Finding extremely clean, tested, and accurate "Reference" quality datasets
intended for verification and calibration purposes can be painfully difficult.
Making a claim that your data is "correct" and usable for validation is a
fairly bold move, and few are willing to go that far.

I believe that NIST is your best bet for finding reliable "Reference"
datasets. In the distant past (i.e. around 2000), NIST made various "Special
Databases" available. Some of them may be useful for the kinds of things you
mentioned. The information on this FTP server is ancient and crufty, but might
help in some way.

ftp://sequoyah.ncsl.nist.gov/pub/databases/catalog.txt

Also check the "README" files in the various directories. Though they are not
the complete databases, there's some sample data available:

ftp://sequoyah.ncsl.nist.gov/pub/databases/data/

I simply don't know if NIST still has reference datasets like the old "Special
Databases" that they used to provide, so I think your best bet is to contact
them and ask.

