Does anyone know of any other, similar resources with datasets and known results for further techniques in statistics, machine-learning, numerical optimization, etc.?
While working on a library for deriving discrete maximum-entropy distributions recently, I had trouble coming up with good test cases. For that sort of thing, you either have to go flip through textbooks, search for other implementations (and hope they have a decent test suite), or think up problems to solve on pen and paper (which misses the sorts of problems you need numerical techniques to solve).
Even a collection of standard problems and datasets without canonical solutions would be valuable. Then everybody could start collecting results from different pieces of software to see which ones are anomalous. Does a website like that exist?
Finding extremely clean, tested, and accurate "Reference" quality
datasets intended for verification and calibration purposes can
be painfully difficult. Making a claim that your data is "correct" and
usable for validation is a fairly bold move, and few are willing to go
that far.
I believe that NIST is your best bet for finding reliable "Reference"
datasets. In the distant past (i.e. around 2000), NIST made various
"Special Databases" available. Some of them may be useful for the kinds
of things you mentioned. The information on this FTP server is ancient
and crufty, but might help in some way.
Also check the "README" files in the various directories. Though they
are not the complete databases, there's some sample data available:
ftp://sequoyah.ncsl.nist.gov/pub/databases/data/
I simply don't know if NIST still has reference datasets like the old
"Special Databases" that they used to provide, so I think your best bet
is to contact them and ask.
The UCI machine learning repository [1] contains a wealth of data sets intended to be used for machine learning. Many of the data sets have had analyses performed on them that could be considered canonical. For example, the Abalone dataset's [2] associated problem is the prediction of a specimen's age, given its measurements. The problem has been analysed thoroughly; a cursory Google search for "Abalone data set" reveals that plenty of people have considered the problem.
Also Amazon (via AWS) [3] have made it really easy to access public data sets.
While working on a library for deriving discrete maximum-entropy distributions recently, I had trouble coming up with good test cases. For that sort of thing, you either have to go flip through textbooks, search for other implementations (and hope they have a decent test suite), or think up problems to solve on pen and paper (which misses the sorts of problems you need numerical techniques to solve).
Even a collection of standard problems and datasets without canonical solutions would be valuable. Then everybody could start collecting results from different pieces of software to see which ones are anomalous. Does a website like that exist?