Hacker News new | comments | ask | show | jobs | submit login

    from sklearn.datasets import fetch_openml
    from sklearn.model_selection import train_test_split
    from sklearn.neural_network import MLPClassifier
    from sklearn.metrics import accuracy_score

    X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

    X_train, X_test, y_train, y_test = train_test_split(X, y,
        stratify=y,
	random_state=1729,
        test_size=X.shape[0] - (10 * 20))

    model = MLPClassifier(random_state=1729)

    model.fit(X_train, y_train)
    p = model.predict(X_test)

    print(accuracy_score(y_test, p))

    X_train, X_test, y_train, y_test = train_test_split(X, y,
        stratify=y,
        random_state=1729, 
        test_size=X.shape[0] - (10 * 200))

    model = MLPClassifier(random_state=1729)

    model.fit(X_train, y_train)
    p = model.predict(X_test)

    print(accuracy_score(y_test, p))
	
This gets you 0.645 and 0.838 accuracy score respectively (versus 62% and 76% in the paper). Sure, different validation (I validate on all the remaining data, they do 20x 70% 30% splits on the 200 and 2000 samples, which needlessly lowers the number of training samples, fairer comparison is 0.819 with 1400 samples), but the scores seem at least comparable. Cool method though, I can dig this and look beyond benchmarks (Though Iris and Wine are really toy datasets by now.)





Perhaps more importantly than the limit on 200/2000 samples, the main difference from your code is that in the paper they only use the 10 first PCA (i.e. X is 10-dimensional instead of 784-dimensional). The fact that they do tends to contradict their claim that the proposed algorithm scales "weakly" with dimension, but for using just 10 dimensions (chosen linearly from the input ones) 76% is not that bad.

Personally, my doubts are more in line with the other ones that were raised here, namely tdj's comment on not building on existing literature (the feeling of "this just looks too much like a classification tree" is very strong while reading), and scalability issues. I also tried something vaguely similar in the past (but using iterated linear discriminant analysis instead of hypercubes / classification trees, and convolutional networks instead of fully connected), but never even finished the prototype because it was so terribly slow even on MNIST that nobody in their sane mind would have used it (I could only load into RAM a few samples at a time, which made it messy).

In any case, it's a pity they didn't try to use these weights as initialization for the "usual" backpropagation, it might have lead to interesting results (especially if in doing so they extended to the whole of the input and not just the first 10 PCA - or at least on ore than 10).


Scores become more equal when you make the first hidden layer size 10 instead of 100 (both methods use an X of 784 dimensions).

Instead of PCA on all features, they could subsample 10 random features to partition, and bag the results of multiple runs. Basically, Totally Random Trees with an arbitrarily handicapped Random Subspaces method. Scales well, can beat Logistic Regression, but not any of the more developed tree methods.

Another difference with established literature is that this algorithm does not use any kind of knowledge transfer from previously learned classes. In most one-shot methods, including those used by humans, the model is already trained on other classes, and uses this information to adapt to unseen classes. Instead, the authors interpret it solely as deep learning under a small training data setting (which -- my code shows -- does not require jumping through hoops).


If an established approach that has been made into a library matches a novel approach, then it's not something to "look beyond" -- it's validation that the novel approach is probably worth investigating more fully.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: