
One-Shot Training of Neural Networks Using Hypercube-Based Topological Coverings - ghosthamlet
http://arxiv.org/abs/1901.02878v1
======
rococode
This seems like a really interesting approach and I think the numbers are
promising. Given that it's an entirely different type of model-building than
traditional methods, I think it's fine for it to just be up to par with a
basic shallow model. If the constructive approach turns out to be comparable
to current state-of-the-art models with sufficient refinement, it could be
really valuable for low-compute applications like IoT devices, etc.

To be honest, I can't say I know enough about the math here to do anything
more than vaguely follow their explanations despite my ML/NLP background. I'm
curious - other ML researchers out there, how much of this are you able to
understand? My impression is that this math is pretty far beyond what ML folks
typically know, although I'm on the lower end of the spectrum as far as math
knowledge goes, so I may be totally wrong (and need to spend more time reading
textbooks haha). I wonder if the complexity may slow down progress if it does
turn out that this kind of geometric construction can compete with iterative
training. It sounds like this approach could potentially support more complex
networks by working more on the geometric representation, so I certainly hope
this paper serves its purpose of motivating people with the right skillsets to
do further exploration.

~~~
logicchains
In this case the fundamental idea is not especially complicated, but the math
may make it seem more complex than it is to somebody unfamiliar with the
notation/vocabulary.

Imagine the case where an input dataset has three dimensions: each piece of
input data could be represented as a point in 3D space. We want to classify
each point with a category. Imagine a cube containing all such points, and
imagine the cube was subdivided into many tiny cubes of the same size. If
these tiny cubes were sufficiently small, then each cube would only contain
points belonging to a single category (sufficiently small cubes would only
contain a single point, making this trivially true). We could then assign to
each cube a category based on the category of the points in it, then if we
wanted to classify a new point, we'd just check what cube it's inside and look
up the category of that cube.

The above process doesn't scale well (would be too slow in practice). An
alternative is, rather than subdividing the cube into tiny evenly-sized cubes,
we use an iterative process, starting with the whole cube, finding an ideal
dimension along which to split it, splitting it (subject to the constraint
that splitting it wouldn't make the children too small), then repeating this
process with the two 3D rectangles that resulted from splitting the original
shape. This could produce extremely long, thin 3D rectangles however, which
might not generalise well to new data, so we cap the aspect ratio of the child
rectangles so they can't get too thin. This approach may leave some areas of
the original cube unclassified, so for each such area we find the category of
the 3D rectangle that the unclassified areas touches most, and assign its
category to the unclassified area, then repeat until there are no unclassified
areas remaining. The end result approximates the result of subdividing the
cube into tiny cubes, but is significantly faster to produce.

The above approach is roughly what's described in the paper, except they
describe an n-dimensional hypercube rather than just a 3-dimensional one, and
provide more detail on how to determine the ideal axis along which to split a
hyperrectangle into child hyperrectangles.

~~~
taneq
This sounds like a K-d tree. Am I understanding it right or have I missed
something?

~~~
logicchains
Yep, pretty much, except it's constructed differently to how we'd normally
construct a K-d tree.

~~~
taneq
Cool, thanks for the translation! I'm gonna go read the paper now...

------
princeofwands

        from sklearn.datasets import fetch_openml
        from sklearn.model_selection import train_test_split
        from sklearn.neural_network import MLPClassifier
        from sklearn.metrics import accuracy_score
    
        X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
    
        X_train, X_test, y_train, y_test = train_test_split(X, y,
            stratify=y,
    	random_state=1729,
            test_size=X.shape[0] - (10 * 20))
    
        model = MLPClassifier(random_state=1729)
    
        model.fit(X_train, y_train)
        p = model.predict(X_test)
    
        print(accuracy_score(y_test, p))
    
        X_train, X_test, y_train, y_test = train_test_split(X, y,
            stratify=y,
            random_state=1729, 
            test_size=X.shape[0] - (10 * 200))
    
        model = MLPClassifier(random_state=1729)
    
        model.fit(X_train, y_train)
        p = model.predict(X_test)
    
        print(accuracy_score(y_test, p))
    	

This gets you 0.645 and 0.838 accuracy score respectively (versus 62% and 76%
in the paper). Sure, different validation (I validate on all the remaining
data, they do 20x 70% 30% splits on the 200 and 2000 samples, which needlessly
lowers the number of training samples, fairer comparison is 0.819 with 1400
samples), but the scores seem at least comparable. Cool method though, I can
dig this and look beyond benchmarks (Though Iris and Wine are really toy
datasets by now.)

~~~
spi
Perhaps more importantly than the limit on 200/2000 samples, the main
difference from your code is that in the paper they only use the 10 first PCA
(i.e. X is 10-dimensional instead of 784-dimensional). The fact that they do
tends to contradict their claim that the proposed algorithm scales "weakly"
with dimension, but for using just 10 dimensions (chosen linearly from the
input ones) 76% is not that bad.

Personally, my doubts are more in line with the other ones that were raised
here, namely tdj's comment on not building on existing literature (the feeling
of "this just looks too much like a classification tree" is very strong while
reading), and scalability issues. I also tried something vaguely similar in
the past (but using iterated linear discriminant analysis instead of
hypercubes / classification trees, and convolutional networks instead of fully
connected), but never even finished the prototype because it was so terribly
slow even on MNIST that nobody in their sane mind would have used it (I could
only load into RAM a few samples at a time, which made it messy).

In any case, it's a pity they didn't try to use these weights as
initialization for the "usual" backpropagation, it might have lead to
interesting results (especially if in doing so they extended to the whole of
the input and not just the first 10 PCA - or at least on ore than 10).

~~~
princeofwands
Scores become more equal when you make the first hidden layer size 10 instead
of 100 (both methods use an X of 784 dimensions).

Instead of PCA on all features, they could subsample 10 random features to
partition, and bag the results of multiple runs. Basically, Totally Random
Trees with an arbitrarily handicapped Random Subspaces method. Scales well,
can beat Logistic Regression, but not any of the more developed tree methods.

Another difference with established literature is that this algorithm does not
use any kind of knowledge transfer from previously learned classes. In most
one-shot methods, including those used by humans, the model is already trained
on other classes, and uses this information to adapt to unseen classes.
Instead, the authors interpret it solely as deep learning under a small
training data setting (which -- my code shows -- does not require jumping
through hoops).

------
tdj
Their hypercube covering formalism can be seen as decision tree induction with
a specific partitioning rule, and terminating branching only at uniformly
labeled leaves. But try are using the tree nodes as kind of an embedding to
apply a softmax on. I like the connection between relus and the geometrical
representation, makes it easier to think about in spatial terms.

Reading this I got several dejavus to my grad school classes on classical ML
stuff. I like the direction but it feels like it could be better if it
admitted that it's a variant of decision tree embedding, and built on some of
the massive amount of research work in that area. At least in terms of
understanding.

I suspect doing a random forest version of this would actually help. Perhaps
we will see this as a legit pre-training step.

~~~
princeofwands
> terminating branching only at uniformly labeled leaves

Also called Perfect Decision Tree.

------
tbenst
Wonder how this compares to single shot with an SVM or nearest neighbor. 76%
on MNIST is frankly embarrassing

------
calvinmorrison
Interesting, we spent the last few days digging into the old Time Warp OS

[https://lasr.cs.ucla.edu/reiher/Time_Warp.html](https://lasr.cs.ucla.edu/reiher/Time_Warp.html)

------
stealthcat
MNIST is joke. I can train linear regression and acheive 90% accuracy.

CIFAR-10 is the new MNIST today.

