
Learning@home hivemind – train large neural networks across the internet - sebg
https://learning-at-home.github.io/
======
awalton
Lots of talk about fault tolerance, not a lot of talk about trusting peers and
preventing them from introducing bad data into your presumably precious
model...

So if you're forced to trust all of the peers, how is this better than a
cloud? Who out there is training models for purely benevolent reasons (i.e.
non-profit seeking) and can trust random nodes? If not for purely benevolent
reasons, who out there is going to donate CPU time to training your model,
essentially writing you a blank check?

~~~
hnjst
Shameless plug... [http://papers.nips.cc/paper/6617-machine-learning-with-
adver...](http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-
byzantine-tolerant-gradient-descent)

~~~
awalton
That's a damned timely shameless plug :). I'll add it to the reading list.

~~~
Iv
Thanks :-)

------
zabhi
Why would an average user want to participate in such a network? The only
reasons I can think of are tied to benevolence and altruism of the
participants. We saw this being successful in the SETI project. I doubt paying
participants would ever be profitable enough for either the operator (why not
rent a few machines over cloud) or the participants (training would need power
and cpu).

How about tying the training and consumption of the model together. An
internet scale tool with a focused goal, like Alexa/Mycroft for speech and
intention recognition, that trains a distributed model while pushing
improvements back might be more successful in getting adoption.

~~~
0-_-0
Lots of people have machines sitting idle that they could rent for neural
network training as long as the money they receive is more than the cost of
electricity. Cloud computing is significantly more expensive than just the
cost of electricity.

------
Nerada
I have no idea how ANNs work, but those GPT-3 numbers make it look like the
barrier to better AI is an expense issue (compute/financial), whereas I always
just assumed we lacked understanding or some better algorithm.

~~~
TaylorAlexander
“The scaling hypothesis” is the name given to the idea that the existing
algorithms might be all we need if we just throw more compute at it. Certainly
GPT-3 is a very interesting data point here. However we definitely also need
better algorithms. It’s a mix of scaling a new algorithms that will get us to
AGI.

~~~
ricklamers
I think that 'just' scaling todays algorithms is quite a naive approach as it
would imply the need of huge amounts of training samples for simple tasks
(simple to humans). Given humans tend to need an order of magnitude less
samples before being able to generalize I think we need more than just scaled
up versions of todays NNs, SVMs, Trees, what-have-you.

~~~
Reelin
I agree that there are clearly algorithmic improvements remaining to be made.
However, a counterpoint to your specific example would be the lottery ticket
hypothesis and related weight agnostic neural networks.

[https://arxiv.org/abs/1803.03635](https://arxiv.org/abs/1803.03635)

[https://ai.facebook.com/blog/understanding-the-
generalizatio...](https://ai.facebook.com/blog/understanding-the-
generalization-of-lottery-tickets-in-neural-networks/)

[https://ai.googleblog.com/2019/08/exploring-weight-
agnostic-...](https://ai.googleblog.com/2019/08/exploring-weight-agnostic-
neural.html)

~~~
ricklamers
The way I interpret the lottery ticket hypothesis is that you don't actually
need the full sized networks (with their structure and parameters) in order to
perform well at some tasks (when comparing performance against larger
networks). I think everyone agrees that most neural networks are highly
overparameterized as successful distillation efforts have shown.

However, this doesn't directly make my point about sample efficiency of
today's algorithms compared to humans less valid. Although what I'll give you
is that with smaller networks the required sample size is expected to shrink
(due to curse of dimensionality). Although the expressiveness is clearly
harmed by the reduced parameter count/altered network structure which possibly
reduces the ability of the network to perform well for certain tasks.

I think it's important to clearly make a distinction between the required
amount of computation and the number of data samples that are necessary when
talking about scaling up existing methods. Compute is "cheap", while data
isn't.

As a side note, I think the usefulness of the lottery ticket hypothesis is
mostly about the ability of random initialization to already give a hint about
the quality of the 'prior' that is encoded by the network structure. Useful
for less computationally intense architecture search as also suggested by the
papers and a paper by Andrew Ng on this topic.

~~~
Reelin
> The way I interpret the lottery ticket hypothesis is that you don't actually
> need the full sized networks (with their structure and parameters) in order
> to perform well at some tasks

Actually that's not the point. Pruning typically results in networks that
still perform well but are harder to train. The idea is to explicitly search
for a subnetwork (via pruning) that is easy to train.

> Although what I'll give you is that with smaller networks the required
> sample size is expected to shrink (due to curse of dimensionality).

I'm not so sure about that either. From
([https://arxiv.org/abs/2001.08361](https://arxiv.org/abs/2001.08361)):

> Larger models are significantly more sample-efficient, such that optimally
> compute-efficient training involves training very large models on a
> relatively modest amount of data and stopping significantly before
> convergence.

My second and third links are also important! The second talks about
generalizing "winning" tickets across other datasets and optimizers. The third
talks about weight agnostic neural networks, which in a nutshell are still
capable of more-or-less performing a task even with _randomized_ weights.

Weight agnostic networks have a lot of parallels to wildlife that is capable
of certain behaviors required for survival effectively immediately, before
there's been a chance for significant learning to take place. This is the
counterpoint I was referring to - an equivalent phenomenon could explain (at
least partially) why humans require so much less data when learning.

~~~
ricklamers
> Actually that's not the point. Pruning typically results in networks that
> still perform well but are harder to train. The idea is to explicitly search
> for a subnetwork (via pruning) that is easy to train.

They state "smaller network, same test accuracy, with similar number of
iterations". So it seems the original network size wasn't necessary for best
test accuracy, and compute requirement is reduced only because it's a smaller
network. Sample efficiency isn't increased according to
[https://arxiv.org/abs/1803.03635](https://arxiv.org/abs/1803.03635).

Good performance with random weights seems to indicate good 'priors' encoded
in the network. Like how convolutional networks encode the prior of
translational invariance and hence it being a naturally good performer on
image inputs/tasks.

I think the parallel to "wildlife that is capable of certain behaviors ...
before there's been a chance for significant learning to take place" is that
priors are also part of biological intelligence. I.e. brain structure at birth
enabling certain surivival oriented behaviors.

Hence, I'm optimistic about transfer learning which could happen through
_both_ better models (priors that generalize well) and pretrained weights
(possibly partially pretrained, i.e. just initial feature extraction). Either
could potentially provide a better starting point from the 'how many samples
are necessary for good performance on a variety of tasks' perspective.

The point is that either way information needs to be added for performance on
tasks to increase. Doing that in a task specific way by using today's
algorithms and a billion samples doesn't seem like the right approach. Finding
algorithms, models or specifically perhaps neural network architectures
(including training procedures, regularizers, loss function, weight tying)
that generalize across tasks without needing many samples due to their
informative priors seems the way forward to me. That's _not_ a naive scaling
of today's algorithms to larger and larger training sets. Which was the point
I was trying to make.

------
smabie
One of my friends has a startup where individuals can sell computer time to
the highest bidder. I've told him that I didn't think it was a good idea, but
this library could change that. I wonder what the performance overhead is
like.

He is focused on the gaming space, but with this, the data science space might
make more sense.

~~~
Nerada
Chessbase actually does something similar, where you can rent other user's
computers to run analysis on positions[1]. The users offering up their
machines set the price though, as opposed to an auction.

[1] [https://en.chessbase.com/post/tutorial-how-does-the-
engine-c...](https://en.chessbase.com/post/tutorial-how-does-the-engine-cloud-
work)

------
sdenton4
Curious how this deals work moving training data around... If you're dataset
is a few gbs moving it around is a good bit of overhead, and a decent chunk of
local disk space for the host system. Probably not bad of there's a consistent
task, but seems like a big problem if the tasks change often.

~~~
justheuristic
/* hypothesizing */ If you're using it for NLP, your dataset (token ids)
typically weighs much less than intermediate tensors. So, i see two scenarios
here:

(1) distribute data chunks as you train using more conventional bittorrent
systems (e.g. [https://academictorrents.com](https://academictorrents.com) but
internal) (2) since you most likely use raw unlabeled data (e.g. just text),
peers can crawl it straight from the web

~~~
sdenton4
Yeah, it's probably less of a concern for text tasks, where the data per
example is relatively light (though there is a whole internet worth of text
data...)

I mostly work with audio, where individual examples are ~2MB, so the dataset
sizes get very heavy quickly.

------
CleanItUpJanny
If Facebook had a community-sourced network for training combat drones, you
guys would trip over each other to volunteer your computing resources

~~~
drusepth
I probably would, yeah.

