
Ask HN: How to avoid AI Dungeon-style bills? I’d like to share something similar - sillysaurusx
Hiya,<p>We’ve trained a GPT-2 1.5B model on chess PGN notation. Surprisingly, it’s not bad after only a day of training: https:&#x2F;&#x2F;lichess.org&#x2F;UMyang4z<p>(Or rather, it’s not bad up until midgame, at which point it usually blunders. We think it’s because it’s “playing blindfolded” due to the fact that it’s trained solely on PGN notation, as opposed to encoding the full board state each move.)<p>We’d love to release a Colab demo similar to AI dungeon. But as with AI dungeon, our model is 5.6GB. Downloading from a GCE bucket would cost $0.056 per click, if I understand the outgoing bandwidth pricing model.<p>Our options seem to be:<p>1. Download the model via BitTorrent in a colab notebook<p>2. set up a server to power the demo rather than distribute the model to every client<p>3. find a host with low bandwidth fees, and write the notebook to download from that<p>All three have tradeoffs, but #3 seems simplest. Anyone know of a way to distribute 5.6GB to ~500k people for less than a few hundred dollars? BitTorrent might be fine if it can deliver the entire model in less than a couple minutes (otherwise people will get bored and leave).
======
nickwalton00
Ai dungeon creator here. Another option is if you can detect what region the
colab notebook is in and have a multi region bucket for each international
area you could download from the right region and it may be quite cheap. Our
costs were primarily from US GCS buckets downloading to colab servers that
were apparently running in Asia and Europe.

~~~
sillysaurusx
Thanks! Do you happen to have an example of how to detect the region in a
Colab notebook? Apparently IP geolocation isn't reliable for GCE IPs (they all
show Mountain View, CA).

------
p1esk
Why is the model so large? I mean, how did you go from 1.5B to 5.6B? Have you
looked into compressing it (quantization, pruning, etc)?

How many playing sessions a server with a single 2080Ti can support? Is it
compute or memory bound? I'd plot num_sessions vs latency (time to compute a
move), and estimate the costs for target scale/performance.

~~~
sillysaurusx
Sorry for the confusing wording. The model is 1.5B. But every 1.5B model is
5.6GB of data. (1,558 million params * sizeof(float32) = 5.6GB)

Quantization is a good idea, but I've had some bad experiences with bfloat16.
(We couldn't get loss to decrease beyond a certain point when using bfloat16.)
But that might have been an artifact of training. Still, I'd rather not harm
the model if possible. It's hard to get a sense of perceptual quality, and I'm
not convinced that validation tests are enough to capture the full nuances of
GPT-2.

You're right that a single server would do the trick in this case. It's just
so much easier to write a notebook with the logic in it than to set up a
server, keep track of sessions, make read-eval-print logic, add error
handling, etc. I was hoping to throw a couple hundred dollars at the problem
of bandwidth hosting rather than spend time making a server.

A server also implies a fixed $300/mo cost, which is pretty expensive. After
the initial demo, we'd probably end up turning the server off. I dislike the
idea of the demo breaking after a month or two.

~~~
p1esk
Re quantization - you should be fine with INT8. So that's 1.5GB. And after
pruning you will probably be able to compress it down to 500MB. To clarify -
we are talking about distributing a trained model, right? You don't need to
_train_ it in reduced precision. Both TF and Pytorch provide tools to quantize
a model after training.

~~~
sillysaurusx
Doesn't quantization harm model quality?

It's hard to quantify how much damage quantization does, but natural language
generation is very subtle.

~~~
p1esk
When you go below 16 bits during training the gradient descent might become
unstable, and there might not be enough precision for small weight updates.
But for inference, 8 bits should be enough. It mostly depends on the model
size and if there's enough learning capacity for the dataset. I suspect GPT2
1.5B is very overparametrized, so in the future it will probably be considered
similar to VGG, which is highly compressible without any accuracy loss. I
wouldn't be surprised if you can finetune GPT2 using only 3-4 bits for both
weights and activations, with no quality loss. By 'finetune' here I mean
finetuning for quantization, not transfer learning.

