
64 bits ought to be enough for anybody - beagle3
https://blog.trailofbits.com/2019/11/27/64-bits-ought-to-be-enough-for-anybody/
======
tyoma
There have been a few OpenCL versus CUDA posts lately and I wanted to add my
take on it after writing the GPU portion of sixtyfour, which was my first
GPGPU application.

OpenCL needs more examples, tutorials, and documentation. I really wanted to
use it because it is cross platform, but quickly gave up. The tutorials and
documentation wasn’t there, and what I did find was confusing. For CUDA there
is an immense body of examples, support, and documentation that makes creating
your first CUDA program dead simple.

I went with CUDA.

~~~
cesarb
> For CUDA there is an immense body of examples, support, and documentation
> that makes creating your first CUDA program dead simple.

Isn't it dead simple only if you have compatible hardware? OpenCL works
everywhere, even in the worst case on the CPU through pocl.

~~~
retrac
I'm surprised this isn't more of an advantage. Sometimes you just need to test
if your code _runs_ and in my experience, a CPU on my netbook is fine for that
purpose. With CUDA I have to upload the code to a VM and test it there. Both
expensive and a bit awkward.

~~~
tyoma
I originally wanted to do just that, but my laptop is a MacBook and OpenCL is
deprecated there. With CUDA I could just ssh to an old box under my desk.

~~~
darkwater
I guess you could just ssh with OpenCL too, no?

~~~
rat9988
Yes he can, but why would he? It just doesn't make sense to do so.

------
Const-me
The source code is very hard to port to Windows: pthreads instead of OpenMP,
Linux sys.calls instead of std::chrono, proprietary compiler extensions for
CPU SIMD for no reason, etc.

That’s unfortunate because it would be interesting to compare virtualized V100
GPU with real hardware 1080Ti. On paper, the two are pretty close. If that’s
true in practice, you can use [https://vast.ai](https://vast.ai) to reduce the
cost by a factor of ~5, i.e. it will only cost about $350.

~~~
kick
That website is kind of sketchy. The blog on it doesn't link anywhere. Is it
yours?

~~~
Const-me
Not mine, found by DDG by a question like “rent 1080Ti GPU”, and there're
more.

I don’t need their services, but I think the prices are reasonable.

Nvidia is way too greedy, they charged $5000 for server equivalent of a $700
consumer GPU. The major difference between the two GPUs is legal, not
technical.

Cloud providers like Google or Amazon have little choice but to pass the cost
to users.

~~~
rrss
The server class gpus are generally different silicon than the consumer gpus.
That's a technical difference.

~~~
Const-me
> generally different silicon

1080Ti / P100:

Chip: GP102 / GP100

Cores: 3584 / 3584

Base clock: 1480 / 1328

Single precision TFlops: 10.6 / 10.6

TDP, Watts: 250 / 300

The only major difference of the chip is double precision performance,
probably irrelevant for the OP's task. And I'm not convinced the difference is
in the silicon as opposed to firmware or drivers, games don't need doubles so
the NV could cripple the consumer model without consequences, just to
differentiate the two products.

~~~
rrss
GP100 and GP102 are different sizes. They are certainly different silicon -
the die area can't be changed in firmware or drivers.

From techpowerup:

GP100: 610 mm^2

GP102: 471 mm^2

~~~
Const-me
Good point. You’re right, they are different chips. But still, the only
difference is double precision performance (that’s probably what these extra
transistors do), and support of HBM2 memory as opposed to GDDR5X. Pretty sure
Amazon/Google/MS would be happy to offer affordable cloud-based GP102-s, if
they could.

------
chaosfox
I don't think I understand the problem.. to do this search you need to have
the target number, and if you already have the target number no search is
required.. what am I missing ?

~~~
saagarjha
It’s a thought experiment to figure out how long it takes to brute force a
64-bit space. In reality you won’t have the number you’re searching for.

~~~
bonoboTP
But in reality you'd need to do some complicated operations on each 64-bit
item, right? When e.g. cracking a password or so.

I also don't understand what type of real-world application is being simulated
here. I do understand that it is a simplified example, but of what?

The takeaway is that it is feasible to perform a single comparison operation
for each 64-bit bitstring. But we don't just do a single operation per item in
the real world.

~~~
empath75
The complicated operation will generally be a constant factor though. Let’s
say it takes 100 times longer each cycle, now it’ll cost you $100,000 instead
of $1000.

------
tyoma
There is an associated github repo with code for those who want to reproduce
results
([https://github.com/trailofbits/sixtyfour](https://github.com/trailofbits/sixtyfour)).
Or if you’d like to contribute an ARM version :).

~~~
saagarjha
Don’t know enough about NEON to help, but do you even have hardware to run it
on if someone did contribute an ARM version?

~~~
tyoma
Yes. AWS has support for ARM instances, and I have an RPI3 :).

------
saagarjha
Up until recently I had considered even a 32-bit brute force to be infeasible
(fun story: I learned this during a CTF, when I complained to a team member
that I was stuck trying to find a way to not have to do that and they just
came back to me with the brute-forced answer about an hour later…). This’ll
teach me to not underestimate computers in the future :)

~~~
clarry
Advent of Code[1] often has problems that might seem just a bit too much to
bruteforce, or not, depending on your background and language at hand. Many of
them are bruteforceable enough that I ended up setting races: run a bruteforce
in the background while I try to write a less naive solution. Kinda fun, but I
also found out that many of the problems are a bit _too easy_ to bruteforce
(with C on a modern CPU), taking something like 10-15 minutes.

[1] [https://adventofcode.com/](https://adventofcode.com/)

------
josteink
In some CPU-discussions I continuously see some people still insist Intel
beats AMD... because AVX512.

But these tests shows minimal difference between AVX2 and AVX512. So what’s
the big deal?

~~~
jiggawatts
In current-generation Intel CPUs the use of AVX512 forces a reduction in clock
speed, negating much of the theoretical benefits.

The clock speed reduction also affects other cores, which may not be executing
AVX512 instructions at that time. Code that is nearly 100% AVX512 will get an
overall speed boost, but mixed multi-threaded workloads can actually regress
in performance. On virtualised or multi-user systems such as Citrix or some
database engines this is a serious issue at the moment.

This will improve or go away entirely once they're on 10nm or some smaller
process.

Secondly, the rarity of AVX512 means that few applications take advantage of
this instruction set, and those that do have not had anywhere the same level
of fine-tuning as the more common AVX2 code.

For example, SQL Server recently got vector instruction set support, called
"Batch Mode Processing", but I can't find any references to indicate that it
uses AVX512, and it probably doesn't.

Once AMD supports AVX512, it trickles down to the mainstream CPUs, and the
clock speeds are maintained it'll have a significant advantage over AVX2.

------
aianus
Does anyone know why the big cloud vendors don’t seem to have any offerings
that can beat a regular consumer machine in single thread performance? Surely
there must be a market for it?

~~~
FractalParadigm
AFAIK server CPUs aren't much different than consumer offerings, and generally
more cores = lower clock speeds, and (generally) subsequently leads to lower
single threaded performance.

It's highly likely there's a market for it, however I'm doubtful they'd be
willing to spec lower-density, higher-performance machines for the few people
who would actually take advantage of them. Considering high core count CPUs
are becoming the new norm, re-writing or finding software that can take
advantage of multiple cores/threads is certainly the way to for maximum
performance (where possible, of course)

------
aewens
While using GPUs is one way to increase performance, with all of those
machines at your disposal you can also use a simple MPI script to run the job
in parallel across all of them to drastically bring down the total runtime.
This is actually how much of the industry handles parallel processes at scale
since running it on a single machine can only get you so far.

~~~
iscoelho
If you run parallel across many machines you will reduce the time, but the
cost stays the same (roughly anyway).

The initial cost is what makes this insane, not the time, and the end result
(GPUs) is undebatable the most cost efficient method (in this case very likely
by a factor of 50-100x).

A situation like this is not premature optimization. Developers sometimes need
to understand that in the scheme of things, their time is not worth very much
compared to the cost of the infrastructure required to run their code.
Throwing the corporate credit card at optimization issues is far too common
today in tech.

~~~
hinkley
On the flip side, we spent almost $100,000 in developer time one summer in
order to keep a handful of customers from having to upgrade hardware.

We could have gifted them all $5000 machines and spent less. Plus, when you
remember that the point of spending on developers is to earn back many times
that cost in sales, the opportunity cost of that 3 months of very senior
development effort was massive.

------
BlueTemplar
No quantum supremacy comparisons ?

------
rolltiide
the joke this time is that we switch to non-binary logic gates with qubits.

so we'll still laugh at this comment even though the addressable space turned
out to be "good enough", it was just an antiquated framework to begin with

------
why-oh-why
If my calculations are correct, that GPU is making 1,000,000,000,000 guesses
per second.

The takeaway here is that cracking a 64bit password only cost $1700 and is
totally doable in 3 weeks.

If you need things to be really safe I guess you need to start lengthening
those passwords.

~~~
oxfordmale
If you need things to be really safe, you should enable 2FA.

Those $1700 of computational costs are likely to be run on hacked
infrastructure, so definitely doable.

~~~
michaelmrose
The 1700 estimate is probably off by a factor of a million.

1.7 billion in computation

------
rstuart4133
The amount of yak shaving needed to get a one line loop running on CUDA is
truly impressive. ( _shudder_ ) But I bet an FPGA implementation would beat
it, in every way.

