Hacker Newsnew | comments | show | ask | jobs | submit | vardump's comments login

Opera Mobile works, but sadly it's only available for Android. Only mobile browser that can handle full desktop sites without excessive horizontal scrolling, most of the time well enough desktop versions become preferable to "mobile" versions. Not to be confused with rather useless Opera Mini.

When I was last on iOS, I actually preferred Opera Mini. This was before Chrome, etc. were options and I have barely used iOS since then, but it was much faster, and had saner zooming.

Intel CPUs resemble GPUs more and more over time. I think just scatter, GPU style ultra slow (high latency) but wide memory interface and texture lookup is missing in Skylake (Xeon).

Gather was already added in Haswell, although it performs badly so far.

Skylake (Xeon AVX-512) handles 16 float wide vectors (512 bits) and can dual issue per clock, bringing effective rate to 32. That's definitely comparable to modern GPUs.

Wasn't Nvidia WARP just 16 float wide per clock cycle? Or 32? For comparison, high end Nvidia 980 GTX GPU has only 16 of such SIMD execution cores. However, they count those 16 cores as 2048 in their marketing literature.

I do wonder if Intel is planning to unify CPU and GPU in 10 years or less. Things sure seem to be moving that way.

If Intel can add significant amounts of eDRAM in package, x86 CPUs aren't that far from being capable of handling GPU duties as well.

-----


Vector Instructions != Scalar Instructions

"WARP Scheduler" gives you a hint.

Okay, so how this works is you have a processor that is 16 scalar cores wide. Each scalar core is really just an out of order scheduler, for 32 in-order pipelined, boring, ALU's. These ALU's can each execute the same instruction, together, giving you the illusion that the scalar core is doing vector processing.

The reality is far weirder. I.E.: If you encounter a branch, the scalar processor can, and will execute both branches on different ALU's, and execute the branch statement on another, allowing for a 10 instruction section of code to run in ~3 instructions time. Trying doing that with a vector processor.

Technically in CUDA you can schedule each ALU itself, thus marketing stuff.

Would you like to know more? http://haifux.org/lectures/267/Introduction-to-GPUs.pdf

-----


> The reality is far weirder. I.E.: If you encounter a branch, the scalar processor can, and will execute both branches on different ALU's

That's not so different than on x86 SSE/AVX. You'd execute both sides of the branch (dual issue) and blend / mask the results away you don't want. This is typically much faster than having a data dependant, unpredictable branch.

Another way is to SIMD sort data according to criteria to different registers and process them separately. This completely sidesteps having to execute both sides of the branch, although some computational resources are still wasted.

-----


>That's not so different than on x86 SSE/AVX. You'd execute both sides of the branch (dual issue) and blend / mask the results away you don't want.

What your talking about is how x86_64 processors can optimize away some branches. Which it does this with the cmov instruction. This has nothing to do with SSE/AVX. Its common to confuse this b/c intel says the branches are executed in parallel (and they often are), just in parallel as the OoO pipeline allows, which is actually quite a few.

Both sides of the branch are pre-computed, then the branch is computed. But its output is sent to a cmov, which just re-assigns a register, instead of jmp into a branch. This avoids pipeline flushes. cmov isn't prefect still costing ~10 cycles, but compared to the ~100 of a pipeline flush its still cheaper.

Provided the same operations are being done on both branches then SSE/AVX can be used. As both branches are just values, and that is literally what vector processors are good at. The chain will end with a cmov.


It has absolutely nothing to do with CMOV. I'm talking about computing, say, 16 results in parallel in a SIMD register, for both sides of "if"-statement. Then masking unwanted results out. SSE/AVX can simulate "CMOV", but for 128/256/512 bit wide vectors.

To make it even more clear, there's not a single CMOV in my code, anywhere. The data doesn't usually even touch general purpose (scalar) registers, because that'd totally destroy the performance.

What you are talking about is how things were done until 1997-1999 or so. SSE in 1999 and especially SSE2 in 2001 changed radically the way you compute with x86 CPUs.

I'm talking about things like vpcmpw [1] (compare 8/16/32 of 16 bit integers and store mask), vpcompressd (compress floats according to a mask, for example for SIMD "sorting" if and else inputs separately), vpblendmd (blending type combining, this example is for int32), vmovdqu16 (for just selectively moving according to mask).

You can do most operations on 8, 16, 32, 64 unsigned and signed, and of course 32-bit and 64-bit floats. Some restrictions apply especially to 8 and 16 bit operands. When appropriate, it's kind of cool to process 64 bytes in one instruction. :)

[1]: https://software.intel.com/sites/landingpage/IntrinsicsGuide... SSE/AVX instruction and intrinsics guide.


GPUs have evolved with about the same pacing. Nvidia's Kepler architecture has a vector length of 192 (single prec.) per core and up to 15 of these cores on one chip.

The question really is, do you optimize the chip for heavily data parallel problems, saving overhead on schedulers and having a very wide memory bus, or do you optimize for single threaded performance of independent threads and give it some data parallelism (Xeon). As a programmer, when you're actually dealing with data parallel programs, doing so efficiently on a GPU is actually quite a bit easier since you have one less level of parallelism to deal with.

-----


Um no 192 = 6 * 32 each streaming multiprocessor operates on warps of size 32, the 6 is the number of different functional units

-----


I think we're mixing up terminologies here. One SMX operates on up to 192 values in parallel (Nvidia calls this 192 "threads" per SMX). Functional units AFAIK is only used in terms of "special functional units" which isn't relevant for this discussion. One SMX has 6 Warp schedulers, but I'm not sure on how independant these can operate. My guess is that branch divergence will only NOP out one whole Warp, but I'm not sure whether the Warps can enter different routines or even kernels (my guess is yes for routines/no for kernels).

-----


So the different functional units (this has a specific meaning in hardware design) are 32 wide and indeed if the instructions to be executed can utilize all 6 of them at the same time the smx will operate on 192 values but that wont be the case if you only need to executed a large number of double precision floating point operations.

ARC (adaptive replacement cache) is the best generic caching strategy I know of, but the patent status is problematic.

https://en.wikipedia.org/wiki/Adaptive_replacement_cache

  In 2006, IBM was granted a patent for the adaptive replacement cache policy.
http://patft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=...

IBM's patent will take quite a while to expire. It's truly a shame this can't be used by operating system page cache policy because of patents.

-----


Holy crap, that's the reason this algorithm isn't all over the place? That seriously irks me, and this one seems like a great patent to look for either Tesla-esque giveaway or just invalidation (EFF, anybody?).

Smart generic caching could do so much for the global infrastructure with all the data we move around these days.

-----


While the patent may be annoying, in practice there are a number of other cache replacement algorithms and mechanisms that are just as effective as ARC. The tradeoffs inherent in cache replacement algorithms means there are numerous ways of achieving approximately equivalent results. ARC is not a globally optimal design or anything like that.

In that sense, the patent is kind of worthless. Anyone that knows what they are doing can design around it with no loss of functionality or efficiency. A lot of sophisticated systems end up designing custom algorithms tuned for the use case anyway.

-----


The link to http://www.varlena.com/GeneralBits/96.php from that Wikipedia page was an interesting read.

-----


> although 3 sticks doesn't make any sense

Yeah, indeed it doesn't make much sense when you have 4 memory channels. Memory performance is going to be very bad. Four sticks would be just fine.

> Or 6 sticks

Neither does 6. Go for 4 or 8 sticks. Not sure if it applies here, but on some motherboards having 2x more than <number of memory channels> sticks means there's a bit more latency.

-----


> with Haswells mediocre improvements Intel has really not come up with anything substantial in years

Unlocking Haswell's performance (up to 2x integer/FPU) requires using AVX2 instructions. Which means at least recompiling and to truly extract the performance, optimizing to AVX2 intrinsics or assembler.

-----


True classic. Also other tunes in Wizball are great.

Here's a great tune from another game. Rob Hubbard - Nemesis the Warlock.

https://www.youtube.com/watch?v=kdzfOXkZrY0

-----


Well, this seems fitting here: http://www.pagetable.com/?p=53

"The Ultimate Commodore 64 Talk @25C3"

-----


I still remember my reaction to Windows 95.

"So finally Amiga-like long file names and plug and play like with Amiga Zorro bus. But why it doesn't still let me to rename volumes as something else than single drive letters C, D, E... etc.?"

My 1995 self would have been so disappointed to hear Windows is still stuck to drive letters in 2015.

-----


I've been using Linux as my primary desktop nearly continuously since 1999, and I'm typing this in Linux right now, but for what it's worth, you can now mount volumes in Windows as a path on another volume.

-----


The difference might seem subtle, but on Amiga you could refer to volumes as "Volume name": like WorkDisk:. Then whenever anything referred to say WorkDisk:dir/file, the system would request you to provide this volume, "WorkDisk".

So mounting a filesystem to a directory is not same. When directories are mounted, you need to provide volume first and then you can refer to it. You could name some USB stick as "Important Files": and whenever you copy or save anything to drive "Important Files": the system would ask you to provide this named volume.

-----


Nice. That's way better than DOS's old "Insert disk for drive B:" message or whatever it was to fake having two floppy drives.

-----


Never too late to switch to Linux. As an ex Amiga fan Linux is the closest I can get to what the Amiga could do.

-----


Pretty much OS agnostic these days. Nowadays it's OS X or Windows + Linux VMs and remote servers.

-----


You can't customize anything much OSX and Windows, though. The tweakability is not just there like what we had in AmigaOS.

-----


I know. When I care, I use Linux.

I do remember AmigaOS tweakability, it's still unmatched by anything modern.

-----


Double refresh rate first, this would help a tiny amount with input lag as well.

-----


I have a PC from 2007 that is not only functional, but reasonable even now. 2.4 GHz Core2Quad Q6600, 16 GB RAM, Nvidia 8800 GTS. 120 GB SSD replaced original hard disk.

Still snappy to use. It's disappointing how little things have improved in 8 years.

-----


same, got a beefy one in 2008 with the core 2 quad q9400, only upgraded the videocard midlife as the 8800 gts overheated and died.

I am about to replace it but it had incredible value for money. while waiting for the next intel cpu iteration I got a cheap samsung 850 and it feels like new!

-----

More

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: