I think that figure is of course referring to the peak bandwidth in burst mode, where you are sequentially access data (tacc = 25 ns), in the ideal case. In that sense, also the figure given for the Cortex M33 are to be taken as absolute maximum as well. 50-70 MB/s sounds representative of random read of 8 bytes blocks, where you have the full access time. This will go even lower if you just make byte-access.
Quake was indeed optimized to work on such 1996 Pentium PCs. Look for instance how the edge/surface/span arrays are allocated in the stack: they allocate extra size, to be sure that the data will be aligned to the cache line size.
50-70 MB/s is burst best case scenario linear read into nothing on contemporary 1996 chipsets/CPUs. Moving more or less halves that, writing is slightly faster than reading due to cache lookups.
320MB/s is even faster than theoretical maximum of EDO on Pentium platform. 8 bytes x 66MHz / 5-2-2-2 timings = <260MB/s burst.
Quake optimized for prefilling caches, but not for contemporary cache sizes. https://dependency-injection.com/2mb-cache-benchmarks/ Doom gains tiny amount when going from 256KB to 512KB, Quake linearly gains all the way to mindbogglingly absurd 2MB of L2. Could really benefit from data-oriented design, but there was no tooling for that at the time not to mention time crunch, Abrash did all he could under circumstances.
I don't think the initial access time (latency) shall be included when speaking about peak bandwidth, otherwise it is not the peak bandwidth.
Peak bandwidth shall be considered with ideally infinite (large enough) payload to make latency negligible. When you have these two values, latency and peak bandwidth, you can estimate your (still theoretical, of course) performance given the transfer size.
The article uses the 240-320 MB/s peak bandwidth, and 110-130 ns latency for a comparison with the used external flash, which has latency in the us range and a peak bandwidth of 17 MB/s (arguably assuming using infinite payload, as 136.5/8 is about 17, i.e. without taking the initial setup time).
Still, even if you compare the actual speeds of a 1996 Pentium with the theoretical external flash speed values cited in the article, the consideration does not change: the external flash is much slower than what you could get even in 1996.
>I don't think the initial access time (latency) shall be included
Its not about the RAS. Bandwidth is bandwidth. When someone says
> In fact, the bandwidth for sequential reads varied a lot but with a 40 MHz EDO 64-bit DRAM (already available on 1996) one could get a maximum throughput of 320 MB/s
it tells me they multiplied 40MHz by 8 bytes and called it good. Thats not how EDO works. EDO still needs a CAS cycle for every new access, even linear. Its BEDO (Burst EDO) that has a 5-1-1-1 pattern.
BEDO DRAM Read Timings (66MHz) 5-1-1-1
EDO DRAM Read Timings (66MHz) 5-2-2-2
FPM DRAM Read Timings (66MHz) 5-3-3-3
SDRAM Read Timings (66MHz) 5-1-1-1
Absolute maximum _purely theoretical_ EDO burst bandwidth at 66MHz is <260MB/s. That doesnt take into account reality of 1996 hardware. Processors (Intel still hasnt acknowledged 'rep movsb' should be optimized), Chipsets and their Cache subsystems (cache on same bus as ram so no parallel accesses, lookup slows down reads). On real hardware 50-70 MB/s is all you get.
>The article uses the 240-320 MB/s
Article states "1996) one could get a maximum throughput of 320 MB/s" which is ~5x higher than reality. Im not arguing the achievement realized here is somehow lesser because of this mistake. Im pointing out assumptions about vintage hardware were incorrectly inflated. In fact those assumptions might have led to lower expectations and worse outcome. Usually learning something is possible with less is a strong catalyst to try until you get there. Great example of this effect while still staying on topic, Video7 FIFO story told by Abrash https://www.bluesnews.com/abrash/chap64.shtml
Abrash: "push past the limits he had unconsciously set in coming up with his original design. And, in the end, I think that the single most important element of great design, whether it be hardware or software or any creative endeavor, is precisely what the Paradise news triggered in Tom: The ability to detect the limits you have built into the way you think about your design, and transcend those limits."
Upvoted as you have convinced me with your solid refs.
I don't see possibility of commenting on the link I posted (silabs community), but on the detailed blog one can leave comments. I will post a there a link to this conversation, asking to address the issue and we'll see what they say.
But with a fraction of CPU resources. Arduino Nano's Cortex M33 is overclocked at 135 MHz, while GBA's ARM7TDMI is running at mere 16.78 MHz.
ARM7TDMI takes 1-4 cycles to perform a simple 32bit x 32bit multiply, depending on the multiplier. I believe Cortex M33 takes just 1 cycle to do same. ARM7TDMI has no divide instruction and critically, no FPU that Quake requires.
GBA has only 32 kB of 0-wait state RAM (AKA internal working RAM). Versus 276 kB on the Arduino Nano.
GBA's 256 kB RAM block (external working RAM) has massive 6 cycle access time when loading a 32-bit value.
It's a true miracle someone managed to even get 1/3 of resolution on this weak hardware!
Afaik, Quake does not do one divide per pixel, it is in steps of 8 pixels (see dscan.c in winquake). Yes, there is non divide but instead of taking hundreds of cycles, tables and other approximations could be used.
Of course, div/vdiv which take only 14 cycles or less are a strong boost on CM4/33.
It means almost an order of magnitude less divisions (and additional calculations as well).
Quake had to do this because it would have been too much especially for a low-end Pentium when it was released in 1996.
Yes it is not even noticeable, especially at low res.
Abrash did this in Quake because those divides are _Free_ when intervened with other code. Pentium FPU is pipelined, you can push FDIV, then FXCH to another data and do something else for a while instead of waiting for the result. The price is hand tuned assembly code that works fast only on Intel FPU in 1996. AMD caught up in 1998-99 finally implementing pipelined FDIV and 0 cycle FXCH.
Depends on the textures used. High contrast textures with vertical lines (e.g. dark wood on bright wall) would make the distortion very visible even at 320x200. However most of the game's textures are not like that.
There are some user made maps however where this can be seen (e.g. i remember playing a map which was supposed to be inside a fantasy town and it used a bunch of wood-on-wall textures that made the distortion apparent).
The article says that even if you put all the static data to flash, you still have to fit about 1.5 MB of non static data, if you don't optimize it. Beside that, all graphics is loaded from the relatively slow external SPI flash, which tops at 17 MB/s with overclock. Yes, the GBA is much slower, but the access to cartridge data is faster than 17 MB/s (and also the random-read speed is in the 100 ns range, not 1-2 us range).
The bird's average speed is 33 mph (29 knots). Cargo vessels tend to cruise at about 18-25 knots, and many move more slowly.
There's little direct traffic between Alaska and Australia. Shipping lines are visible through their emissions trails, as in this Nullschool link showing NO2 concentrations, from May of this year. The long lines are shipping lanes. You'll note these from Panama to New Zealand, tracking along the Western US coast and Alaska along the Great Circle route to Japan and China, and past Papua New Guinea, among other notable routes:
The data recorders would also likely note any marked variations in travel speed or direction. Again, ships tend not to cover the routes flown by Godwits.