Yet I also feel the things C910 does well are overshadowed by executing poorly on the basics. The core’s out-of-order engine is poorly balanced, with inadequate capacity in critical structures like the schedulers and register files in relation to its ROB capacity. CPU performance is often limited by memory access performance, and C910’s cache subsystem is exceptionally weak. The cluster’s shared L2 is both slow and small, and the C910 cores have no mid-level cache to insulate L1 misses from that L2. DRAM bandwidth is also lackluster.
I'm not a CPU designer but shouldn't this be points that one could discover using higher-level simulators? Ie before even needing to do FPGA or gate-level sims?
If so, are they doing a SpaceX thing where they iterate fast with known less-than-optimal solutions just to gain experience building the things?
Quite likely, yes. It should be possible to make estimates of how much your cache misses are going to impact speed.
But there's a tradeoff. It looks like they've chosen small area/low power over absolute speed. Which may be entirely valid for whatever use case they're aiming at.
No, that's when they open sourced it. It was designed in 2018/early 2019 and picked up the May 2019 RVV spec. By late 2021 I already had a commercially sold C910 dev board (RVB ICE).
I swear there's one brillaint chip journalist/analyst at every moment that has the Mandate of Heaven to do brilliant things. Anand Lal Shimpi was that person once, then Ian Cutress...
All the RTL basically. It’s in a directory called gen_rtl (generated RTL?) and has remarkably few comments for such a complex code base.
Also although technically it's open source if it's generated verilog then isn't that a lot less useful than the code that was used to generate the rtl?
As long as it looks vaguely like any other register based ISAs, generally there is very little in an architecture that would prevent making an high performance implementation. Some details might make it more difficult, but Intel has shown very effectively that with enough thrust even pigs can fly.
The details would be in the microarchitecture, which would not be specified by RISC-V.
> This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.
> It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.
The criticisms there are at the same time 1) true and 2) irrelevant.
Just to take one example. Yes, on ARM and x86 you can often do array indexing in one instruction. And then it is broken down into several µops that don't run any faster than a sequence of simpler instructions (or if it's not broken down then it's the critical path and forces a lower clock speed just as, for example, the single-cycle multiply on Cortex-M0 does).
Plus, an isolated indexing into an array is rare and never speed critical. The important ones are in loops where the compiler uses "strength reduction" and "code motion out of loops" so that you're not doing "base + array_offset + indexelt_size" every time, but just "p++". And if the loop is important and tight then it is unrolled, so you get ".. = p[0]; .. = p[1]; .. = p[2]; .. = p[3]; p += 4" which RISC-V handles perfectly well.
"But code size!" you say. That one is extremely easily answered, and not with opinion and hand-waving. Download amd64, arm64, and riscv64 versions of your favourite Linux distro .. Ubuntu 24.04, say, but it doesn't matter which one. Run "size" on your choice of programs. The RISC-V will always be significantly smaller than the other two -- despite supposedly being missing important instructions.
A lot of the criticisms were of a reflexive "bigger is better" nature, but without any examination of HOW MUCH better, or the cost in something else you can't do instead because of that. For example both conditional branch range and JAL/JALR range are criticised as being limited by including one or more 5 bit register specifiers in the instruction through having "compare and branch" in a single instruction (instead of condition codes) and JAL/JALR explicitly specifying where to store the return address instead of having it always be the same register.
RISC-V conditional branches have a range of ±4 KB while arm64 conditional branches have a range of ±1 MB. Is it better to have 1 MB? In the abstract, sure. But how often do you actually use it? 4 KB is already a very large function -- let alone loop -- in modern code. If you really need it then you can always do the opposite condition branch over an unconditional ±1 MB jump. If your loop is so very large then the overhead of one more instruction is going to be far down in the noise .. 0.1% maybe. I look at a LOT of compiled code and I can't recall the last time I saw such a thing in practice.
What you DO see a lot of is very tight loops, where on a low end processor doing compare-and-branch in a single instruction makes the loop 10% or 20% faster.
"don't run any faster than a sequence of simpler instructions"
This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some.
For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency.
Note that Zba's sh1add/sh2add/sh3add take care of the problem of separate shift+add. But yeah, modern x86-64 doesn't have any difference between indexed and plain loads[0], nor Apple M1[1] (nor even cortex-a53, via some local running of dougallj's tests; though there's an extra cycle of latency if the scale doesn't match the load width, but that doesn't apply to typical usage).
Apart from the lip service that has obvious reasons to support that latest chip technology is crucial to the national security, I fail to understand why this is the case.
What is the disadvantage that a country has if they only have access to computer technology from the 2010s? They will still make the same airplanes, drones, radars tanks and whatever.
It seems to me that it is nice to have SOTA manufacturing capability for semi-conductors but not necessary.
Once China has a CPU that is really good enough for most critical tasks, it might as well start dealing with Taiwan in order to let other countries see how well they progress if they no longer have the manufacturing capabilities of TSMC and others at their disposal.
If played well, it could even let them win the AI race even if they and everyone else have to struggle for a decade.
In the early 2000s, bringing China into the global community was widely seen as a strategic decision. The Bill Clinton and George W. Bush administrations supported integrating China’s economy into the international rules-based system.
China did not want to integrate. China has been seeking strategic independence in its economy by developing alternative layers of global economic ties, including the Belt and Road Initiative, PRC-centered supply chains, and emerging country groupings for longer than US.
Too much technological ties with China are seen as a potential vulnerability. It's not just technology itself, but it's importance in trade and economy. If the US or its allies have value chains tightly integrated with China on strategic components, it creates dependence.
This is dishonest. China didn't spend the last 20 years invading multiple countries, committing acts of mass-murder and destabilizing the whole Middle-East.
If anything, China's rise is a stabilizing factor for the whole world. It balances the aggression originating from the United States.
Running LLMs I am absolutely positive I could do with my Dell r810 cluster back in 2010, if I had access to DeepSeek.
Training a frontier model probably not. Again not clear what is the strategic benefit of having access to a frontier LLM model vs a more compact & less capable one.
At this rate they're going to get them whether they need them or not. Big push in the west for "AI everywhere" e.g. Microsoft Copilot; UK has some ill-defined AI push https://www.bbc.co.uk/news/articles/crr05jykzkxo
>They will still make the same airplanes, drones, radars tanks and whatever.
Eventually there'll be fully autonomous drones and how competitive they'll be will be directly proportional to how fast they can react relative to enemy drones. All other things being equal, the country with faster microchips will have better combat drones.
Alternatively, the country which has the biggest drone manufacturer in the world that can sell a $200 drone[0] capable of following a human using a single camera and sending the video in real time over 20 km using the same inhouse designed chipset both for AI control and video transmission [1] would probably win.
> All other things being equal, the country with faster microchips will have better combat drones
That's very unlikely imo.
When it comes to drones.. no matter how fast your computation is, there are other bottlenecks like how fast the motors can spin up, how fast the sensor readings are, how much battery efficient they are etc...
Right now the 8 bit ESCs are still as competitive as 32 bit ESC, a lot of the "follow me" tasks were using lot less computational power than what your typical smartphone these days offers...
Current drones are very limited compared to what they could do with a lot more processing power and future hardware developments. E.g. imagine a drone that could shoot a moving target hundreds of metres away in the wind, while it itself was moving very fast.
"Large" drones (aircraft rather than quadcopter) seem to follow the same rules as manned aircraft and engage with guided or unguided munitions of their own. If the drone is cheap enough then "drone as munition" seems likely to win.
> They will still make the same airplanes, drones, radars tanks and whatever.
At the same cost and speed? Volume matters.
> is crucial to the national security
National security isn't just about military power. Without the latest chips, e.g. if there were sanctions, it could impact the economy. The nation can be become insecure e.g. by means of more and more people suffering from poverty.
On the top of my head, you would like to cut costs for materials and nuclear research, big data analytics, all kinds of simulations, machine learning tasks that might not be llm-s, but still give you intelligence advantage, economic security in being able to provide multiple services on lower prices. If necessary, one can throw money, people, and other resources at a problem, but those could be spent elsewhere for higher return on investment. Especially, if you have multiple compute intensive tasks on the queue, you might have to prioritize and to deny yourself certain capabilities as a result. So, I'd say that it is not any one single task that needs cutting edge compute, it is the capability to perform multiple tasks at the same time on acceptable prices that is important.
From the benchmarks on another of the blog posts, I very roughly estimate this to be about 1/50 the performance of a Ryzen 7950X for CPU-bound tasks not requiring vector instructions. For vectorizable workloads it will be much slower due to the lack of software SIMD support.
That said, isn't the C910 with a critical buggy vector block?
It is amazing acheivement in a saturated market. The road for reaching a fully mature and performant large RISC-V implementation is still long (to catch up with the other ones)...
... but a royalty free ISA is priceless, seriously.
- If points are (significantly) higher than comments: the submission is very niche or highly technical that a lot of people can appreciate but only a few can meaningful comment. See right now here 120 vs 14
- If comments are higher than points or levitating towards 1:1 ratio: casual topic or flamewar (politics). See the DOGE post on the front page now, 1340 vs 2131
That being said I think "healthy" posts have a 1.5:1 - 2:1 ratio
I think „scary” is the best word. If of course rumor about choosing RISC-V over ARM is true!
We just saw big win of ARM over Intel's multi-decade scam and yet here we are with another split from ARM, because of politics. Scary to see how stupid, talking people, convincing other people to hate without real reasons can lead to such stories…
> We just saw big win of ARM over Intel's multi-decade scam
We also just saw Arm sue one of their customers following an acquisition of another of Arm's customers, and try to make them destroy IP that was covered by both customers' licenses. Nobody wants to deal with licensing, and when the licensor is that aggressive it makes open alternatives all the more compelling, even if they're not technically on-par.
softbank is usually the dumb money on the poker table, funding bad ideas long after anyone intelligent leaves them. wework is probably the best example.
getting funded by softbank is probably a good proxy indicator for companies losing their competitive edge.
I'm not a CPU designer but shouldn't this be points that one could discover using higher-level simulators? Ie before even needing to do FPGA or gate-level sims?
If so, are they doing a SpaceX thing where they iterate fast with known less-than-optimal solutions just to gain experience building the things?
reply