In the early 2000's I worked at a company where our IT section was in its own building with only about 18-24 or so people spread out over three mostly open plan areas between development, testing and infrastructure.
Even so we still had an incident where two guys walked in and just collected a few laptops before making their escape.
We like to think that we are hyper-vigilant and intelligent as human beings, but in general we tend to just focus on what is in front of us most of the time. We assume that when things are happening that they must be ordinary, or else why would they be happening?
Here's an easy to understand example. I've been playing EvE Online and it has an API with which you can query the game to find information on its items and market (as well as several other unrelated things).
It seems like a prime example for which to use AI to quickly generate the code. You create the base project and give it the data structures and calls, and it quickly spits out a solution. Everything is great so far.
Then you want to implement some market trading, so you need to calculate opportunities from the market orders vs their buy/sell prices vs unit price vs orders per day etc. You add that to the AI spec and it easily creates a working solution for you. Unfortunately once you run it it takes about 24 hours to update, making it near worthless.
The code it created was very cheap, but also extremely problematic. It made no consideration for future usage, so everything from the data layer to the frontend has issues that you're going to be fighting against. Sure, you can refine the prompts to tell it to start modifying code, but soon you're going to be sitting with more dead code than actual useful lines, and it will trip up along the way with so many more issues that you will have to fix.
In the end it turns out that that code wasn't cheap at all and you needed to spend just as much time as you would have with "expensive code". Even worse, the end product is nearly still just as terrible as the starting product, so none of that investment gave any appreciable results.
been there. for these kinds of projects i never start with code, i start with specs. i sometimes spend days just working on the specs. once the specs are clear i start coding, which is completely different monster. basically not that different from a common workflow (spec > ticket > grooming > coding; more or less), but everything with AI.
Yeah, I make sure to spend my time on the hard problems and where you need to design for the future. I use AI up to around method level after that, have it do the drudge work of typing up tedious text, or to complete required boilerplate etc.
Did you tell it to consider future usage? Have you tried using it to find and remove dead code? In my experience you can get very good code if you just do a few passes of AI adversarial reviews and revisions.
Stanford's cogen plant has an underground "ice cube" for campus/municipal chilled water infrastructure. Perhaps scaling something like that makes sense or perhaps to use an absorption heat pump (AHP) that can operate like and the reverse of an Einstein–Szilard refrigerator?
The funniest thing I've seen GPT do was a while back when I had it try to implement ORCA (Optimal Reciprocal Collision Avoidance). It is a human made algorithm for entities where they just use their own and N neighbours' current radii along with their velocity to calculate mathematical lines into the future, so that they can avoid walking into each other.
It came very close to success, but there were 2 or 3 big show-stopping bugs such as it forgetting to update the spatial partitioning when the entities moved, so it would work at the start but then degrade over time.
It believed and got stuck on thinking that it must be the algorithm itself that was the problem, so at some point it just stuck a generic boids solution into the middle of the rest. To make it worse, it didn't even bother to use the spatial partitioning and they were just brute force looking at their neighbours.
Had this been a real system it might have made its way into production, which makes one think about the value of the AI code out there. As it was I pointed out that bit and asked about it, at which point it admitted that it was definitely a mistake and then it removed it.
I had previously implement my own version of the algorithm and it took me quite a bit of time, but during that I built up the mental code model and understood both the problem and solution by the end. In comparison it easily implemented it 10-30x faster than I did but would never have managed to complete the project on its own. Also if I hadn't previously implemented it myself and had just tried to have it do the heavy lifting then I wouldn't have understood enough of what it was doing to overcome its issues and get the code working properly.
It's probably the same, for example in Afrikaans its just gif. Vergif is the verb action of doing it, and vergiftig the same past tense of it having happened previously.
What I agree on is that if we had modern .NET available we'd get a free 2-3x improvement, it would definitely be great. BUT having said that, if you're into performance but unwilling to use the tools available then that's on you.
From the article it seems that you're using some form of threading to create things, but you don't really specify which and/or how.
The default C# implementations are usually quite poor performance wise, so if you used for example the default thread pool I can definitively say that I've achieved a 3x speedup over that by using my own thread pool implementation which would yield about the same 30s -> 12s reduction.
Burst threading/scheduling in general is also a lot better than the standard one, in general if I feed it a logic heavy method (so no vectorization) then I can beat it by a bit, but not close to the 3x of the normal thread pool.
But then if your generation is number heavy (vs logic) then having used Burst you could probably drop that calculation time down to 2-3 seconds (in the same as if you used Vector<256> numerics).
Finally you touch on GC, that's definitely a problem. The Mono variant has been upgraded by them over time, but C# remains C# which was never meant for gaming. Even if we had access to the modern one there would still be issues with it. As with all the other C# libraries etc., they never considered gaming a target where what we want is extremely fast access/latency with no hiccups. C# in the business world doesn't really care if it loses 16ms (or 160ms) here and there due to garbage, it's usually not a problem there.
Coding in Unity means having to go over every instance of allocation outside of startup and eliminating them, you mention API's that still need to allocate which I've never run into myself. Again modern isn't going to simply make those go away.
Sure, we could use Burst to speed up some strategic parts, but that would not help with the core of the game.
To give some context, things are very complex in our game, we have fully dynamic terrain with terrain physics (land-slides), advanced path-finding of hundreds of vehicles (each entity has its own width and height clearance), trains, conveyors and pipes carrying tens or even hundreds of thousands of individual products, machines, rockets, ships, automated logistics, etc. There is no one thing that could be bursted to get 3x gain. At this point, we'd have to rewrite the entire game in C++.
So what's the reason we use C#? Productivity, ease of debugging and testing, and resilience to bugs (e.g. null dereference won't kill the program). Messing with C++ or even burst would cost us more time and to be honest, the game would possibly not even exist at that point.
Could you share some details about your custom thread pool that got 3x speedup? What was the speedup from? It is highly unlikely that a custom thread pool would have any significant impact on the benchmark in our case. As you can see from Figure 3, threaded tasks run for about 25% of the total time and even with Mono, all tasks are reasonably well balanced between threads. Threads utilization is surely over 90% (there is always slight inefficiency towards the end as threads are finishing up, but that's 100's of ms). An "oracle" thread pool could speed tings up by 10% of 25%, so that is not it.
Vectorization could help too but majority of the code is not easily vectorizable. It's all kinds of workloads, loading data, deserialization, initialization of entities, map generation, precomputation of various things. I highly doubt that automatic vectorization from code generated by IL2CPP would bring more than 20% speedup here. The speedup from burst would mostly come from elimination of inefficient code generated by Mono's JIT, not from vectorization.
For now, we are accepting the Mono tax to be more productive. But I am hoping that Unity will deliver on the CoreCLR dream. In the meantime, my post was meant raise awareness and stir up some discussion, like this one, which is great. I've read lots of interesting thoughts in this comments section.
>Sure, we could use Burst to speed up some strategic parts... the game would possibly not even exist at that point.
Yeah, the thing with Burst is that its a lot easier to work with if you start with it than having to replace/upgrade code later, especially if you're not familiar with it. A big issue is usually that you create structs with data and they're referencing other structs etc., all those need to be untangled to really make use of Burst.
I myself am also a big C# fan, it is a lot easier than using C. Unity has a lot of issues but there's a reason its so widely adopted and used. (I myself am currently working on a Unity C# tool that I believe will speed up code development significantly).
Your game does sound as if its a VERY ripe target for Burst usage based on the elements that you describe, but the real question should be if you need it. For example if you're already running at 60 fps on whatever your mid target hardware is at whatever max + N% load/size for a game instance then you don't need it. But if you're only hitting 40fps and design-wise want to increase e.g. your map size by 2x then it might be something to look into. Also if you look at e.g. Factorio, they spend a LOT of time optimizing systems, but of course you first need to launch the game (which is and should be the priority).
If you have for example 25 systems (e.g. pathfinding, trains, pipes, etc.) and they're evenly balanced then as you say then you won't increase your game speed by 2x by just converting one of those. BUT if for example your pipes are being processed in 4ms per frame, so you instead adopt other strategies like only processing them every Nth frame or doing M pipes per frame; at that point using Burst to just get that 4ms down to 0.5ms might be a really worthwhile target to make your game play better. The same goes for all your systems where the upgrade will have a cumulative effect.
I highly suggest learning just the basics of Burst in your spare time and trying it out on something basic to get the feel of it. As with all code/libraries it'll unfortunately take some time to figure out how to effectively use it.
Roughly speaking:
- You don't have to have SOA data, but it helps. At the start just convert methods over 1 to 1.
- You have to convert most C# container types to Burst ones, for example in struct Vehicle { Wheel[] wheels } you need to change Wheel[] over to NativeArray<Wheel>, and the Wheel struct itself also need to not use complex types etc.
Other types such as NativeSpan are also very useful, instead of storing the wheels just use a ref Span to them instead.
- After you have basics going you can try out SOA along with more math/less logic so that the code can be vectorized, once you see that big speedup for certain types of code it's hard to go back.
>Could you share some details about your custom thread pool that got 3x speedup? What was the speedup from? It is highly unlikely that a custom thread pool would have any significant impact on the benchmark in our case. As you can see from Figure 3, threaded tasks run for about 25% of the total time and even with Mono, all tasks are reasonably well balanced between threads. Threads utilization is surely over 90% (there is always slight inefficiency towards the end as threads are finishing up, but that's 100's of ms). An "oracle" thread pool could speed tings up by 10% of 25%, so that is not it.
My thread pool itself is pretty standard, it spins up some heavy threads and uses ManualResetEvent to trigger them. Its advantage lies in pre-registering simple Action (with/without parameters) calls to set methods that'll be called when the thread runs; and with more gaming related options for whether we're waiting on thread completion, interleaving them with other threads etc.
A big plus is that it has a self-optimization function, so it'll self-adjust the thread count vs the total time runs take, the total # of amounts of items being processed for the given workload etc. so as to automatically find very good sizes for all those elements to use for the target computer, vs just assuming e.g. 32, 64 or 128 inner elements and launching the max available threads on the PC (as thread pools usually do).
>Vectorization could help too but majority of the code is not easily vectorizable. It's all kinds of workloads, loading data, deserialization, initialization of entities, map generation, precomputation of various things. I highly doubt that automatic vectorization from code generated by IL2CPP would bring more than 20% speedup here. The speedup from burst would mostly come from elimination of inefficient code generated by Mono's JIT, not from vectorization.
Yeah, if its startup/generating code that's mostly bypassed by loading a game then its not worth switching over. Do note that code compiled by Burst will in general be more optimized than Mono just due to better tooling, but in general its not worth moving over just for that due to the amount of work you need to do so. The real wins come in if some generating element that's done often is taking too long, or during gameplay where you can replace elements in the game that take e.g. N milliseconds to calculate every frame and drop those down to 1/10th - 1/100th of the time it used to take.
This part of your comment is wrong on many levels:
"The Burst compiler/HPC# plays on every meme perpetuated by modern gamedev culture (structure-of-arrays, ECS), but performance wise, generally still falls short of competently, but naively written C++ or even sometimes .NET C#. (Though tbf, most naive CoreCLR C# code is like 70-80% the speed of hyper-optimized Burst)".
C++ code is much faster than C#, but modern C# has become a lot better with all the time that's been invested into it. But you can't just take a random bit of C code and think that it's going to be better than an optimized bit of C#, those days are long past.
Secondly, the whole point of Burst is that it enables vectorization, which means that if you've converted code to it and it's used properly that its going to support instructions up to 256 wide (from what I remember it doesn't use AVX512). That means that it's going to be significantly faster than standard C# (and C).
If the author is generating for example maps and it takes 80 seconds with Mono, then getting to between 10-30 seconds with Burst is easy to achieve just due to its thread usage. Once you then add in focused optimizations that make use of vectorization you can get that down to probably 4 odd seconds (the actual numbers really depend on what you're doing, if its a numerical calculation you can easily get to 80x improvement, but if there's a lot of logic being applied then you'll be stuck at e.g. 8x.
For the last point, new modern C# can't just magically apply vectorization everywhere, because developers intersperse far too much logic. It has a lot of libraries etc. that have become a lot more performant, but again you can't compare that directly to Burst. To compare to Burst you have to do a comparison with Numerics, etc.
It's funny how ideas come and go. I made this very comment here on Hacker News probably 4-5 years ago and received a few down votes for it at the time (albeit that I was thinking of computers in general).
It would take a lot of work to make a GPU do current CPU type tasks, but it would be interesting to see how it changes parallelism and our approach to logic in code.
> I made this very comment here on Hacker News probably 4-5 years ago and received a few down votes for it at the time
HN isn't always very rational about voting. It will be a loss if you judge any idea on their basis.
> It would take a lot of work to make a GPU do current CPU type tasks
In my opinion, that would be counterproductive. The advantage of GPUs is that they have a large number of very simple GPU cores. Instead, just do a few separate CPU cores on the same die, or on a separate die. Or you could even have a forest of GPU cores with a few CPU cores interspersed among them - sort of like how modern FPGAs have logic tiles, memory tiles and CPU tiles spread out on it. I doubt it would be called a GPU at that point.
GPU compute units are not that simple, the main difference with CPU is that they generally use a combination of wide SIMD and wide SMT to hide latency, as opposed to the power-intensive out-of-order processing used by CPU's. Performing tasks that can't take advantage of either SIMD or SMT on GPU compute units might be a bit wasteful.
Also you'd need to add extra hardware for various OS support functions (privilege levels, address space translation/MMU) that are currently missing from the GPU. But the idea is otherwise sound, you can think of the 'Mill' proposed CPU architecture as one variety of it.
Perhaps I should have phrased it differently. CPU and GPU cores are designed for different types of loads. The rest of your comment seems similar to what I was imagining.
Still, I don't think that enhancing the GPU cores with CPU capabilities (OOE, rings, MMU, etc from your examples) is the best idea. You may end up with the advantages of neither and the disadvantages of both. I was suggesting that you could instead have a few dedicated CPU cores distributed among the numerous GPU cores. Finding the right balance of GPU to CPU cores may be the key to achieving the best performance on such a system.
As I recall, Gartner made the outrageous claim that upwards of 70% of all computing will be “AI” in some number of years - nearly the end of cpu workloads.
I'd say over 70% of all computing is already been non-CPU for years. If you look at your typical phone or laptop SoC, the CPU is only a small part. The GPU takes the majority of area, with other accelerators also taking significant space. Manufacturers would not spend that money on silicon, if it was not already used.
> I'd say over 70% of all computing is already been non-CPU for years.
> If you look at your typical phone or laptop SoC, the CPU is only a small part.
Keep in mind that the die area doesn't always correspond to the throughput (average rate) of the computations done on it. That area may be allocated for a higher computational bandwidth (peak rate) and lower latency. Or in other words, get the results of a large number of computations faster, even if it means that the circuits idle for the rest of the cycles. I don't know the situation on mobile SoCs with regards to those quantities.
This is true, and my example was a very rough metric. But the computation density per area is actually way, way higher on GPU's compared to CPU's. CPU's only spend a tiny fraction of their area doing actual computation.
> If you look at your typical phone or laptop SoC, the CPU is only a small part
In mobile SoCs a good chunk of this is power efficiency. On a battery-powered device, there's always going to be a tradeoff to spend die area making something like 4K video playback more power efficient, versus general purpose compute
Desktop-focussed SKUs are more liable to spend a metric ton of die area on bigger caches close to your compute.
If going by raw operations done, if the given workload uses 3d rendering for UI that's probably true for computers/laptops. Watching YT video is essentially CPU pushing data between internet and GPU's video decoder, and to GPU-accelerated UI.
Looking at home computers, most of "computing" when counted as flops is done by gpus anyway, just to show more and more frames. Processors are only used to organise all that data to be crunched up by gpus. The rest is browsing webpages and running some word or excel several times a month.
Is there any need for that? Just have a few good CPUs there and you’re good to go.
As for how the HW looks like we already know. Look at Strix Halo as an example. We are just getting bigger and bigger integrated GPUs. Most of the flops on that chip is the GPU part.
HN in general is quite clueless about topics like hardware, high performance computing, graphics, and AI performance. So you probably shouldn't care if you are downvoted, especially if you honestly know you are being correct.
Also, I'd say if you buy for example a Macbook with an M4 Pro chip, it is already is a big GPU attached to a small CPU.
Even so we still had an incident where two guys walked in and just collected a few laptops before making their escape.
We like to think that we are hyper-vigilant and intelligent as human beings, but in general we tend to just focus on what is in front of us most of the time. We assume that when things are happening that they must be ordinary, or else why would they be happening?
reply