No. Plausible code is syntactically-correct BS disguised as a solution, hiding a countless amount of weird semantic behaviours, invariants and edge cases. It doesn't reflect a natural and common-sense thought process that a human may follow. It's a jumble of badly-joined patterns with no integral sense of how they fit together in the larger conceptual picture.
Why do people keep insisting that LLMs don't follow a chain of reasoning process? Using the latest LLMs you can see exactly what they "think" and see the resultant output. Plausible code does not mean random code as you seem to imply, it means...code that could work for this particular situation.
Because they don't. The chain-of-reasoning feature is really just a way to get the LLM to prompt more.
The fact that it generates these "thinking" steps does not mean it is using them for reasoning. It's most useful effect is making it seem to a human that there is a reasoning process.
I love how generating strings like "let me check my notes" is effective at ending up with somewhat better end results - it pushes the weights towards outputting text that appears to be written by someone who did check their notes :D
I can't remember which lecture it was, but a guy said "they don't think, they only seem to think, and they won't replace a substantial portion of human labor, they will only seem to do so" ;)
Joking aside, this is exactly what happens with companies announcing "AI" replacing human labor when what they actually do is correcting for COVID-time overhiring while trying to make it appear in a way that won't make the stocks go too red.
It doesn't have to be either because the burden of proof is not on me. It's on whoever claims that chaining multiple prompts together produces thinking, even though a single prompt is just predicting n-grams.
The chain does not change the token generation process, it just artificially lengthens it.
Easy, humans can synthesize new facts using logic and context. And the conclusions can be checked against the real world to check that the reasoning was correct.
Or another way: reasoning is a socially constructed concept, developed by humans. Humans therefore have defined reasoning, and must therefore know how to reason.
Or a third way: I experience reasoning, you experience reasoning. I am currently reasoning. You are currently reasoning. I am human, as are you. Therefore humans reason.
This one is even easier. LLMs record objective data about n-gram distribution, there is no room for any "subjective state" in their working set.
Or another way: an LLM will respond the same way if you wait for one second or a decade between prompts. The only way it is interacted with is through a stream of tokens. There is exactly one stream at all times, and each time the stream is input again, it is barely different from the previous input. The LLM does not behave differently depending on the contents of the stream. It may produce the exact same token for two different streams. It may also encounter the same stream twice, and will act the same in both cases. If it were a "subject" "experiencing" say, a discussion, it would use its pasts "experience". But it does not.
They haven't replied to my comment but have to yours, so I can only assume they actually cannot point out the difference, which makes sense as the philosophy of mind is a very old subject and there is no way a threaded conversation like this would produce any concrete answers.
We went from expressing computation via formal, mostly non-ambiguous languages with strict grammar and semantics, to a fuzzy and flaky probabilistic system that tries to mimic code that was already written. What could go wrong?
You are correct. The final output was polished by ChatGPT, but I originally presented the basic ideas in a few paragraphs. The LLM added a sense of flow and persuasiveness which are somehow lacking in my rather dry and non-native English writing style.
Too many people, myself included, often skip LLM content.
After I am done rubberducking with the LLM I write my own code. I can you should do that too. Leave a few days after your session to let the discussion percolate in your mind, then write the post and only then check the discussion again to see if any details can be added.
In all fairness, I dont actually write blog posts like this, I just bang 'em out whenever I feel like, so not sure of it will work as well as I think it would.
It starts from the identifier. At every stage, it outputs a sub-expression which is the “mirrored use” and corresponds to the boxed representation below it. When it reaches the top of the expression, it prints the final type of the expression which is the lone specifier-qualifier list.
As per the screenshot, “arr” is an array of 4 elements. Consequently, “arr[0]” is an array of 8 elements. Then, “arr[0][0]” is a pointer. And so on, until we arrive at the specifier-qualifier list.
Good C code will try to avoid allocations as much as possible in the first place. You absolutely don’t need to copy strings around when handling a request. You can read data from the socket in a fixed-size buffer, do all the processing in-place, and then process the next chunk in-place too. You get predictable performance and the thing will work like precise clockwork. Reading the entire thing just to copy the body of the request in another location makes no sense. Most of the “nice” javaesque XXXParser, XXXBuilder, XXXManager abstractions seen in “easier” languages make little sense in C. They obfuscate what really needs to happen in memory to solve a problem efficiently.
> Good C code will try to avoid allocations as much as possible in the first place.
I've upvoted you, but I'm not so sure I agree though.
Sure, each allocation imposes a new obligation to track that allocation, but on the downside, passing around already-allocated blocks imposes a new burden for each call to ensure that the callees have the correct permissions (modify it, reallocate it, free it, etc).
If you're doing any sort of concurrency this can be hard to track - sometimes it's easier to simply allocate a new block and give it to the callee, and then the caller can forget all about it (callee then has the obligation to free it).
The most important pattern to learn in C is to allocate a giant arena upfront and reuse it over and over in a loop. Ideally, there is only one allocation and deallocation in the entire program. As with all things multi-threaded, this becomes trickier. Luckily, web servers are embarrassingly parallel, so you can just have an arena for each worker thread. Unluckily, web servers do a large amount of string processing, so you have to be careful in how you build them to prevent the memory requirements from exploding. As always, tradeoffs can and will be made depending on what you are actually doing.
Short-run programs are even easier. You just never deallocate and then exit(0).
Most games have to do this for performance reasons at some point and there are plenty of variants to choose from. Rust has libraries for some of them, but in c rolling it yourself is the idiom. One I used in c++ and worked well as a retrofit was to overload new to grab the smallest chunk that would fit the allocation from banks of them. Profiling under load let the sizes of the banks be tuned for efficiency. Nothing had to know it wasn't a real heap allocation, but it was way faster and with zero possibility of memory fragmentation.
Most pre-2010 games had to. As a prior gamedev after that period I can confidently say that it is a relic of the past in most cases now. (Not like that I don't care, but I don't have to be that strict about allocations.)
Yeah. Fragmentation was a niche concern of that embedded use case. It had an mmu, just wasn't used by the rtos. I am surprised that allocations aren't a major hitter anymore. I still have to minimize/eliminate them in linux signal processing code to stay realtime.
The normal practical version of this advice that isn't a "guy who just read about arenas post" is that you generally kick allocations outward; the caller allocates.
> Ideally, there is only one allocation and deallocation in the entire program.
Doesn't this techically happen with most of the modern allocators? They do a lot of work to avoid having to request new memory from the kernel as much as possible.
Just because there's only one deallocation doesn't mean it's run only once. It would likely be run once every time the thread it belongs to is deallocated, like when it's finished processing a request.
parse (...) {
...
process (...);
...
}
process (...) {
...
do_foo (...);
...
}
It sounds like violating separation of concerns at first, but it has the benefit, that you can easily do procession and parsing in parallel, and all the data can become readonly. Also I was impressed when I looked at a call graph of this, since this essentially becomes the documentation of the whole program.
It might be a problem when you can't afford side-effects that you later throw away, but I haven't experienced that yet. The functions still have return codes, so you still can test, whether a correct input results in no error check being followed and that incorrect input results in an error check being triggered.
is there any system where doing the basics of http (everything up to framework handoff of structured data) are done outside of a single concurrency unit?
Not exactly what you’re looking for, but https://github.com/simdjson/simdjson absolutely uses micro-parallel techniques for parsing, and those do need to think about concurrency and how processors handle shared memory in pipelined and branch-predicted operations.
Why does "good" C have to be zero alloc? Why should "nice" javaesque make little sense in C? Why do you implicitly assume performance is "efficient problem solving"?
Not sure why many people seem fixated on the idea that using a programming language must follow a particular approach. You can do minimal alloc Java, you can simulate OOP-like in C, etc.
Unconventional, but why do we need to restrict certain optimizations (space/time perf, "readability", conciseness, etc) to only a particular language?
Because in C, every allocation incurs a responsibility to track its lifetime and to know who will eventually free it. Copying and moving buffers is also prone to overflows, off-by-one errors, etc. The generic memory allocator is a smart but unpredictable complex beast that lives in your address space and can mess your CPU cache, can introduce undesired memory fragmentation, etc.
In Java, you don't care because the GC cleans after you and you don't usually care about millisecond-grade performance.
If you send a task off to a work queue in another thread, and then do some local processing on it, you can't usually use a single Arena, unless the work queue itself is short lived.
> Why should "nice" javaesque make little sense in C?
Very importantly, because Java is tracking the memory.
In java, you could create an item, send it into a queue to be processed concurrently, but then also deal with that item where you created it. That creates a huge problem in C because the question becomes "who frees that item"?
In java, you don't care. The freeing is done automatically when nobody references the item.
In C, it's a big headache. The concurrent consumer can't free the memory because the producer might not be done with it. And the producer can't free the memory because the consumer might not have ran yet. In idiomatic java, you just have to make sure your queue is safe to use concurrently. The right thing to do in C would be to restructure things to ensure the item isn't used before it's handed off to the queue or that you send a copy of the item into the queue so the question of "who frees this" is straight forward. You can do both approaches in java, but why would you? If the item is immutable there's no harm in simply sharing the reference with 100 things and moving forward.
In C++ and Rust, you'd likely wrap that item in some sort of atomic reference counted structure.
In C, direct memory control is the top feature, which means you can assume anyone who uses your code is going to want to control memory through the process. This means not allocating from wherever and returning blobs of memory, which means designing different APIs, which is part of the reason why learning C well takes so long.
I started writing sort of a style guide to C a while ago, which attempts to transfer ideas like this one more by example:
> Of course, in both languages you can write unidiomatically, but that is a great way to ensure that bugs get in and never get out.
Why does "unidiomatic" have to imply "buggy" code? You're basically saying an unidiomatic approach is doomed to introduce bugs and will never reduce them.
It sounds weird. If I write Python code with minimal side effects like in Haskell, wouldn't it at least reduce the possibility of side-effect bugs even though it wasn't "Pythonic"?
AFAIK, nothing in the language standard mentions anything about "idiomatic" or "this is the only correct way to use X". The definition of "idiomatic X" is not as clear-cut and well-defined as you might think.
I agree there's a risk with an unidiomatic approach. Irresponsibly applying "cool new things" is a good way to destroy "readability" while gaining almost nothing.
Anyway, my point is that there's no single definition of "good" that covers everything, and "idiomatic" is just whatever convention a particular community is used to.
There's nothing wrong with applying an "unidiomatic" mindset like awareness of stack/heap alloc, CPU cache lines, SIMD, static/dynamic dispatch, etc in languages like Java, Python, or whatever.
There's nothing wrong either with borrowing ideas like (Haskell) functor, hierarchical namespaces, visibility modifiers, borrow checking, dynamic dispatch, etc in C.
Whether it's "good" or not is left as an exercise for the reader.
> Why does "unidiomatic" have to imply "buggy" code?
Because when you stray from idioms you're going off down unfamiliar paths. All languages have better support for specific idioms. Trying to pound a square peg into a round hole can work, but is unlikely to work well.
> You're basically saying an unidiomatic approach is doomed to introduce bugs and will never reduce them.
Well, yes. Who's going to reduce them? Where are you planning to find people who are used to code written in an unusual manner?
By definition alone, code is written for humans to read. If you're writing it in a way that's difficult for humans to read, then of course the bug level can only go up and not down.
> It sounds weird. If I write Python code with minimal side effects like in Haskell, wouldn't it at least reduce the possibility of side-effect bugs even though it wasn't "Pythonic"?
"Pythonic" does not mean the same thing as "Idiomatic code in Python".
Good C has minimal allocations because you, the human, are the memory allocator. It's up to your own meat brain to correctly track memory allocation and deallocation. Over the last century, C programmers have converged on some best practices to manage this more effectively. We statically allocate, kick allocations up the call chain as far as possible. Anything to get that bit of tracked state out of your head.
But we use different approaches for different languages because those languages are designed for that approach. You can do OOP in C and you can do manual memory management in C#. Most people don't because it's unnecessarily difficult to use languages in a way they aren't designed for. Plus when you re-invent a wheel like "classes" you will inevitably introduce a bug you wouldn't have if you'd used a language with proper support for that construct. You can use a hammer to pull out a screw, but you'd do a much better job if you used a screwdriver instead.
Programming languages are not all created equal and are absolutely not interchangeable. A language is much, much more than the text and grammar. The entire reason we have different languages is because we needed different ways to express certain classes of problems and constructs that go way beyond textual representation.
For example, in a strictly typed OOP language like C#, classes are hideously complex under the hood. Miles and miles of code to handle vtables, inheritance, polymorphism, virtual, abstract functions and fields. To implement this in C would require effort far beyond what any single programmer can produce in a reasonable time. Similarly, I'm sure one could force JavaScript to use a very strict typing and generics system like C#, but again the effort would be enormous and guaranteed to have many bugs.
We use different languages in different ways because they're different and work differently. You're asking why everyone twists their screwdrivers into screws instead of using the back end to pound a nail. Different tools, different uses.
A long time ago I was involved in building compilers. It was common that we solved this problem with obstacks, which are basically stacked heaps. I wonder one could not build more things like this, where freeing is a bit more best effort but you have some checkpoints. (I guess one would rather need tree like stacks) Just have to disallow pointers going the wrong way. Allocation remains ugly in C and I think explicit data structures are are definitely a better way of handling it.
This shared memory and pointer shuffling is of course fraught with requiring correct logic to avoid memory safety bugs. Good C code doesn't get you pwned, I'd argue.
This is not a serious argument because you don't really define good C code and how easy or practical it is to do.
The sentence works for every language. "Good <whatever language> code doesn't get you pwned"
But the question is whether "Average" or "Normal" C code gets you pwned? And the answer is yes, as told in the article.
The comment I was responding to suggested Good C Code employes optimizations that, I opined, are more error prone wrt memory safety - so I was not attempting to define it, but challenging the offered characterisation.
Agree re: no need for heap allocation - for others: I recommend reading thru whole masscan source (https://github.com/robertdavidgraham/masscan), it's a pleasure btw - iirc rather few/sparse malloc()s which are part of regular I/O processing flow (there will be malloc()s which depending on config etc. set up additional data structs but as part of setup).
Yes, you can do it with minimal allocations - provided that the source buffer is read-only or is mutable but is unused later directly by the caller. If the buffer is mutable, any un-escaping can be done in-place because the un-escaped string will always be shorter. All the substrings you want are already in the source buffer. You just need a growable array of pointer/length pairs to know where tokens start.
Yes! The JSON library I wrote for the Zephyr RTOS does this. Say, for instance, you have the following struct:
struct SomeStruct {
char *some_string;
int some_number;
};
You would need to declare a descriptor, linking each field to how it's spelled in the JSON (e.g. the some_string member could be "some-string" in the JSON), the byte offset from the beginning of the struct where the field is (using the offsetof() macro), and the type.
The parser is then able to go through the JSON, and initialize the struct directly, as if you had reflection in the language. It'll validate the types as well. All this without having to allocate a node type, perform copies, or things like that.
This approach has its limitations, but it's pretty efficient -- and safe!
It depends what you intend to do with the parsed data, and where the input comes from and where the output will be going to. There are situations that allocations can be reduced or avoided, but that is not all of them. (In some cases, you do not need full parsing, e.g. to split an array, you can check if it is a string or not and the nesting level, and then find the commas outside of any arrays other than the first one, to be split.) (If the input is in memory, then you can also consider if you can modify that memory for parsing, which is sometimes suitable but sometimes not.)
However, for many applications, it will be better to use a binary format (or in some cases, a different text format) rather than JSON or XML.
(For the PostScript binary format, there is no escaping, and the structure does not need to be parsed and converted ahead of time; items in an array are consecutive and fixed size, and data it references (strings and other arrays) is given by an offset, so you can avoid most of the parsing. However, note that key/value lists in PostScript binary format is nonstandard (even though PostScript does have that type, it does not have a standard representation in the binary object format), and that PostScript has a better string type than JavaScript but a worse numeric type than JavaScript.)
Yes, you can first validate the buffer, to know it contains valid JSON, and then you can work with pointers to beginings of individual syntactic parts of JSON, and have functions that decide what type of the current element is, or move to the next element, etc. Even string work (comparisons with other escaped or unescaped strings, etc.) can be done on escaped strings directly without unescaping them to a buffer first.
Ergonomically, it's pretty much the same as parsing the JSON into some AST first, and then working on the AST. And it can be much faster than dumb parsers that use malloc for individual AST elements.
You can even do JSON path queries on top of this, without allocations.
Theoretically yes. Practically there is character escaping.
That kills any non-allocation dreams. Moment you have "Hi \uxxxx isn't the UTF nice?" you will probably have to allocate. If source is read-only you have to allocate. If source is mutable you have to waste CPU to rewrite the string.
I'm confused why this would be a problem. UTF-8 and UTF-16 (the only two common unicode subsets) are a maximum of 4 bytes wide (and, most commonly, 2 in English text). The ASCII representation you gave is 6-bytes wide. I don't know of many ASCII unicode representations that have less bytewidth than their native Unicode representation.
Same goes for other characters such as \n, \0, \t, \r, etc. All half in native byte representation.
It’s just two pointers the current place to write and the current place to read, escapes are always more characters than they represent so there’s no danger of overwriting the read pointer. If you support compression this can become somewhat of and issue but you simply support a max block size which is usually defined by the compression algorithm anyway.
You have a read buffer and somewhere where you have to write to.
Even if we pretend that the read buffer is not allocating (plausible), you will have to allocate for the write source for the general case (think GiB or TiB of XML or JSON).
Not if you are doing buffered reads, where you replace slow file access with fast memory access. This buffer is cleared every X bytes processed.
Writing to it would be pointless because clears obliterate anything written; or inefficient because you are somehow offsetting clears, which would sabotage the buffered reading performance gains.
I thought we were talking about high performance parsing. Of which buffered reads are one. Other is loading entire document into mutable memory, which also has limitations.
It is conceivable to deal with escaping in-place, and thus remain zero-alloc. It's hideous to think about, but I'll bet someone has done it. Dreams are powerful things.
> These lies don’t just affect them but also the people reading it as they might never see what actually happens
This is what sustains this whole economic bubble built on debt and future promises. At all levels of society, you have these inflated unrealistic expectations and BS circulating in the media. Technically-incompetent but eloquent and charismatic CEOs predict that in 6 months, some major technological shift will happen. Managers preach about adjusting their organisations to these new realities. Workers have no choice but to play the game with all its dirty tricks, if they want to stay employed. Anyone who dares to say that the emperor has no clothes is isolated in a dark corner because they may suddenly deflate the value of the whole economy. This is corporate feudalism disguised as a competitive economy.
They were popular because there was no Unix culture in Eastern Europe at the time. Pretty much any computer geek was a DOS user. To me personally, it always seemed kind of lame because many of these people would not bother to properly learn the shell language.
Their English is sufficiently good. It's a cultural aspect regarding writing style. When Russians and most Eastern Europeans write about technical subjects, they tend to be concise, dense and straightforward. Americans, on the other hand, are over-expressive and tend to saturate their writing with pointless metaphors and rhetorical devices.
Rust encourages a rather different "high-level" programming style that doesn't suit the domains where C excels. Pattern matching, traits, annotations, generics and functional idioms make the language verbose and semantically-complex. When you follow their best practices, the code ends up more complex than it really needs to be.
C is a different kind of animal that encourages terseness and economy of expression. When you know what you are doing with C pointers, the compiler just doesn't get in the way.
> Pattern matching should make the language less verbose, not more.
In the most basic cases, yes. It can be used as a more polished switch statement.
It's the whole paradigm of "define an ad-hoc Enum here and there", encoding rigid semantic assumptions about a function's behaviour with ADTs, and pattern matching for control-flow. This feels like a very academic approach and modifying such code to alter its opinionated assumptions isn't funny.