I think there are two main reasons you tend to be a slow thinker:
1. You're not focused enough. (you pay too much attention to what people think about what you're going to do, instead of focusing on the question itself.)
2. You haven't practiced enough. (This is the first time you've encountered this kind of problem.)
Why bother using human language to communicate with a computer? You interact with a computer using a programming language—code—which is more precise and effective.
Specifically:
→ In 1.0, you communicate with computers using compiled code.
→ In 2.0, you communicate with compilers using high-level programming languages.
→ In 3.0, you interact with LLMs using prompts, which arguably should not be in natural human language.
Nonetheless, you should communicate with AGIs using human language, just as you would with other human beings.
Why bother using higher-level programming languages to communicate with a computer? You interact with a computer using assembly - raw bit shifting and memory addresses - which is more precise and effective.
Using assembly is not really more precise in terms of solving the problem. You can definitely make an argument that using a higher level language is equally if not more precise. Especially since your low level assembly will be limited to which architectures it can run on, you can state that the c++ that generates that assembly is "more precisely defining a calculator program".
I agree with your general point, but C++ isn't a great example, as it is so underspecified. Imagine as part of our calculator we wrote the function:
int add(int a, int b) {
return a + b;
}
What is the result of add(32767, 1)? C++ does not presume to define just one meaning for such an expression. Or even any meaning at all. What to do when the program tries to add ints that large is left to the personal conscience of compiler authors.
Precision is not boolean (present or absent/0 or 1). There may be many numbers between 0 and 1. Compared to human languages, programming languages are much more precise that makes the results much more predictable in practice.
I can imagine OS being written in C++ and working most of the time. I don't think you can replace Linux written in C with any number of LLM prompts.
LLM can be a [bad so far] programmer but a prompt is not a program.
Using code may not be more precise in terms of solving a problem than english. Take the NHS. With better AI, saying build a good IT system for the NHS may have worked better than this stuff https://www.theguardian.com/society/2013/sep/18/nhs-records-...
You can express dang near anything you wish to express in assembly in a higher order programming language because it is designed to allow that level of clarity and specificity. In fact most have compile time checks to stop you if you have not properly specified certain behavior.
The English language is not comparable. It is a language designed to capture all the ambiguity of human thought, and as such is not appropriate for computation.
TLDR: There's a reason why programmers still exist after the dawn of 4GL / 'no code' frameworks. Otherwise we'd all be product managers typing specs into JIRA and getting fully formed applications out the other side.
That’s what this is. It’s caching the state of the model after the tokens have been loaded. Reduces latency and cost dramatically. 5m TTL on the cache usually.
Interesting! I’m wondering, does caching the model state mean the tokens are no longer directly visible to the model? i.e. if you asked it to print out the input tokens perfectly (assuming there’s no security layer blocking this, and assuming it has no ‘tool’ available to pull in the input tokens), could it do it?
The model state encodes the past tokens (in some lossy way that the model has chosen for itself). You can ask it to try and, assuming its attention is well-trained, it will probably do a pretty good job. Being able to refer to what is in its context window is an important part of being able to predict the next token, after all.
Theres no difference between feeding an LLM a prompt and feeding it half the prompt, saving the state, restoring the state and feeding it other half of the prompt.
Ie. The data processed by the LLM is prompt P.
P can be composed of any number of segments.
Any number of segments can be cached, as long as all preceeding segments are cached.
The final input is P, regardless.
So; tldr; yes? Anything you can do with a prompt you can do, becasue its just a prompt.
When the prompt is processed, there is an internal key-value cache that gets updated with each token processed, and is ultimately used for inference of the new token. If you process the prompt first and then dump that internal cache, you can effectively resume prompt processing (and thus inference) from that point more or less for free.
Depends on what front end you use. But for text-generation-webui for example, Prompt Caching is simply a checkbox under the Model tab you can select before you click "load model".