This is such a technopurist take. People who use LLM’s already know they can give wrong information. Your documentation won’t be able to cover every single possible contextual scenario that an LLM can help with. I think there are valid reasons to not allow OpenAI to spider you, but this is just a really silly one that feels pretty egotistical. People aren’t going to this guy saying “well OpenAI said your software works this way and it doesn’t”. It’s an entirely contrived scenario that doesn’t exist in reality.
> People who use LLM’s already know they can give wrong information
I think this is unfortunately much less true than expected... Lawyer using chatgpt.. teachers using chatgpt... even professors using chatgpt... as if its a source of truth.
There have been a few instances, sure, and they made headlines, but that was pretty early on when LLM behavior was not well understood. I think that fake citations (as the most obvious and well documented example) are a well understood problem now, and if you google “ChatGPT fake citation” you only get a few articles mostly referencing the same couple of cases from months ago. It doesn’t seem pervasive at all.
Anecdotal, but everytime I tell someone that the citations from ChatGPT can be bogus they are very surprised. They know that the answers can be incorrect, but they don't understand the process behind an LLM well enough to understand that a citation can be generated in the same way the rest of the text is.
My CTO showed me oh so ever happy that he translated in English something I did in French to send it to a foreign corporation. I read it, the first word was wrong. Most of the rest had more or less the same meaning, but not that first word.
I argued ChatGPT is dangerous because he was gonna send an incorrect document because of it (he had not sent it yet), but he straight up _refused_ to admit the word was wrong and saying “the meaning is almost the same!” Well it was not…
So yeah some people are not aware ChatGPT can be wrong/dangerous, and some people are worse, and refuse to believe/listen to actual people and prefer a robot.
It sounds like he was excited about it using some new tech and then was upset when you blithely smashed on him.
Would it have been more emotionally mature of him to put that aside and listen to your criticisms? Yes, of course. But you probably could have saved some trouble and conflict by sharing in his joy a little before helping him understand the pitfalls and issues.
They're only not doing that because my software is not common yet. But look at GitHub issues for any semi-famous project, and you'll see a lot of questions about misunderstandings, and that's before LLM's poisoned everything.
> But look at GitHub issues for any semi-famous project, and you'll see a lot of questions about misunderstandings
This usually happens because people don’t read documents to understand why something isn’t working in the first place or the documentation is not clear.
If anything, an LLM makes this sort of stuff more accessible.
Anecdotally, I find using something like ChatGPT to rubber duck engineering problems with various libraries to be much more enjoyable and useful than going to Stack Overflow or mucking through overly verbose (or not verbose enough) docs.
For the last two weeks my little webserver has been getting 200+ hits a day from bots with the useragent of anthropic-ai. At first it was what you'd expect, mirroring all the pdfs and such. But the last week it's just /robots.txt. 200+ times per day from amazon-ec2 so I have no way of knowing if it's actually anthropic-ai.
I was happy that they'd be including documents on topics I found interesting and things I wrote in the word adjacency training of their foundational model. That'd mean the model would be more useful to me. But the robots.txt stuff is weird. Maybe it's because I've had,
You should take down the documentation entirely, if you want to prevent incorrect interpretations of things. The LLMs won’t be the ones emailing you, the people who would get things wrong if the LLM provided some kind of confident wrong answer would probably simply not read your documentation as the vast majority of users do not. You’re just shifting some, but not all, misunderstandings into totally uninformed questions that will mean an additional email pointing them to RTFM.
All of these “we’re not letting bots crawl our site!” posts make me feel like I’ve travelled back in time to when having web spiders crawl your site was a big deal. You can’t really prevent people from using tools wrong, and it is odd that so many people care about this futile attempt to insulate yourself from stupid users that I managed to see it on the front page of HN.
The worst part is, if an LLM has already read in your docs and the interaction you fear your users having with LLMs comes to pass: they will have misapprehensions about the old version of your docs which will be even more wrong.
Allow me to prepare you for the future now before you have to hear it from someone else, you will be getting email spam about LLM Algorithm Optimization soon. LLMAO firms are probably already organizing around the bend of time, we’re just a little before they become visible.
I agree that LLMs are almost more likely than not to answer documentation questions wrong, to hallucinate methods that don’t exist, or just be silly. But the value I see in allowing LLMs to train on documentation is in the glue code that an LLM could (potentially!) generate.
Documentation, even good docs, usually only answer the question “What does this method/class/general idea do?” Really good docs will come with some examples of connecting A and B. But they will often not include examples of connecting A to E when you have to transform via P because of business requirements, and almost never tell you how to incorporate third-party libraries X, Y, and Z.
As an engineer, I can read the docs and figure out the bits, but having an LLM suggest some of the intermediary or glue steps, even if wrong sometimes, is a benefit I don’t get only from good documentation.
unpopular opinion: llm responses being wrong is still valuable to me since it gives me a better jumping off point to exploring than nothing at all. especially with something like coding that can easily be back-propagated due to something not compiling/not working as intended. could be harmful in other areas tho.
yeah, if the LLM gives me 2 truths that are beyond the documentation, like an edge case or maybe an example in a better describes way for me to grok, and one false thing, usually the false thing is so bad i can tell it's false or it's truthy but the value from the two truths exceeds the negative value of the falsehood.
Generally speaking though you can also cut back on hallucination by asking for a source from a second LLM or using good retrospections and adding system messages to ensure if it doesn't know an answer to say so and not make one up.
Really, I think hallucination is the wrong word bullshitting or gaslighting might be better. You're asking it something and it thinks you want an answer any answer so if it doesn't know it makes it up. Similar to people who confess to crimes they didn't do because of distressful interrogation tactics.
You can’t realistically cover every use case. With an LLM you can say something like “Make me a program that sets up an SDL window with the title ABCD that has a 396x224 RGB565 framebuffer and moves around a red square using the WASD keys by using a loop to fill in pixels in that framebuffer and then quits when it reaches the right edge of the screen” and it has a reasonable chance of making something that works or at least is easy to adjust into something that does. Just because sometimes it might not work the first time isn’t a good reason to try to stop people from using it entirely
It does become more wrong, yes, but blocking it isn’t going to help it get any better. The idea that everything an LLM does can be replaced by documentation isn’t true
How is it false? I’d say an LLM is like the output you’d get if you forced someone to write something with a strict time limit and without being allowed to go back and edit things or look anything up - likely to be wrong about anything that needs deep thought, but not entirely useless for simple things that are just tedious like boilerplate code
> Despite the volume of documentation, my documentation would still be just a tiny blip in the amount of information in the LLM, and it will still pull in information from elsewhere to answer questions.
I sympathise. I've recently discovered that apparently I have enough Internet clout that ChatGPT knows about me. As in I can carefully construct a prompt and it will unmistakably reference me in particular. Don't even need to provide my name in the prompt.
Except, every fucking detail of what it "knows" about me is 100% false, and there's nothing I can do to correct it. I'm from a wrong country, I did things in my career that I absolutely didn't, etc.
I understand that some people don’t want their work to train AI. Personally I like that the work I publish is not completely useless as it is at least used to train LLMs.
The guy who posted about blocking OpenAI so they will not answer questions about his software wrong (meaning not completely) ignores that his documentation is inaccessible to many less technically literate people. LLM AIs help bridge the gap to get newbies using software before they can understand the manuals.
When I entered college, my first Pascal course was on an SVR3 Unix system, and I read every manpage that I can find, because it was fantastic that I had access to that. Previously, I had read every shred of documentation of the Commodore 65xx systems, which generously included every technical detail possible. I mean, I had basically started on this in fifth grade. Reading manuals is how I gained my technical literacy.
IIRC LLMs also use common crawl data for training. Are they also blocking common crawl?
Another thing is that chatgpt 4 can do live retrieval of websites in response to users questions. That is a different crawler doing that I imagine. Are they going to block that too?
This. Unfortunately, there is common crawl, there is bing and a million of other ways they could hide/get the data from. Or, just ignore robots.txt, it's not like it's a very honest or transparent operation they run there.
I bet information about his software is around elsewhere, and now ChatGPT will make up even more. I don't know how this is fixed. Structured queryable data, I guess.
You are correct, but if I demonstrate that I have done what I could to deny OpenAI access, and they still have it in their model, then I probably have more legal recourse against them.
But what legal recourse? chatGPT could be considered a search engine, and technically scraping public facing sites without a login is perfectly allowed and legal. The best you'd be able to do is a dmca request, I'm not sure how they'd comply with that though. I've seen dmca requests in Google, when someone's work is being offered free without their permission. I'm guessing this would be the same sort of situation.
I wonder if they can selectively block or remove specific content from the LLM. Personally I think it's a fools folly to even try.
AI chat is the new interface to search, I use ai powered search engines for 90 percent of searches. sometimes I still go to the source website so there's still a chance of search engines bringing revenue.
Personally I think there should be a way for them to reward sites in a medium like program where views or uses as a resource are points for a share of the revenue for the month or something.
I don’t think you have any legal recourse, even if training isn’t considered fair use (which it seems to be) I don’t think you can copyright having knowledge of how to use a library or piece of software (maybe the specific way you write the documentation is but it will infer it from other sources instead)
> But here’s the problem: it will answer them wrong.
There is no way to know that, and even if it ends up being true, blocking openai will likely make the problem worse, e.g. the ai answers will be worse without access to the documentation.
But if people come to me with problems, I can give them a link to that post and say, "GPTx does not know my stuff. You will want to read the docs yourself."
Most people who have any logic sense will likely try chatGPTs answer if it's wrong they'll go to the docs and try to see why it's wrong or they'll tell chatGPT it's wrong give a link to the docs to clarify and ask why, if using something like phind.com. Plus just because openai doesn't index it doesn't mean you can't use langchain to scrape the site and ingest the data. I'm pretty sure this is how phind works when I reference a specific page.
For example it gave me really wrong info when I requested info on the latest version of nextjs, I asked it if it could double check on their website at url, and it said sorry here's the correct info and all was good. I've never gotten wrong answers I couldn't have it fix assuming it has Internet access.
I could, but with so many people trusting GPTx too much, I suspect that they would not come to me with questions, but bug reports and insist that my software has the bug even if it does not.
By saying that GPTx just doesn't know my stuff, it would be easier to tell them to go away.
I don’t see why you can’t just tell them that GPTx makes things up sometimes, and still tell them to go away (not literally because that would be rude)
Why do you think that? I'm sorry if I came across as rude or something but I genuinely disagree with your point, I don't think trying to explain why I think this is a bad idea makes me a troll. I'm not alone either so do you think everyone except you is a troll?
Just a thought that I have, wouldn’t it be better to block all robots and only to whitelist a select few? More AI bots are scraping now and in the future…
Seconding. robots.txt is just a way of "asking nicely." If somebody wants to scrape, they're spoofing their UA and ignoring it. Can't do anything other than monitor the logs and ban IPs one by one.