When ChatGPT first came out, I was surprised at the depth of information it had about region-specific market sizing in our (relatively niche) industry.
Turns out the model had ingested a pay-to-read article that cost upwards of $2000 [0] and was quoting the figures in it and referencing it directly (i.e. attributing the info to the article in question).
I knew about the paper but had never purchased it. I was actually surprised I could access the data through ChatGPT.
A few days later the same information was gone. I assumed someone might have decided to keep it on the low regarding having ingested all of these sources of restricted access content. I now get a generic blurb about the same question.
The opacity of it all made me a little worried. What similar models exist, trained on actually private information, for more nefarious purposes?
Unless you actually checked the article, ChatGPT likely hallucinated the figures and picked a plausible-sounding source. Most likely, nothing changed when you tried a few days later. The responses are stochastic due to the temperature setting. If anything did change, it was probably an update to reduce hallucination.
I never saw the actual article, but the quoted figures were not just plausible and reasonable but also internally consistent throughout several exchanges. It produced exact figures for key (i.e. largest) submarkets and growth estimates for smaller ones, in the same way that I would expect the original content to be structured.
I would bet with a 95%+ confidence that it parroted the actual contents. $2000 for that extra 5% tho.
Yeah it was probably just a random pdf that commoncrawl got to.
Plenty of people post that kind of stuff on accidentally indexed sites. The risk though is even if it remembers the right format, there’s no guarantee it remembers the right numbers. So caveat emptor.
I don't know what you mean. ChatGPT's reply said something along the lines of "according to study ABC done in 2020 by research firm XYZ...". It's not some theory I concocted.
> A few days later the same information was gone. I assumed someone might have decided to keep it on the low regarding having ingested all of these sources of restricted access content. I now get a generic blurb about the same question.
This didn't happen. There is no way to remove information from the model and retraining it costs millions of dollars.
Smaller side model that just prints the "umm sorry I can't do that" message. And they did retrain it at least once - that's why it got so much faster and why the model chooser has "default" and "legacy" GPT-3.5.
Notice that if you hit regenerate it often actually answers the question though.
You can imagine a chain of GPTs, good chatGPT monitors your instruction and refuses to pass it to regular chatGPT. if your prompt does get passed in and generates an offensive content, monitor chatGPT will monitor to see if output said something bad and then refuse to return the result and apologize.
I have no clear idea about the legality of AIs, but it definitely incited violence in several cases. If a person do it, it’s against the law even in the US.
I was hoping this article would be about that. It's not. It's a waste of time spent pearl clutching about how the authors find some small percentage of the internet/Common Crawl "troubling." I don't know what they expected.
If you want a model to, say, never draw pornography or write a Hitler speech, you don't do that by excluding "bad" content from the foundation model but rather you tackle that in later stages of training, particularly in this phase
Having some of that in the foundation model actually helps the model "understand" what it is it is not supposed to do. The whole point of the foundation model is that it generalizes from the examples you show it later to similar examples it saw during "pre-training" and a certain about of offensive content will help it learn what it is offensive much quickly later.
> The opacity of it all made me a little worried. What similar models exist, trained on actually private information, for more nefarious purposes?
What do you mean by 'nefarious purposes'? If you have illicit access to private data, chatGPT won't reveal anything that isn't already in the data. If your access to private information is legitimate, anything you do with an LLM trained with the information will have the same moral/legal consequences as just using the information directly.
this is what Microsoft is trying with Office 365 GPT approach, where they are siphoning training data from every company and offering graphs and summaries in return
One of my friends was trying to figure out "how" ChatGPT knew about tried something I hadn't thought out, just ask it "do you know paper xyz by author abc?" and in fact it had.
I once asked ChatGPT if it knew about an esoteric programming language. It said "yes!" and then proceeded to quote information and code snippets for Python.
I am shocked that Discord is not on this list. I would bet my mortgage that Discord would sell their chats as a dataset to train on if the price was right.
I often get downvoted for suggesting this, but why not simply host all your chats, videoconferencing etc. on your own website? Make it a gated community and only post little teasers to the likes of Instagram and YouTube and TikTok and Twitter and Facebook. Let them come to YOUR site to participate and abide by YOUR terms. Why make a Facebook Group or a Discord server?
Why do so many of us, including the biggest influencers in the world, spend so much money only to post all of our content on the Big Tech’s sites and infrastructure? You give all your social capital to them too, and they turn around and sell it back to you (access to your own followers) and can deplatform you. They encourage you to pay them to attract eyeballs away from other videos, and vice versa, in a zero-sum game on their own platform.
They now increasingly even control our conversations and our democracy. Why not have these conversations on our own forums, powered by our own software, the way we do on HN?
We gave away all our social capital and our content, to these platforms. And then they turn around and train big data on it internally, and scrape it for their AIs.
The intersection of people who can host it themselves vs those who host chats, video conferences, etc events is very very small. Once they start trying to figure out the SSH host keys and TLS firewalls, they will become very quickly overwhelmed and out of their depth. They would likely turn to an expert at this point, but it has become far cheaper and easier to just turn to a third-party service to handle this. The very service you ask them to dismiss for the troubles of hosting themselves.
On the other end of the spectrum, I have dedicated systems to hosting my own chats, emails, source code, and more. My user count is super low. Seems that not everyone wants to come to my copy-cat Gitea server when "all the cool kids" are on GitHub
Forget the SSH keys and TLS firewalls. What open source software would you actually use, to start to match Facebook Groups or Discord Forums in features?
Because they act as a portal and distribution service. The same reason freelance journalists don’t print their own newspapers but instead publish via either a syndicate or directly through papers. The paper has the brand, distribution, ability to monetize and pay journalists for their content, layout staff, etc. Similarly say YouTube is YouTube in the same way the NYT is the NYT - people go to YouTube for content, it provides hosting and distribution as well as ways to promote your content, monetize it, they provide a UX and technical infrastructure, etc.
There is nothing stopping anyone in any RFC or other technical standard from hosting their own content on equal footing with YouTube. People do host stuff on their own machines and sites. You’re just not directly aware of it because it’s basically impossible to find relative to the ease of content discovery on YouTube. No one is getting rich or even paying their bills because on self hosted because beyond having no funnel monetization of content is actually kinda hard.
Case in point, politics aside, Parler and Truth Social were basically formed based off similar thoughts. Why do we let these big companies make content moderation policies? I’d like a place where I can be abusive and racist or whatever without fear of censorship. I’m after all entitled to hold those views and say them. But, it turns out, it’s hard to provide the quality of service of say Facebook on the premise of absolute freedom - especially when it’s so toxic most people are turned off. This gets to the point: you can host yourself, but if you want to build a “free YouTube” you probably need to ask why did YouTube, a for profit entity, take the positions it has in the market? Because the market rewards them for making a family friendly middle ground platform that provides easy access to personalized content and a simple experience for content creators to distribute and monetize their content. Do that on your own systems, and you’ll probably do alright - except one thing. They have the most users. Acquiring users, or in news papers, subscribers, is not easy or cheap.
So, that’s why things are the way they are. But you’re absolutely able to self host. Just don’t expect everyone to show up at your random website.
> Why do so many of us, including the biggest influencers in the world, spend so much money only to post all of our content on the Big Tech’s sites and infrastructure? You give all your social capital to them too, and they turn around and sell it back to you (access to your own followers) and can deplatform you. They encourage you to pay them to attract eyeballs away from other videos, and vice versa, in a zero-sum game on their own platform.
Because we're lazy and Big Tech solves the immediate, short term problem.
Glad someone mentioned this!!! It is one of good reason to self-host! And only send sneak peek to social media sites. Stallman has a recommendation similar to that on what to post to FB.
Only use social media as a notification/publishing/advertising system, post important stuff on your own site.
Of course, I should clarify, it is his recommendation if organisations who need a facebook presence, as many small businesses or event organizers struggle to not have a FB/instagram presence.
Because we're not content with just a private little soapbox in our private little space talking to the few hipsters we can convince to drop by; we are vain and want to talk to the crowds, and the crowds are all over /there/.
Because the average person and the average HN user are different sets of people. The typical person who wants to start a forum on, say, trout fishing or stamp collecting has only a vague notion of a server is, much less how to set one up and maintain it. Even if they did, they'd have little or no interest in doing those things. It's far easier for them to just make a Facebook group or whatever.
Self hosting with channels, voice chats and permissions is harder. Discord also had one unified account to join different channels, private message, etc.
There's a few other reasons - Midjourney uses Discord because it's a free CDN but also because they take care of legal stuff (CSAM and copyright notices.)
As it happens, I am in the process of building something pretty close to what you describe. Here is an extremely partial list of reasons people don't do it.
- Written content: because network effects are real. Also because building a blog in 2023 that doesn't have awful look-and-feel or seem like a spam site is remarkably hard. Substack is full of chuds, but you can firewall off from them most of the time and at least you can write things instead of writing code to write things. WordPress is maybe the least bad option today if you want interactivity (which I would assume you do, what with the "community" bit and all), and getting things like SSO (for playing nice with all those other things in that gated community) costs money for dubiously-supported plugins that claim to do things they might not actually do. Ghost has a ton of weird limitations that are often claimed to be by design, so unless you want a plain white (or, if you're bold, black) site that's Just Text and maybe a few images, it's not great either. Static sites--community, no bueno.
- Podcasts: this one is not "because network effects are real", this one's just "because open-source podcast hosting software basically doesn't exist and what does is bad at doing the job." Meanwhile, anchor.fm is free.
- Chats: because network effects are real. And because the options are mostly unpleasant. IRC is difficult and mobile-unfriendly. Mattermost is buggy. Zulip is pretty good, but now you're a sysadmin and most people on Discord just want to shitpost about Pokemon. (I am not doing this, either, because there are limited hours in the day.)
- Videoconferencing: because network effects are real. (You may be detecting a theme here!) And because self-hosting Jitsi is an enormous pain in the rear. And, again, now you're a sysadmin, and most people on Discord just want to shitpost about Pokemon. (Might do this one, just for latency's sake when recording stuff. Haven't decided yet. It is hard. It is also pretty expensive to run.)
- Video content: because network effects are real. Also because I don't own a CDN, and because YouTube is $free dollars per minute watched. Even my former employer, whose stack I quite like, is not $free dollars per minute watched--nor should they be! They're targeting businesses. (Needless to say: definitely not doing this one.)
Gated communities are great, but somebody needs to make achievable these things for normal people, and they're probably going to want money for it. And while I've written plenty of open source software, the stuff I'm building won't really be portable to somebody else because I'm not making decisions for others' consumption. I'm making decisions for exactly my (relatively high) tolerance for system administration and exactly my (relatively high) ability to fix things when they break. I'm going to open-source parts of it, but I can only open-source parts, and they'll still depend on my CMS of choice, etc. etc. unless I spend time not building my thing and instead building abstraction/indirection layers for hypothetical other people.
If you want people to do similar, maybe it should be you who goes and builds the path to doing it?
"Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set."
There is copying for personal or academic use and then there is copying for commercial purposes. What if the later is done without permission from the copyright owner. The incessant promotion of so-called "AI" sure seems commercial in nature. "Tech" companies rarely spam us so incessantly without some desperate scheme to make money, for example data collection about computer users, persistent surveillance and annoying, inefficient online advertising.
As the article acknowledges, the training data that goes into the chatbots that people actually use in 2023 -- ChatGPT (both the GPT-3.5 and GPT-4 flavors) -- are probably very different from c4, which was released 4 years ago. GPT-4's training data is probably a similarly big chunk of the public Web plus a bunch of licensed content. But we can't know for sure because OpenAI has chosen not to disclose any information about the people that they take content from to build their service.
When asking ChatGPT about programming questions, it feels pretty similar to asking questions on stack overflow, except you get answers instantly. I assume that Q&A sites must be among the most important training data.
Also gists and GitHub issues, they would’ve had near perfect data sets of everything they had permission to touch at GitHub.
I saw something on Twitter, where someone was offered a job at OpenAI to write code for common coding problems and provide very verbose reasoning and comments. Can’t find it now..
It's disheartening to see AI companies profiting from free content I've created online over years and years. StackOverflow, forums, Github, even my personal crappy website is there.
> more than 72,000 instances of “swastika,” one of the banned terms from the list.
Isn't swastika being a common sacred symbol in Eastern and specially Dharmic religions common knowledge? This, alongwith most of the religious websites being Christian, Islamic and so on is quite concerning.
> 1 And, of course, put me in a team that uses Blub, and I’ll pick up Blub in a heartbeat. Except php. I tried to give php an honest chance recently (“It has changed”, they said, “It is much better with modern practices”) but it was painful all the way through, even when I tried to do everything right.
With this level of the author's expertise, you can throw this article right in the garbage bin.
It states that if an LLM is good at the LSAT, it must have trained on many practice LSAT tests. How do we know this? The test measures things like reading comprehension and logical reasoning. LLMs have reasoning ability with novel material, and there are plenty of ways to train logical reasoning outside of LSAT test prep materials.
whoa, they claimed deviantart is a text-to-image generator being sued by artists? is that really a thing? Last I heard Deviantart was really against it
This article is so misleading. First there’s nothing secret about all this. Also these chat bots don’t just “sound” smart, they are providing value to millions of users.
OK, but the article itself is (or at least may be?) more interesting than the misleading title. Let's focus on that instead. We've edited the title now.
From https://news.ycombinator.com/newsguidelines.html: "Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead."
Right. It's going to come as a shock to exactly no one that (e.g.) Wikipedia was used as a source.
Then there's the fear-mongering about using public databases (e.g., voter registration lists) in "unforeseen ways". While that may be true, they don't seem to be concerned about non-AI actors using these databases in "unforeseen ways".