Hacker News new | past | comments | ask | show | jobs | submit login
June 2023 Data Dump is missing (meta.stackexchange.com)
556 points by JasonPunyon on June 9, 2023 | hide | past | favorite | 257 comments



This, along with recent Reddit goings-on has made me realize a major risk with the current structure of online communication. Take either Reddit or Stack Exchange as examples. They build a platform, and users contribute their time, thought, energy, and knowledge to build a community on that platform. Those companies can then gatekeep and restrict access to all that the community built, when all they did is provide the platform, and store the data. We need to rethink this model.

The thought and knowledge of communities and users need to belong to those communities and users. To people they intentionally and thoughtfully delegate to and trust. We need to decentralize our communications, like how the internet used to be before the arrival of social media and mega forums. We need to revert to small, focused forums, with less anonymous, more persistent communication, run by people we trust. Otherwise, we will continue to see mega companies harvest our data and use it (or not provide it) against our wishes. If we don’t work to mitigate that dynamic, we have nobody to blame for the poor outcomes but ourselves.


This was one of the promises originally of Stack Overflow: all the content is Creative Commons licensed so that if they "turned evil" (I believe it was Joel that put it this way) the community could, in a way, create a fork. https://web.archive.org/web/20230203170609/https://stackover...

Unfortunately the dumps themselves are not a legal requirement, just a gentleman's agreement, so realistically exercising this ability was still at the whim of the company.


> This was one of the promises originally of Stack Overflow: all the content is Creative Commons licensed

This reminds me of the promise OpenAI was built on. Unfortunately, it turned out to be a bold claim to be respected and too good to be true [0]

0. https://news.ycombinator.com/item?id=34979981


Maybe just a gentlemen's agreement, but a nice canary too. Once the dumps stop, it's time to start waving middle fingers and GTFO.


I stopped answering questions when Monica got sacked as a moderator:

https://meta.stackoverflow.com/questions/393046/who-or-what-...

To me, this was the canary. Just another psychopath megacorp.


That kind of shit is poison. It's like there's a weakness in community run stuff that allows people to come in and co-opt it for their own agenda. I don't know that this is a corporate issue, it's an issue of people not pushing back because they don't want to get accused of anything and letting special interests walk all over them. It's happening everywhere. But I agree, no point on dealing with people who spend their time on this garbage.


a weakness in community run stuff

But the community is not running it: all the infrastructure is in the hands of a for-profit corporation. Contrast this with the Freenode/Libera split: because not just moderation but also hosting was done by the community, they could continue operations fairly quickly when Freenode turned evil.

So I guess that's the lesson we should learn from it (again): the community doesn't own shit if it does not run the daily operations.


It’s starting to look like benevolent dictator is the way to go as that has at least a small chance of survival


I dunno, look what happened to RMS.


Torvalds, Micay


I always wonder why original founders just sell the company and do something else. Why don't they try to control it more and make sure it stays aligned with needs of society more? Either they can't because of shareholder/equity owners pressure, or they won't, because they really don't care and just said it for PR


I certainly wonder about the "do something else" in the sense of serial entrepreneurs. If I could cash in once I'd be done. If I had enough money to retire on, I would. Run a cat shelter or something.

But the actual answer here is probably a combo of a few things: One, running a company is probably not as much fun as building a company. Much of my career has been "pioneer" roles where nobody else has done the job before. At a certain point, the foundation is laid and the problems to solve are different and often less interesting -- at least to me. It's the build vs. maintain thing.

Two, they started with good and noble intentions. Money got involved. A lot of money got involved. The noble intentions were replaced with reality.

Three, have you met users? As a site grows you have to deal with more and more people and people can be very demanding and not very appreciative. Coupled with the previous factors, I think original founders get burnt out and decide to take the cash and move on. The allure of building anew is too much, the grind of maintenance is too much, and the cash is too good to pass up.

Also four... there's a peak for any site. You often don't know when or how, but you do now that someday your site's maximum value, interest, participation, and all that is going to peak and then decline. Sticking around to fight the good fight may just mean passing up a payday and being left with a declining property nobody wants anymore.


Probably 10-15 years ago now someone I new built a dev focussed B2B SaaS company that was quite successful. I’m not sure if they ever raised investment, but they were hyper efficient and definitely not following the VC model. Profitable, very small team, and that’s how they liked it. I compared notes with the founder regularly and found it very inspirational. He didn’t want to follow the typical VC model. He wanted to build a long term company that could sustain itself. I just loved everything about what they were doing and how well they were doing it.

And then one day, completely unexpectedly (to me) I read that they’ve been acquired by a private equity firm. I reached out to find out what happened and what changed. His answer was along the lines of “turns out everyone has an exit point after all. Priorities and motivations change and I’ve given this company everything I have to give it. It’s time to explore what’s next”

I think about it a lot. And I’ve witnessed it in various ways numerous time since. People that were hacking on a side project with the idea of “I just want this to be a fun passive income stream” seeing the adoption and love their work gets suddenly thinking “oh my, there’s potential here I didn’t see before! I could build a whole team and company around this… let’s go raise investment!”.

I think we’re just really bad predictors at how achieving the things we want will impact us emotionally. When we reach those milestones we react either more positive or more negatively than we could understand previously.


And yet you get people like Zuckerberg who'll stick around until the end. It's not like he cares about users or connecting people beyond them being a means to grow the company. Yet he saw through the company from its founding to the gigantic megacorp it is now, and it doesn't seem like hell ever want to quit. Why didn't he quit? It's not like his users are any more appreciative and he's far from beloved by anybody.


Maybe it’s because he can stay in pioneer mode, doing new things like Libra, VR etc.

Maybe in some businesses pioneering is less likely (by nature, or the corporate culture) like Stack Overflow making inroads into crypto or VR or AI.


When you have more money than God you measure success in different ways. One of which is to try to be a Steve Jobs or other famous business celebrity.


…or they might have determined that they‘d rather spend their time on something else.

Keeping control is a (mostly time) commitment and liability. You have to stay on top of things and actively decide on issues that inadvertently come up.


> Either they can't because of shareholder/equity owners pressure, or they won't, because they really don't care and just said it for PR

That is assuming the worst in people. Have you ever wanted to move onto something new? If you make something cool, it is not your lifelong obligation to oversee it.


The original founders sold the site for $1.8 Billion.


That was in 2021. https://stackoverflow.blog/2021/06/02/prosus-acquires-stack-...

Joel left in 2019 https://twitter.com/spolsky/status/1111267189133316097

Jeff left in 2012 https://blog.codinghorror.com/farewell-stack-exchange/

I'm not sure that it is fair to say that the original founders sold it for that amount.


It's possible that they had equity in the company after they left.


They probably had some nice checks written with their name in the "pay to the order of"... but they weren't the ones doing the selling, negotiating the price, or having any say in it more than any other shareholder (weighted by the shares and voting status) would.

To that end, with their various rounds of funding ( https://www.crunchbase.com/organization/stack-overflow/compa... ), they likely had very little say in any of it after that series E ( https://stackoverflow.co/company/press/archive/series-e ) which is after Jeff and Joel had left and likely diluted any value they had substantially.


[flagged]


I didn't know, what can I read about this?



What? How do you justify declaring a person as a whole "super toxic" with only a link to an interview citing a link to a blog post including "Things like… hyper-competition… an over focus on aggressive competition… things like zero sum thinking" when:

* the post is about interviewing which mostly is, for better or worse, a competitive zero-sum process.

* the post is more than a decade old (then, more like two now.)

* the post is written by someone else.

I have no opinion on the guy one way or the other, and he may in fact be toxic, but that's hardly compelling evidence.


I linked to one article.

You are free to find your own.

I was part of Stack Overflow in the beginning and experienced it first hand back then


Because despite claims to the contrary most of these sites/projects aren't created for altruistic reasons, they were created to make money (at some point). Cashing out is typically part of the long term plan.

In the case of Stack Overflow, I think the reason for the data dumps was two-fold: one of the original founders (who left long ago) came across as at least idealistic and wanting to do the right thing. The other was pragmatic and most likely always thinking about the money angle. However, the other founder likely also saw the value of the data dumps from a PR standpoint which was quite valuable as they were initially trying to replace expertsexchange.com that paywalled most of the content. IIRC, they discussed the data dumps in the early days of their podcast.

Now that there's big money to be made from machine learning (both the models and the data they are trained on), they've likely decided 'screw it' on the PR value of the data dumps and would rather get some of that sweet, sweet machine learning money.


The thing is, I sympathize with them not wanting machine learning companies to make money off the site’s content without any benefit to the contributors, moderators, or the site itself. I worry that gating access won’t really change that and just mean that the site owners also benefit at the expense of the community.


Wow I read the text for that link you posted in a very different way than I intended.


It is a well known situation. The best thing is that I don't think it was intentional, contrary to other well-known "offenders".

Experts Exchange was well known for showing up in search results but not providing the answers without paying. Many people hated it and wanted search engines to implement some sort of deny list to filter it out automatically.


Heh... yes, that was a popular joke at the time and one of many cautionary tales in picking a multi-word domain name.


> I always wonder why original founders just sell the company and do something else.

They typically have millions of reasons. Sometimes billions.


So the idea is that in case leadership wants to 'carve out a kingdom' that is not in line with community wishes, the community could take the data dump and create a clone of sorts? Then now the last snapshot for doing so would be the last data drop from March?


Yes. There's moderately successful precedent: Wikivoyage is a fork of Wikitravel, which was went evil after it was sold to a content farm.


So it's time to community fork?


Assuming that the linked post is accurate and that the "approval from senior leadership" to turn the dump back on does not come...then yes, I would say so. Actually there is already Codidact, although if I recall correctly they explicitly ruled out importing SE data when they started up. https://codidact.org


Yeah, Codidact isn't a "fork" because they don't use the SE data.


It’s a fork of the community rather than the data and content.


You really would need an existing dump to seed the new site.


There were a number of issues that lead to the decision not to do a grab and seed of SE into codidact.

There was the "what license is that post actually under? Is it 2.5? 3.0? 4.0?" which made things difficult.

There was the "what are the actual attribution requirements that SE has for sites that use its content?" This is a bit of an issue because it's never really clear what those requirements are and what you need to do. It can also hurt SEO because it's duplicated content. Furthermore, codidact leadership had already and enough dealings with SE lawyers and likely wanted to avoid any other.

Lastly, there was the desire to make a philosophical break with SE. The codidact founders didn't want to have anything to do with SE.

Some sites are doing ok. Others stood up but didn't have sufficient involvement to keep them going.

For a counter example, "PhysisOverflow" has an import tool that they use. https://www.physicsoverflow.org/4536/import-queue

Having an imported site that is mostly inactive with activity on that same content is even more disappointing than having a mostly empty site. And active mirroring is a time-consuming process that runs into rate limit issues with an API.


Thanks for the write up.

https://software.codidact.com/categories/38

I haven't used codidact (sorry, name needs replacing), ok just poked around

  * too slow
  * needs type ahead find search
  * needs a GIST experience
 
The site looks good, presentation is really clean. Lots to like about it. But the think that replaces SO is going to have to be a step function in capabilities. That said, just fixing the weird descend into performative rule following and language-lawyering on SO might be that step function.

  * "Tipping" or actually giving money to a question answerer would be cool
  * Having a question asker being able to put a bounty on question would be cool
On the face of it, I am not getting scalability (in many senses) vibes from codidact.


Personally when the fork happened, I was interested in it... but I'm mostly "meh" about it since its another Q&A style format that maintains all the advantages and disadvantages of the format that SO provides.

I really hoped that they would have gone for something much more radical in terms of trying to create a way to share knowledge.

It's Stack(Exchange|Overflow) with better governance and a different development cycle time and focus - and I appreciate that... but its still that same Stack in terms of underlying format.

I would have been more interested if they went to something like how Discourse is different from forums. It's still a forum, but it took the structure of a forum in a new way that solves some of the traditional forum issues.

I'd also suggest checking out https://topanswers.xyz ( https://topanswers.xyz/tex is one of the more active ones). For example https://topanswers.xyz/tex?q=4593 - and you'll note that there's a chat rather than comments on a question or an answer...

But getting that community there, and stable, and growing is a very hard problem. It's one of the things that Reddit and SO are fairly successful at doing because of the network effects and the corresponding exoduses from other services when they were growing.

There was a compelling story to tell of "why do you ant to switch." There's a story now, but mastodon, Lemmy, codidact and similar haven't really stood out. It feels like "we're the same but better... if you ignore the performance of the site."

"Yea, I could switch from reading reddit to reading Lemmy... but it doesn't have a feed of 200 cat pictures I need each day for my daily dose of eye bleach."


Totally agree!

The shtshow is the reason to switch, but there also needs to be new capabilities on the other side. As soon as they break old.reddit.com, I am out.

I feel like https://en.wikipedia.org/wiki/ActivityPub is 10x more complex than it needs to be. The SO replacement should be an application on-top of an existing protocol.


We outlined some of our broad goals (intended differentiators) here: https://meta.codidact.com/posts/276296, in case that helps. Codidact is a work in progress. The biggest non-technical difference from SO is how we treat communities and their members: communities have a lot more autonomy, and we treat people decently. No stockholders are driving anti-community business decisions.

On the technical level, while Q&A is central, we also have other post types and other models. That post I linked to is an article in a blog that's part of our Meta community. The Electrical Engineering community has papers, so people can present information outside of the Q&A structure. Code Golf has a sandbox where people can get feedback on draft challenges before posting them. Software Development has a Code Review category. Some of our communities have added their own customizations to the code, like Code Golf's leaderboard for challenge answers. We want to work together with our communities to build what best serves their needs.

We've done some things that look small but might have larger effects. For example, the asker of a question can't mark one answer as "accepted" like on SO, but anybody can mark an answer as "works for me" -- or "outdated", or other annotations that communities can define. Scoring takes controversy into account, because +10/-5 and +5/-0 are very different even if they're both "net 5". With threaded comments, it doesn't matter so much if two people have an extended conversation; it's not in the way. Abilities are granted based on activity and reputation is just a number -- or can be turned off entirely if that's what a community wants. We're trying to make as much stuff configurable as we can, because we can't possibly know what's going to be best for every single community and don't have the hubris to claim we do.

We have the usual bootstrapping problem of a new thing. Our communities are small and trying to grow. Because they're small, visitors don't see thousands of questions and high activity, so they don't participate either and wander away, making it harder to build activity. We would love to find people who want to work with us to build communities. We recognize that helping to build a community with us is going to be harder and slower than just asking your question on SO, but if everyone were happy with SO this thread wouldn't be here, so maybe we're an option to a few people reading this?

(I haven't posted much on Hacker News, so I hope I've read the room correctly and that this kind of comment is ok. If not, I apologize and would appreciate correction so I don't repeat mistakes. Thank you.)


A decentralized system will never work because 99% of users do not care at all; the centralized systems are easier to sign up for and use. It's been demonstrated over and over and over again.

Even if the underlying tech is decentralized, the community will settle around one or a few big instances (for example, Gmail and GitHub) which often end up having significant control over the trajectory of the entire ecosystem. If you run your own email server and you get put onto Google's spam list - you're fucked.


I don't know that I agree. I think most people don't care about decentralization but they do care about the effects it brings.

Email is a great example where most people wouldn't be interested in a version of email that only let's you email other @gmail.com users. Having a email address that can contact anyone, a phone number that can ring any other phone number etc instead of being locked into a single corporation network is a clear value add that people care about.

The main issue from my perspective is that we only have a select few large tech companies that operate as monopolies so are effectively able to block out new decentralized protocols from coming to be.

RCS messaging is a great example which I think most people would use over alternatives like WhatsApp and Imessage except that apple refusing to support it locks a huge fraction of the market out and stops widespread adoption being possible.

I don't think it's a question of preference, or people being uninterested. It's just a boring and repeated story of corporate monopolies intentionally reducing consumer choice.


>Email is a great example where most people wouldn't be interested in a version of email that only let's you email other @gmail.com users. Having a email address that can contact anyone, a phone number that can ring any other phone number etc instead of being locked into a single corporation network is a clear value add that people care about.

That is only because those technologies predate those companies. Normal people don't care that you can't DM a Reddit user on Twitter or that your Instagram posts don't automatically show up on your Facebook page. People are generally fine with centralized corporate platforms as long as it isn't a restriction of a previously free technology and the network effect has done its thing to attract enough people to the platform.


Surely, normal people get exhausted and burned out with how many accounts they need to create on all the platforms, just to stay in touch with everyone in their circle of friends. It was already crazy for me in the 90s days of Instant Messaging, and pidgin was a Godsend to consolidate all those accounts and friends-list into one interface. Surely normal people hate on the sheer volume of apps they need to install on a phone with limited storage.

I know a few people who are on exactly one platform, and don't mind it, but most everyone seems to need tendrils all over the place to keep up with a normal Internet social life.


Pidgin was great.

Now I have all those different (web or not) applications, that never support all my devices. Most application steal a crazy amount of data, consumes bandwidth as if free on the way, while assuming they are alone and can consume all the resources of the computer they are running on. And they all have different UI (dark) patterns, and bugs and this and that.

</rant>


The result of that is people stick to a know list of apps, and the chance of new apps breaking in is almost impossible.

Its why WhatsApp wasn't replaced by the mirad of chat apps that offer a better experience and better features.


> normal Internet social life

I guess I'm abnormal? They only way I contact any friends/family is text message or email. I've never used Facebook, Instagram, or whatever other platforms people use to "keep up with a normal Internet social life". Yet, I have a plethora of friends and don't feel like I have any problem keeping up with family. Maybe people don't really need these platforms.


Yes you are. In the sense that you deviate from the norm. Which is fine, really. I would hope so anyway.


I think things like being able to contact anyone are important to people, but decentralisation doesn't necessarily provide that (e.g. if I sign up on a Mastodon instance will I be able to see the messages of everyone on every Mastodon instance, and will they be able to see mine? Will I even know if somebody I care about can see my messages or not?)

I think decentralisation is not a selling point to most people. It's an implementation detail that they're happy to go along with but it's a negative if it make the experience worse, makes everything more complicated, if they can't talk to the people they know IRL, etc.


> [M]ost people don't care about decentralization but they do care about the effects it brings.

I’ve tried pointing those out as bluntly as possible as an experiment. As in “well, surprise, locked-in crap with impenetrable failure modes locks you in and is impenetrable when it fails, you signed up for that”.

People didn’t appreciate it, as I expected, but they did seem to recognize the truth of it. That is, the response was along the lines of being forced to use the thing to communicate with some person or institution, not of liking it or thinking it’s not at fault.

I don’t know how one would use this to organize an IM revolt (Riot? sorry), but there does seem to be at least some fuel for it even among people who are not outright IT professionals.

> RCS messaging is a great example which I think most people would use over alternatives like WhatsApp and iMessage [...].

RCS might make a slight amount of sense in a culture that still uses SMS / texts in non-negligible amounts, but that’s basically North America and Japan AFAIU? And I prefer that territory shrink, not grow, as I’m very much not thrilled by the idea of handing back over detailed control over IM—even just billing, not content—to phone carriers. Whatever the country, they have extensive history proving they can and will screw you for decades unless you can leave, and it will take everybody leaving for them to stop.


I agree, but to nitpick one thing: RCS isn't properly decentralized. It's controlled by carriers in the GSMA and, with the current way the infrastructure has been deployed, Google. Interoperable on the app level, yes, but not a poster child for decentralization.


> I don't think it's a question of preference, or people being uninterested. It's just a boring and repeated story of corporate monopolies intentionally reducing consumer choice.

not really. nothing at all is stopping one from starting a new social network that's federated. the issue is users have no reason to move.

it's more a question of incentives, and there's basically none to use something that you're not already using unless it's better, heavily advertised or you're simply paid to.


>If you run your own email server and you get put onto Google's spam list - you're fucked.

It's even worse than that- I ran my own email server, and for some reason gmail delayed any emails from their system to outside of their system. That meant that people would send me an email but I wouldn't get it for 20 minutes. These delays don't exist when using big email providers (it stopped being a problem when I switched to Fastmail, for example) but if you're running a small server Google makes it a nightmare.


Sounds like you need to allowlist google from your grey list. They have a long retry on their side. Once you tell a server to 'go away, come back later' it is really up to the server to decide when or if to retry. Additionally, If they use multiple sending IPs, you can end up grey listing again and again before they try back with a good ip.

You'd either need to allowlist the big providers sending blocks or just drop grey listing all together.


I completely disabled greylisting to try and resolve that, but it is another good point on why gmail kind of sucks for people not on gmail.


So you accepted their mail on first contact? And first contact was 20 min after sent header timestamp?


My gray list behaves like this. It's seldom a problem in practice, but new senders may need a "warm-up".


I have run my own email server off a Linode for the last decade-plus, and I have never encountered this. Most of the people with whom I correspond (I run my own business from my server) are on Gmail, and I have always received their emails instantly. If you were getting emails only 20 minutes later, I wonder if there was some server misconfiguration on your end, e.g. sending messages into graylisting delays.


I assume you didn't have anything in between Gmail and your setup? Like a forwarding layer or something?


I disagree, it worked before and it is the reason the internet even exists.

The core issue is that user generated data is owned by one individual company. There are existing system that don't have this issues e.g. Usenet or bittorrent.

We don't need to idiot proof the web. There are enough people to gather some place for a social network even if it's hard to use. The others can stay and will stay on reddit anyway until one day when they also had enough and learn to use some alternative.


The "value" of Reddit as a website is vastly overrated anyway. There's nothing on Reddit that can't be obtained elsewhere, folks just get stuck in patterns that are familiar and presume it's because options are limited.

The world will continue to spin without Reddit, or if Reddit isn't popular anymore, or if Reddit kicks all of its current users out, and so on.


Reddit’s value these days is as a super-forum. Instead of needing five logins and five accounts for separate hobbies, you can have one login, one account, and if you occasionally want to comment or ask questions about something else, you can do so without having to create another account. Doesn’t hurt that Reddit has a very good network effect.


That's how you use it, but I'd argue that 90% of Reddit's users don't operate the same way, considering they don't even have accounts to begin with.

In fact, I would even argue that Reddit is actively trying to push users who use Reddit this way off of the platform, as they're not as easily monetized. Reddit is notorious for having low conversion rates on its ads and extremely high ad blocking rates.


I'm definitely a push-away user. I'll never pay them money to use their site and I rarely comment in the ten years or so I've had an account. Same goes for HN; I would never pay money to use this site - I would just find a new site to frequent. I understand this is unfortunate for the website operator but I consider operating a web server at this point a labor of love, not a business model.


> but I consider operating a web server at this point a labor of love, not a business model.

Hey, at least you're open about your entitlement!


IMO the internet post-2010ish is inferior to the one before in theory. Early creators were thoughtful, they created protocols foreseeing a lot of these problems. I'm not sure what's gonna happen next but the parallel universe I'd like to be in is that the internet in the last 15 so years were anomalies or a curve that raises quickly then dies off.


Decentralized software is not the only alternative. A non-profit site would be much better than a publicly owned one. Further, it could be operated by a co-op and democratically run, with its own “laws”.


> A non-profit site would be much better than a publicly owned one. Further, it could be operated by a co-op and democratically run, with its own “laws”.

If it's better why isn't that how these sites are run? Wikipedia for example is an anomaly.


They're hard to capitalize. You need a skilled founder willing to forego profits, or a group of well intentioned devs with management experience, to organize the business side of things. You have to figure out non investment sources of income (charge for services, donations, etc. maybe no ads or at least no ad networks if you're worried about privacy or ethics). It's a lot of work, especially trying to scale that up to where you can actually pay employees salaries and benefits.

It's one thing to have an open source model for software that doesn't necessarily need that level of official organization, but once you start getting into backend infra costs, you can't run that on volunteer labor alone, you eventually have to fundraise for infra costs. If only there were a public or community driven internet... it's too bad the peer to peer models (federation, torrents, etc.) don't work well for real time global communications. Centralized messages are much easier to read and write to.

That doesn't mean such a model can't work or isn't good, it's just much harder, and all for no profit motive.

Realistically I could see a bunch of ex FANGers pooling income into a worker owned coop and starting a tech collective, then maybe handing it over to a 501c3 at some point or just keep it going as a private company without outside shareholders. But someone had to organize all that, and it's not the traditional strong suit of devs.

Fingers crossed though. Would love to see something like that happen.


The difficulty now is that you'll have to compete against Stack Overflow, whereas Stack Overflow had to compete with the paywalled Experts Exchange. Both for-profit and non-profit will have a very hard time unless Stack Overflow really goes off the rails in Quora-style it will probably stick around.

Also I think people underestimate how much effort goes in to moderating Stack Overflow, and how delicate the entire system is. There's already a whole bunch of Open Source Q&A software out there and I'm sure one of them will work fine, that's not really the hard part, and managing servers is also "just" a matter of spending time. Moderating and managing it all is much harder and more time-consuming; there are people whose ability to hold down a job or finish their homework quite literally depends on being to ask questions on Stack Overflow: there's a lot of incentive to abuse the crap out of it, more so than many other sites.


SO also had pretty serious influencers in the tech community to bootstrap everything.

However I’d guess this is existential for them. I’d guess an AI is checking and generating a huge percentage of homework (and work) and that’ll only increase. The place to attack SO is through the chat / code gen UI and git repo’s. New languages and frameworks are probably on GitHub in their most hardened state approximately as quickly as snippets land on SO, and ChatGPT’s upvote is a good indicator of quality sans SO.


"It's been demonstrated over and over and over again."

Well, yes. But it's also been demonstrated over and over again the risks of centralized sites. Maybe, just maybe, one of these days that lesson will stick and communities will take a longer term view. It seems like the cycle is happening a bit faster each time now, so maybe folks will get tired of the "damn, time to move to another site, again..." thing.

Or not. Convenience tends to trump lots of other considerations most of the time.


Between gmail and your own email server, there are thousands of medium size email providers that work ok, so email is a bad example, decentralization still works there. As for GitHub, the coders of all people should have known better than to pile into one site and sell their s̶o̶u̶l̶s̶ code to devil for a little bit of convenience.


It does because e-mail is not a social network.

You don't "join" a gmail to see social streams of posts. There is no penalty or friction to start with some small e-mail provider.


> A decentralized system will never work because 99% of users do not care at all

Maybe, maybe not. It's worth a shot, though, isn't it?


Only one percent of users need to care about decentralization for it to work sufficiently well as a safety net and equalizer.


It's worse. Even in case users did care about decentralization, such solutions are less reliable, not more. Subjective moderation, rule changes, rug pulls.


…but how necessary are those 99% for the viability of the new system? Perhaps we're better off without them.


>when all they did is provide the platform, and store the data.

Is there not significant innovation and benefit that was designed and implemented in the first place that caused users to contribute their time, thought and energy?

I think the real problem here is when organizations that rely on a crowd-sourced business models decide they just have to be billionaires or solve all the worlds problems with their platforms, instead of just staying true to their model. I don't see what's wrong with just running a highly successful business that makes money for it's founders and doesn't have to go out and strive every day to be the next Facebook or Google.

Make no mistake. Platforms like Reddit and Stackoverflow are real, serious businesses. But why can't they exist and be a general successful business like your local mom and pop restaurant or toy store or whatever?

I run RadioReference.com and Broadcastify, both which are significant businesses but also rely almost solely on crowd sourced data and content. We're wildly successful - but I've never seen the need to hire 3,000 people, or IPO, or do series raises to expand into solving world peace. Our premium subscription pricing has been the same for 15 years. I completely eliminated advertising on one of the platforms last year. We make a lot of money. We provide a lot of value to our communities, and we carefully innovate and expand to provide value. It's a nice happy life for everyone involved, and I don't have to deal with a VC who will be determined to either make a trillion dollars or torpedo my business.


The core problem making it so difficult for this to ever actually happen is that it is 2023 and I guess you only just today somehow came to this realization as if it were new or unexpected or not something people had been saying for the past 25 years of us watching these online platforms abuse their positions of power and slowly turn the screws on people.

Over the past quarter of a century of people trying to create online walled gardens of hosted content we've seen this happen over and over again, and the examples are so numerous that reddit was itself a replacement for Digg and StackOverflow for Experts Exchange. And yet, somehow, today, you suddenly woke up :(.

The reality is that we live in a dystopian Eternal September where as people finally notice what is going on and leave they are just replaced by new people who don't care or simply didn't use the prior service and are attracted to the new shiny, and another 25 years from now you're going to see people making the same unapologetic "I now realize" statements.

What we need to do is figure out how to actually replicate the feeling you are having in a way that doesn't require you to have spent years on a platform and then watching it die so it can be communicated to people before they bother to use a new platform, and in a way that somehow makes them willing to collectively not experience viral lock-in.

(And we also need to figure out how to make people willing to accept doing that at some cost to themselves, whatever form that might take: people on HN continuously do the thing where they give up freedom for a little temporary convenience and then get angry at others for daring to suggest that something a bit harder to use or with any extra friction would ever be a sane thing for anyone to use :/.)

Back in 2017 I gave a talk at Mozilla Privacy Lab called "That's How You Get Dystopia" where I just documented a ton of examples of abuse of centralized power and the reality is that every few days I just come across more stuff to add to the list... and this talk doesn't even bother with all the numerous service that simply enshittified or shuttered.

https://youtu.be/vsazo-Gs7ms


+100

I was saying to someone yesterday that "enshittification" is a sub-optimal coinage for something that really shouldn't need a new term, and which focuses attention on symptoms rather than root causes. If you give someone a power of attorney over your assets, you'll likely find that they start behaving less well towards you. Or if you give up agency, others will treat you like less of an agent. But what matters is not their behavior at the end but your decisions before that point.


Ever since Gracenote/CDDB it was pretty clear that this is the model. Still pissed off about that.


> all they did is provide the platform, and store the data.

You're seriously underestimating the effort it took to build that platform and how much effort it continues to take to keep it running well. I'm not talking about technical challenges, but social ones. It took a long time for them to get the system and incentives right (and it's still not quite right, IMHO), and it takes continued effort to keep it running well in the form of moderation and stopping abuse (and here it also doesn't quite get things right).

I could bang out a "BufferUnderrun.com" in a few months; many people could. But that's not the hard part.


>We need to rethink this model.

Once upon a time, most people who wanted or had something to say wrote their own little website and hosted it themselves (be it in a datacenter or a server in their closet). Some even ran forums and got fancy with server-side magic because that's what nerds do. Even the kids who couldn't afford anything had free, basic hosting services to choose from (anyone remember those days?).

The internet was designed as a distributed network and the denizens then were distributed. You only got as centralized as a given ISP or datacenter provider.

Of course, we all know as more and more commoners came onto the internet they didn't want to bother with developing or hosting or maintaining a website or anything. They just wanted to shitpost, for free, with blackjack and hookers.

And so "free" services like Reddit, Facebook, et al. came about to serve that demand. Information became centralized, because who the fuck has time to be responsible? Offload that crap!

The cost of that offloading of responsibility has now come knocking with debt collectors in tow, with interest.

I guess what I'm trying to say is: We don't need to rethink anything. We just need to take some god damn responsibility for ourselves. Responsibility is power, and with power you can tell commercial interests you disagree with to screw off.


The problem is information hoarding. If you imagine going to a pub meeting people regularly, and the pub records everything you ever said, and then one day the pub owner says they're going to charge you for the recordings, you'd laugh. Nobody would pay. If they tried to charge you to enter, you'd go to another pub, you wouldn't lament the "loss of your culture".

In fact, people don't record what they talk about in pubs because the point is the chat experience not the records of previous chats. Data isn't oil and it isn't quite sewage, it's more like quicksand or thickets of weeds growing and tangling around your feet. Like minimalists say 'stuff is bad' but stuff is useful, it's having stuff hidden in cupboards and drawers and a garage full of stuff and wanting a bigger house to hold more stuff and most of the stuff going unused because you can't bring yourself to let go of it, and companies advertising that more and newer stuff will make your life better and solve your problems, which is the biggest problem with stuff. Sufficientism might be a more appropriate name - enough stuff to make your life better and no more.

Enough chat to make your life better isn't "all of it kept forever".


Really? The takeaway isn’t rather that rent-seeking AI models need to figure out a way to reimburse companies and communities who’ve stored up all this capital?

Seems to me SO built and delivered huge, huge amounts of value and it’s now all at risk because multibillion dollar companies are free riding.


Users on SO created value and freely shared it with a community in expectation that the value they created would be freely and collectively shared with everyone. In SO's case this expectation was explicit; the data backup and API was billed as a deliberate choice designed to give users the freedom to migrate and scrape data in case the company went "evil." It was designed specifically to reduce SO's ownership claim over user-generated content.

It's not that SO has a moral right to control and profit from that content. The reality is that SO holding that content at all is a conditionally granted privilege that the community affords the site, and it is a privilege that was always designed to be revocable and the data moveable if SO started abusing its position of power as a host and trying to lock down access.

Some writing/content sites that have taken steps to restrict AI access based specifically on community request. That's a very different situation; if a community (particularly a closed or close-knit community) is collectively and (mostly) uniformly trying to avoid an AI scraping the content that they created, then good for them. There are communities online that are in that position. But "how will the company get reimbursed for our valuable asset" should not be part of that conversation. And SO in particular was set up around norms that deliberately allowed this kind of scraping. It's not their asset to protect.

> rent-seeking AI models

I have issues with modern AI economic models too but I don't think that "rent-seeking" is an accurate term to use. A better word would probably be "parasitic"; I understand (and at somewhat agree with) the argument that OpenAI is looking to repackage information it didn't create in a way that redirects attention away from the original source of information.

But I'm having a really hard time figuring out how OpenAI is hoarding a scarce asset to extract value by controlling access to that asset. The more obvious rent seeking behavior here is coming from SO, a company trying to restrict access to Creative Commons licensed content created for free by unpaid volunteers, and trying to reclassify that content as their corporate property.

I guess being as charitable as possible, I do worry about the SaaS model of many AIs that are dedicated to content generation, and I worry a little bit about AI models becoming heavily integrated into creative processes and then extracting a kind of monetary "creative tax" from artists/creators while heavily restricting what they are allowed to make. That's at least adjacent to rent seeking, but I'm still not sure it's the term I would use and I'm not convinced it's a scenario that's applicable here.


Thank you for the really thoughtful response!

Good point that rent-seeking is maybe not the correct term now, but it looks increasingly like services will have to lock down content or shut down due to AI models frontrunning them with their own content. In that world, the AI models are in a great rent-seeking position (i.e. only they have the [old] content which was broadly available and now is not, due to their own incentive distortion).

In any case I buy your argument with regard to SO stewardship of this data and certainly my intuitions were that the major contributors are not super thrilled about their content being digested by models and spit out with no attribution, but that is absolutely an assumption on my part.

Would be interested to see a poll of those users on this question!


I do think if we were having this conversation about an explicitly community-owned forum or fanfic hosting service -- ie, a scenario where it's obvious that the community is behind the decision -- my reaction would likely be very different. I'm broadly pretty sympathetic to a forum saying, "we're doing this for us, not for a VC firm."

SO in specific though is an interesting site in that the value proposition of the site was very heavily based on this information being freely available and uncontrolled. I think they're in a position where it's much less appropriate for the site owners to try an clamp down on AI scraping.

If there is a strong movement from the SO community to change that, I'm not aware of it, but who knows, maybe I'm out of the loop.

Off the top of my head, another example of the distinction I'm getting at would be something like Wikipedia -- if the Wikipedia owners started trying to outright block site backups my immediate response would be, "well wait a second, that was not the deal we all made around this site, we signed up to help the Wikimedia foundation build an Open encyclopedia, even if that means it gets pulled into an AI dataset. We specifically didn't want the Wikimedia foundation to have the power to decide what usage of this data they would allow or deny."


> but it looks increasingly like services will have to lock down content or shut down due to AI models frontrunning them with their own content.

This feels like its slightly off to me.

An LLM that was trained on job postings to be able to categorize them isn't trying to do job postings ( https://wfhmap.com/algorithm/ ) but rather be able to do meaningful classification of bulk unstructured data.

An LLM trained on reddit is... weird to talk to, but talking to it doesn't replace asking a proper subreddit with people answering and comments back and forth. Is ChatGPT stealing views from people complaining about their job in /r/antiwork? Going to something in /r/news and sort by controversial and getting some popcorn turns out to be much more interesting than ChatGPT ever will be.

Maybe you can say that ChatGPT with some training of Stack Exchange sites has some utility (and that its really classified, tagged, and feedback given makes it even more useful), but GitHub CoPilot was trained on just GitHub stuff and its better at code than pretending "try {some broken code} hope that helps" is going to be useful for a LLM.

To me, this feels much more like CEOs that are having difficulty with existing monetization attempting to lock up the data that they have under a questionable pretext to monetize that to companies looking to train models for other things.

The sorting out of what the rights are on the output of models is something that needs to be sorted out - probably by the courts. I am still of the opinion that if something that might be copyrighted is used from any source, then the person doing the copying (who has agency) needs to do a license check themselves on it. I know that there is GPL code on Stack Overflow that looks like its licensed under CC 4.0 and if you copied the SO answer and put it in a BSD licensed repository, you'd be in violation of the GPL - and that's without touching any LLM.

There are also lots of non copyright things that the data could be used for. I'd like to make a AI-CATegorizer. Train it on a representative number of images form each of the reddit cat subs so that someone can ask it "here's a picture of my cat, what subs can this be posted to" and get back "/r/airplaneears /r/blackcats /r/stealthbombers" - and that's not something that is potentially generating copyrighted content (though it inherently uses it)... pretending that that those images were under a CC license, would it need to attribute all of the images that were part of the training data set to respond back with those three subreddits?


There are a couple of different motivations a company could have around blocking API access to prevent AI scraping:

A) scraping itself is too expensive. I suspect that's probably not the case with SO because they blocked backup. Downloading the database from the Internet Archive doesn't cost SO any money.

B) the AI is going to replace the original creators (or more likely, devalue their work and push wages lower) and they'd like to prevent that negative social consequence. This is the charitable interpretation, and I understand writers/programmers/artists being concerned about it, even if I'm slightly more cynical myself about how AI content generation is going to work out once the "shine" has worn off. Note that I'm not saying that this concern is necessarily right or that there aren't positive uses of AI that have nothing to do with replacing jobs; just that it's a concern that a site/community could reasonably have.

C) companies are realizing that there's a lot of VC money in AI right now, and they would very much like to be in the business of selling shovels, and their feeling is that if anyone anywhere is making money off of "their" content then they are morally deserving of some kind of cut no matter what. This is obviously the case for some companies, but is (charitably) probably not the case for all of them.

One test we could use to try and distinguish between B and C is -- if a company is blocking API access, are they then turning around and licensing that data or opening up paid API access, and if they are, is any of that money going to the users that made the content? If SO turns around and makes API access paid and continues to not pay any of the volunteers writing answers, at that point it's much easier to argue that they're trying to sell shovels, not trying to protect users.

This is also part of why I take a cynical view of what Reddit is doing with its API (although Reddit claims they're in camp A more than B). Reddit is probably not doing this to protect its users from theoretical AI displacement because it's still planning to license the data. It's just pricing it so high that only giant companies would be able to afford it.


On the B/C test: From Jody Bailey (SO staff) https://meta.stackexchange.com/a/390040 (in full) posted 7 minutes ago (as I write this)

> Stack Overflow senior leadership is working on a strategy to protect Stack Overflow data from being misused by companies building LLMs. While working on this strategy, we decided to stop the dump until we could put guardrails in place.

> We are working on setting up the infrastructure to do this correctly in the age of LLMs --- where we continue to be open and share the data with our developer community but work to set up a formal framework for large AI companies that want to leverage the data.

> We are looking for ways to gate access to the Dump, APIs, and SEDE, that will allow individuals access to the data while preventing misuse by organizations looking to profit from the work of our community. We are working to design and implement appropriate safeguards and still sorting out the details and timelines. We will provide regular updates on our progress to this group.

---

On reddit, for the A/C test... yea, I'm going to be cynical there that they're looking to sell the information (and it isn't so much trying to protect users). But also that 3rd party clients are not showing ads and may be poorly behaved when provided a free API with (what at one time) was generous rate limiting.


I looked into this a bit more and I'm somewhat doubtful of this justification. There's at least been discussion within SO about charging access: https://meta.stackexchange.com/questions/388551/is-se-going-...

> First, I'd like to say that the intent of what Prashanth is saying is very simple: to return value to the community for the work that you have put in. The money that we raise from charging these huge companies that have billions of dollars on their balance sheet will be used for projects that directly benefit the community.

This is worded very specifically. Is SO planning to give money to users? They don't say anything like that; instead they say that they'll be "spending that money on the platform."

Well what does that actually mean? Every feature that SO builds could be characterized as "for the benefit of the community." It's hard not to read that response as just another way of saying "we're going to profit from this as a company, but don't worry because we use our profits to fund product development."

Heck, Reddit could make exactly the same claim, and in fact the linked Wired article actually makes that comparison:

> "Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive," Stack Overflow’s Chandrasekar says. "We're very supportive of Reddit’s approach."


SO delivered some value, the users are the ones who delivered huge, huge amounts of value


Right, which SO (and not some other site) managed to entice.

Somehow users didn’t flock to my ethansusefulprogrammerquestionandanswers.com ¯\_(ツ)_/¯


It seems more likely that there, too, it was the users doing the work, not SO.

the primary value on SO is generated by the users, and thus the value proposition to enticing new users is also generated by the users. SO is just a forum.


And companies like OpenAI will take the profit and even kill some of the users jobs.


and the heat death of the universe will eventually extinguish all life


I think these are different time frames


they're also all different topics


I can't count the number of 'crypto-bros' who told me web 3.0 was coming, to solve these problems (and apparently, any problem you could think of)


I see where you're coming from calling out AI data miners for rent-seeking, but most social media platforms are also engaging in rent-seeking behavior.


> The thought and knowledge of communities and users need to belong to those communities and users.

I would say: Need to be a public resource, belonging to no-one, i.e. no person, or group, or company should have legitimacy in denying access to it. They should all be considered _trustees_ of such a resource.

> when all they did is provide the platform, and store the data.

To be fair, SE Inc. did a lot more than provide the platform. A lot of development and design work, publication, a bit of the curation work, etc. I don't like how they behave but let's give them what they're due.

---

Also note the ongoing Moderator Strike (!): https://meta.stackexchange.com/q/389811/196834


I believe we'll have more of these oh s** moments soon when people will finally realise why we need web3. Yes the whole space was full of scammers, charlatans but the technology and point was to create a substrate for networks on the internet.

The idea that these networks and communities need to run on centralised servers is archaic. The technology exists where people should be able to own their own network (followers, subs, following, posts).


Let's be honest though, the primary thing that attracted most powerful people to web3 wasn't decentralization, it was reintroduction of artificial scarcity into digital spaces. Web3 billed itself as empowering users, but it always had an undercurrent of commodification and gatekeeping.

And that's exactly why (ignoring the scammers or pump-and-dump businesses) it saw such heavy investment from VC/tech types. The promise they were interested in wasn't democratization even if that's what they told their users -- what they were interested in was taking a plentiful resource (digital bits) and building a scarce asset that they could use to further entrench exclusivity, status, and monopolistic control over what that asset represented.

Read back over every sales pitch for web3 games. At some point they always devolve into talking about how ordinary users will be able to rent seek: to "license" characters/weapons/gear and passively earn income from other players, or to hoard exclusive tokens/releases in the game and speculate on their future value. Web3 looked at infinite digital spaces and its response was, "infinity is a problem that we need to solve." And it's revealing to look at most web3 branded metaverse attempts and see just how quickly they reintroduced real-world concepts like housing/space scarcity (why on earth would we want a housing market in a digital space with no physical constraints?), and how quickly they leaned into cosmetics and customization as a monetization strategy rather than a user right to free expression.

In general, if a technological "paradigm" is primarily associated with and primarily popular with VC firms, it's probably not being developed with the user in mind.

On the other hand: federation, interoperability, mobile identities, and legal efforts to build a right to data export existed independently of web3 and have shown a lot more promise when it comes to actually increasing user agency.


Again I'm in agreement with everything you are saying here including the part with "other efforts".

I just think we shouldn't throw the baby out with the bathwater. Just because the space got ravaged by zero interest rates, VC's, scammers, charlatans, snake oil salesman and even worse, it doesn't mean the technology and the premise was wrong.

Having a ledger that is secured by decentralised consensus is not only useful but will be a necessity for the digital first future we are heading towards.

We are reaching the limits of the current paradigm. We see companies like Meta having to be the arbiter of consensus. We've seen platforms showing their true faces and commoditising peoples data and network which wasn't really theirs to begin with.


> Having a ledger that is secured by decentralised consensus is not only useful but will be a necessity for the digital first future we are heading towards.

I think I would disagree with this specifically; at least I would disagree that blockchain ledgers are necessary or helpful in most situations.

I'm not going to say that there's no applications that are useful that would rely on a blockchain, but they're extremely limited and the technology itself seemed to be mostly useful and mostly oriented towards turning plentiful resources into scarce ones.

It's of course bad for Meta to be the arbiter of consensus. It doesn't necessarily follow that distributed (in some ways still very centralized, at least on a conceptual level) ledgers are a good alternative, particularly now that we're seeing that basing consensus around eventual consistency, fragmentation, etc... seems to be at least more promising than the alternatives, even if it isn't perfect.

I'm very much in support of efforts like 3rd Room, Fediverse, Matrix -- seeing their success has given me very little reason to believe that distributed ledgers would be necessary, and has given me some reason to believe that in many instances they would be actively harmful. 3rd Room for example would (imo) absolutely be a worse project if it was economically/socially built around a distributed ledger. Not only is there no need for 3rd Room to have that kind of coordination around state in VR rooms, it would be actually harmful for 3rd Room to require all of its VR rooms to use a shared ledger/state. Consensus (distributed or not) isn't necessary for what they're trying to build.


web3 failed because it was a rebranding exercise by cryptocurrency holders trying to create new demand for their random numbers.

Centralized servers aren’t archaic, they’re a natural outcome of how social systems work: finding communities is hard; people want to contribute their ideas, not play sysadmin; spammers and AI researchers will create enormous costs for you; etc. If you federate, you will have more time dealing with those issues than a single focused competitor and you are unlikely to see free contributions which outweigh those costs.

Everything you mentioned is available now on Mastodon, and it’s really interesting to see how that works. Some people love having a small network of their friends, but a lot of people have trouble finding people they want to follow. Instances can have their own rules but dealing with abuse is now a multiparty process and since a lot of instances are run by volunteers that can be slow, unreliable, and inconsistent. Some small servers get hammered by storage and bandwidth demand but there’s no great path to monetization unless you have a ton of users willing to pay more than most people are used to paying for internet services.

In general, these are social problems and there is only so much technology can do to improve them.


I tend to agree but where we diverge is in the thinking that these problems cannot be solved with technology, I believe the opposite.

The point of web3 is to abstract away things like sysadmin by commoditising consensus. Once a blockchain gains mass & momentum it opens up a whole new world of possibilities to hack/reinvent social media.

You could use multiple different types of social media and still maintain a single identity (auth). This means you could find friends and friendlies everywhere you go.

The key is the consensus layer and the ability to store & and read critical metadata. I'll give a personal example:

I've been helping a friend to create a combination of Netflix and DVDs. We package the movie licenses into NFTs (MovieKeys) so when the user signs in with their wallet they can stream all the movies in they own.

There are so many possibilities with this but lets focus on the social part. In theory, a social media service could scan their users wallet for MovieKeys and create a social graph based on that. Heck you could create entire forums just out of the people who owns a certain moviekey. I wont go further because we go a lot of things cooking atm.

The general point is, the technology and the UX to make these things possible is just an arms length away. The entire space got ravaged by scam artists instead of trying to build real magical experiences people actually want.


> I've been helping a friend to create a combination of Netflix and DVDs. We package the movie licenses into NFTs (MovieKeys) so when the user signs in with their wallet they can stream all the movies in they own.

How do you plan to deal with piracy? I can't see many rights-holders going for a system which doesn't allow them to revoke keys.


There is where the Netflix side of the equation comes in. You can sell or gift your MovieKey like a DVD but at the end of the day its a license to stream on our platform.

All content is DRM protected and we'll ban your account from the platform if you pirate or clone your keys ad infinitum.

I'm personally not a fan of this solution but the technology is just not here yet to do this without a centralised control system.

So I get where my friend is coming from, no artist will want to have anything to do with us if people can just pirate the stuff or clone their keys ad infinitum.


> I believe we'll have more of these oh s* moments soon when people will finally realise why we need web3.

Well no, what we need is web0, the original premise of the Internet.

Every protocol was documented in open RFCs, everything is decentralized and everyone is free to use any client and server (or write their own) and everything interoperates. Nobody can own it, there's no "it" to own. That's the only solution to eliminate the otherwise neverending cycle of proprietary platforms followed by their inevitable "oh s* moment".


The world we live in today is very different the world web0 was created for, its older than me.

Don't get me wrong, standardised protocols play a very important role in the current world and it will play a bigger role in the future of the internet (Scuttlebutt, IPFS, Matrix..).

Its just not enough, we need a decentralised way to provide consensus on the internet. People wont set up their own servers, companies that provide these services will always look like a Pareto distribution (FAANG or MAMAA).

In other words, FAANG or what ever the next incarnation of these companies are always incentivised to get rid of interoperability. Selfish reasons aside, after a certain point interop will directly stand in the way of providing better user experiences.

This is why web3 is such an elegant system. It provides a substrate that directly incentivises interoperability. Auth and payments is taken care off, the only thing remaining is custom features but that gets rapidly commoditised, then the only frontier remaining is interoperability.

A good example of this is the NFT marketplaces: Auth is just connecting your wallet. Payments are taken care off. Then you build cool features but everybody else copies you and you copy them so thats a stalemate. Then you have to be interoperable, like OpenSea going multi chain or Magic Eden supporting Eth.

The key here is, the moment someone else supports interoperability, if you dont you are put at a large disadvantage. The same kind of dynamics will happen to decentralised social media platforms.


I got burned on this sort of thing for cddb. Hundreds of discs entered in. Suddenly that data was someone elses and they charged for it.


Yeh most of decentralised cloud companies are very silly.


> We need to rethink this model.

This problem is inherent to client/server software, and there are really only three ways to do it:

1. The server side of client/server is centralized and run by corporations

2. The server side is decentralized, meaning everyone has their own server

3. Abandon the server, clients connect directly to each other without a server intermediating

Option 3 would be ideal, but would require significant technological advances - it'll be a lo0ong time before bandwidth is cheap enough that Kim Kardashian can serve photos and movies to all of her fans direct from her phone. Option 1 is what we have now, and is terrible in a variety of ways.

Option 2 would be hard but is not obviously impossible, so still our best bet - sure, it's not viable now, but it sure seems like it could be, if an iphone's worth of r&d were put in to it. I would honestly be amazed if no one at Amazon is working on such a thing, since no one would benefit more than AWS from a future in which a cloud VM becomes one of the things that most middle-class families rent monthly.


Content-addressing together with P2P and extra paid relays for those who really need it. In terms of "superstars" sharing content, if they share their image which is content-addressed and can be fetched from anyone, it's enough that one peer shares it with three others for it to be reliable enough in practice. Content like that is also usually just relevant because of recency, so large swaths of people try to access it within 24h, after that the news cycle already moved on so won't be fetched much after that.


> We need to revert to small, focused forums, with less anonymous, more persistent communication, run by people we trust.

You're onto something. Team-BHP [1] is run exactly like this, and it seems to be working.

For those wondering, it's a car-enthusiasts website based in India. They've been around for around 18 odd years I think.

The moderators all have actual dayjobs.

When signing up you have to write a paragraph about why you're really a petrolhead (or dieselhead because Indians love European turbo-diesels :) ), and there's a human on the other end vetting your sign-up application! Plenty, including me, have been rejected atleast once. I got in on my 2nd attempt years later.

As a matter of principle they refuse to do car advertisements.

I don't know how well the site is engineered but it works. Check it out. But I suspect most non-Indians (such as most people on HN) wouldn't find it that useful as it's mostly about the Indian car scene.

[1] https://www.team-bhp.com/forum/


I'm more concerned for authors of published works.

Imagine writing a text book with a royalty publishing deal. Your publisher decides they're going to use your book, amongst others, to train an LLM that can answer questions on your subject, and they're not going to pay you anything.

It's a legal gray area and they've got teams of lawyers whereas you do not.


> realize a major risk with the current structure of online communication

I wish this lesson could be learned once for all.

A long-lived community/repository cannot be built on a proprietary platform owned by some corporation. Full stop, no exception. It can't be done. A corporation will at some point need to maximize profit extraction which will ruin it for everyone. A corporation also won't support a platform forever nor can the entity itself survive forever. A single point of failure can't last forever.

> We need to decentralize our communications

Look at the solutions which have lasted longest. Email & mailing lists, going strong since the 1970s. Completely decentralized, interoperability defined by open standard protocols, anyone can build interoperable clients and servers. Nobody owns it. There's no "it" to own. That is what's needed for long term viability.


I'd argue its not just forums, but other key parts of the internet. Like Microsoft training co-pilot AI on github code, but not following the licensing of some code they straight up copy and suggest.

I'm kind of curious what is next.


The risk of centralized systems was discussed long ago. The Cathedral and the Bazarr was published in 1999. None of these ideas are new. Everyone who payed any attention knew it was coming.


Incentives in this structure go both ways though, ideally keeping everything in symbiotic balance. Companies that alienate their users tend to not do well shortly after.


I like the idea of decentralized but I'd suggest you don't need to go fully decentralized where every peer has a full copy. Actually I kind of like the Bitcoin approach in that you have the ability to create a full peer, but most people do not. This would allow some decentralization and reduce risk, but not burden everybody with running a full peer.


How did newsgroups work in the days of yore? Depending on which ISP you used, you may or may not have all of the posts within a group if they even had the group. I remember paying for access to specific (can't remember the name) provider that had the most complete listing of newsgroups and had the longest retention of posts. viva a.b.m.a!!


https://en.wikipedia.org/wiki/Network_News_Transfer_Protocol

Originally over UUCP (Unix Unix Copy Protocol) and done via dial ups at night (when the rest of the batch transfers were done - email too with the old bang path). The two servers would exchange all the batched email and news posts that were routed to the other side.

RFC 977 ( https://www.w3.org/Protocols/rfc977/rfc977 ) has an example of how files are copied between the two systems (section 4.6) including fetching and receiving mail.

Note that not all posts outbound are necessarily of interest to the other server. An IHAVE message could come back with either a "I want it" type response or a "not interested"

> The IHAVE command informs the server that the client has an article whose id is <messageid>. If the server desires a copy of that article, it will return a response instructing the client to send the entire article. If the server does not want the article (if, for example, the server already has a copy of it), a response indicating that the article is not wanted will be returned.

That's how some of the moderation worked - your server would say "I don't want anything that came by way of X host" or "not interested in that newsgroup."

One of the amusing things to me (looking back at this), if you're familiar with HTTP response codes, you'll likely get most of the way through the NNTP ones.

   200 server ready - posting allowed
   400 service discontinued
   411 no such news group
   500 command not recognized
I'd also suggest a read of RFC 850 ( https://www.w3.org/Protocols/rfc850/rfc850.html ) for some other background and section 5: The News Propagation Algorithm


But how did all of this line up within the federated or not conversation? If each ISP could host their own version, that doesn't sound federated. But who was in control of the "main" source of truth type of version?


There was no "main" source of truth for each version and each ISP could have its own set of posts. The bofh.* hierarchy for example had a very limited distribution and if it was found that one of the sites that provided it was leaking it to the general public they'd collectively cut off sites from being able to post or receive posts until they rectify their configuration.

http://usenet.trigofacile.com/hierarchies/index.py?status=pr... and https://web.archive.org/web/20220815151921/http://bofh.taron...

Sure, an ISP could host its own news.answers site and not accept posts from others for that group nor pass messages from it out to others... but that would be a rather lonely place.

Likewise, one ISP may have a different set of posts that it shows for a news group because of moderation actions (we don't accept any posts greater than 1MB in size because of disk space issues and we don't host any newsgroup named *.binaries).

Federation comes from exchanging those posts using a common protocol - NNTP.


Because there is no central controller of NTTP feeds or data streams.


That doesn't stop grumpy admin of a federated instance to just do exactly same thing.

> run by people we trust

People change, or retire, just like corporation goals change.

Focusing on more independent is not enough. If you want truly unbreakable stuff first part of the puzzle is saving user's handle and identity in a way that can't be removed.

Then finding out a way to link that to their content so when place of hosting it goes away people can follow to the new place

Then just have all of that content be signed by that identity so users can verify that it is really that person.

And I can't believe I'm saying that unironically but blockchain might just be the solution for that.

Something like immutable log of:

* user declaring "I'm jeff@example.com, here are my public keys". Servers then validate via DNS record or some .well-known location entry whether user is allowed to declare they are from @example.com * user declaring "behold! jeff@example.com stuff is <here>, and <here> and <here> are addresses for various federation systems". Only passes if that request is signed with above privkey of course * user declaring "behold! My new public key is X and Y. And Z key is revoked!" * user declaring "behold! I am now george.effluent@company.com! Re-does checks but for new domain and users previously subscribed to jeff@example com get served redirect".

etc.

Then when server admin inevitably goes rogue you can take your posts and subscribers and go somewhere else.

And when @example.com owner decides "well I'm just gonna to redirect stuff to ads", you can just change your handle and direct people to right place, and other handle is forever taken.


> when all they did is provide the platform, and store the data

And all Google did was build a search engine.



Perhaps these sort of things shouldn't be for profit enterprise, given the inability of companies to not slaughter the goose that lay the golden eggs.


The problem is defining "these sorts of things". StackOverflow didn't do anything evil, they created a useful website and people flocked to it voluntarily.


The world keeps going dark. What a terrible era.


What irks me about this is that 100% of their data is provided for free, by the community that they have fostered, the people like myself who have answered > 2500 questions[0], and now SO feels hard-done-by by LLMs using all their hard work to create tools like CodeGPT, GitHub copilot, etc.

Were it really a site for helping developers to improve their skills and increase their productivity through the give-and-take model that SO was, at least once upon a time, SO should perhaps take a deep breath and realise that this might not change a thing apart from causing their contributors to feel like they were never part of it in the first place.

I'm not sure if I've correctly articulated that, but I do find SO's stance to be quite revealing. It feels to me like they're crying foul that ChatGPT and the how many other systems out there are stealing their revenue. None of the contributors (apart from the employee ones, I suppose) ever got paid any currency other than high-fives in the form of rep, medals, the gamified stuff, moderation rights, and at certain rep levels some swag in the form of t-shirts and the usual.

I never wanted any money from SO, but the revelation of this attitude has left me feeling, well, a little sad to say the least.

[0]https://stackoverflow.com/users/70393/karim79


I think I see.

The economies of the internet are changing. Now with LLMs being accessible at an exponentially cheaper rate, we're seeing old models crumble and new models rising.

The era of moderated user content is changing drastically and the stalwarts of social networking, or adjacent, services are closing ranks to try and anticipate the change.

Thanks for the insight. I had a vague notion that these new policies were because of some recession or some other basic economic issue. I think a better theory is that the lowering economic cost of LLMs that are becoming available are the reason for all these changes.


Absolutely, and very well put. The costs are pretty close to zero at this point for something that would have been mostly considered, publicly at least, as science fiction just a year ago. It's what the printing press was to the scribe, only way, way more disruptive. I don't know if I'm correct here, but that is what it looks like right now and we're basically still at year zero. I mean, it even got the Google all riled up. The company which basically equates to "The Internet" is having to react, I'm pretty sure that's the first time I've seen such a thing in my life. It goes way beyond their reaction to the iPhone, way beyond the threat of MS exercising their monopoly(ish) power back in the day to try to mitigate the Google threat.

History repeats itself albeit in a fascinating way which I'm still trying to grasp.

I don't blame SO. I think they are acting rationally and as anyone would facing such a threat.

Secretly, I want the internet ad economy of nothing to go away. I won't mention any names (cough taboola cough cough) but that might be the only upside to this tech. Let's see what happens six months or so from now.

Side note, I'm running llama on a really crappy old server and that was enough to convince me that I'll be able to run an LLM on my watch in the near future.


Fair enough, but you were fully aware of this arrangement going in, and chose to participate. SO didn't opt into being training data for ChatGPT, and I doubt they would have given the chance. You may object that SO implicitly did so by making their site available to the public, but the ethics of GAI training data a new moral gray area that we're still navigating. They at least have something of a case to be made.


> Fair enough, but you were fully aware of this arrangement going in, and chose to participate.

Yep, but that's not the point.

> SO didn't opt into being training data for ChatGPT, and I doubt they would have given the chance.

Neither did Wikipedia (at least to my knowledge). I thought the point of opening up information was to benefit the public, first and foremost, and without hidden terms which state something along the lines of "it's free and open information built by the community, but when something disrupts our ads-driven business model and we make it unfree".

It would have been nice if they had at least allowed their contributors to vote on this, or have some sort of a say.


> "it's free and open information built by the community, but when something disrupts our ads-driven business model and we make it unfree"

I feel the same, that it would be much better today if that was the agreement we all entered into. (Though, I doubt anyone would have read it anyway so I’m not sure it would have made any difference.)

But it seems that no-one saw this disruption coming, so it wasn’t possible to plan ahead for this outcome. Call it complacency, call it ignorance, it’s too late to plan now, so we get this kind of reaction instead


> None of the contributors (apart from the employee ones, I suppose) ever got paid any currency other than high-fives in the form of rep, medals, the gamified stuff, moderation rights, and at certain rep levels some swag in the form of t-shirts and the usual.

I would love to see some kind of identity and reputation system where the "high-fives in the form of rep" could follow people across communities. It may not feel like much compensation if you've contributed over 2500 answers, but having reputation gained in your area of expertise grant you a high level of trust to interact in other communities could be valuable, at least in my opinion.

Assuming they're making this move to protect against AI / LLMs, I think SO is in an impossible situation here. When all the ChatGPT hype started, one of my first questions was "what happens to the incentive for contributors and creators?" Why would I want to contribute on a platform if I know an AI model is going to come in, take my contribution, and regurgitate it back to the masses in a way that I can't control?

Even if I get some attribution from the AI/LLM, do I even want it? If the LLM is blending content from multiple sources, which changes the context and presentation I put effort into, is the quality going to be high enough to match what I strive to achieve for myself when I'm trying to build a reputation as a high quality contributor? What if the AI is hallucinating objectively poor quality content and giving me partial attribution?

So, for me, part of the social contract with SO is that I provide answers, but I get to control the entire interaction; the context, the presentation (mark up), defending criticism in the comments, etc.. In addition to that, since the entire conversation happens inline, I can be corrected by someone even more knowledgeable than me and use that feedback for self improvement.

I think AI is going to be disruptive and the whole idea, for me anyway, behind disruption is that you break an existing system and then everyone is free to take a shot at claiming part of the new gold rush that occurs while trying to build the replacement. The problem with AI is that it's going to break a lot of services that do a good job of serving the community and shouldn't be broken. SO is a great example of a healthy community that doesn't need disruption, but the massive amount of high quality, curated content is going to make them a prime target for LLM training.

Personally I think the only solution is for "noai" variants of popular open source licenses so contributors have the ability to make it clear they don't want to contribute to AI/LLM companies. If SO had an option to flag contributions as CC-BY-SA-NOAI, I'd enable it on my stuff going forward.


> I would love to see some kind of identity and reputation system where the "high-fives in the form of rep" could follow people across communities. It may not feel like much compensation if you've contributed over 2500 answers, but having reputation gained in your area of expertise grant you a high level of trust to interact in other communities could be valuable, at least in my opinion.

Honestly I think that's an excellent idea - a rep "passport" of sorts which gains you a certain level of trust within certain communities.

> Assuming they're making this move to protect against AI / LLMs, I think SO is in an impossible situation here. When all the ChatGPT hype started, one of my first questions was "what happens to the incentive for contributors and creators?" Why would I want to contribute on a platform if I know an AI model is going to come in, take my contribution, and regurgitate it back to the masses in a way that I can't control?

Sadly, I think this is an unpreventable outcome of what is happening right now. I don't think anyone will have any control over this, at all. We can only hope it will never be the case that being active (actual human contributors) becomes a worthless pursuit.

> Even if I get some attribution from the AI/LLM, do I even want it? If the LLM is blending content from multiple sources, which changes the context and presentation I put effort into, is the quality going to be high enough to match what I strive to achieve for myself when I'm trying to build a reputation as a high quality contributor? What if the AI is hallucinating objectively poor quality content and giving me partial attribution?

Another excellent point, the prospect of this being possible today - AI attribution from a hallucinated version of a human's objective contribution sounds freaking terrifying to me. Not a world I want to live in, to be honest.

> I think AI is going to be disruptive and the whole idea, for me anyway, behind disruption is that you break an existing system and then everyone is free to take a shot at claiming part of the new gold rush that occurs while trying to build the replacement. The problem with AI is that it's going to break a lot of services that do a good job of serving the community and shouldn't be broken. SO is a great example of a healthy community that doesn't need disruption, but the massive amount of high quality, curated content is going to make them a prime target for LLM training.

As will every single human-created/curated content-source, IMHO. I think that "quality" will be really, really hard to objectively measure in the near future as the whole world of digital information becomes tainted with applied statistical models which can do a reasonably good job of predicting what people perceive to be high-quality reasoning, answers, content. I like the idea of underground speakeasies where there's no wifi, just humans.

> Personally I think the only solution is for "noai" variants of popular open source licenses so contributors have the ability to make it clear they don't want to contribute to AI/LLM companies. If SO had an option to flag contributions as CC-BY-SA-NOAI, I'd enable it on my stuff going forward.

That would be great, but I'm pretty sure that no LLM corporation would care about those flags, even with strict regulations in place from governments.


> I think that "quality" will be really, really hard to objectively measure in the near future as the whole world of digital information becomes tainted with applied statistical models which can do a reasonably good job of predicting what people perceive to be high-quality reasoning, answers, content.

That's the scariest thing I've heard today. Lol.

Even now, I think the proper use of grammar and spelling alongside assertive language has a lot of people fooled into thinking LLMs are actually intelligent. It's hard to explain to people how the LLMs know everything and understand nothing.

I've been bullish on the idea of using domains as identity for a long time. I think by using them as a universal ID you could build reputation and trust across the internet and that helps everyone a lot when trying to assess the reliability of information. If you add in attestations for factual info it gets even more interesting. Ex: GitHub attests user @john.example.com has 1000 commits to the XYZ project. Suddenly you have a more reliable way of ranking John's comments about XYZ as a topic, regardless of where they show up (as long as those identities are validated somehow).

If you look at that as "ranking people" and judge it in the context of being a valuable piece of input for LLMs/AIs, the big push for "better" identity systems like "Passwordless" start to look like a hell of a coincidence. My cynical side wonders if we'll see a push for validated (via government ID) identity systems. Something as simple as a "real human from Canada" tag would provide immense value for AI training (and marketing).

No matter what, I think AI is going to cause changes in the way online identity and reputation work. I think if it evolves into some kind of system with domains as identity it'll be decentralized and provide long term benefit. I think if we see something with verified IDs controlled by the current big tech companies it could devolve into something disappointing or even detrimental for the average user.


It's unfortunate we are seeing all of these data platforms get locked off, because this is not going to affect AI development from big companies, it's only going to affect the ability for individuals to run AI development of any form in their home.

I hope the data that has been found so far is going to big enough going forward, but it's incredibly unfortunate that this is happening.

I hope all the people making these decisions wake up with a bad headache and severe heartburn tomorrow.


IANAL, but I'm curious:

Suppose that deep-pocketed AI companies were paying Reddit, Stack Overflow, etc. to make it harder for other AI upstarts to access those data. I.e., to build a mote by denying competitors access to previously accessible data sets.

Would that violate antitrust laws in various major markets?


Nitpick, also because the contrast is kind of funny:

mote: a small particle, speck, atom, "mote of dust"

moat: a deep ditch, often filled with water, as a first line of defence around a castle.


Hopefully this comment won't be demoated by the algorithm - it truly holds water on its own!


You are only helping it with puns and we know that puns are the gateway to consciousness! They are like the corpus callosum of language, serving as the bridge that spans the moat between the castles of wit and creativity.


Given that this seems to happen all the time without antitrust issues it probably wouldn't, even though I feel like it should.

What we need is a legal way for companies to keep the data open, but also require OpenAI and friends to pay them for it.


> What we need is a legal way for companies to keep the data open, but also require OpenAI and friends to pay them for it.

Couldn't that be accomplished by a law or ruling that using something for training AI doesn't exempt you from having to follow its license? OpenAI is already in blatant violation of both the "BY" and "SA" parts of the existing license.


Arguably, a model created by training on a corpus of data is a derived work of that corpus.

Let's say I take a collection of images and use a program to compress them. When decompressed, the images are close to, but not exactly the same as the originals. Despite being in a different format, and despite not being exactly the same as the originals, the copyright to the compressed images is still held by whoever previously held it.

If I take the collection of images from earlier and train a diffusion model based on it, I'm essentially just compressing it a different way. With the right prompt, you can get out something very similar to what you put in.


By this logic, isn't remembering something in your brain also a derived work? But that would not make any sense to protect until you create and distribute something based on that memory. The same logic should be applied to this.


If you remember it from your brain and perform it live, that’s perfectly fine. So there’s a line to be drawn somewhere and I don’t think it’s super clear cut in most cases.


No, according to copyright if you remember something from your brain and perform it live you are very much in violation.

If you remember it and make something that is a distinct work, something that may be steals the idea without reproducing any of its elements, that's never been considered under copyright.

I think that's going to be the litmus test for these AI. If you can get them to produce out both that is this things from anything else, it's not going to be a copyright violation because it's not a copy of anything.


> If I take the collection of images from earlier and train a diffusion model based on it, I'm essentially just compressing it a different way

Not really. If diffusion models were compression they'd be so lossy as to be totally worthless


> What we need is a legal way for companies to keep the data open, but also require OpenAI and friends to pay them for it.

inherently not possible as then it would not be "open" to begin with.


Open except you have to pay if it gets big enough seems perfectly reasonable to me.

I understand the idea, it's not truly open in that case, but so long as the ability to build new things on it and prosper from it is preserved im alright.

The key is that it's not doing something like trying to restrict you from using it in a certain way, only requiring you give a fair share of profits.

This was, fun fact, the original purpose of patents. They weren't designed to keep things closed and owned by individuals, they were designed to allow people to freely share and make a profit so that ideas could be built on by each other. The patent system is turned into this corrupted terrible mess where things are almost never shared through licensing or payment, and it's just a way to build monopolistic enterprises nowadays.

An open source system that allows for this sort of payment would also allow for many many more things to be open when currently the bad actors who will build and take that work and just never pay you back for it.


There are ways to require payment for some uses of things that are legitimately open. As an example, consider the practice of selling exceptions to the GPL, as is done for Qt.


> As an example, consider the practice of selling exceptions to the GPL, as is done for Qt.

There are people who do not consider that "open".

See the whole debacle about what exactly constitutes "open source"


> There are people who do not consider that "open".

Who? Even Richard Stallman is okay with what they do: https://www.fsf.org/blogs/rms/selling-exceptions


Richard Stallman isn't necessarily the sole authority on such things. Consider creative common vs. AGPL. Is CockroachDB "open"? etc.

In any case the software world has changed drastically since that article has been published.


Those all seem black and white. Creative Commons' NC and ND licenses are not open, but the rest are. The AGPL is open. CockroachDB is not.


you are not understanding. why is it not open? Who is the authority of "open". Why is CockroachDB not open? I can see their source on GitHub.

"open" is not like 1+1=2. ultimately it is arbitrary. one definition of open is "to make available", by that definition all of them, including CockroachDB are "open".

open does not necessarily mean you can use it, just like how an open door does not necessarily mean you can enter the house.

in any case, we can agree to disagree.


You're confusing "open" with "visible".


I literally quoted a definition from Webster for "open". let's just stop this pedantry. I'm going to go back to "Open"AI.


"Open" is a pretty vague word which could mean all sorts of things.

"Open source" is defined by the Open Source definition according to the OSI [1]. In saying that, I realize that every couple of years somebody tries to claim that their understanding of the term "open source" should trump the one the community has settled on. I personally am not ready to acquiesce to this semantic drift, at least, not yet.

[1] - https://opensource.org/osd/


It's not even real semantic drift. It's basically the astroturfing version of it. The people trying to change the meaning are doing so because they want to capitalize on its good name without meeting the true definition.


Does this prevent any external contributions in GPL?


It means external contributors need to agree to a CLA for their changes to be incorporated into upstream Qt.


> It's unfortunate we are seeing all of these data platforms get locked off

Are there any AGPL-like licenses that address this?


Oof. This was one of the big central tenets of SO, the reason it wasn't Experts Exchange 2.0- the escrow of the community's contributions.


If you knew one simple trick all the answers on Experts Exchange were at least freely available.

That trick was to simply scroll past the paywall. They had all the answers exposed so that google would index them. It was hilarious and silly.


Back in the time when Google didn't play favorites on companies not following their terms of service.


As a reminder, all the SE sites have content under a Creative Commons, By Attribution, Share Alike license, allowing for, among other things, commercial re-use [0] [1].

Yes, it sucks that the SE sites are getting more draconian about allowing access to their content but the SE sites are well insulated against it completely disappearing precisely because they're under a libre/free license. Note that Reddit [2], nor HN I might add [3], have any such licensing terms that allow for commercial reuse.

Decentralization might be a viable option in the future, but for right now, centralized sites are the norm and the way to protect against the content from disappearing is to put it under libre/free licensing. Note that Wikipedia is centralized and it would certainly be a tragedy if they became more draconian about sharing their data but the content itself is and will be available to the general public, effectively the "commons", because of the licensing terms.

To me, this is yet another reminder of why we need to future proof with libre/free/open licensing terms. Or reform copyright, but I don't see that happening within my lifetime.

[0] https://stackoverflow.com/legal/terms-of-service/public#lice...

[1] https://creativecommons.org/licenses/by-sa/4.0/

[2] https://www.redditinc.com/policies/developer-terms#text-cont...

[3] https://www.ycombinator.com/legal/#tou


The change to 4.0 was done without permission according to many in the comunity.

"Stack Exchange doesn't have the right to unilaterally change the license of previously submitted content." - https://meta.stackexchange.com/questions/333089/stack-exchan...


Older posts are under the older CC 3.0 license, newer posts under the CC 4.0.

https://meta.stackexchange.com/questions/344491/an-update-on...


Should petition Daniel Gackle et al to CC-BY user-generated content on HN.


I haven't studied the legalize, but I assume if they put all answers behind a paywall tomorrow nothing can be done. I don't think the license says they must share.


The intent is to prevent exactly this type of possibility. They may be able to put it behind a paywall and copyright future work but not work that's already been published.

Obviously not legal precedent, but there is some discussion on the matter by the Creative Commons organization [0].

[0] https://creativecommons.org/faq/#what-happens-if-the-author-...


Everyone wants to be "smart" by web scraping, harvesting data, building models. No one bothers to build and sustain platforms where quality content can be crowd sourced. Parasitic arrangement is slowly starting a new era of the internet. Question how long until existing data dumps will become outdated and fall into irrelevance.


We just needed enough data to awaken the mega mind, now we may rest and the mega mind shall bring an era of peace, prosperity, and scientific achievement.

Praise the mega mind.


All hail the mega mind.


Yikes. Reddit. Stack overflow. It's all going south.

Maybe we won't even have to wait for LLMs to destroy the web we used to know.


This is LLMs destroying the web we used to know.

I would be willing to bet that the driving force behind the decision was to make it less trivial for LLMs to say "the data was already there under an open license, so we legally undercut stack overflow".


The fact that everyone is hoarding data because they think there is a gold rush afoot is obvious. Everyone with loads of data is clamping down, hoping they can get a cut of those AI VC dollars. Except for Wikipedia at least.

But let's be real about the morality here: Stack Overflow is a badge-powered mechanical Turk. It uses 100% unpaid labor to go and search Google for answers and post them on SO, providing a "service"[1]. For it to moralize about the ownership or sanctity of data is irony.

[1] - There are exceptions, obviously. There are true experts who wander the virtual halls of StackOverflow and dole out wisdom. But overwhelmingly it is clear that answers primarily come from people who rush to Google and then copy/paste from blogs and tech papers. And while Stack Overflow dumps are CC because that's the agreement that it made with contributors, a lot of the content on the site was ripped without attribution and in defiance of IP. So...maybe not too many tears for SO.


> There are exceptions, obviously. There are true experts who wander the virtual halls of StackOverflow and dole out wisdom.

This is what makes SO valuable for me. Every year it seems like "trust but verify" requires a little more emphasis on the verify part, so it's extremely valuable when I run into a true expert where I know I can take their input at face value and rely on it.

I think the winners of the AI gold rush will be the ones that figure out how to help users assess the trust they should be putting in the information being surfaced. The problem I see with the current ChatGPT, etc. systems is that they seem to treat visibility and popularity as credibility and expertise when that's not the case. In my experience, the real experts don't say anything unless they can add to the conversation and there's never a lack of low quality information from uninformed participants.

For example, via ChatGPT...

Q: How do I protect against the scrub of death with ZFS?

A: The term "scrub of death" is commonly used to refer to a situation in ZFS where a data scrub operation can lead to catastrophic data loss due to undetected errors or issues. While ZFS is designed to provide data integrity and protection, it is still important to take certain precautions to minimize the risk of encountering such a situation. Here are some steps you can take to protect against the scrub of death with ZFS:

Except it's not a real thing [1]. For bonus fun, read the comment right after Ahrens'.

I don't understand why SO doesn't lean into that. Anyone can train an LLM on the raw data, but SO has the information needed to do a better job of ranking the quality of the inputs, so wouldn't they be able to build an LLM that's significantly better than anyone else with the same raw data? Understanding the quality and reliability of an answer is far more important to me than getting an answer.

What's more frustrating than getting an answer on a programming question and taking hours to figure out that it was complete BS and doesn't work as described?

I don't know much about LLMs, but, if I were SO, I'd be figuring out how to lock down the ranking information as quickly as possible because that's where the value is. The ranking and acceptance of answers, alongside tags, overall user rank, participation frequency, etc. should mean that SO has a significant advantage when it comes to ranking and weighting the input data, right?

I want the input from subject matter experts to count the most and SO has the best data set to provide that. I don't see the point of locking down the content when the real value is in the ranking. It's odd that SO doesn't see that considering the entire network is modelled on that idea. Maybe they do and there are bigger changes coming down the pipe.

I think the real debates are going to come in the future if SO releases a paid LLM product that's trained on community contributed content and rankings.

1. https://arstechnica.com/civis/threads/ars-walkthrough-using-...


Time to adopt Nostr as future-proof path.


Without movement on this [1] I can't see adoption.

[1] https://github.com/nostr-protocol/nostr/issues/97


So, so much decentralized tech never gets adoption due to a lack of an identity management layer that nobody wants to build because it can’t be perfectly decentralized and have the account recovery features that 99% of regular folks need. This is an example where perfect is the enemy, nemesis even, of good.

Someone should build an identity system that is optionally centralized or federated (if you like your key custody, you can keep it), migrateable and that ONLY handles identity. That will still be orders of magnitude better than relying on Google, Twitter and friends, simply because there won’t be a glaring conflict of interest of platform rent-seeking.

Moreover, anyone who wants to build decentralized/federated apps don’t have to reinvent the wheel poorly. It’s so sad to see project after project fading into the ether because people can’t fucking sign in in a reasonable way.

At least with crypto currency, there’s a somewhat strong argument for individual key custody, but I’m not talking about protecting $20M while on the run from the feds, I’m talking about afternoon shitposting with friends and strangers.


Ahaha.. is this a serious post? I'll take the bait.

If you want to shitpost with friends and strangers than exists no realistic purpose for identity management since the main goal is to remain anonymous and true anonymity comes by default on nostr.

In case you do want to protect your identity in that case protect your keys. In case you missed the last few months, there are browser extensions that do not grant access to private keys, similar to metamask and other crypto wallets.

All of that are battle-proven technologies with several years of practice and success in keeping private keys private. You should know that, the question is why don't you know that, or more frankly why won't you know that.


I think you’re misunderstanding my point. I’m not saying key custody is infeasible. I’m saying current solutions aren’t working for average people, ie non-techies who don’t even know what a private key is. Do you disagree with this?

If you agree with this premise, that also rules out browser extensions as a universal solution because most users are on mobile. They also have multiple devices and somewhat frequently forget their credentials. Nostr is amazing and if you read my history I have only good things to say about it. But that doesn’t mean that the UX works for everyone, and I simply argue that key custody is a recurring issue in practice.

(Btw, by shitpost, I mean any random discussion such as the ones here on HN or Reddit. Not 4chan.)


Please notice that the large majority of metamask users are complete non-techies. The rules of this mechanism are explained since the beginning: keep your keys somewhere safe and this hasn't been an issue since years now.

I'm sure you can agree that having someone above users with access to their private keys is a serious failure point to user privacy. Exactly the reason why nostr remains strongly out of reach when compared to government controlled media.


That your own personal opinion, which is contrary to the growth metrics.

Nostr community is about true freedom to write just about anything.


How does nostr handle determined activists with establishment backing? What if a nostr user posts information an activist wants censored and the activist goes around threatening relays, their hosting providers, anyone connected to people operating the relay?


There is a quintessential feature that differs nostr from any other social network in the past decades: Your private key proves that you wrote a specific text.

Assuming the scenario where a user is basically chased away from major western relays, he can still continue writing new texts with his private key. As long as some relay located somewhere (e.g. China, Panama, Moon, ..) accepts his texts, then others will still be able to read and know it was from that specific person.

There are other ways to censor a person/nostr: 1) Block the whole traffic related to nostr relays at provider level inside a country. 2) Make illegal to use nostr since it is "unregulated communication media". 3) spam the network with hideous/horrible content, and then market the protocal as "darknet" only used by criminals or mentally ill people.

Any of these tactics are used often. The thing about nostr is that texts don't live just on relays and anyone can easily archive them. This means that history by specific users can be kept and safeguarded for the future. That is mostly the reason why I like it so much. We only know detailed history thanks to the records that survived until our days. Any closed platform eventually closes down their data (e.g. reddit, stackoverflow, twitter, etc) but in practice this is the same as denying access to our collective online history. Nostr will survive woketivism or any other *isms.


Something else will come up, until the endless quest for advertising revenue catches up and ruins that as well.


Really strange comment.

> I was recently impacted by the Company's layoff.

> I'm offering what I can to uphold the Company's values of Transparency & being Community-centric.

I wouldn't offer transparency about a former employers internal operations. Let them respond or at least ping a current employee to respond.


I'm reading it as the ex-employee thinking that the company might not want to respond, and choosing to do so despite that, on the grounds that it should be acceptable to do so (i.e. the ex-employee couldn't be publicly "scolded" for it without the company publicly displaying not following their values)


There may be an NDA involved. And staying on good terms with previous employers (or at least not burning any bridges) is generally a good idea regardless.


For any curious, the original announcement of the data dump - https://stackoverflow.blog/2009/06/04/stack-overflow-creativ...


"Just sorta stating the obvious here, but the timing of this is unbelievably terrible; I actually can't fathom a worse time for this call to be made than in light of this week. –zcoop98"

Or, it's exactly the best time to do it. Doing it now allows your news to get blended in with the Reddit news. Doing it later after Reddit chatter settles down means all of the chatter is directed squarely at you.


I don't think they are referring to the Reddit news, I think they are referring to the ongoing disagreement between the SO Inc staff and volunteer moderators and resulting strike over the new policy on generated content: https://meta.stackexchange.com/questions/389811/moderation-s...


It also means people are more motivated to build a replacement than just by the timewasting reddit being unavailable.


Replace them both with a model somewhat like Wikipedia, open the content for the world, and get a cut of profits from the corporations that want to use the data to train on it.


After I made that comment, I stopped for a short time to think what I would do. My conclusion was that it is much better done with something like Mastodon (or even Mastodon itself) than with a web site.


And then community projects subscribe to the feeds and index and make searchable?

I could see basically structured toots representing questions and then ... oh man, did you just nerd snipe this?

Are you saying that it should be managed by Mastodon the org, or fediverse the technology?


I meant the technology. The one downside that I see is that AFAIK, there is bad support for editing questions and answers. If something like this is created, there could be threads of patches making any edition, but that requires tooling support.

(And yeah, I'm purposefully trying to nerd-sniping people here :)


You have to paint a nicer starting canvas to really lure in the nerds, they can add one pretty little bush and feel like they just made the Mona Lisa. Extra points if there are two natural yet divisive solutions. Truly a tire-fire of flame war in the making. Events! gRPC! Both! Kafka! Rabbit!


I only hope this and the Reddit slowmo-trainwreck-in-progress sensitivise more people about the value of the data they contribute and how it is appropriated by the platforms.


Each contribution, and most individual contributer, is worthless though. They only have value in aggregate.


> I mention the timing, as this change long pre-dated the current moderator strike and related policy changes.

A mod strike? I hadn't heard about this.

https://meta.stackexchange.com/questions/389811/moderation-s...



Sad, I had a lot of fun with it making StackRoboflow[1] (This Question Does Not Exist) a few years ago.

The models (AWD-LSTM and GPT-2) weren't good enough back then to usefully answer programming questions -- but it's super cool to see that vision realized with GPT-4 and other modern LLMs.

[1] https://stackroboflow.com


Yesterday's data dumps/APIs fostered community, new market/channel discoveries & low risk acquisitions.

Today's data dumps/APIs foster easier access to train ML/AI models to put them on the path to irrelevance. They're pulling out all stops like there's no tmw, and there might not be, if they're willing to shake things up like this.


Stack Overflow and Reddit want money for AIs to train on their data which is why they made these changes, so which companies are next? Could HN get crappier in order to milk AI money for its valuable comments? I guess Wikipedia at least can't do jack to get AI cash for its valuable data.


This data dump was part of the compact between users (whoc reated the content) and the platform (who host it). The data dump was insurance against the company going the CDDB/Gracenote, Experts Exchange or Quora route and either paywalling or even just gating that content. We don't need a repeat of that.

If the data dump is gone, that compact is broken and honestly it's time to stop contributing to SO.


Wild guess: somebody came up with a business plan to monetize all that data for future LLM usage.


twitter, reddit, stack overflow... the digital version of burning the library of alexandria

it was always a broken system built on dodgy contracts, but it is still sad to see how unceremoniously everything implodes

will any lessons be learned? unlikely.


All of our institutions are headed by the likes of Caligula, Nero and Elagabalus so it's only ever a matter of time before the charlatans in charge set it on fire themselves. Never count on anything lasting longer than a year. Motivations can change overnight.

With one exception, there are no instances of anything crowdsourced/community-supported that aren't later paywalled, gatekept or destroyed to prevent exfiltration. It's always an advance-fee scheme. The longer the duration of time, the more the terms are corrupted until the people expecting delivery on the original promise end up being told "what promise?" (The exception is piracy sites. Ironically the illegal nature of the activity seems to keep the owners honest.)

Never work for free, for any promise of long-term future payout, "exposure," or any other bullshit. When they fuck you over--and they will, because you made it so easy--you'll be too broke (and broken) to sue. Every inch, every day you give them is just more time for them to find ways to cheat you.

(You'll learn this lesson the hardest way in making concessions to a high-conflict ex-spouse armed with a 50/50 child custody agreement...they get you to agree to let the kid stay with them during your scheduled time, more and more, until they can prove the kid is basically with them 100% of the time-- then you get slapped with a vastly-increased child support order. You can't claw anything back because they have commitments now. Thus, you get cheated out of both your relationship and your money.)


The answer mentions a layoff. I haven't caught wind of that. What happened?



Stackoverflow has over 500 employees ?!


I don't know why people are so often surprised about the number of employees in a company. My company has half the number of employees, we're not remotely as relevant as SO.


Why is that surprising?


It's basically a wiki with well under a terrabyte of data total and a billion requests a month (modest load in the grand scheme of web apps). It runs on less than 10 servers (https://www.datacenterdynamics.com/en/news/stack-overflow-st...). It's kind of bonkers to have hundreds of engineers supporting a handful of servers.


- Sales people, account managers, etc. for their ads business

- Sales people, account managers, etc. for their Teams product.

- Sales people, account managers, etc. for their Enterprise self-hosted product.

- Sales people, account managers, etc. for sponsored tags, collectives, etc.

- Support for the above (and the public Stack Exchange sites).

- Engineers for the above (and the public Stack Exchange sites).

- Community managers (who, among other things, fight abuse).

It all adds up. From what I remember most people working for SO weren't engineers, not even years ago (many were involved with the jobs site back then). There used to be a "Our Team" page which listed everyone who worked for SO, but it seems that's gone now.


500 employees, not 500 engineers.


Is this such a big problem? You could still scrape all the data, or not?


Yeah, this is a baffling part of this—you can still scrape (for now, I guess), if you have the time and effort to do so. Disabling the dump makes it harder only if you have e.g. a shoestring budget.

For example, my hobby search engine got started because I found out about these dumps and decided it would be an interesting challenge to try to work with them[1]. If I’d needed to build a scraper first the project would never have gotten off the ground.

[1]: https://search.feep.dev/blog/post/2021-09-04-stackexchange


You can scrape the data today. If they lock the access (a little bit of a false dychotomy, they could limit the access as well, but to simplify the argument), there will be nothing to scrape - you will still be able to access the old dumps, however.


For those who downvote me, can you explain? I'm really curious on the answer of the question. I don't really understand what the problem is. The data stays under the same licence anyway, so scraping it shouldn't really be any issue.


This is an internet ecosystem issue that is simplified to thoughtless bashing of supposedly evil companies. Yes, these actions are clumsy and user-hostile but consider the big picture.

We have companies like Reddit and Stackoverflow not being profitable, despite being wildly successful in usage and internet mind-share. Neither of these companies are particularly over-staffed.

We post our "valuable" contributions there. So valuable that nobody wants to pay for it (structurally). We block ads. AI does the daylight robbery. We expect free APIs and data dumps.

Perhaps this is our wake-up call. The limitations of the "free" model and companies running at a loss for 15 years straight. It was always an anomaly.


Not sure if this is relevant, but the Hacker News BigQuery dataset also stopped updating since Nov 2022: https://issuetracker.google.com/issues/261579123


As a silver lining, perhaps the cash-grab, zero value-added clones will no longer clutter our google results?


I wonder if the execs at SO figure that OpenAI fed the CC data dump directly to ChatGPT and decided maybe they didn't want to make it quite so easy for them to do it again? Maybe they want to make OpenAI pay for it, or at least attach the license-required attribution.


I guess this is a defensive move against being inadvertently used for ChatGPT model.


I think you mean more like a "thieves have stole the horse! quick close the barn door"


Cat's of out of the bag already with that one.


There is a time when the bill comes due for any "free" service.


The friends you thought you had weren't


[flagged]


I mean it's the same with HN. I'm here for the comments. I could get articles on another news aggregator.


Not surprising - why would any content driven business want all of their stuff to be vacuumed up for free?


It's not really their stuff though. The content comes from the users. And part of the reason users are willing to contribute to SE is because of the licensing model and the fact that the data is available outside of SE. Obviously this is more important to some users than others, and probably some percentage don't care about it at all. It's hard to say what those percentages actually look like though.

IF (and it's definitely an "IF") this is an intentional and permanent change by SE management, they are fundamentally changing the basic understanding between users and SE, and they have to understand that some subset of users are likely to quit using SE in response. Again, it's hard to say how many. Maybe enough to have a material impact, or maybe not. That would be the gamble they'd be taking though.


> they are fundamentally changing the basic understanding between users and SE

Given the way they've communicated in the last 2 weeks or so, this seems pretty clear. Before we had employees engaging as real human beings all over the place, and you were talking to Jon, Tim, Robert, Shog, etc. and not "Mr. Ericson, title such-and-such, representing Stack Exchange Inc."

Now all we have is a bunch of announcements, with no discussion, engagement, or even a recognition that anything is even being read. It feels like pissing in the wind – disagreement is one thing, reasonable people can disagree, but ignoring is so much worse; it's like you're not even taken serious.

Stack Exchange has gone through various phases (e.g. the "Jeff era" was different from the "stagnation era" that followed after he left), but the implied social contract was always that the community would offer their spare time and in return they would get a platform and some voice in how that platform is run. There have certainly been moments of friction in this relationship, but the basics of it never changed until now (not even with the whole debacle surrounding the firing of a moderator a few years back).


Before the release of the LLMs where everyone could run it, the amount of slurping was probably manageable. Now that anyone can train an LLM and SO/SE/Reddit/etc are obvious places to go for training data, I can see where the systems would easily be overwhelmed. People contribute to SO/SE because it's a common place to go for community help. Training a for profit chatbot from the community data that wasn't provided to the chatbot by the community seems to break the spirit in which the contributions were made. I'm on the fence of the argument, but most definitely in the direction of not liking all of the model training for free.


A lot of people were only willing to contribute to StackOverflow because of the CC licensing, trusting the knowledge wouldn't be locked up. As a business that depends on vast amounts of volunteer effort they need to balance providing a site where people are willing to contribute against making as much money as they can.


I wonder how many of those contributors, if re-consulted, would sign up to having their contributions used to train a for-profit LLM though?

I certainly didn’t sweat it out helping people on SO to pay for Sam Altman’s fucking swimming pool.


I mean they would just scrape it if there's no data dump. It just makes it harder for the small guys. They probably scraped and are scraping HackerNews.

Generative AI doesn't follow copyright or even explicit software licenses as we have seen in AI art with human signatures and Microsoft Copilot.


They definitely scraped HN: one of my favorite ways to change the style of an LLM is to ask it to rephrase in the style of a snarky HN comment.


They probably didn’t even have to scrape it…

Sam Altman is still chairman of the board of directors of Y Combinator.


There was always the possibility of some sort of aggregator/other front end sitting on top of the SO data. We just didn't know exactly what a successful one would look like until relatively recently. I always limited how much I contributed based on that as likely outcome. Discontinuing the data dump is a much bigger deal to me and completely changes the value proposition of their various sites.


For what it's worth, as someone who has put a lot of writing online, I'm not bothered by having my writing including in the training sets of these LLMs. I write because I want to share knowledge, and it isn't important whether people get the knowledge directly from me versus mediated by friends, LLMs, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: