Some users comment frequently and uniquely enough that they get their own cluster. I also like how there's "Deno vs Node" lol. It'd be cool if you could click on a comment and get the most similar ones by whatever distance metric you used.
Interesting breakdown of the dataset. Are you guys manually assigning labels to the clusters after the fact or is this using some kind of LLM to create a cluster name?
Everything is automatic and non-manually curated; we basically just uploaded the data without doing anything dataset-specific as far as the clusters are concerned. First we create an embedding of the rows, then we run projection & clustering on it. The first clusters we generate are narrow. After we generate the first round of clusters we label them with an LLM, and then cluster those to create the more generalized clusters. Have an LLM label those, and then we're done.
I'm not sure how to view the embedding, I clicked on the graph, narrowed down to a comment, but it only shows the row and not the raw array? (Or I've misunderstood)
We debated doing 2D vs 3D and 3D brought a bunch of usability issues. We also noticed most SOTA embedding visualizations were 2D and already yielded good insights.
I'm not sure about others here, but I occasionally spend time typing out a long reply to someone and then simply deleting the reply without posting. Most of the time I conclude -- a little too late -- that the effort was not worth it.
How nice it would be to have an LLM trained on all of my previous writings and simply be able to click a button to indicate "reply to this person, please." I know I don't have enough training data from HN, and maybe not even from all of the sites I contribute, combined. It is still a nice thought, though.
But: Let's say I do acquire enough training data to have a local LLM do exactly what I describe. My volume of "replies" would certainly increase. Is that a good thing, on average? If the tool became ubiquitous, would it be a good thing for the average social media user? Or more pointedly, would it be a good thing for consumers of that social media? The cynic in me thinks "no" -- the effort required today surely weeds out _some_ idiots....
(Full disclosure, I nearly closed this window without clicking the "add comment" button.)
I sometimes do the same thing. Same with emails. I’m just not sure there’s enough training data to pick up my tone or to reply making a similar assessment to that which I would make.
Would be a wonderfully fun app to try though.
Would it be a good thing? Do you think there are relies you should have posted that you did not?
> Do you think there are relies you should have posted that you did not?
I think there are probably times where a reply I deleted would have added value to the conversation. Not always, but perhaps enough.
> Would it be a good thing?
That is the issue. More != better, for all cases, as it says nothing about quality.
There is not enough training data for an LLM to emulate me, but what about prodigious writers? Could we realistically emulate Paul Graham or Joel Spolsky?
Recently (before the Microsoft recall announcement) I created a keylogger that logs all keystrokes, mouse position as well as window system specific things such as focused window etc.
My idea was to be able to eventually use that data to train some AI with that data and to e.g. pick up on my different writing styles in a document editor, terminal etc.
The original dataset is located at [1] (not our HF account). HN data is directly available via the HN API [2]. The privacy policy you point to does not cover HN posts.
> Hacker News Information: If you create a Hacker News account (ID and profile), we do not collect any Personal Information unless you choose to provide your email address and/or information in the "about" field (“HN Information”). Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.
I'm a little surprised people don't know the ownership story on HN. Didn't it raise questions when they realized they can't delete their posts without mother-may-I'ing the mods?
HN is pretty up-front that when you post here you are providing them content for free to more-or-less consume as they please.
You did, although I have no idea if OP is an affiliate of YC (IANAL but maybe you could argue the public API is a form of sublicensing the data?):
> By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed.
> IANAL but maybe you could argue the public API is a form of sublicensing the data?
Given that the license is explicitly identified as both sublicensable and transferable and includes the right to distribute, I have a very hard time seeing how anyone could argue that the recipient of data that YC explicitly exposes through their "Official HN API" isn't allowed to use it.
My data is licensed only to Y Combinator and its affiliated companies (I would have prefer it to be licensed it only for news.ycombinator,com), not any other rando that can access it via a browser or an API
Sematic (OP's employer; rebranded to Airtrain?) is a YC company. Not a lawyer but I assume that would be included in "affiliated" since YC presumably has some ownership of them.
But I do feel saddened and personally betrayed; I thought the licence I gave to YN was just for news.ycombinator.com to store and show my comments, not for any other purposes.
All HN did do was store and show your (public) comments.
What this other company did with that (public) data seems to me to be a separate issue that you should take up with that company, just like the fact that your public comments (which you explicitly gave permission to HN to show) have been indexed by Google, Bing, and probably thousands of other spiders, bots, scrapers, etc.
I'm curious how you expected this to work. Like if you only give HN permission to store and show your comments on the public web, then somehow no other entity out there will be able to do anything with them?
Yes, I expect it to work in the same way instagram works, for example. If a commercial entity started yoinking photos from instagram and using them for commercial purposes, shit will hit the fan.
Again, the fact my user data has been scrapped already doesn't mean it was scrapped legally. I'm ok with HN showing my comments, I'm not ok with anyone else than HN using my user data.
> ommercial entity started yoinking photos from instagram and using them for commercial purposes, shit will hit the fan.
Will it though? I would imagine Meta would block them and then posture with a C&D or a frivolous lawsuit, but if they share the phones you gave them on the public internet, they're publicly consumable right? What law do you feel is broken there?
It's not Meta that would sue them (although they would), it's the copyright owners (the users) that will. Photos or comments, the User retains copyright on their content, and only license it to Meta or YC for specific purposes. Yes, that means Meta/YC and their affiliated companies can use the content for other purposes than displaying it in a browser, but 3rd parties 100% can't.
Well, are you going to sue? Are you going to sue Google and Bing and whoever else? Have you even bothered to contact the people at Airtrain to ask them to remove your content? Have you contacted HN to have them delete your account and comments?
Or are you complaining here (ironically) to make a point, but you don't actually care that much?
Shit hits the fan because Instagram considers user content a golden goose, and they have a vested interest in not letting it get outside their control. Not because they feel a particular obligation to protect user privacy. That's generally been status quo for every social network.
HN cares a lot less; they're a tech comment site and don't actively discourage people using the dataset gleanable from the contents of the site for novel experimentation.
(Sidebar: I see "scrapped" coming up a lot in these conversations these days. Where is that neologism coming from? I'm familiar with people calling it "scraping" but it seems like the term has drifted for some reason?)
Anyone could (in a technical sense) scrape HN or access the data through the API and do whatever they want with it. It's unclear to me whether the license granted to HN by your use of the site gives someone doing so license to your comments (I suspect not but IANAL) but the general argument here is that this would fall under fair use. Certainly that seems the case if they didn't display the dataset itself. I'm not sure how it would fair though given they are displaying the content.
You put things in a place with an expectation of a certain standard of use and then go after people hammer and tongs with a strict interpretation. Sometimes, the strict interpretation need not be valid. You can just shake them down.
That's a contract between users and HN. Airtrain is a 3rd-party.
If HN API exposes personal information publicly through their API then there is a problem.
And AFAICT the only way for HN to prevent user comments from being used by 3rd-party is preventing access to those comments, meaning a) sign-up will have to be more stringent and b) visitors will have to sign-in just to read (or scrape) comments.
Back in the day, we called that "indexing" and it was fundamental to making the web in any way usable; without search engines, the whole thing was data with no ability to locate it.
I don't know precisely what changed that people decided that analysis is a bridge too far.
Actually, another problem might be GDPR: I have found my username... and that is a clearly a PII because it's directly and univocally bound to me
I dont really care (for now) about this... but on the principle, I'm a bit fed up too by companies just crawling anything to train anymodel without any care about the datas, the people that produced them, and the consequences on people's life.
I look forward to the not-too-distant future where the EU protections grow stronger and places like HN have to respond by banning all European users lest they run afoul of a draconic legal framework.
It'll kill a lot of experiments (Mastodon immediately comes to mind; can't be pulling comments from other people's servers if those comments are attached to personal data like the commenter's username, right?).
Well, maybe you should think about the real responsabilities: Europe make laws in reaction to ABUSES. So dont blame Europe for the legislation, but the abusers that made this legislation mandatory to defend european citizen ;-)
Actually, Europe is so slow that a lot of experiments may take place. And there wont be any legislation if there no abuse...
It took a loooonnnng time for Europe to react to Facebook, Google & co abuse with users datas. Same for OpenAI using a awful lot of copyrighted material without giving anything back... So thank'em for Europe legislation :-)
I'm not up on the nuance of the GDPR, but has it been tested that your public profile name - which you set knowing it will be displayed publicly - is PII?
It's not PII (an American term) but it is personal data (a GDPR term).
Personal data is (broadly) considered to be data that could be used to track or tie your behavior online together into a profile. The UK's ICO calls out usernames specifically as
an example of such data. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-re....
(For those of us who have been around on the Internet long enough to remember the era where people intentionally chose handles to remain pseudonymous and separate from their IRL personas, this seems counter-intuitive and a little preposterous, but the GDPR doesn't care what "netizens" think about privacy; it's a broad attempt to impose a "non-native" concept of privacy over the preexisting net culture).
Well, you choose a username in a specific context, even if it's public.
For example, you may agree to have your linkedIn profile name next to your HN username... maybe. But I'm not sure that you would agree to have your LinkedIn profile name next to your Tinder username.
And you sure don't want that to happen without your agreement and even without you knowing about it (but learning about it from a colleague for example).
That's why GDPR has some right to deletion or modification. And why some days, Europe may go after data brocker
(as a side note: not sure why my comments were downvoted. I didn't say that I would go after anybody - and surely not HN - I only said that uncontrolled use of any data without any anonymization and without consent might be the source of problems with regard to legislation decided BECAUSE too many shady business abused of it. You may not like it but then... well... downvote the abusers)