Hacker News new | past | comments | ask | show | jobs | submit | brianjking's comments login

LLMWhisperer is great, seconded.

LLMWhisperer from Zipstack at https://llmwhisperer.unstract.com/ or https://github.com/VikParuchuri/surya will do a good job for you.

LLMWhisperer has some nice tooling where they can fall back to OCR as well forcing text extraction from scanned documents as well as documents that have the text preserved as text.


I'm equally curious.

Huggingface Inference Endpoints can autoscale to 0 and cost nothing when not being used.

Beautiful, would love a way to deploy this easily on Huggingface Inference Endpoints.

The need for a custom handler.py has always stopped me from doing so.


It doesn't really matter their legal risk here, IMO. What matters is the court of public opinion in this case.

Even if they are able to show irrefutable proof that it wasn't ScarJo and is in fact another person entirely it will not matter.

This is one of those times that no matter what the facts show people will be dug in one way or another.



Hey, that's great!

But wait, we already had a working mechanism to signal exactly this type of opting out[1] so let me rephrase the OP question: why does OpenAI get to be exempt from existing opt-out mechanisms and implement their own?

It certainly does seem as if they're trying to position themselves as a new standard against which content owners have to actively opt-out, and thus disregarding the already existing active opt-out signals. But that would mean that they don't actually care about privacy, and their opt-out signal is disengenuous! That can't be right, can it?? Surely everything they do is in good faith, just like every other corporation ever!

Anyway, the fact that they disregard existing privacy standards and rolled out their own privacy standard definitely gives me a lot of confidence that they will forever follow the privacy standards they themselves created!

Now excuse me, but I have to go get treatment for terminally metastasized sarcasm.

[1] https://en.m.wikipedia.org/wiki/Robots.txt


I'm... pretty sure OpenAI respects robots.txt, as explained in the link GP shared?


Whether it respects robots.txt is irrelevant if its existence is secret for the entire time it's doing the scraping.


I'm sorry, but I'm confused by this comment. What exactly is secret?


The several years that they were scraping the web to build their models and they weren't telling anyone about it.


I don't think that's how robots.txt or scraping really works. Do you expect scrapers to announce every bot they run? Do you expect webmasters to add a robots rule for every bot?

If someone didn't want OpenAI or anyone else scraping their site, whether OpenAI or anyone else announces they're scraping doesn't matter, if they respect robots.txt, and you have rules to catch unannounced scrapers.


What I'm saying is that it doesn't matter if you disallow them access now, because they've already gotten everything they want, whether you wanted it or not.

The difference between this scraper and other scrapers is that normally, scrapers are usually used for personal or nefarious purposes.

The data scraped for AI models is used explicitly for a commercial purpose by a commercial entity and the original creator received zero compensation or notice that their work was going to be used in a commercial product. The actual rights holders of the works that were used in an unauthorized manner have no way to seek compensation or removal of their work from this commercial product.

There is little material difference between this behavior and if someone downloaded your site and used its content in a book they were selling. It doesn't matter that you discovered this book was printed two years ago. Your work is still being used without your permission.

When the little guy does it, that's called piracy and theft. When billion dollar corporate entities do it, it's called a technological marvel.


> The difference between this scraper and other scrapers is that normally, scrapers are usually used for personal or nefarious purposes.

This doesn't seem accurate at all. Plenty of businesses are built on scraping data; see: Google.

> The data scraped for AI models is used explicitly for a commercial purpose by a commercial entity and the original creator received zero compensation or notice that their work was going to be used in a commercial product. The actual rights holders of the works that were used in an unauthorized manner have no way to seek compensation or removal of their work from this commercial product.

I think the questions of fair use might keep us busy for hours.

> There is little material difference between this behavior and if someone downloaded your site and used its content in a book they were selling. It doesn't matter that you discovered this book was printed two years ago. Your work is still being used without your permission.

I think a more fair comparison would be if someone used my website as reference/inspiration/etc when writing a book.


I'll accept my reading comprehension mistake if you can quote the passages you're referring to, but I don't see what you're talking about in the GP link


https://platform.openai.com/docs/gptbot

  Disallowing GPTBot

  To disallow GPTBot to access your site you can add the GPTBot to your site’s robots.txt:

  User-agent: GPTBot
  Disallow: /


I stand corrected. Thank you.


*> why does OpenAI get to be exempt from existing opt-out mechanisms and implement their own?

1) because they are not in law

2) because you too can ignore robots.txt


Am I missing something?

I thought is was obvious that Microsoft is clearly about to establish the next "standard" with near windows level of ubiquity, it will end up our primary starting point to use Microsoft stuff - we won't open apps, Copilot will.

Actions speak louder than words tho - look at how obvious they are being

Copilot is included with windows, they added a button for it to all keyboards made here on out, built it into Edge, Office, is a standalone app, their search engine and now their Xbox games NPCs will be AI powered, prolly open to all their game pass studios.

If it goes the way I expect Microsoft will be essentially done positioning themselves for the world we talk to and expect to listen to us - and organize, track and recall anything I talk to about it. Perfect for the smart glasses we all about to buy

Tbh, I think this will be the end of computing as we conceive it now - just not for the reason I expected originally.

Folders for example - I think Copilot will end folders and all the file organization stuff for normal users. I shouldn't need to ever kno where that stuff is on my PC after a future date, or manage it in any way.

Instead we'll have "real-time" folders, created from our own saved content, assembled to our inquiry and according to our preferences all named, topic labeled, and dated - but not by us.

Stored and retrieved by AI - lots like human memory actually.

Bc we'd then NEED Copilot just to access our stuff - I think that is most definitely coming sooner than later


Because robots.txt is a standard people can choose to follow, its not a law



Nice, thanks for sharing. Also, I highly suggest checking out therapy. BetterHelp is what helped me bite the bullet, since they locate your therapist for you and you can switch anytime you want.


Glad BetterHelp worked for you, but I heard some horror stories about their service: https://youtu.be/XcTssbRvA2w

AI summary by Kagi:

- BetterHelp is a controversial online therapy service that has faced numerous scandals over the past 6 years.

- In 2018, it was revealed that BetterHelp did not actually verify the qualifications of all its therapists as it had claimed.

- Recently, BetterHelp has been fined $7.8 million by the FTC for sharing users' sensitive mental health data with advertisers without proper consent.

- Many popular YouTubers have continued to promote BetterHelp despite its troubled past, leading to backlash from viewers.

- The video host's girlfriend, who has a master's degree in psychology, critiques BetterHelp's claims about the benefits of their service compared to in-person therapy.

- The pricing for BetterHelp's services is significantly higher than traditional in-person therapy, making it inaccessible for many.

- The video host and his girlfriend share personal negative experiences with BetterHelp therapists, including lack of privacy and unprofessional behavior.

- There are numerous "horror stories" on social media about poor experiences with BetterHelp's therapists and services.

- While BetterHelp may have helped some users, the host believes the negatives outweigh the positives given the company's shady practices.

- The host encourages viewers to avoid BetterHelp and instead seek mental health support from more reputable and ethical providers.


This isn't really meant to replace therapy to be honest, it is more meant to be a stop gap between appointments for therapy if that makes sense.


Are you using Assistants API v2 with streaming?


Yeah, I do both in prod and in the lib. In the lib I even ported Anthropics streaming API to be OpenAI compatible. Will write the docs over the coming days if interested.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: