Hacker News new | past | comments | ask | show | jobs | submit | solomatov's comments login

I would be much more likely to install this if it was published in the app store.

There are good reasons not to publish on app store ie. if you want to actually make any money from the app

My main concern is security and privacy. App store apps are sandboxed but manually installed apps usually are not.

If you are small, the app store looks to me as the easiest solution for selling apps.

also if u have gone thru the hell that is publishing and signing mac apps

Thanks for the feedback! I haven't tried to do this yet, but it's built on Tauri 2.0 and it looks not too hard (https://tauri.app/distribute/app-store/). Will take a look at this

Most popular Mac apps like Spotify, vscode, are not

Because they're big enough so they can afford not to, and they want to do things that the sandbox/review process/monetisation rules wouldn't let them. I assume the sandbox is exactly why parent wants the app to be there

I would have thought the exact opposite to your statement, they are big enough that they should afford it. Seems like the ability to forgo the app store on mac allows apple to get away with stuff like high friction review process and monetization rules. Without the big players pushing back, why would they change?

Doesn't apple charge app store apps 30% all their transactions/subscriptions? What company in their right mind would want to opt into that if they don't have to?

A smaller to a medium sized company. Due to several reasons:

- Setting up payment with a third party provider isn't that simple, and their fees are far from zero.

- Getting users. Popular queries in Google are full of existing results, and getting into there isn't easy and isn't cheap. Also, search engines aren't the most popular way to get apps to your devices, usually people search directly in app stores. Apple takes care of it, i.e. I guess that popular app with good ratings get to higher position in search results.

- Trust. I install apps on the computer without Apple only if I trust the supplier of the software (or have to have it there). Apple solves it with their sandboxing.

Yep, 30% are a lot, but for these kinds of businesses it might be well worth it (especially with reduced commission of 15% for smaller revenue).


My guess is they sample closed lambda terms representing functions.


Could you give examples of such companies (I am really curious)?


> Could you give examples of such companies

One example is ARM, which licenses the processor designs they create, and do not build or sell the chips themselves.


I am aware of a few orgs that license interesting software R&D often with engineering support, sometimes with an equity component. Another variant is the R&D holding company that creates separate companies to commercially exploit the R&D in different parts of the public or private sector. Most such R&D orgs are very low-profile, they usually don't have an internet presence. Many use few or no patents these days, those economics don't make sense unless the business is largely owned by lawyers, which creates a different kind of company (much closer to patent trolls).

It is a bespoke kind of business, tailored to the specific technology and investment network of the people involved.


Drag (air friction) is proportional to the square of velocity. If you know aerodynamic properties of iphone, you could find a value of velocity where drag is equal to gravitational force, at which point the speed doesn't change. You could use it as an approximation of the speed at which iphone hits the ground.


We won’t need semantic web if machines are able to understand natural language.


Eh, I could see this but I think having well-structured data will still be useful for both cutting down on 'prompt clutter' and giving it a better chance to come up with something useful.

It might get better in later versions but even so, I think passing well-structured data will always result in cleaner output than passing it data with lots of noise.

If we are going to work alongside AGIs, we should consider them equal to humans and by that I mean, passing them good data rather than assuming they'll figure out what we meant. Obviously it won't be _as_ big of an issue or requirement but if you can put into such a format, why wouldn't you?


It's not a fair use question, it's a question about models reproducing the article text almost verbatim.

(IANAL)


If it's fair use reproducing the article text verbatim is fine.


What is "it"? Training can be fair use, i.e. updating weights incrementally based on predicted next token probabilities. And I (not a lawyer) think that if a broadly trained foundation model can recall some verbatim text, that doesn't mean the model is infringing.

It seems like the lawsuit here is talking about specific NYT related functionality, like retrieving the text in new articles. That essentially has nothing to do with training and running a foundation LLM, it's about some specific shenanigans with NYT content, and it's legal status would appear to have nothing to do with whether training is fair use.


Good luck trying to explain "updating weights incrementally based on predicted next token probabilities" to completely non-technical lawyers and judges.


Good thing they don't have to. As I've said before, this slight of hand to talk about the case as-if its about training is a great move by OpenAi; however the case is more than just about training. NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.

This is akin to having to explain to non-technical lawyers and judges how crypto works. In the FTX case it became irrelevant because you can just nail them on fraud for using deposited funds for non-allowed reasons.


>NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.

So if ChatGPT didn't refer to it verbatim and if ChatGPT trained on it and mixed it with other content, NYT would be OK with that? Tbh I don't get it.

Edit: I found this in my bookmarks archive - https://news.slashdot.org/story/23/08/15/2242238/nyt-prohibi...

Also this: https://www.cnbc.com/2023/10/18/universal-music-sues-anthrop...


It depends on the purpose of ChatGPT. If people use it as a substitute for the NYT then yes I suspect NYT will be not ok with it.

I think also the courts will also side with NYT. Very recently there was a copyright case involving Andy Warhol [1] which he lost. Despite the artwork being visually transformative; it's use was not transformative. So, to me that means if you create a program using NYT's materials that is used as a replacement for NYT it will not count as fair use. Obviously you could just do what say Google does and fork money over to NYT for some license.

However, my initial point is that this is a tangent. NYT has claimed that OpenAI is using NYT's works at least as-is and so OpenAI can just be nailed for that. Which is my point about FTX; it's irrelevant if their exchange was legal since you can just nail them for mis-use of customer funds. Another example would be Al Capone; it doesn't matter if he's a mobster because you can nail him for tax evasion.

[1]: https://www.cbsnews.com/news/andy-warhol-supreme-court-princ...


I think this is more of a question of licensing content, sooner or later AI chat bots will have to license at least some of the content they are trained on.

But broadly speaking this is also the question of the "Open Web" and will it survive or not. Walled gardens like Facebook, Instagram etc. are strong and pervasive but still majority of people use and acknowledge publicly open websites from the Open Web. If AI chat bots do not drive traffic to websites then they are walled gardens and Microsoft, Google or whoever will lock users in and try to squeeze them for money.


I didn't see NYT allege that - their lawsuit explains pre-training pretty accurately I thought.


Its buried on page 37 - #108. There probably are other examples in the lawsuit but this is sufficent.

> Synthetic search applications built on the GPT LLMs, including Bing Chat and Browse with Bing for ChatGPT, display extensive excerpts or paraphrases of the contents of search results, including Times content, that may not have been included in the model’s training set. The “grounding” technique employed by these products includes receiving a prompt from a user, copying Times content relating to the prompt from the internet, providing the prompt together with the copied Times content as additional context for the LLM, and having the LLM stitch together paraphrases or quotes from the copied Times content to create natural-language substitutes that serve the same informative purpose as the original. In some cases, Defendants’ models simply spit out several paragraphs of The Times’s articles.

https://www.courtlistener.com/docket/68117049/1/the-new-york...


Oh I see - yeah, that's the part of the lawsuit that's about Bing and ChatGPT Browse mode retrieval augmented generation.

It's a separate issue from the fact that the model can regurgitate it's NYT training data.

There's a section on page 63 which helps clarify that:

    Defendants materially contributed to and directly assisted
    with the direct infringement perpetrated by end-users of the
    GPT-based products by way of: (i) jointly-developing LLM
    models capable of distributing unlicensed copies of Times
    Works to end-users; (ii) building and training the GPT LLMs
    using Times Works; and (iii) deciding what content is
    actually outputted by the GenAI products, such as grounding
    output in Times Works through retrieval augmented generation,
    fine-tuning the models for desired outcomes, and/or
    selecting and weighting the parameters of the GPT LLMs.
So they are complaining about models that are capable of distributing unlicensed copies (the regurgitated training data issue), the fact that the models were trained on NYT work at all, and the fact that the RAG implementation in Bing and ChatGPT Browse further creates "natural-language substitutes that serve the same informative purpose as the original".


Yep, you seem to be right. Google stores the quotes from pages, and it's fair use. Again, I am not a lawyer, and didn't think about this.


This is argued extensively in the lawsuit document.

A key argument the NYT is making is that part of the definition of fair use is not producing a product that competes with the original.

They argue that ChatGPT et al DO compete with the original, in a way that harms the NYT business model.

One example they give: ChatGPT can reproduce recommendations made by the Wirecutter, without including the affiliate links that form the Wirecutter's main source of revenue - page 48 of https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...


There are plenty of other services that strip affiliate links from content for users, such as ad blockers.

Notably, in both those cases, the user is specifically asking for what wirecutter thinks.

To me, that makes the infringing behavior clearly the fault of the user of the tool, not the tool itself.

When I saw this lawsuit announced, I assumed that large portions content were being reproduced in response to generic queries. That isn't the case, every example I've seen from this lawsuit has prompts where the user specifically asks for the content. To me, any fault and liability rests on the user here.


Fair use is a key defense for OpenAI.

The article also mentions the idea that the model is merely extracting (uncopyrightable) facts, which is interesting, but might be a tough one to prove since LLMs have no way of establishing what are facts, and don't return facts by design.


> state-space models make transformer based models obsolete

We will see whether they work on a large scale pretty soon. I hope they will, but they might not be. There're models which might outperform more advanced models on the smaller scale, and I haven't heard how Mamba performs on GPT scale.


Yes, it's doable. Your model won't be as large and as performant as a real large scale model, but you could train something. You could watch this for a start: https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxb...


It's much more complicated than that. The article 3 (https://gdpr-info.eu/art-3-gdpr/) says two possible ways to get into a territorial scope of GDPR:

- the offering of goods or services

- the monitoring of behavior of data subjects

Offering doesn't mean that it's just available and/or sellable in EU. It's more complicated than that. EDPB has a guidance on this topic: https://edpb.europa.eu/our-work-tools/general-guidance/guide... In short, document shows examples where some services are available in EU, and sellable there but personal data isn't covered by GDPR.

On the other hand, my understanding is that monitoring of behavior is always covered by GDPR.

(I am not a lawyer and this is not a legal advice)


I don’t think I dispute that the GDPR and related laws claim to apply to me if I have a website that EU residents access.

I dispute that they have jurisdiction to actually apply their laws to me, any more than the US can charge somebody with violating FCC regulations for a radio signal sent from Norway.

There are specific things like extradition treaties, trade agreements, and parallel legislation that cover existing areas where this happens. Is there one that covers application of the GDPR in the US?


The U.S. and the EU signed the Data Privacy Framework over this past summer. https://www.dataprivacyframework.gov/s/ This offers methods for EU residents to exercise claims against U.S. businesses.

Among other requirements, a participating organization must provide you:

  Information on the types of personal data collected
  Information on the purposes of collection and use
  Information on the type or identity of third parties to which your personal data is disclosed
  Choices for limiting use and disclosure of your personal data
  Access to your personal data
  Notification of the organization’s liability if it transfers your personal data
  Notification of the requirement to disclose your personal data in response to lawful requests by public authorities
  Reasonable and appropriate security for your personal data
  A response to your complaint within 45 days
  Cost-free independent dispute resolution to address your data protection concerns
  The ability to invoke binding arbitration to address any complaint that the organization has violated its obligations under the DPF Principles to you and that has not been resolved by other means
https://www.dataprivacyframework.gov/s/article/My-Rights-und...


> There are specific things like extradition treaties, trade agreements, and parallel legislation that cover existing areas where this happens. Is there one that covers application of the GDPR in the US?

Nope. Extradition only covers the case where you go to some other country and commit a crime there, then return to the US. If the crime you committed there is serious, and is also a crime here, then extradition can apply. There are other conditions as well, but the key is that it has to be a crime in both places.

Europeans can claim that you must follow their laws until they are blue in the face but it won’t magically become true. You can safely ignore it. Enjoy competing against European businesses without having to pay any of the same costs.


Even if you do not have to comply with GDPR, 12 States have passed data privacy regulations to date. You may still need to comply with data protection law regardless if you qualify for various State laws.

Even if State law doesn't apply - you have have HIPAA, GLBA, SOX etc.


All irrelevant to the question. But it of course true that we have plenty of our own laws to follow.


>I am, personally... at least until it's all done on-device and functions offline.

You could do it now. Apple computers with a lot of RAM are pretty good at running Llama2.


Not really, I'm talking mobile phone and near OpenAI quality.

My workstation has allowed me to dabble - I'm familiar, a unified pool of memory does very little for me.

The experience with self-hosted stuff leaves a bit to be desired, both in generation speed and content.

The software needs work, I'm not saying we won't get there... just that we haven't, yet.

With a ridiculously beefy system I can eek out some slow nonsense from the machine. It's neat, and I can do it, I just don't find it very useful


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: