Hacker News new | past | comments | ask | show | jobs | submit login
ChatGPT Enterprise (openai.com)
860 points by davidbarker 8 months ago | hide | past | favorite | 510 comments



Explicitly calling out that they are not going to train on enterprise's data and SOC2 compliance is going to put a lot of the enterprises at ease and embrace ChatGPT in their business processes.

From our discussions with enterprises (trying to sell our LLM apps platform), we quickly learned how sensitive enterprises are when it comes to sharing their data. In many of these organizations, employees are already pasting a lot of sensitive data into ChatGPT unless access to ChatGPT itself is restricted. We know a few companies that ended up deploying chatbot-ui with Azure's OpenAI offering since Azure claims to not use user's data (https://learn.microsoft.com/en-us/legal/cognitive-services/o...).

We ended up adding support for Azure's OpenAI offering to our platform as well as open-source our engine to support on-prem deployments (LLMStack - https://github.com/trypromptly/LLMStack) to deal with the privacy concerns these enterprises have.


My company (Fortune 500 with 80,000 full time employees) has a policy that forbids the use of any AI or LLM tool. The big concern listed in the policy is that we may inadvertently use someone else’s IP from training data. So, our data going into the tool is one concern, but the other is our using something we are not authorized to use because the tool has it already in its data. How do you prove that that could never occur? The only way I can think of is to provide a comprehensive list of everything the tool was trained on.


It’s a legal unknown. There’s nothing more to it. Your employer has opted for one side of the coin flip, and it’s the risk averse-one. Any reasonably-sized org is going to be raising the same questions, but instead opting to reap the benefits and take on the legal risk, which is something organisations do all the time anyway.


There is a very real concern about being “left behind” on these issues building.

You’ve got to be early, but not so early you get legal or business disruptions or concequences.

It’s quite the balancing act for exec teams.


For me that discussion is always hard to grasp. When a human would learn coding autodidacticly by reading source code, and later they would write new code — then they could only do so because they read licensed code. No one would ask for the license, right?

So why do we care from where LLMs learn?


> So why do we care from where LLMs learn?

Because humans aren't computers and the similarities between the two, other than the overuse of the word "learning" in the computer's case, are nonexistant?


Are you really asserting that these models aren't learning? What definition of learning are you using?


Don't know if they are, and don't really care either and I don't care to anthropomorphize circuitry to the extent that AI proponents tend to, especially.

Humans and Computers are 2 wholly separate entities, and there's 0 reason for us to conflate the two. I don't care if another human looks at my code and straight up copies/pastes it, I care very much if an entity backed by a megacorp like Micro$oft does the same, en-masse, and sells it for profit, however.


Okay, so the scale at which they sale their service is a good argument that this is different from a human learning.

However, on the other hand we also have the scale at which they learn, which kind of makes every individual source line of code they learn from pretty unimportant. Learning at this scale is statistical process, and in most cases individual source snippets diminish in the aggregation of millions of others.

Or to put it the other way round, the actual value lies in the effort of collecting the samples, training the models, creating the software required for the whole process, putting everything into a good product and selling it. Again, in my mind, the importance of every individual source repo is too small at this scale to care about their license.


The idea that individual source snippets at this scale diminish in aggregation, is undercut by the fact that OpenAI and MSFT are both selling enterprise-flavoured versions of GPT, and the one thing they promise is that enterprise data will not be used to further train GPT.

That is a fear for companies because the individual source snippets and the knowledge "learned" from them is seen as a competitive advantage of which the sources are an integral part - and I think this is a fair point from their side. However then the exact same argument should apply in favour of paying the artists, writers, coders etc whose work has been used to train these models.

So it sounds like they are trying to have their cake and eat it too.


Hmm. You sure this is the same thing? I would say it’s more about confidentiality than about value.

Because what companies want to hide are usually secrets, that are available to (nearly) no one outside of the company. It’s about preventing accidental disclosure.

What AIs are trained on, on the other hand, is publicly available data.

To be clear: what could leak accidentally would have value of course. But here it’s about the single important fact that gets public although it shouldn’t, vs. the billions of pieces from which the trained AI emerges.


It's really not different in scale. Imagine for a moment how much storage space it would take to store the sensory data that any two year old has experienced. That would absolutely dwarf the text-based world the largest of LLMs have experienced.


If you don't care, why are you confidently asserting things you're not even interested in examining? It just drowns out useful comments.


Do humans really read terabytes of C code to learn C?

Humans look at a few examples and extrapolate…


But that also exists in the AI world. It’s called „fine tuning“: a LLM trained on a big general dataset can learn special knowledge with little effort.

I’d guess it’s exactly the same with humans: a human that received good general education can quickly learn specific things like C.


Humans have experienced an amount of data that absolutely dwarfs the amount of data even the largest of LLMs have seen. And they've got billions of years of evolution to build on to boot


You're straying away. Let's talk about learning C.

Also humans didn't evolve in billion of years.


The process of evolution "from scratch", i.e. from single-celled organisms took billions of years.

This is all relevant because humans aren't born as random chemical soup. We come with pre-trained weights from billions of years of evolution, and fine-tune that with enormous amounts of sensory data for years. Only after that incredibly complex and time-consuming process does a person have the ability to learn from a few examples.

An LLM can generalize from a few examples on a new language that you invent yourself and isn't in the training set. Go ahead and try it.


I can't even convince it to put the parameters in a function call in the correct order, despite repeatedly asking.


There is the element of the unknown with LLMs etc.

There is a legal difference between learning from something and truly making your own version and simply copying.

It's vague of course - take plagiarism in a university science essay - the student has no original data and very likely no original thought - but still there is a difference between simply copying a textbook and writing it in your own words.

Bottom line - how do we know the output of the LLM isn't a verbatim copy of something with the license stripped off?


> So why do we care from where LLMs learn?

same difference there is between painting your own fake Caravaggio and buying a fake Caravaggio (or selling the one you made).

the second one is forgery, the first one is not.


The way I see it is that with AI you have really painted your own Caravaggio, but instead of an electrochemical circuit of a human brain you've employed a virtual network.


> but instead of an electrochemical circuit of a human brain you've employed a virtual network.

technically it is still a tool you are using, differently from doing it on your own, with your hands, using your own brain cells, that you trained over the decades, instead of using a virtual electronic brain pre-trained in hours/days by someone else on who knows what.


Okay if it’s about looking at one painting and fake that. However, if you train your model on billions of paintings and create arbitrary new ones from that, it’s just a statistical analysis on what paintings in general are made of.

The importance of the individual painting diminishes at this scale.


And if you look at lots of paintings, and create a new painting which is in a very similar style to an existing painting?

Is that a forgery? Have you infringed on the copyright on all the paintings you looked at?


Why do people bring this up? People are not LLMs and the issues are not the same.


I'd add to this, the damage an LLM could do is much less than a human could do in terms of individual production. A person can paint so many forgeries... A machine can create many, many more. The dilusion of value from a person learning is far different than machine learning. The value extracted and diluted is night and day in terms of scale.

Not to say what will/won't happen. In practice, what I've seen doesn't scare me much in terms of what LLMs produce vs. what a person has to clean up after it's produced.


Why are the issues not the same? Are you privileging meat over silicon?


Yes they are. Most people will.

They are not the same because an LLM is a construct. It is not a living entity with agency, motive, and all the things the law was intended for.

We will see new law as this tech develops.

For an analogy, many people call infringement theft and they are wrong to do so.

They will focus on the someone getting something without having followed the right process part while ignoring the equally important someone else being denied the use of, or loss of property part.

The former is an element in common between theft and infringement. And it is compelling!

But, the real meat in theft is all about people losing property! And that is not common at all.

This AI thing is similar. The common elements are super compelling.

But it just won't be about that in the end. It will be all about the details unique to AI code.


Using the word "construct" isn't adding anything to the conversation. If we bioengineer a sentient human, would you feel OK torturing it because it's "just a construct"? If that's unethical to you, how about half meat and half silicon? How much silicon is too much silicon and makes torture OK?

> Most people will [privilege meat]

"A person is smart. People are dumb, panicky dangerous animals, and you know it". I agree that humans are likely to pass bad laws, because we are mostly just dumb, panicky dangerous animals in the end. That's different than asking an internet commentor why they're being so confident in their opinions though.


If we bioengineer:

Full stop. We've not done that yet. When we do, we can revisit the law / discussion.

We can remedy "construct" this way:

Your engineered human would be a being. Being a being is one primary difference between us and these LLM things we are toying with right now.

And yes, beings are absolutely going to value themselves over non beings. It makes perfect sense to do so.

These LLM entities are not beings. That's fundamental. And it's why an extremely large number of other beings are going to find your comment laughable. I did!

You are attempting to simplify things too much to be meaningful.


Define "being". If it's so fundamental, it should be pretty easy, no?

And I'd like if this were simple. Unfortunately there's too many people throwing around over-simplifications like "They are not the same because an LLM is a construct" or "These LLM entities are not beings". If you'll excuse the comparison, it's like arguing with theists that can't reason about their ideological foundations, but can provide specious soundbites in spades.


It is easy!!

First and foremost:

A being is a living thing with a will to survive, need for food, and a corporeal existence, in other words, is born, lives for a time, then dies.

Secondly, beings are unique. Each one has a state that ends when they do and begins when they do. So far, we are unable to copy this state. Maybe we will one day, but that day, should there ever be one, is far away. We will live our lives never seeing this come to pass.

Finally, beings have agency. They do not require prompting.


So these jellyfish aren't "beings" because they can live forever? Or do they magically become "beings" when they die?

https://en.m.wikipedia.org/wiki/Turritopsis_dohrnii

Also twice now you've said the equivalent of "it hasn't happened yet so no need to think about the implications". Respectfully, I think you need to ponder your arguments a bit more carefully. Cheers.


Of course they are beings!

They've got a few fantastic attributes, lots of different beings do. You know the little water bear things are tough as nails! You can freeze them for for a century wake them up and they'll crawl around like nothing happened.

Naked mole rats don't get any form of cancer. All kinds of things the beans present in the world that doesn't affect the definition at all.

You didn't gain any ground with that.

And I will point out, it is you who has the burden in this whole conversation. I am clearly in the majority if you want things with what I've said. And I will absolutely privilege meets face over silicon any day, for the reasons I've given.

You, on the other hand, have a hell of a sales job ahead of you. Good luck maybe this little exchange helped a bit take care


> Or do they magically become "beings" when they die?

quoting from your link

although in practice individuals can still die. In nature, most Turritopsis dohrnii are likely to succumb to predation or disease in the medusa stage without reverting to the polyp form

This sentence does not apply to an LLM.

Also, you can copy an LLM state and training data and you will have an equivalent LLM, you can't copy the state of a living being.

Mostly because a big chunk of the state is experience, like for example you take that jellyfish, cut one of its tentacles and it will be scarred for life (immortal or not). That can't be copied and most likely never will.


Regarding the copying of a being state, I'm not really sure that's ever even going to be possible.

So for the sake of argument I'll just amend that and say we can't copy their state. Each being is unique and that's it. They aren't something we copy.

And yes that means all of us that thinks somehow they're going to get downloaded into a computer? I'll say it right here and now that's not going to fucking happen.


Companies don't go around donating their source code to universities either, even if it was only for the purpose of learning.


> So why do we care from where LLMs learn?

Because humans dont put the "Shutterstock" watermark logo on the images they produce.


As with all absolutes* exceptions exist:

Viagra Boys - In Spite Of Ourselves (with Amy Taylor)

    I absolutely love that the entirety of the video is unpurchased stock footage with the watermark still on it. This is cinematic gold.
https://www.youtube.com/watch?v=WLl1qpDL7YA

* well, most ...


cargo cult programming is real though


> The only way I can think of is to provide a comprehensive list of everything the tool was trained on.

There are some startups working in the space that essentially plan to do something like this. https://www.konfer.ai/aritificial-intelligence-trust-managem... is one I know of that is trying to solve this. They enable these foundation model providers to maintain an inventory of training sources so they can easily deal with coming regulations etc.


Isn’t that a benefit of using a provider?

Microsoft/OpenAI are selling a service. They’re both reputable companies. If it turns out that they are reselling stolen data, are you really liable for purchasing it?

If you buy something that fell of a truck, then you are liable for purchasing stolen goods. But if it turns out that all the bananas in wall mart were stolen from cosco you’re not as a customer liable for theft.

Similarly, I don’t know if Clarkson Intelligence have purchased proper license for all the data they are reselling. Maybe they are also scraping some proprietary source and now you are using someone else’s IP.


> But if it turns out that all the bananas in wall mart were stolen from cosco you’re not as a customer liable for theft.

Actually, that would be fencing stolen goods and customers could have obligations.

Cases of bananas is a bit silly as returning bananas would be not possible and value is too small to bother.

But imagine reputable car dealer selling stolen cars, repossession is far more likely here.


Even if you find a way to successfully forward liability and damages to Microsoft and OpenAI - which I doubt you will be able to as the damages are determined by your use of the IP - you do not gain the right to use the affected IP and will have a cease and desist for whatever is built upon it.

How legitimate the IP concern is and whether it holds up in court is one thing, but finger pointing will probably not be sufficient.


Also I thing that MS / OpenAI cannot and will not indemnify you. I think that their CEO and CFO have a duty not to...


> How do you prove that that could never occur?

Realistically you can prove that just as well as you can prove that employees aren't using ChatGPT via their cellphones.

There are also organizations that forbid the use of Stack overflow. As long as employees don't feel like you're holding back their career and skills by prohibiting them from using modern tools, and keep working there, hey. As long as you pay them enough to stay, people will put up with a lot, even if it hurts them.


Using chatgpt to code is not a skill. It’s a crutch. Any employees that feel held back by not being able to access it aren’t great in the first place.


Using $technological_aide \in {chatgpt,ide,stackoverflow,google,debuggers,compilers,optimizers,high-mem VMs}$ to code is not a skill. It’s a crutch. Any employees that feel held back by not being able to access it aren’t great in the first place.


I don't think using ChatGPT is similar to searching for answers on S.O. Maybe if you were asking people on S.O. to write your code for you, or plugging in exact snippets. The point here is that letting ChatGPT write code directly into your repo is effectively plagiarism and may violate any number of licenses you don't even realize you're breaking, whereas just looking at how other people did something, understanding, and then writing your own code, does not.


Honestly I couldn’t tell you whether copying code out of Stack Overflow or out of ChatGPT is more legally suspect. For SO, you don’t know where the user got the code from either (wrote it themselves? Copied it out of their work repo? Copied from some random GPL source?)


Well, this is why you don't copy code from S.O. You read it, understand why it works, then write your own.


I've been experiencing carpal tunnel on and off for a couple of weeks now. I can tell you that reading through some code generated by "insert llm x" is substantially less painful than writing all of it by my own hand.

Especially if you start understanding how to refine your prompts to the point where you use a single thread for project management and use that thread to generate prompts for other threads.

Not all value to be gained from this is purely copypasta.


Same here. End of last year I had to take more time off than I wanted to because of my wrists and hands. Copilot and GPT4 (combined with good compression braces) got me back in the game.


Take sick leave


Boy, those easy answers are right there! Can't miss 'em.

A guy could wonder why so many of us do not use those answers.

Could it be the details complicate things just enough to take the easy answer off the table?

Perhaps it is just me. What say you?


Do you have a point to make? Maybe I should ask chatgpt to find it because I sure can't.


Yes I do. The point is blurting out some one liner fix all doesn't help anyone really.


Stackoverflow yes, the others aren’t the same category. Asking someone else to give you code to do X means you have struggled synthesizing algorithms yourself. It’s not a good habit because it means you struggle to be precise in how the program behaves.


> It’s a crutch

problem is most of the code chatgpt spouts is wrong, in so many subtle ways, that sometimes you just have to run it to prove it.

so basically you have to be better than chatgpt at that particular task to spot its mistakes.

using it blindly it's similar to the Gell-Mann amnesia effect

https://theportal.wiki/wiki/The_Gell-Mann_Amnesia_Effect

said by someone who uses chatgpt extensively, it is good for the structure, to get an idea, but as a code generator it kinda sucks.


Thank you.

I am not a programmer and only know some very rudimentary HTML and Java. After hearing everyone enthuse about how they use ChatGPT for everything, I thought that I could use it to generate a page that I thought sounded simple enough. Gist of it was that I needed 100 boxes of the same dimensions that text could be inputted into. I figured that it'd be faster with AI than with me setting up an excel sheet that others would have to access.

Instead, the AI kept spitting out slightly-off code, and no matter how much reiterations I did it did not improve. Had I known the programming language, I would have known what needed to be changed. I think that a lot of highly experienced people are using it as a short-hand to get started, and a lot of inexperienced people are going to use it to produce a lot of shoddy crap. Anyone relying on ChatGPT that doesn't already know what they're doing is setting themselves up for failure.


> said by someone who uses chatgpt extensively, it is good for the structure, to get an idea, but as a code generator it kinda sucks.

Interestingly, the same applies to text-to-image programs. Once you used these for a while, you realize their utility and value are little more than an inspiration or a starter. Even if you wanted to ignore the ethical implications, very little they produce is useable. LLM are amazing. However, their end-product application is overrated.


Anyone who can't figure out how to become more efficient with the aid of LLMs is a dinosaur.


I dunno about that. I honestly tried to extract _any_ value in my day to day work from LLMs, but aside from being an OK alternative to Google/SO, I mostly did find it to be a crutch.

I never had issues with quickly writing a draft or typing in code. I do realise that for a lot of people, starting on a green field is hard, but for me it's easier.

My going hypotheses is that people are just different, and some get true value out of it while others don't. If it works for you, I'm not gonna call you names for it.


I guess that it depends on how popular your thing. If you are doing something not done before or really unique, then hope for useful hints is lower.

If you do something done 10000 times before or is mix of two things done over and over then you are more likely to get advise.


Exactly. It's a time-saver for bureaucratic chores. And that's great. There is no need to strive for excellence when mediocrity is required.

Useless for anything that requires originality, elaborate humor, or finesse.


Yes, try it with a proprietary language, a closed source environment and lots of domain and application knowledge required to achieve anything. There ChatGPT is completely out of it.


Here is a bunch of JSON. Output the c# classes that can deserialise it.

An intern in college could do that, but it isn’t worth our time to do.

For this function, write the unit tests. Now you do not have anything that you can blindly commit, but you are at the stage where you are reviewing code.

Could you do all of this by hand? Sure but you never would, you would use an IDE. Chatgpt is better than an IDE when you know how to use it.


I think it can be a productivity booster. At my company, I need to touch multiple projects in multiple languages. I can ask ChatGPT how to do something, such as enumerate an array, in a language I’m less familiar with so that I can more quickly contribute to a project. Using ChatGPT as a coach, I am also slowly learning new things.


I remember people saying literally the exact same thing almost word for word about Google almost a quarter century ago.


You mean the search engine that NEVER EVER gives me the documentation I'm locking for, but always goes for a vaguely related blog with hundreds of ads?


Horrible metaphor aside, at our company it's not the software developers most overusing ChatGPT.


Sounds like the perspective of someone who never gave that tool a real chance. To me it’s primarily just another aid like IDE inspections, IDE autocompletions etc.

I use GitHub Autopilot mainly for smaller chunks of code, another kind of autocompletion, which basically safes me from typing, Googling or looking into the docs, and therefore keeps me in the flow.


Can you share your linkedin?


I'd avoid the use of the word crutch, it sounds ableist as fuck; my girlfriend has a joint disease and relies on a crutch to walk. In other words, while I understand it's just a figure of speech: how dare you.

It's a tool that can help, just like an IDE, code generators, code formatters, etc. No need to talk down on it in that fashion, and there's no need to look down your nose at tools or the people that use it.


How is it offensive to use crutch in this context? The implication being that it's a tool that helps people do something they'd struggle with otherwise. I don't see why anyone might be offended by that.


Why is the word "crutch" ableist, when the meaning is exactly analogous? It's a tool to help you do something you can't easily do yourself.


Your girlfriend is less able to walk and thus uses a crutch to compensate, but it doesn't let her walk as well as a normal person. Replace "walk" with "code" and the sentence works for ChatGPT if the grandparent is correct.


> It's a tool that can help, just like an IDE, code generators, code formatters, etc

And like crutch for someone who cannot walk without it.

Or glasses (which I use) and allow me to regain almost as good vision as someone with no deformity of eyeballs.


Someone that doesn't use the tools available to them would be like folks hammering in nails with their forehead.


You can't code without access to stackoverflow?

Official documentation is still available…


It’s an interesting question.

To effectively sue you, I believe the plaintiff would have to prove the LLM you were using was trained on that IP and it was not in the public domain. Neither seems very doable.


I don't actually think either of those things are all that hard, certainly it's a gray area until this actually happens but I think AI generation is not all that different from any other copyright situation. Even with regular copyright cases you don't need to prove "how" the copying occurred to show copyright infringement, rather you just have to show that it's the likely explanation (to the level of some standard). Point being, you potentially don't need to prove anything about the AI training as long as you can show that the AI's result is clearly identifiable as your work and is extremely unlikely to be generated any other way.

Ex. CoPilot can insert whole blocks of code with comments and variable names from copyrighted code, if those aspects are sufficiently unique then it's extremely unlikely to be produced any way other than coming from your code. If the code isn't a perfect copy then it's trickier, but that's also the case if I copy your code and remove all the comments, so it's still not all that different from the current status quo.

The bigger question is who gets sued, but I can't imagine any AI company actually making claims about the copyright status of the output of their AI, so it's probably on you for using it.


It could open quite a wide window for patent trolls though, who generally go for settlements under the threat of a protracted court battle which is of minimal cost to them, as they are often single purpose law firms that do that and only that.

Being able to have your legal counsel tell them to go bug openAI could potentially save you from quite a few anklebiters all seeking to get their own piece.


Your observation highlights the complexities of legal actions related to AI-generated content. Proving the exact source of a specific piece of content from a language model like the one I'm based on can indeed be challenging, especially when considering that training data is a mixture of publicly available information. Additionally, the evolving nature of AI technology and the lack of clear legal precedents in many jurisdictions further complicate the matter. However, legal interpretations may vary, and it's advisable for any legal proceedings to involve legal experts well-versed in both AI technology and intellectual property law. Also, check out AC football cases.


Just curious, do they have bans on "traditional" online sources like Google search results, Wikipedia, and Stack Overflow?

From my view, copying information from Google search results isn't that much different from copying the response from ChatGPT.

Notably Stack Overflow's license is Creative Commons Attribution-ShareAlike, which I believe very people actually realize when copying snippets from there.


> Notably Stack Overflow's license is Creative Commons Attribution-ShareAlike, which I believe very people actually realize when copying snippets from there.

A lot of the snippets would not meet the standard for copyrightable code, though. At least that’s my understanding as non-lawyer.


With SO you also have no guarantee that the person has the license to put that snippet. Even that could have been copied from somewhere else. A customer was scanning for and banning SO, if that was the only determined source.


> but the other is our using something we are not authorized to use because the tool has it already in its data.

We won't know if this is legally sound until a company who isn't forbidding A.I. usage gets sued and they claim this as a defense. For all we know the court could determine that, as long as the content isn't directly regurgitated, it's seen as fair use of the input data.


It's not logical, because how can the company prove that could never happen from 80,000 employees writing things?

i.e. Without ChatGPT an employee could still copy and paste something from somewhere. ChatGPT actually doesn't change the equation at all.


They are stupid and don't understand risk vs reward.


So, how do you plan to commercialize your product? I have noticed tons of chatbot cloud-based app providers built on top of ChatGPT API, Azure API (ask users to provide their API key). Enterprises will still be very wary of putting their data on these multi-tenant platforms. I feel that even if there is encryption that's not going to be enough. This screams for virtual private LLM stacks for enterprises (the only way to fully isolate).


We have a cloud offering at https://trypromptly.com. We do offer enterprises the ability to host their own vector database to maintain control of their data. We also support interacting with open source LLMs from the platform. Enterprises can bring up https://github.com/go-skynet/LocalAI, run Llama or others and connect to them from their Promptly LLM apps.

We also provide support and some premium processors for enterprise on-prem deployments.


But, in order to generate the vectors, I understand that it's necessary to use the OpenAI's Embeddings API, which would grant OpenAI access to all client data at the time of vector creation. Is this understanding correct? Or is there a solution for creating high-quality (semantic) embeddings, similar to OpenAI's, but in a private cloud/on premises environment?


Enterprises with Azure contracts are using embeddings endpoint from Azure's OpenAI offering.

It is possible to use llama or bert models to generate embeddings using LocalAI (https://localai.io/features/embeddings/). This is something we are hoping to enable in LLMStack soon.


Sentence-Bert is at least as good as OpenAI embeddings. But I think more importantly Azure OpenAI model api is already soc2 and hipaa compliant.


Enterprises can bring up https://github.com/go-skynet/LocalAI, run Llama or others and connect to them from their Promptly LLM apps - So spin up GPU instances and host whatever model in their VPC and it connects to your SaaS stack? What are they paying you for in this scenario?


> is going to put a lot of the enterprises at ease and embrace ChatGPT in their business processes.

Except many companies deal with data of other companies, and these companies do not allow the sharing of data.


Usually that’s not a problem it just means adding OpenAI as a data processor (at least under ISO 27017). There’s a difference between sharing data for commercial purposes (which is usually verboten), vs for data-processing purposes.


At the corp I work for Chat GPT (even bing) is blocked at the firewall. Hopefully now we'll be able to use it.


I've been maintaining SOC2 certification for multiple years, and I'm here to say that it's largely performative and an ineffective indicator of security posture.

The SOC2 framework is complex and compliance can be expensive. This can lead organizations to focus on ticking the boxes rather than implementing meaningful security controls.

SOC2 is not a good universal metric for understanding an organization's security culture. It's frightening that this is the best we have for now.


Will be doing a show HN for https://proc.gg, a generative AI platform I've built during my sabbatical.

I personally believe that in addition to OpenAI's offering, the ability to swap to an open source model e.g. Llama-2 is the way to go for enterprise offerings in order to get full control.


Azures ridiculous agreement likely put a lot of orgs off. They also shouldn't have tried to "improve" upon OpenAI's APIs. OpenAI's APIs are a little under thought (particularly fine tuning) but so what?


> we quickly learned how sensitive enterprises are when it comes to sharing their data

"They're huge pussies when it comes to security" - Jan the Man[0]

[0] https://memes.getyarn.io/yarn-clip/b3fc68bb-5b53-456d-aec5-4...


Non-use of enterprise data for training models is table-stakes for enterprise ML products. Google does the same thing, for example.

They'll want to climb the compliance ladder to be considered in more highly-regulated industries. I don't think they're quite HIPAA-compliant yet. The next thing after that is probably in-transit geofencing, so the hardware used by an institution reside in a particular jurisdiction. This stuff seems boring but it's an easy way to scale the addressable market.

Though at this point, they are probably simply supply-limited. Just serving the first wave will keep their capacity at a maximum.

(I do wonder if they'll start offering batch services that can run when the enterprise employees are sleeping...)


> don't think they're quite HIPAA-compliant yet

OpenAI offers baa to select customers.


Fedramp? High?


OpenAI says they offer a BAA but I haven’t heard of anyone actually being able to get to someone who could put one together.

You can get a BAA through Azure’s OpenAI service though, I believe the details are located in this document:

https://azure.microsoft.com/en-us/resources/microsoft-azure-...


Pretty sure I know an org that has baa with OpenAI directly. Agree that azure is more straightforward (and what I did at my startup).


It’s probably a lot easier if you’re a corp with name recognition, but the regular folks who post online about it have said no luck so far.


Not hi-trust afaik, but they will do hipaa eligible baa with select customers. As sister comment says, it’s easier to go through azure and you get basically the gamut of azure compliance certs for free.


I thought they already didn't use input data from the API to train; that it was only the consumer-facing ChatGPT product from which they'd use the data for training. It is opt-in for contributing inputs via API.

https://help.openai.com/en/articles/5722486-how-your-data-is...

That said, for enterprises that use the consumer product internally, it would make sense to pay to opt-out from that input being used.


The ChatGPT model has violated pretty much all open source licenses (including MIT license which needs attribution. Show me one single OSS project's license attribution before arguing please.) and is standing still. With the backing of microsoft, I am confused. What will happen if they violate their promise and train data selectively from competitors or potential small companies?

What is actually stopping them? Most companies won't have the fire power to go against microsoft backed openai. How can we ensure that they can't violate this? How can they be practically held accountable?

This as far as I am concerned is "Trust me bro!". How is it not otherwise?


> The ChatGPT model has violated pretty much all open source licenses

Are you claiming this because they used copyrighted material as training data? If so, I think you're starting from the wrong point.

Please correct me if I'm wrong, but last I heard using copyrighted data is pretty murky waters legally and they're operating in a gray area. Additionally, I don't think many open source licenses explicitly forbid using their code as training data. The issue isn't just that most other companies don't have the resources to go up against Microsoft/OpenAI, it's that even if they did, it isn't clear whether the courts would find that Microsoft/OpenAI did anything wrong.

I'm not saying that I side with Microsoft/OpenAI in this debate, but I just don't think this is as clear cut as you're making it seem.


> Are you claiming this because they used copyrighted material as training data? If so, I think you're starting from the wrong point.

All open source license comes under copyright law. It means if they violate the OSS license, the license is void and the tech/material becomes copyright protected. So yes, it would mean that it is trained on copyrighted material.

> Additionally, I don't think many open source licenses explicitly forbid using their code as training data.

It doesn't forbid. For example, permissive license like MIT can be used to train LLM's if they are in compliance. The only requirement when you train on a MIT licensed codebase is that you need to provide attribution. It is one of the easiest license to comply. It means, you just need to copy paste the copyright notice. The below is the MIT license of Emberjs.

Copyright (c) 2011 Yehuda Katz, Tom Dale and Ember.js contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

This copyright notice needs to be somewhere in ChatGPT's website/product to be in compliance with MIT license. If it is not, MIT license is void and you are violating the license. The end result is you are training on copyrighted material. I am more than happy to be corrected if you could find me any single OSS license attribution shown somewhere for training the openai model.

Also, this can be still be fixed by adding the attribution for the code that is trained on. THIS IS MY ARGUMENT. The absolute ignorance and arrogance is their motivation and agenda.

Which is why I am asking, WHAT IS STOPPING THEM FROM VIOLATING THEIR OWN TERMS AND CONDITIONS FOR CHATGPT ENTERPRISE?


Deterrence.

First offense could be excused as "blazing a trail and burning down the forest by accident".

But now they have a direct business contract with bigger companies that can lawyer up way better than open source foundations that live on donations and goodwill of code contributors.

Imagine they make a huge deal with Sony or Dell and either company can prove their "secure" enterprise plan was used for corporate espionage.

The legal and reputation repercussion could sink even a fortune 100 company


I thought attribution is required only if you redistribute the code. That’s why saas businesses don’t need to attribute when using open source code on their backend. Maybe a similar concept could be used for training data. I’m far from an expert so this is just a thought.


ChatGPT does redistribute the code, it's essentially the same issue as someone reading proprietary sources or GPL sources on a proprietary project, because they aren't abiding by the license they are breaking the terms. there is no possibility of clean room implementations with ChatGPT


> ChatGPT does redistribute the code

My whole point is that I don't think that's legally true at the moment. There's enough difference in how generative AI works compared to pretty much anything before it that what ChatGPT legally does is up for debate. If a court rules that what ChatGPT does counts as redistribution then yes, I agree that they're likely violating copyright law, but AFAIK that ruling hasn't happened yet.


This is the wrong way to look at it. This comes in the same line of saying "we need a new license for AI" argument. There is nothing stopping an LLM/AI from abiding the license. The OSS license can be used by AI's or LLM's as long as they comply with the terms.

A license exist with terms. You can abide the terms and use it. It doesn't matter whether an AI, a person or an alien from a distant planet is using it. They can follow the terms. This is not a technical challenge but arrogance to abide.

Also, are you saying a model like chatgpt can do so much complex tasks and text processing but can't recognise an OSS license text of 20ish lines?


I am not sure I can agree. What is stopping them from using only permissive license and adding attribution for all the licenses in a single long page? Nothing. This is not a technical issue.


If they violate their own terms they will lose customers and money. If they violate OSS nothing happens.


If they violate their own terms, nothing happens until someone knows about it and is inclined to act against them.


US copyright/IP management is such a shitsh*w. On one hand you can get sued by patent trolls who own the patent for 'button that turns things on' or get your video delisted for recording at a mall where some copyrighted music is playing in the background, on the other hand, you get people arguing that scraping code and websites with proprietary licences is 'fair use'


To me it’s been clear for a while: copyright is a fiction which has outlived its usefulness.


Taking this from a different perspective, let's say that ChatGPT, CodePilot, or similar service gets trained on Windows source code. Then a WINE developer uses ChatGPT or CodePilot to implement one of the methods. Is WINE then liable for including Windows proprietary source code in their codebase even if they have never seen that code.

The same would apply to any other application. What if company A uses code from company B via ChatGPT/CodePilot because company B's code was used as training data? Imagine a startup database company using Oracle's database code through use of this technology.

And if a proprietary company accidentally uses GPL code through these tools, and the GPL project can prove that use, then the proprietary company will be forced to open source their entire application.


> the proprietary company will be forced to open source their entire application

Top 1 misconception about open source licenses.

GPL doesn't mean if you use the code your entire project will become GPL.

GPL means if you use the code and your project is not GPL-compatible, you are committing copyright infringement. As if you stole proprietary code. If brought to the court, it would be resolved just like other copyright infringement cases.


> What is actually stopping them? Most companies won't have the fire power to go against microsoft backed openai.

Microsoft/Amazon/Google already have competitor's data in their cloud. They could even fake encryption to get all the customer's disk access. Also most employees use google workspace or office 365 cloud to store and share confidential files. How is different with OpenAI that makes it any more worrying?


To my understanding this train has left the station, it's gonna take much more than lousy gray laws to stop it


    > For all enterprise customers, it offers:
    > Customer prompts and company data are not used for training OpenAI models.
    > Unlimited access to advanced data analysis (formerly known as Code Interpreter)
    > 32k token context windows for 4x longer inputs, files, or follow-ups
I'd thought all those had been available for non enterprise customers, but maybe I was wrong, or maybe something changed.


I think the real feature is this:

" We do not train on your business data or conversations, and our models don’t learn from your usage. ChatGPT Enterprise is also SOC 2 compliant and all conversations are encrypted in transit and at rest. "


Which part of that is new, because I was pretty sure they were saying "we do not train on your business data or conversations, and our models don’t learn from your usage" already. Maybe the SOC 2 and encryption is new?


They don't train on data when you either use the API or disable chat history, which is inconvenient.


yes, this is terrible. I want chat history, but I don't want them to use my data. Can't have both, even though I am paying $20/month!


Use a third party interface which uses the API directly like YakGPT or the OpenAI playgrounds, and you can save some costs that way along with a local chat history that’s not shared with OpenAI.


Prior to releasing the chat history "feature" there was an opt-out form that could be submitted, which did not have any impact on the webapp's functionality. I'm not current enough to know if that form 1) ever had any effect, and 2) if a form-submitted opt-out is still valid given they now have the aforementioned in-app feature.


Best way of doing this I have found is using separate browser profiles.

So i have one primary profile logged in normally and a separate tab which i turn off chat history.

So now I get best of both worlds


I don't see how that's the best of both worlds.

They want chat history and no training on the same conversation.


you can have different profiles on a single ChatGPT plus subscription?


Really? This seems like one Chrome extension away...


so that someone else gets your data?

Chrome extension is a no go.


Who says it can't save it to a local database?


It can, until the extension developer receives a tempting offer for it, as has happened countless times


Fork the extension and use your own then.


And you’re going to spend the time reviewing every single commit to make sure the dev didn’t sell out without telling anyone? Or risk running a potentially outdated and vulnerable extension?


And then you’ve gotta wake up in the morning and put your shoes on!

What’s with the argumentative tone? Do you think that the replier doesn’t know this?


The obvious point being that the “fork your own code and write your own kernel” attitude is simply unworkable for 99.999% of the population.

If we had to waste that much time re-inventing the loaf of bread, and then making sure that my neighbors didn’t decide to throw some raisins in my loaf, that we never get around to figuring out the next best thing: slicing it.


The argument you're making (you can't trust software whose you code you haven't studied) applies to every software package ever made.


Exactly. So we have to choose: trust everything reasonably mainstream with the hope that someone is watching it, or stop functioning.


...which is a lot more work than "one Chrome extension away".


>" We do not train on your business data or conversations, and our models don’t learn from your usage. ChatGPT Enterprise is also SOC 2 compliant and all conversations are encrypted in transit and at rest. "

That's great. But can customer prompts and company data be resold to data brokers?


But, can they provide a comprehensive dump of all data it was trained on that we can examine? Otherwise my company may end up using IP that belongs to someone else.


It's exactly opposite. The entire point of an enterprise option would be that you DO train it on corporate data, securely. So the #1 feature is actually missing, yet is announced as in the works.


You probably wouldn't want that, you'd want to integrate with your data for lookups but rarely for training a new model.


Can't believe the pushback I'm getting here. The use case is stunningly obvious.

Companies want to dump all their Excels in it and get insights that no human could produce in any reasonable amount of time.

Companies want to dump a zillion help desk tickets into and gain meaningful insights from it.

Companies want to dump all their Sharepoints and Wikis into it that currently nobody can even find or manage, and finally have functioning knowledge search.

You absolutely want a privately trained company model.


None of the use cases you are describing require training a new model. You really don't want to train a new model, that's not a good way of getting them to learn reliable facts and do so without losing other knowledge. The fine tuning for GPT 3.5 suggests something like under a hundred examples.

What you want is to get an existing model to search a well built index of your data and use that information to reason about things. That way you also always have entirely up to date data.

People aren't missing the use cases you describe, they're disagreeing as to how to achieve those.


>>Companies want to dump all their Excels in it and get insights that no human could produce in any reasonable amount of time.

>>Companies want to dump a zillion help desk tickets into and gain meaningful insights from it.

>>Companies want to dump all their Sharepoints and Wikis into it that currently nobody can even find or manage, and finally have functioning knowledge search.

Mature organizations already have solutions for all of these things. If you can't mine your own data competently, you've got bigger problems than not having AI doing it for you. It means you don't have humans who understand what's going on. AI is not the answer to everything.


So these "mature" orgs are using something better than openai, can you explain ?


I wish I lived in the same universe as you


I wonder if corporations would train it on emails/Exchange as well, since they are often technically company property and could contain valuable information not found in tickets/wikis.


I think those are examples of prompting, not modeling. You'd use the API to develop an app where the end user's question gets prefaced with that stuff. Modeling is more like teaching it how to sensibly use language, which can be centralized instead of each enterprise having experts in that. It would be like having in-house English teachers instead of sending people to school, based on a desire to have a corporate accent -- interesting but probably not useful in most cases.


ChatGPT doesn’t work for this. There is a huge GIGO problem here that it’s missing the organizational knowledge to disambiguate. Unless you’ve pre-told it which excel sheets are correct, this is DOA.

ChatGPT only works as well as it does because it’s been trained on a corpus of “internet accepted” answers. It can’t fucking reason about raw data. It’s a language model.


Coca Cola doesn’t want to train a model that can be bought by Pepsi.


But that's exactly the point, an enterprise offering should be able to provide guarantees like this while also allowing training - model per tenant. I think the reality is they are doing multi-tenant models which means they have no way guarantee your data won't be leaked unless they disable training altogether.


I'm imagining some corporate scenario where Coca Cola or Pepsi are purposefully training models on poisoned information so they can out each other for trying to use AI services like ChatGPT to glean information about competitors via brute force querying of some type


Well, the idea is that you can't buy the training model of a competitor.


What are you talking about?


I think you missed this part:

ChatGPT Enterprise is also SOC 2 compliant and all conversations are encrypted in transit and at rest. Our new admin console lets you manage team members easily and offers domain verification, SSO, and usage insights, allowing for large-scale deployment into enterprise.

I think this will have a solid product-market-fit. The product (ChatGPT) was ready but not enterprise. Now it is. They will get a lot of sales leads.


Just the SOC2 bit will generate revenue… If your organization is SOC2 compliant, using other services that are also compliant is a whole lot easier than risking having your SOC2 auditor spend hours digging into their terms and policies.


“all conversations are encrypted … at rest” - why do conversations even need to _exist_ at rest? Seems sus to me


Chat history is helpful.


I believe the API (chat completions) has been private for a while now. ChatGPT (the chat application run by OpenAI on their chat models) has continued to be used for training… I believe this is why it’s such a bargain for consumers. This announcement allows businesses to let employees use ChatGPT with fewer data privacy concerns.


You can turn off history & training on your data


Note that turning 'privacy' on is buried in the UI; turning it off again requires just a single click.

Such dark patterns, plus their involvement in crypto, their shoddy treatment of paying users, their security incidents... make it harder for me to feel good about OpenAI spearheading the introduction of (real) AI into the world today.


> Such dark patterns, plus their involvement in crypto, their shoddy treatment of paying users, their security incidents... make it harder for me to feel good about OpenAI spearheading the introduction of (real) AI into the world today.

Interesting. My opinion is it is a great product that works well for me, I don't find my treatment as a paying user shoddy, and their security incident gives me pause.


  > I don't find my treatment as a paying user shoddy
I have never payed for a service with worse uptime in my life than ChatGPT. Why? So that OpenAI could ramp up their user-base of both free and paying users. They knowingly took on far more paying users than they could properly support for months.

There are justifications for the terrible uptime that are perfectly valid, but in the end, a customer-focused company would have issued a refund to the paying customers for the months during which they were shafted by OpenAI prioritizing growth.

That doesn't mean OpenAI isn't terrific in some ways. They're also lousy in others. With so many tech companies, the lousy aspects grow in significance as the years pass. OpenAI, because of all the reasons in my parent comment, is not off to a great start, imo.


They're not involved in crypto, just the CEO is.


That's an important correction. Thanks, I got a bit carried away with the comment. There's enough hearsay on the internet, and I don't want to contribute.

While we're at it, another exaggeration I made is "security incidents"; in fact, I am only aware of one.


If you turn off history and training, you as the user can no longer see your history, and OpenAI won't train with your data. But can customer prompts and company data still be resold to data brokers?


Yes they bundled it under single dark pattern toggle so most people won’t click it.


Worse (IMO) than that is the fact that when the privacy mode is turned on, you can't access your previously saved conversations nor will it save anything you do while it's enabled. Really shitty behaviour.


What about prompt input and response output retention for x days for abuse monitoring? does it not do that for enterprise? For Microsoft Azure's OpenAI service, you have to get a waiver to ensure that nothing is retained.


>Customer prompts and company data are not used for training OpenAI models.

That's great. But can customer prompts and company data be resold to data brokers?


I'm going to see if the word "Enterprise" convinces my organization to allow us to use ChatGPT with our actual codebase, which is currently against our rules.


No copilot too?


I can't believe any organization (except open source projects) allows the use of co-pilot


Me too, but then I see everyone hosting their code on github and I’m not quite sure what the substantial difference is.


Loads of them do. If you are already using Github enterprise it doesn't change meaningfully change anything from a security perspective.


Last I checked:

- GPT-4 (ChatGPT Plus): has max 4K tokens ?

- GPT-4 API: has max 8K tokens (for most users atm)

- GPT-3.5 API: has max 16K tokens

I'd consider the 32K GPT-4 context the most valuable feature. In my opinion OpenAI shouldn't discriminate in favor of large enterprises. It should be equaly available to normal (paying) customers.


If you pick ChatGPT with GPT-4 and select the Plugins version I believe the context window is 8K.


Thanks. I'm currently using the API models (even GPT 3.5 16K) for things that require a larger context. So much for "Priority access to new features and improvements" as advertised with Plus.


It is pretty much is if you use OpenAI via Azure, or you're large enough and talk to their sales (the 2x faster is dedicated capacity I'm guessing)


everything but 32k version and 2x speed is the same as the consumer platform


https://news.ycombinator.com/item?id=37298864

Having conversations saved to go back to like in the default setting on Pro, that's disabled when a Pro user turns on the privacy setting, is another big difference.


Would be great to be able to search them.


32k is available via API


For everyone?


Nope, I don't know anyone who has access to the 32k model. The best that's widely available is GPT 3.5 16k.


> Customer prompts and company data are not used for training OpenAI models.

This is borderline extortion, and it's hilarious to witness as someone who doesn't have a dog in this fight.


Not really, they want some users to give them conversation history for training purposes and offer cheaper access to people willing to provide that.


This assumes the portion of the enterprise fee related to this feature is only large enough to cover the cost of losing potential training data, which is an absurd assumption that can't be proven and has no basis in economic theory.

Companies are trying to maximize profit; they are not trying to minimize costs so they can continue to do you favors.

These arguments creep up frequently on HN: "This company is doing X to their customers to offset their costs." No, they are a company, and they are trying to make money.


The fact that companies want to maximise profits doesn't prove the point you think it does.

Nobody is arguing that there's an exact matching of value to the company between 1 user giving OpenAI permission to use their chat history for future training and 1 user paying $20/month. But based on your simplistic view, no company would ever offer a free tier because it's not directly maximising revenue.

It's very obvious that getting lots of real-world examples of users using ChatGPT is beneficial for multiple reasons - from using in future training runs (or fine tuning), to analysing what users what to use LLMs for, to analysing what areas ChatGPT is currently performing well or badly in, etc.

So it's not about blankly and entirely "offsetting costs", it's about the fact that both money into their bank account and this sort of data into their databases are both beneficial to the long-term profitability of the company even though only one of them is direct and instant revenue.

Before ChatGPT was released for the world to use, OpenAI were even paying people (both employees and not) to have lots of conversations with it for them to analyse. The exact same logic that justified that justifies allowing some users to pay some or all of the fee for the service in data permissions rather than money.

I'm speaking from experience making these sorts of business decisions, and to a company like OpenAI this is just basic common sense.


Exactly, there is an opportunity cost to NOT training on this data.


As long as they provide free Enterprise access for all those whose data they already stole...


I assume that means they don't train on company data that is sent through ChatGPT Enterprise.

I don't think they're removing all instances of your company from their existing data sources, which would make sense to call "borderline extortion".


Interesting, but I am a bit disappointed that this release doesn't include fine-tuning on an enterprise corpus of documents. This only looks like a slightly more convenient and privacy-friendly version of ChatGPT. Or am I missing something?


At the bottom, in their coming soon section: "Customization: Securely extend ChatGPT’s knowledge with your company data by connecting the applications you already use"


Great now chatgpt can train on outdated documents from the 2000s, provide more confusion to new people, and give us more headaches


On the other hand, there was a lot of knowledge in those documents that effectively got lost - while the relevant tech is still underpinning half the world. For example: DCOM/COM+.


I think this is actually of great value.


I saw it, but it only mentions "applications" (whatever that means) and not bare documents. Does this mean companies might be able to upload, say, PDFs, and fine-tune the model on that?


Pretty unlikely. Generally you don't use fine-tuning for bare documents. You use retrieval augmented generation, which usually involves vector similarity search.

Fine-tuning isn't great at learning knowledge. It's good at adopting tone or format. For example, a chirpy helper bot, or a bot that outputs specifically formatted JSON.

I also doubt they're going to have a great system for fine-tuning. Successful fine-tuning requires some thought into what the data looks like (bare docs won't work), at which point you have technical people working on the project anyway.

Their future connection system will probably be in the format of API prompts to request data from an enterprise system using their existing function fine-tuning feature. They tried this already with plugins, and they didn't work very well. Maybe they'll come up with a better system. Generally this works better if you write your own simple API for it to interface with which does a lot of the heavy lifting to interface with the actual enterprise systems, so the AI doesn't output garbled API requests so much.


When I first started working with GPT I was disappointed in this. I thought like the previous commentor that I could fine tune by adding documents and it would add it to the "knowledge" of GPT. Instead I had to do what you suggest is vector similarity search, and add the relevant text to the prompt.

I do think an open line of research is some way for users to just add arbitrary docs in an easy way to the LLM.


Yes, this would definitely be a game changer for almost all companies. Considering how huge the market is, I guess it's pretty difficult to do, or it would be done already.

I certainly don't expect a nice drag-and-drop interface to put my Office files and then ask questions about it coming in 2023. Maybe 2024?


That would be the absolute game-changer. Something with the "intelligence" of GPT-4, but it knows the contents of all your stuff - your documents, project tracker, emails, calendar, etc.

Unfortunately even if we do get this, I expect there will be significant ecosystem lock-in. Like, I imagine Microsoft is aiming for something like this, but you'd need to use all their stuff.


There are great tools that do this already in a support-multiple-ecosystems kind of way! I'm actually the CEO of one of those tools: Credal.ai - which lets you point-and-click connect accounts like O365, Google Workspace, Slack, Confluence, e.t.c, and then you can use OpenAI, Anthropic etc to chat/slack/teams/build apps drawing on that contextual knowledge: all in a SOC 2 compliant way. It does use a Retrieval-Augmented-Generation approach (rather than fine tuning), but the core reason for that is just that this tends to actually offer better results for end users than fine tuning on the corpus of documents anyway! Link: https://www.credal.ai/


What are the limitations on adding documents to your system? Your website doesn't particularly highlight that feature set, which it probably should if you support it!


Thanks for the feedback! Going to make some changes to the website to reflect that later today! Right now we support connecting Google Doc, Google Sheet, PDFs from Google Drive, Slack channel, or Confluence space. O365, Notion and a couple other sources integrations are in beta. We don't technically have restrictions on volume, the biggest customers we have have around 100 GB of data with us total. If you were trying to connect a terrabyte worth of data, that might be a conversation about pricing! :)


You can use https://Docalysis.com for that. Disclosure: I am the founder of Docalysis.


Your pricing seems to eliminate some use cases, including mine.

Rather than wanting to import N documents per month, I would want to import M documents all at once, then use that set of documents until at some future time I want to import another batch of K documents (probably a lot smaller than M) or just one document once in a while.

By limiting it to a fixed amount of documents per month, it eliminates all the applications where you need to import a complete corpus before the service is useful.


Thanks, I'll have a look!


I believe latest version of Elastic offers this


Totally agree. retrieval augmented generation is still the preferred way to give the LLM more knowledge. Fine-tuning is mostly useful for adapting the base model for another task. I wrote about this in a recent blog post: https://vectara.com/fine-tuning-vs-grounded-generation/.

Anyone knows how this new capability works in terms of where the model inference be done? Would it still be at the OpenAI side or is this going to be at the customer side?


In your opinion, is it an either or scenario? Or would fine-tuning on docs + RAG be even more powerful?


I've been wondering this myself lately.

After using RAG with pgvector for the last few months with temperature 0, it's been pretty great with very little hallucination.

The small context window is the limiting factor.

In principle, I don't see the difference between a bunch of fine-tuned prompts along the lines of "here is another context section: <~4k-n tokens of the corpus>", which is the same as what it looks like in a RAG prompt anyway.

Maybe the distinction of whether it is for "tone" or "context" is based on the role of the given prompts and not restricted by the fine-tuning process itself?

In theory, fine-tuning it on ~100k tokens like that would allow for better inference, even with the RAG prompt that includes a few sections from the same corpus. It would prevent issues where the vector search results are too thin despite their high similarity. E.g. picking out one or two sections of a book which is actually really long.

For example, I've seen some folks use arbitrary chunking of tokens in batches of 1k or so as an easy config for implementation, but that totally breaks the semantic meaning of longer paragraphs, and those paragraphs might not come back grouped together from the vector search. My approach there has been manual curation of sections allowing variations from 50 to 3k tokens to get the chunks to be more natural. It has worked well but I could still see having the whole corpus fine-tuned as extra insurance against losing context.


It's not impossible that fine-tuning would also help RAG. but it's certainly not guaranteed and hard to control. Fine-tuning essentially changes the weights of the model, and might result in other, potentially negative outcome, like loss of other knowledge of capabilities of the resulting fine-tuned LLM.

Other considerations: (A) would you fine-tune daily? weekly? as data changes? (B) Cost and availability of GPUs (there's a current shortage)

My experience is that RAG is the way to go, at least right now.

But you have to make sure your retrieval engine work optimally: getting the very most relevant pieces of text from your data: (1) using a good chunking strategy that's better than arbitrary 1K or 2K chars (2) using a good embedding model (3) Using hybrid search, and a few other things like that.

Certainly the availability of longer sequence models is a big help

Sharing this relevant discussion from LinkedIn: https://www.linkedin.com/feed/update/urn:li:activity:7101638...


Yeah, I'll be curious to see what it means by this. Could be a few things, I think:

- Codebases

- Documents (by way of connection to your Box/SharePoint/GSuite account)

- Knowledgebases (I'm thinking of something like a Notion here)

I'm really looking forward to seeing what they come up with here, as I think this is a truly killer use case that will push LLMs into mainstream enterprise usage. My company uses Notion and has an enormous amount of information on there. If I could ask it things like "Which customer is integrated with tool X" (we keep a record of this on the customer page in Notion) and get a correct response, that would be immensely helpful to me. Similar with connecting a support person to a knowledgebase of answers that becomes incredibly easy to search.


Azure-hosted GPT already lets you "upload your own documents" in their playground; it seems to be similar to how ChatGPT GPT-4 Code Interpreter handles file uploads.


You don't fine-tune on a corpus of documents to give the model knowledge, you use retrieval.

They support uploading documents to it for that via that code interpreter, and they're adding connectors to applications where the documents live, not sure what more you're expecting.


Yes, but what if they are very large documents that exceed the maximum context size, say, a 200-page PDF? In that case won't you be forced to do some form of fine-tuning, in order to avoid a very slow/computationally expensive on-the-fly retrieval?

Edit: spelling


Typical retrieval methods break up documents into chunks and perform semantic search on relevant chunks to answer the question.


Fine-tuning the LLM in the way that you're mentioning is not even an option: as a practical rule fine-tuning the LLM will let you do style transfer, but you knowledge recall won't improve (there are edge cases, but none apply to using ChatGPT)

That being said you can use fine tuning to improve retrieval, which indirectly improves recall. You can do things like fine tune the model you're getting embeddings from, fine tune the LLM to craft queries that better match a domain specific format, etc.

It won't replace the expensive on-the-fly retrieval but it will let you be more accurate in your replies.

Also retrieval can be infinitely faster than inference depending on the domain. In well defined domains you can run old school full text search and leverage the LLMs skill at crafting well thought out queries. In that case that runs at the speed of your I/O.


We have >200 page PDFs at https://docalysis.com/ and there's on-the-fly retrieval. It's not more computationally expensive than something like searching one's inbox (I'd image you have more than 200 pages worth of emails in your inbox).


Retrieval Augmented Generation would be something to check out. There was a good intro on the subject posted here a week or 3 ago.


This is one of the reasons we decided to go with Databricks. Embed all the things for RAG during ETL.


Well the message in this video certainly did not age well: https://www.youtube.com/watch?v=smHw9kEwcgM

TLDR: This might have just killed a LOT of startups


Haha i also thought about that Y Combinator video. Yep, their prediction didn't age well and it's becoming clear that openAI is actually a direct competitor to most of the startups that are using their api. Most "chat your own data" startups will be killed by this move.


Yeah like, if OpenAI can engineer chatGPT, they can sure as hell engineer a lot of the apps built on top of chatGPT out there.


No different than Apple, then. A lot of value is provided to customers by providing these features through a stable organization not likely to shutter within 6 months, like these startup "ChatGPT Wrappers". I hope that they are able to make a respectable sum and pivot.


I think almost each startup is focusing on enterprise as it sounds lucrative but selling to an enterprise might qualitatively offset its benefits in some way (very painful).

Personally I love what Evenup Law is doing. Basically find a segment of the market that runs like small businesses and that has a lot of repetitive tasks they have to do themselves and go to them. Though I can't really think of other segments like this :)


In 2023 you get to pay to be the (premium) product.


If your entire startup was just providing a UI on top of the ChatGPT API, it probably wasn't that valuable to begin with and shutting it down won't be a meaningful loss to the industry overall.


There's a typical presumed business intuition that any large company will confer business to a host of "satellite companies" who offer some offshoot of the product's value proposition but catered to a niche sector. Most of these are however just "OpenAI API + a prefix prompt + user interface + marketing". The issue is (which has been brought up since the release of the GPT-3 API 3 years ago) that no startup can offer much more value than the API alone offers, and thus it's easier, comparatively, than in analogous cases of this type of startup model with other larger companies in the past, for OpenAI to capitalize on this business


This has been the weirdest part of the current wave of AI hype, the idea that you can build some kind of real business on top of somebody else's tech which is doing 99.9% of the work. There are hard limits on how much value you can add.

If you want to build something uniquely useful, you probably have to do your own training at least.


Any startup that is using ChatGPT under the hood is just doing market research for OpenAI for free. The same happened when people started experimented with GPT3 for code completion, right before being replaced by Copilot.

If you want to build an AI start-up and need a LLM, you must use Llama or another model than you can control and host yourself, anything else is basically suicide.


>Any startup that is using ChatGPT under the hood is just doing market research for OpenAI for free

It's not free if you have paying clients.

> If you want to build an AI start-up and need a LLM, you must use Llama or another model than you can control and host yourself, anything else is basically suicide.

You're still doing market research for OpenAI. Just because you aren't using their model doesn't mean they can't copy your UX. Prompts are not viable trade secrets after all.


> It's not free if you have paying clients.

No early stage start-up has revenues covering their expenses. But in fact you're right, it's not even “free”, it's “investor-subsidized” market research.

> You're still doing market research for OpenAI. Just because you aren't using their model doesn't mean they can't copy your UX. Prompts are not viable trade secrets after all.

Prompt aren't viable trade secret, but fine-tuning datasets and strategies, customer data[1], customer habits, user feedback, etc. are. And if you're using OpenAI, you're giving all that to them. Also, given their positioning, they cannot address any use-case that involve deploying your model inside your customer's infrastructure, so this kind of market research is completely irrelevant for them.

[1]: And don't get fooled by wordings saying that they don't train on customer data, they are still collecting much more info that what you'd like them to. For instance, even just knowing the context size that users like to work with in different scenario is a really interesting data for them, and you can be sure that they collect it and adapt to.


"Unlimited access to advanced data analysis (formerly known as Code Interpreter)"

Code Interpreter was a pretty bad name (not exactly meaningful to anyone who hasn't studied computer science), but what's the new name? "advanced data analysis" isn't a name, it's a feature in a bullet point.


Also I'd heard anecdotally on the internet (Ethan Mollick's twitter I think) that 'code interpreter' was better than GPT 4 even for tasks that weren't code interpretation. Like it was more like GPT 4.5. Maybe it was an experimental preview and only enterprises are allowed to use it now. I never had access anyway.


I still have access in my $20/m non-Enterprise Pro account, though it has indeed just updated its name from Code Interpreter to Advanced Data Analysis. I haven't personally noticed it being any better than standard GPT4 even for generation of code that can't be run by it (ie non-Python code).


I've been using it heavily for the last week - hopefully it doesn't become enterprise only... it's very convenient to pass it some examples and generate and test functions.

And it does seem "better" than standard 4 for normal tasks


Ah I'd better start using it more again and see if I find it better too


I also have a pro account, and I’ve looked for and not seen code interpreter in my account. Am I just missing it?


You may have to go into settings and enable it under beta options.


Thank you, that worked!


GPT 4.5 concept was from latent space! https://www.latent.space/p/code-interpreter#details


In my account it now says "Advanced Data Analysis" instead of "Code Interpreter". Looks like it is the new name.


I had the old name, reloaded the page, and got the new name.

What a terrible name! They should have asked ChatGPT for suggestions.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: