Hacker News new | past | comments | ask | show | jobs | submit login
PdfGptIndexer: Indexing and searching PDF text data using GPT-2 and FAISS (github.com/raghavan)
311 points by raghavankl on July 8, 2023 | hide | past | favorite | 137 comments



The most frustrating thing about the many, many clones of this exact type of idea is that pretty much all of them require OpenAI.

Stop doing that.

You will have way more users if you make OpenAI (or anything that requires cloud) the 'technically possible but pretty difficult art of hoops to make it happen' option, instead of the other way around.

The best way to make these apps IMO is to make them work entirely locally, with an easy string that's swappable in a .toml file to any huggingface model. Then if you really want OpenAI crap, you can make it happen with some other docker secret or `pass` chain or something with a key, while changing up the config.

The default should be local first, do as much as possible, and then if the user /really/ wants to, make the collated prompt send a very few set of tokens to openAI.


It's difficult to compete. A small business might answer 10,000 requests to their chat bot. The options are

- Pay openai less than $50mo

- Manage cloud gpus, hire ml engineers > $1000/mo

- Buy a local 4090 and put it under someone's desk, $no reliability +$1500 fixed

Any larger business will need scalability and you still can't compete with openai pricing.

Maybe one of you startup inclined people can make an openllama startup that charges by request and allows for finetuning, vector storage


I’ve got an expensive GPU at home I’m not even using because there aren’t that many things to do with it. Give me more local options.





Let other people pay you to run their stuff on your hardware with Vast.ai.


Even if you are not into coding there are many good AI tools that run local. Two very easy examples:

I've had great fun with the " Easiest 1-click way to install and use Stable Diffusion on your computer."

https://github.com/easydiffusion/easydiffusion

And while Whisper is OpenAI, it is trivial to use locally and extremely usefull

https://github.com/chidiwilliams/buzz


It depends heavily on the use case, not org size. I consult for a ~70 people org that needs to process ~1M tokens per day. That costs $30K per day on OpenAI ChatGPT API. I'm sure this is not an extraordinary case.


Each person in the org needs 1M GPT-4 token and semantic search can’t be used to trim queries? I would be super curious to know more about this use case.


The data doesn't scale according to employee size. If they manage to cut the headcount in half, they'd still need to process the same amount of info.

The use case is based on public information on the internet. News articles, PRs, social media posts, etc.

LLMs are used to extract info from text in a structured format. It used to have several classification and NLP models to do the job, but now a single LLM can do it faster and with better accuracy.


I have a 4080, let’s do a startup. #cancode #hashomelab


> Maybe one of you startup inclined people can make an openllama startup that charges by request

I'm currently building www.lalamon.us specifically to provide a fully hosted open source model experience. One slight difference is that I'm providing a private chat instance for each user, so charging based on hours of active chat usage seemed to make more sense. Per-request charging seems more unpredictable for users, but I'd be interested in hearing the case either way.

Feel free to reach out with more questions if interested; my email is in my profile.


Doing this. We soft launched yesterday with a paid Falcon-40B playground - 3 models for now Falcon 40b instruct, uncensored, and base. Adding API and per token pricing this week.

https://api.llm-utils.org/

And more models coming soon.

Vector storage isn’t on the roadmap (what stops using a separate vector store from working well? Could add to roadmap but want to add understand more first), and we could add fine tuning if it’s a common request.


Lots of people using LLMs to make chat bots from their existing datasets: customer service troubleshooting, FAQs, billing, scheduling. Being able to upload their own pdfs, spreadsheets, docx, crawl their home page, lets the chat bot become personalized to their use case. While you could locally query your own vectordb before prompting, people buy paid service so they won't have to manage any of the technical details.

If people can drag and drop some files from their nas, you parse them with apache tika or similar https://tika.apache.org/ , they can start using personalized branded bots. It also lets you do things like refusing to answer, if the vector database returns nothing and the use case requires a specific answer from the docs only (not the llm to make stuff up).


For those use cases the “custom ChatGPT” tools I linked here might be better https://news.ycombinator.com/item?id=36649777


Shouldn't you use a .com tld?

Will your pricing be competitive with Replicate?


Not secure... NET::ERR_CERT_COMMON_NAME_INVALID Subject: *.safezone.mcafee.com

Issuer: McAfee OV SSL CA 2

Expires on: Aug 3, 2023

Current date: Jul 8, 2023

PEM encoded chain: -----BEGIN CERTIFICATE----- MIIGfzCCBWegAwIBAgIQKt9VNrFtaozA1bILX1OcfzANBgkqhkiG9w0BAQsFADBk MQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0


FastChat-T5 can work for such a use case and it runs on (beefy) CPUs. With a 700$/month instance, it can do 4 conversations simultaneously, without needing GPUs.

The instant a company has sensitive data, this becomes very viable.


Wait until winter time and heat your house!


Good double use of that low entropy energy. Heat pumps excepted.


People don't scale. This is personal. Only 3 is a good choice for people in a site with the name hacker something.


What do you (or anyone else, feel free to chime in) do with other LLMs that makes them useable for anything that is not strictly tinkering?

Here is my premise: We are past the wonder stage. I want to actually get stuff done efficiently. From what I have tested so far, the only model that allows me to do that halfway reliably is GPT-4.

Am I incompetent or are we really just wishfully thinking in HN spirit that other LLMs are a lot better at being applied to actual tasks that require a certain level of quality, consistency and reliability?


I still wonder what makes GPT-4 so much better than its contemporaries. That's why I saw tons of people trying to explain how GPT-4 works starting from simple neural network distasteful, tons of people already knew and do that but none of them is nearly close to GPT-4.l


> I still wonder what makes GPT-4 so much better than its contemporaries.

OpenAI have had many years to craft their dataset down from the noisy public datasets, and GPT4 is (supposedly) a mixture of 8 "expert models" each of which is 220B (5x+ larger than the Falcon 40B) with a total of 1.7B parameters (3x+ Google's huge 540B PaLM). The hardware and software to train networks of that scale is also a deep moat. Relatively speaking, the model architecture ("gpt from scratch") is the easiest piece.


From my understanding. GPT-4 is the biggest, or one of the biggest. It was trained on low quality internet datasets, like the others. What makes it different is post-training on custom data with human supervision. We know they even outsourced that to Africa. Second, they integrated it with external tools. Like Python interpreter, internet browser. But the first is most important. Also most likely they have experimented and found some tricks which make it bit better.


They pay tons of people to type out conversations that they can feed into it. It's just a lot of people doing a lot of work.


This line of thinking only works if it's impossible to imagine a world where OpenAI isn't the leader. In 2 years if the non OpenAI models are better then it will serve us much better to allow these tools to work with other models as well.


Since OpenAI is all just APIs with simple interfaces, I don't think that plugging a different, capable model in whatever tool you are building is going to be an issue.


You are correct in this assesment. A majority of individuals and startups playing around with turning LLMs into products aim to be prepared for the arrival of the subsequent generation of models. When that occurs, they'll already have a product or company in place and can simply integrate the new models.

Models are getting commoditized, well executed ideas are not.


> is that pretty much all of them require OpenAI.

They're not here to release an actual product. They're here to release part of a CV proving they have "OpenAI" experience. I'm assuming this is the result of OpenAI not actually having any homegrown certification program of their own.


> OpenAI not actually having any homegrown certification program

A bit off topic but where are certifications (e.g. Cisco, Microsoft) useful? I am sure they are useful (both to candidates and companies) because people go to the effort to get these certs, and if they were useless everyone would have stopped long ago. I don't assume people do it for ego satisfaction.

But I've never worked anywhere where it has come up as an interview criterion (nobody has ever pointed it out when we are looking at a resume, for example). Is it a big business thing? Is it just an HR thing?


Years ago, companies could get discounts if they were a “certified gold partner” or whatever.

To be a partner, the company would need a certain number of certifications among their employees, so there was tangible value to companies who either used a ton of Microsoft licensing or Cisco/Dell hardware, or resold those to their own clients (better discount equating to higher margin).

In some cases, getting the higher level certifications like Cisco CCIE was a virtual guarantee of a good job.

I feel like this has become less of a thing in recent years, but I’m not involved in that space anymore.


Definitely still a thing with Azure and Atlassian


Could you elaborate or drop some links? Are people getting discounts on Azure cloud bills with certifications?


It is a consulting / business partner thing. Different levels in the business partner programs require minimum number of certified employees in your consulting firm. So if you work in that slice of the industry, certifications matter. Outside of that... not so much.


Mainly when applying for a corporate job where you have 0 referrals. It's a guidepost that you at least have some idea what you're doing and are worth interviewing when people can't find someone who knows you and your previous work.


I've only ever seen it as a people who don't have a job thing.



This is a well-written tutorial and it's exactly what I was looking for! Thanks so much.


The only OpenAI 'crap' being used here is to generate the embeddings. Right now, OpenAI has some of the best and cheapest embeddings possible, especially for personal projects.

Once the vectors are created tho, you're completely off the cloud if you so choose.

You can always swap out the embedding generator too, because LangChain abstracts that for your exact gripes.

Everything else is already using huggingface here and can be swapped out for any other model besides GPT2 which supports the prompts.


> Once the vectors are created tho, you're completely off the cloud if you so choose.

Ehr no? You'll need to also create an embedding of your query, which makes you totally dependent on OpenAI. If you swap out embedding algorithm you will have to regenerate all the embeddings as well, they might not be even the same size.


Ah, I see what the top comment was implying. I was a bit short sighted on that side. Yes, you'd be tied to OpenAI for any new queries you need to generate. There could be some ways to offload that (vector of a vector) but it is a cloud dependency. I'd argue not a cost dependency based on how cheap these are.


Do you have citations on OpenAI embeddings being some of the cheapest and best? The signs I've seen points almost in the opposite direction?


The only embeddings I currently see listed on https://openai.com/pricing are Ada v2, at $0.1/million tokens.

Even if the alternative is free, how much do you value your time, how long will it take to set up an alternative, and how much use will you get out of it? If you're getting less than a million tokens and it takes half an hour longer to set up, you'd better be a student with zero literally income because that cost of time matches the UN abject poverty level. This is also why it's never been the year of linux on the desktop, and why most businesses still don't use Libre Office and GIMP.

I can't speak for quality; even if I used that API directly, this whole area is changing too fast to really keep up.


If you look at a embeddings leaderboard [1], one of the top competitors called InstructorXL [2] is just a pip install away. It's neck and neck with Ada v2 except for a shorter input length and half the dimensions, with the added benefit that you'll always have the model available.

Most of the other options just work with the transformers library.

[1] https://huggingface.co/spaces/mteb/leaderboard

[2] https://github.com/HKUNLP/instructor-embedding


Setting up an embedding alternative out of huggingface sentence transformers is fairly easy, the magical thing openai does is that they will create embedding of 8192 characters at a time while most other emerging will force you to chunk your documents in 512 characters long sequences, losing lot of context, multiplying your queries result, search times etc


If you've never coded or used Python before, yeah, go with OpenAI. Otherwise, generating embeddings with SentenceBERT takes 5 minutes.

And from my personal experience Ada embeddings are not the best. They are large (makes aproximate searching harder), are distributed weirdly, and zimply put, other embeddings give better results for retrieval.

Another advantage is that you are not an OA's whim: they just announced the deprecation of some previous model. What are you going to do when they will deprecate Ada v2 and you've built a huge system on top of it? You'll have to regenerate embeddings and hope everything still works just as well.


Yes, exactly this, I also want to say I'm not someone who generally thinks open models are better. I think embeddings just haven't been a focus for OpenAI and it shows. Maybe in the future they will focus on it


Yes, when new technology comes you may need to upgrade. OpenAI aren't hero's but they are covering the cost to move people from old to new embedding models


Running models on your own hardware isn't just about cost, there are privacy concerns too.


No, I have anecdotal evidence using it. How does it compare in your usage of OpenAI and competitors?


I'm interested if you have specific options you think are better and/or cheaper.


Can you elucidate on what those signs are? Thanks in advance


As per the Massive Text Embedding Benchmark (MTEB) Leaderboard maintained by Huggingface, OpenAI's embedding models are not the best.

https://huggingface.co/spaces/mteb/leaderboard

Of course, that's far from saying that they're the worst, or even headed that way. Just not the best (those would be a couple of fully opensource models, including those of the Instructor family, which we use at my workplace).


What? It's only one file, and it definitely looks like it's using openAI to make the actual queries.

    qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())


You can change the arguments to from_llm() to point to a local model instead. Example here: https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML/discuss...


ClosedAI has freaked me out with how much power they have, and how irresponsible they are with it.

I'm so horrified that they are going to take away the ability to ask medical questions when the AMA comes knocking at their door.


I don't liked closedAI either, this seems like the first tech I've played with in a long time that was great on day 1, and seems to get progressively less great over time.


Amen, local first should be the default for anything that sucks all my data.

Although until these things can do my laundry none of them deserve any of my compute time either.


txtai makes it easy to use Hugging Face embeddings and Faiss, all local and configurable. https://github.com/neuml/txtai

paperai is a sub-project focused on processing medical/scientific papers. https://github.com/neuml/paperai

Disclaimer: I am the author of both


Honestly, for me, getting good quality results matters way more than keeping my searches private. And for that, nothing compares with GPT4.


Have you seen PrivateGPT. It's quite good and free.


It's not nearly usable. It's functional in that it spits out a response. Can that response be used for anything useful? No.


what hardware do you need for that?


Consumer-grade, AFAIK it's GPT4All with LLaMA.


That's doesn't really answer the question.

Neither does the github "System requirements" section, which I find disappointing. Ideally, it should give minimum memory requirements and rudimentary performance benchmark table for a sample data set, across a handful of setups (eg, Intel CPU, Apple M1, AMD CPU, with/without a bunch of common GPUs). With that information I would know whether or not it's worth my time even trying it out on my laptop.

Edit: lol, I went through 2 pages of the issues on the github page and most of them could be avoided by putting this basic information into the system requirements:

https://github.com/imartinez/privateGPT/issues/174 https://github.com/imartinez/privateGPT/issues/179 https://github.com/imartinez/privateGPT/issues/104 https://github.com/imartinez/privateGPT/issues/141 https://github.com/imartinez/privateGPT/issues/282 https://github.com/imartinez/privateGPT/issues/316 https://github.com/imartinez/privateGPT/issues/333

... And lots more! Some of these people have 128gb memory and 32 cores, and still find it "very slow". Others having memory pool errors. Some of the answers hand-waving at needing "a more better computer"

I reckon a lot of these issues could be closed and linked to a single ticket for proper hardware requirements in the readme.


Link?



It runs (slowly) on my 6 year old i5 laptop.


> Stop doing that.

This commanding attitude on HN seems to be getting worse lately. Not a fan.


#Chaxor: I fully agree with this. I am not keen on this being a one horse race, and for privacy reasons would like to deploy these models locally. However, it seems for many programmers it is somewhat easy to build something that can query into OPenAI so they can put it on their resume.

Do you know of any FAISS / open source / one-click install w<windows app here I can search in my PDFs via vectors? I can see Secondbrain.sh will have the function in the future, but currently it does not.

I have around 500 documents I want to be able to search in.


When I was creating this tool, I made sure to abstract out the reliance on just one LLM or vector db. Instead, I focused on using langchain/huggingface for tokenization, embedding, and conversational modeling. This was done purposely so that it would be simple to replace the OpenAI dependencies with any other LLMs if needed.


Even GPT-3.5-turbo-16K isn't good enough for most retrieval augmented generation tasks.

Locally ran LLMs are far worse.

I don't like it either, but for now, if you want good RAG, you have to use GPT-4


Gpt4all does exactly that. You can choose between local model or bring your own openai token.


Does it provide a uniform interface that includes: retries, caching, streaming?


Keep your data private and don't leak it to third parties. Use something like privateGPT (32k stars). Not your keys, not your data.

"Interact privately with your documents using the power of GPT, 100% privately, no data leaks"[0]

[0] https://github.com/imartinez/privateGPT


It’s significantly worse than OpenAIs offerings, and I’m tired of people pretending as though these models are totally interchangeable yet. They are not.


Is this robust enough to feed all your emails and chat logs into it and have convos with it? Will it be able to extract context to figure out questions to recent logs, etc?


In theory, yes.

I've not got it to work yet though, it ends up hallucinating answers to all the questions about documents I feed it.


How does this run on an Intel Mac? I have a 6 core i9. Haven't been able to get an M series yet so Im wondering if it would be more worth it to run it in a cloud computing environment with a GPU.


>Mac Running Intel When running a Mac with Intel hardware (not M1), you may run into clang: error: the clang compiler does not support '-march=native' during pip install.

If so set your archflags during pip install. eg: ARCHFLAGS="-arch x86_64" pip3 install -r requirements.txt

https://github.com/imartinez/privateGPT#mac-running-intel


I’m curious about the response times though, i imagine they will be quite slow on an intel Mac


Having something that could be used with confluence would be so nice. Having documentation written and just asking questions about it.


[flagged]


Oh, not this again. No, it's not a net negative in ALL forms. And if you really were concerned about downsides, AI has a ton more potential downsides than Web3 ever did, including human extinction, as many of its top proponents have come out and publicly said. Nothing even remotely close to that is the case for Web3 at all:

https://www.nytimes.com/2023/05/30/technology/ai-threat-warn...

https://www.theguardian.com/technology/2023/may/10/ai-poses-...

https://www.theguardian.com/technology/2023/may/30/risk-of-e...

https://www.theverge.com/2023/5/30/23742005/ai-risk-warning-...


" many of its top proponents have come out and publicly said." You don't have to uncritically accept that, it's far more likely that they're just self aggrandizing in a "wow I'm so smart my inventions can destroy the world, better give me more money and write articles about me to make sure that doesn't happen".


I see, so when it comes to the top experts in AI screaming “we made a mistake, please be careful” we should use nuance and actually conclude the opposite — that we should press ahead and they’re wrong.

But with Web3, we should just listen to a bunch of random no-name haters say “there are NO GOOD APPLICATIONS, trust us, period, stop talking about it”, use no nuance or critical thinking of our own, and simply stop building on Web3.

Do you happen to see the extreme double standard here you’re employing, while trying to get people to see things your way?


The crypto group had a lot of time and even more money to make a compelling product that took off and so far they've failed. We've watched fraud after fraud as they've shown themselves to just be ignorant and arrogant ideologues who don't understand how the "modern" finance system came to be, what the average user wants out of financial or social products, or just outright scammers. We can keep sinking money into a bottomless pit or we can move on and learn from their mistakes.

I didn't say to dismiss any concerns out of hand, but the whole idea of "x-risk" or "human extinction" from ai is laughable and isn't taken seriously by most people. Again if you think critically about the whole idea of "human extinction" from any of the technology being talked about you should see it as nonsense.


The AI crowd has been working for multile decades and only now has made progress that people care about. Also I’m pretty sure the “eye-watering amounts of money” Sam Altman referred to exceed what developers of even the crypto projects had, when they built eg Bitcoin or Ethereum. That’s what it took to make AI turn heads. Until AlphaGo you could also yell that AI has no real applications.

The personal computer crowd had hobbyists like Wozniak coming to meetups for decades and computers were the province of nerds, now everyone is addicted to them. Decades took place

You are like a person yelling at the video game industry: “pong and space invaders are a stupid waste of time with ugly graphics!! Don’t play or make video games!!” Until a few decades later we have Halo, Call of Duty etc.


I dunno, on the crypto side stablecoins are pretty compelling for hassle-free cross-border transfers—there’s $125bn in circulation, which to me means it’s taken off.

On the AI side, I mean for example it’s not laughable to think anybody on the planet could just feed a bunch of synthetic biology papers to a model and start designing bioweapons. It’s not hard to get your hands on secondhand lab equipment…


100% private? Hmm. I think with the amount of paranoia that the folks in power have about local LLM’s, I wouldn’t be in the slightest surprised that the Windows telemetry will be reporting back what people are doing with them. And anyone who thinks otherwise is in my view just absolutely naive beyond hope.


Don't have so much pride in yourself. Nobody actually cares what you're doing. Well, China might.

And this is probably illegal in several countries besides that since queries might have medical information or other protected data.


Is it going to send my personal data to OpenAI? Isn't that a serious problem? Does not sound like a wise thing to do, not at least without redacting all sensitive personal data from the data. Am I missing something?


By default, data sent to the OpenAI API is never used for training and is deleted after a maximum of 30 days (mostly).

Data usage policies: https://openai.com/policies/api-data-usage-policies

Data usage policies by model: https://platform.openai.com/docs/models/how-we-use-your-data


A few weeks ago GitHub made a strong statement about code in repos not being viewed by humans, that was very liberating.

If OpenAI could offer similar privacy statements it would immediately be much more useful. E.g. if they simply add a 'private' option, I'd pay double or triple for it.

OpenAI's tools are incredibly good and so easy to use, it's just that I simply cannot use them for most the things I want to do with them because of the privacy considerations, and that sucks.

I suspect OpenAI value the insights they get from looking at the data more than they do the extra revenue they'd receive if they could ensure privacy.


This is my question as well. Is there a more nuanced way to tell how personal data is used other than confirming that an OpenAI key is or is not needed?


This readme is very confusing. It says we're going to use the GPT-2 tokenizer, and use GPT-2 as an embedding model. But looking at the code, it seems to use the default LangChain OpenAIEmbeddings and OpenAI LLM. Aren't those text-embedding-ada-002 and text-davinci-003, respectively?

I don't understand how GPT-2 enters into this at all.


The embedding model used is the default OpenAI API embedding which is text-embedding-ada-002. GPT2 is only used during the tokenization process to efficiently calculate token lengths.


Is there a company that makes a hosted version of something like this? I quite want a little AI that I can feed all my data to to ask questions to.


https://libraria.dev/ offers this and more as a service. It has added conveniences like integration with your google drive, youtube videos, and such


If you subscribe to ChatGPT plus, you can use ChatWithPDF (https://plugins.sdan.io) which has 50k+ daily active users!




Depending on the size of your data, chiseleditor.com is a free option.


I don't get it, GPT-2 is (one of the few) open models from OpenAI, you can just run it locally, why would you use their API for this? https://github.com/openai/gpt-2


It's not using GPT-2 - the README is incorrect.

It's using "from langchain.embeddings import OpenAIEmbeddings" - which is the OpenAI embeddings API, text-embedding-ada-002

The only aspect of GPT-2 this is using is GPT2TokenizerFast.from_pretrained("gpt2") - which it uses as the length function to count tokens for the RecursiveCharacterTextSplitter() langchain utility.

Which doesn't really make sense - why use the GPT-2 tokenizer for that? May as well just count characters or even count words based on .split(), it's not particularly important how the counting works here.


The embedding model used is the default OpenAI API embedding, text-embedding-ada-002. GPT2 is only used during the tokenization process to calculate token lengths efficiently. I have updated my readme to reflect this information correctly.


I am assuming GPT 4 will provide better answers to your queries compared to GPT 2.


Am I the only one who doesn't need to search across my data? What are the use cases here


Example use case:

We have a group at work that meets and discusses various investment topics. The guy organizing it is fairly well connected and every week he tries to get an external speaker to come and present. Very educational.

I have raw notes for each of these presentations. My goal has always been to go through those notes, and properly organize the knowledge in there into a wiki of sorts. It's been 3 years since this all started and I still haven't found the time to do it. If I want to be realistic, I should accept that it'll never happen.

How do I go about finding information that I have in those notes? I could use text search but it's too sensitive to my search string - I'll often fail to find what I need. Also, the information may be scattered across several files, and I'd have to open all the hits and scan to find what I need.

With technology like this, I can put all my notes into some vector DB, and then use AI to ask in plain English what I need. Locally the system interprets my query and finds the most relevant documents in the DB. It then sends my query and those hits to OpenAI to interpret my question, and find the answer amongst my notes. A while ago I used Langchain to set it up and I got it working as a proof of concept. An Aha moment was when I asked it something and it gave me a response with information that was scattered over two different presentations. My challenge is that there are so many parameters I could play with, and I haven't yet thought of a way/metric to assess the performance of my system (any pointers would be appreciated!)

There's nothing personal in these notes, so no privacy concerns. I did want to set a similar thing up with over 20 years of emails, but didn't due to privacy. Also, I use a mail indexer (notmuch) which is fairly good so the need to use AI is not as strong.

But for other (non-personal) notes? If I can get this system working fairly well, it'd be a life saver. I've made so many notes on so many topics over the years, and it's worth real money not to have to organize it well. Just let me write my notes, and use an AI to retrieve what I need.


You're creating additional hardship for yourself. Why create a pdf only to convert it out of pdf again. Just insert all your notes into the LLM model.


Because that requires retraining the model every time you take new notes. And this way you also still have the raw notes as similarity matches from the vector db, rather than them "disappearing" into the LLM model.


I see thanks for the insight.


You are probably replying to the wrong thread - I didn't say anything about a pdf.


Apologies


Sometimes I have the data, but I'm not sure where it is.

Sometimes I know where the data is, but there's a lot of it and all I'm looking for is a quick explanation of something.

Sometimes I have a lot of data from a lot of sources, but what I want in the end is a summary based on what most/all of them agree on, or possibly a summary of how they differ.

There's a lot of use-cases here, many of which I think people don't get a "lightbulb moment" about their usefulness until they've dug in and seen what is possible, because we are so used to how we approach these tasks normally.

But the range of uses is quite broad. A project I'm working on for myself is a variation of this, where I've ingested years and years of my own notes and journals, and make queries for the purposes of my own introspection and personal growth. (I think there's a lot of of potential in this arena in general)


Anyone know how milvus, quickwit, pinecone compares?

I've been thinking about seeing if there's consulting opportunities for local businesses for LLMs, finetuning/vector search, chat bots. Also making tools to make it easier to drag and drop files and get personalized inference. Recently I saw this one pop into my linkedin feed, https://gpt-trainer.com/ . There's been a few others for documents I've found

https://www.explainpaper.com/

https://www.konjer.xyz/

Nope nope, wouldn't want to compete with that on pricing. Local open source LLMs on a 3090 would also be a cool service, but wouldn't have any scalability.

Are there any other finetuning or vector search context startups you've seen?


Pinecone and Milvus would be alternatives for their use of FAISS for the vector store and search component. I think more of the embeddings difference would be noticed by what’s used for creating the embeddings (eg the ones here https://news.ycombinator.com/item?id=36649579 instead of the OpenAI embeddings API they used), rather than noticing differences from the embedding store/search alternatives which I can’t think of what the difference would be other than maybe performance at a large scale and cost and personal preference / developer experience.

Hadn’t heard of Quickwit but from a quick glance at their site it doesn’t look like a vector store, seems perhaps unrelated.

For tools for making custom ChatGPTs see my list: https://llm-utils.org/List+of+tools+for+making+a+%22ChatGPT+...

Fine tuning as a service there’s Lamini AI, aimed at enterprises.

Other embeddings startups there’s Weaviate.


I am working on a simple vector db just with numpy: https://github.com/sdan/vlite

I think milvus, quickwit, and pinecone are geared more towards enterprise and are hard to use.


qdrant is better in my opinion


Why have the OpenAI dependency when there's local embeddings models that would be both faster and more accurate?


Which ones?


all-MiniLM-L6-v2 from SentenceTransformers is the most popular one as it balances speed and quality well: https://www.sbert.net/docs/pretrained_models.html


I’m working for a company that works as a security layer between any sensitive enterprise data and the LLMs. Regardless of the model (HF, ChatGPT, Bard), and regardless of the medium - conversational data, pdf, knowledge bases like Notion etc. It hides the sensitive data, preventing risky use and fact checking at the same time. Happy to make an intro if that’s what you’re looking for! tothepoint.tech


Also what does this do that llamaindex doesn't?


gpt4all has this truly locally. I recommend those with a decent GPU to give it a go.


I assume this is the link: https://github.com/nomic-ai/gpt4all ?


Don't build a personal ChatGPT, and don't let OpenAI, Microsoft and their business partners (and probably the US government) have a bunch of your personal and private information.


By default, data sent to the OpenAI API is never used for training and is deleted after a maximum of 30 days (mostly).

Data usage policies: https://openai.com/policies/api-data-usage-policies

Data usage policies by model: https://platform.openai.com/docs/models/how-we-use-your-data


You many want to read about National Security Letters:

https://www.eff.org/issues/national-security-letters/faq


So, they don’t promise they won’t look at it - just that they won’t use it for training.


So avoid all Microsoft products too?


Does Microsoft have an AI opt out?

AWS has an AI opt out at the organizational level that prevents them from using your data to “improve” their other services.

(I personally recommend everyone opt out now in AWS if you haven’t already…)

https://docs.aws.amazon.com/organizations/latest/userguide/o...


Please provide this reference in your readme / blog as it is the original source for your work... and provides the background for the tradeoff between the 2 approaches: 1) fine-tuning vs 2) Search-ask

https://github.com/openai/openai-cookbook/blob/main/examples...


I respect OpenAI for creating a comprehensive cookbook, and my tooling uses OpenAI for embeddings and chat completion which I have mentioned in the Readme. However, it was not built using a single reference or code example, and rather it is a combination of ideas from huggingface, openAI and langchain documentation.


The author has a demo of this here: https://www.swamisivananda.ai/


I appreciate you finding and sharing this demo. I have also written a blog post on the vision of building a personal ChatGpt here https://devden.raghavan.studio/p/chatgpt-using-your-own-data




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: