I used GPT to build a search tool for my second brain note-taking system

michaericalribo · on Feb 6, 2023

These "augmented intelligence" applications are so exciting to me. I'm not as interested in autonomous artificial intelligence. Computers are tools to make my life easier, not meant to lead their own lives!

There's a big up-front cost of building a notes database for this application, but it illustrates the point nicely: encode a bunch of data ("memories"), and use an AI like GPT to retrieve information ("remembering"). It's not a fundamentally different process from what we do already, but it replaces the need for me to spend time on an automatable task.

I'm excited to see what humans spend our time doing once we've offloaded the boring dirty work to AIs.

pessimist · on Feb 6, 2023

In chess we had a tiny window of a few years when humans could use the help of computers to play the world's best chess. By 2000, computers were far better than humans and the gap has increased. Chess players are now entertainers, like all us humans are destined to spend our time doing.

somenameforme · on Feb 7, 2023

The story of what happened with chess deserves a lot more elaboration, because it's fun and interesting (and may also foretell outcomes in other scenarios)!

In chess the first "new" (unplayed in a high level game) move in a game is called the novelty, or theoretic novelty. In times before computers this would not infrequently be an objectively strong move that simply had not been played in a given position before. And this continued for time after computers became quite strong with players using computers to find interesting strong ideas in all sorts of positions. Each time these sort of novelties would be sprung, positions would become redefined and our broader knowledge of the game continued to stretch on outward.

But then something fun happened - the metagame shifted. Now it's no longer really about founding some really strong move as your novelty - but often about finding a technically mediocre, if not simply bad, move that gives you good practical chances. So you're looking for moves that your opponent probably has not considered because they look bad (and the computer would agree that they're bad) but you're much more prepared and comfortable in than he is.

The big difference now also is that instead of a novelty redefining a position in a positive way, it's often something you spring once or maybe twice - and then never touch again. And this is now happening regularly at the absolute highest levels of chess. So rather than having humans just desperately trying to emulate machines, those machines became yet another tool to exploit and improve our practical results with.

It's kind of funny watching a game when this happens and less experienced players will immediately begin shouting "BLUNDER!" when the computer evaluation of a position suddenly drops, without realizing the player who just "blundered" is still well within his preparation. But the other guy is now probably out of his. Even the players themselves, there's often a sort of "u srs?" type response. This [1] is a fun one from the always emotive Ian Nepomniachtchi during the most recent world champions candidates event. He is now playing for the world championship. In any case, it's at that point that the game begins!

[1] - https://youtu.be/AwgIksw1go0?t=30

rlayton2 · on Feb 6, 2023

While reductive, isn't that true in many professional sports though? I have a wide variety of tools I can use to travel 100m faster than Usain Bolt, but its incredible to watch him do it on his own.

l33t233372 · on Feb 7, 2023

I agree that professional athletes are entertainers, but I’m not sure what you’re getting at with that point.

haswell · on Feb 7, 2023

The way I read this, in a world with machines that can travel at high speeds, people still watch professional runners because they are interesting to watch, and we’re inspired by human achievement.

For similar reasons, it doesn’t really matter that computers are better at chess than us.

tomjakubowski · on Feb 7, 2023

The cool thing I think is that we watch both, running and race cars.

sixstringtheory · on Feb 7, 2023

Also, I can drive my car to get places quickly, but sometimes I enjoy cycling or walking instead because it allows me to take in my surroundings more fully at the slower pace, and the exercise makes me feel good.

cjohansson · on Feb 7, 2023

Yes but humans are required (as runners or drivers) to make it interesting to watch. Humans are human-centered by nature (of course)

lynx23 · on Feb 7, 2023

Well, isn't that true for most skills which have been suplemented by technology? In fact, if you look at the handcrafting bussiness, hand-made is now a selling point. Like someone prefers a product to be handmade, others will prefer watching a game of chess between two humans. Just because machines are better at something doesnt mean that humans have become useless. Its just a question of the point of view, IOW, how depressing you want to see the world.

prox · on Feb 7, 2023

It’s always interesting that this point is always from a competing context That is to say from a survival point of view. I mean nobody really wants AI because it could be fun. We are as a species really inept to move beyond our survival idioms I feel.

lm28469 · on Feb 7, 2023

> like all us humans are destined to spend our time doing

Automation has been here for quite a long time now, if it took people out of the work pool for them to become entertainer we'd know about it.

It's always the same issue in fact, replacing workers by machines is good, but if your goal is still to have a "full employment" society you have to make them work somewhere else. It's not even a new concept but we seem to rediscover it every now and then apparently

> Automation, the most advanced sector of modern industry as well as the model which perfectly sums up its practice, drives the commodity world toward the following contradiction: the technical equipment which objectively eliminates labor must at the same time preserve labor as a commodity and as the only source of the commodity. If the social labor (time) engaged by the society is not to diminish because of automation (or any other less extreme form of increasing the productivity of labor), then new jobs have to be created. Services, the tertiary sector, swell the ranks of the army of distribution and are a eulogy to the current commodities; the additional forces which are mobilized just happen to be suitable for the organization of redundant labor required by the artificial needs for such commodities.

Guy Debord, 1967

throwawaylinux · on Feb 7, 2023

If you think the people who own the machines will be happy to support everyone else just sitting around "entertaining" themselves, you're in for a rude shock.

110 · on Feb 7, 2023

In case folks are interested in trying it out, I just released the Obsidian plugin[1] for Khoj (https://github.com/debanjum/khoj#readme) last week.

It creates a natural language search assistant for your second brain. Search is incremental and fast. You notes stay local to your machine.

There's also a (beta) chat API that allows you to chat with your notes[2]. But that uses GPT, so notes are shared with OpenAI if you decide to try that.

It is not ready for prime time yet but maybe something to check out for folks who are willing to be beta testers. See the announcement on reddit for more details[3]

Edit: Forgot to add that khoj works with Emacs, Org-mode as well[4]

[1]: https://obsidian.md/plugins?id=khoj

[2]: https://github.com/debanjum/khoj#chat-with-notes

[3]: https://www.reddit.com/r/ObsidianMD/comments/10thrpl/khoj_an...

[4]: https://github.com/debanjum/khoj/tree/master/src/interface/e...

bostonvaulter2 · on Feb 7, 2023

This looks great! I definitely plan on checking it out.

rolenthedeep · on Feb 6, 2023

One of my biggest dreams is a self-hosted AI that always listens through my phone and automatically takes notes, puts events in my calendar, set reminders, and template journal entries. A true personal assistant to keep my increasingly-complex life in order.

I'd love a system where I can just point a search engine at my brain. I tried really hard for a while, but I just didn't have the discipline or memory to exhaustively document everything.

An AI that can do this kind of thing in the background would be an absolute godsend for ADHD and ASD people.

dewey · on Feb 7, 2023

I feel like this was the promise of voice assistants like Siri many years ago, turns out they currently still have issues turning on / off lights reliably.

lloydatkinson · on Feb 7, 2023

I was thinking the other day that Amazon and others are probably going to either use ChatGPT or their own version to replace the core of many of their voice assistants. Siri is useless.

Alexa used to be better, but only yesterday I asked it to flip a coin.

"I've added flip a coin to your basket"

???

I asked again, and she actually flipped a coin.

asdff · on Feb 7, 2023

I'll believe an AI model can update my calendar accurately only after it can correctly tell if I just said 15 or 50.

wazoox · on Feb 10, 2023

Just say "Meet Jane tomorrow at ten to four PM" and you're gold :)

taydotis · on Feb 10, 2023

Meeting with Jane scheduled from 10am to 4pm tomorrow ;)

asdff · on Feb 13, 2023

"Here's what I could find about Mahjong Tournaments 1024"

rolenthedeep · on Feb 7, 2023

The issue with modern voice "assistants" is that it's impossible to monetize them and still keep them useful. Something like this will probably never get built by a capitalist corporation because there's no way to turn a profit.

That's why siri and Google are driving their assistants into the ground with "by the way"

dewey · on Feb 7, 2023

I don't think this makes sense, it's a feature that makes their platform more valuable even if not directly monetized. There's many features like that.

If someone would nail voice assistants that would be a bit selling point for their platform and it's a good way to get people into their paid services too.

Not really sure what you are referring to with driving the assistants in the ground, from what I can tell it's just not very good tech ("one moment", "working on it", "there was an problem answering this query",...) and not related to any monetization strategies.

lannisterstark · on Feb 8, 2023

You'd think that they can probably support this with the ad-revenue they're getting...

senectus1 · on Feb 6, 2023

MS is very close to this. they have chatGPT listening to meetings and taking notes then adding tasks to attendee calendars...

lazyasciiart · on Feb 6, 2023

Would be useful if it can insert notes like “no agenda planned” and “this meeting could have been an email”

seeraan · on Feb 7, 2023

Tactiq does it already (can sign up for the beta)

alostpuppy · on Feb 7, 2023

Link? Because I’m gonna need this.

senectus1 · on Feb 7, 2023

they lightly cover it in this blogpost https://www.microsoft.com/en-us/microsoft-365/blog/2023/02/0...

but I've seen more detailed capability... I can't remember if It was under NDA though. I cant seem to find it with a quick search though

tactiq · on Feb 9, 2023

https://tactiq.io/

gk1 · on Feb 7, 2023

Rewind is trying to do this: https://www.rewind.ai/

(I am not affiliated.)

mesozoic · on Feb 12, 2023

Wasn't this a Black Mirror episode?

zirgs · on Feb 7, 2023

And reports you to police if you happen to do something illegal?

rolenthedeep · on Feb 7, 2023

What do you think you're adding to this conversation?

vineyardmike · on Feb 8, 2023

It’s a genuine concern. If you are constantly recording your life, and sharing it with a third party, you’re building an incredible paper trail of every minor infraction.

There’s a whole bunch of “crimes” that society just kinda ignored at scale. Eg underage drinking in college. People knowingly and willingly speed when it’s against the law. Imagine if your smart car automatically recorded and stored its speed and gps coordinates at all times? It’d be so easy for the government to automatically subscribe to that data and start sending automated tickets… nevermind all the worse things that can happen.

This data can be manipulated and abused by stalkers and hackers, abusive partners controlling their wives or kids, churches trying to guilt you into behaving differently, etc.

rolenthedeep · on Feb 8, 2023

> If you are constantly recording your life, and sharing it with a third party

Which is why the very first sentence of my post includes the phrase "self-hosted"

leobg · on Feb 6, 2023

Slight overkill to use GPT, though it works for the author and I can see that it’s the low hanging fruit, being available as an API. But this can also be done locally, using SBERT, or even (faster, though less powerful) fastText.

Also, it’s helpful not to cut paragraphs into separate pieces, but rather to use a sliding window approach, where each paragraph retains the context of what came before, and/or the breadcrumbs of its parent headlines.

cratermoon · on Feb 7, 2023

I just put my notes into ElasticSearch and use the TextAnalysis processor and get a lot of the same functionality.

alostpuppy · on Feb 7, 2023

This is interesting. How effective was this?

cratermoon · on Feb 7, 2023

ElasticSearch's text analyzer isn't semantic, but it does stemming and a few other transformations. Combined with the ranking of results by match, I can type in a couple of words or short phrase and get back a reasonable set of notes. That combined with the tags I used and linking means I can typically find the most relevant notes pretty quickly. It might be nice to have some ability for it to match synonyms of search terms given, eg. 'movie' and 'cinema'. I worry about overzealous collapsing of synonyms, for example 'film' can mean 'movie', but mostly they are different concepts.

dchuk · on Feb 6, 2023

When using SBERT instead of gpt for this use case, is it paired with some sort of vector database or just all done in code/memory?

leobg · on Feb 6, 2023

You’d want persistence, since the embedding process takes some time. But you don’t need to go all Pinecone on this. There is FAISS, and there is hnswlib, for example. Like SQLite for vector search.

gk1 · on Feb 7, 2023

Friendly reminder that we (Pinecone) have a free tier that holds up to ~5M SBERT embeddings (x768 dimensions). For quick projects, going "all Pinecone on this" could turn out to be the easier and faster option.

leobg · on Feb 7, 2023

Point taken ;-)

I like to stand up for the little guy. I hear Pinecone this and Pinecone that. And nobody seems to pay any attention to the awesome dude who made hnswlib.

gk1 · on Feb 7, 2023

Who, Yury Malkov? He won’t be offended… He’s an advisor to Pinecone. :)

And yes, both he and HNSW are awesome.

moneywoes · on Feb 7, 2023

What about Postgres with pg vector?

sowbug · on Feb 6, 2023

I wonder whether your individually trained chat bot will be allowed to assert the Fifth Amendment right against self-incrimination to stop it from talking when the police interview it. And if it's allowed, do you or it decide whether whether to assert it? What if the two of you disagree?

Similar questions for civil trials, divorce proceedings, child custody....

rolenthedeep · on Feb 6, 2023

Think about what kinds of data an AI assistant would store about you vs what your phone stores.

Your phone is already more or less an extension of your brain, and whether or not you can be forced to unlock and surrender it for inspection is already a contentious topic.

IANAL, but phone privacy would probably set the precedent for AI assistant privacy.

Always keep your phone encrypted and be aware what your local laws are. Some places will force you to provide biometric authentication, but not provide a PIN or password. Check if your phone has a duress lockdown mode: some phones lock and/or wipe if you press the power button five times or something like that.

jiggunjer · on Feb 6, 2023

All this advice isn't worth much when people will install any app that'll give a dollar discount on their favorite shop/restaurant.

qwertox · on Feb 6, 2023

This is a topic which really deserves a lot more attention, as in: from magazines to newspapers to talk shows. Seems like an appropriate time to get it on the agenda before governments opt to decide on their own.

roywiggins · on Feb 6, 2023

How could it have a right not to self-incriminate when it can't be tried for a crime? An AI can't be indicted or convicted.

Humans can be required to testify too if they're immunized.

sowbug · on Feb 6, 2023

Apologies for the ambiguity. I imagine that my future AI-based "second brain" will be derived from my own brain, including its personality, memories, and preferences. Anyone in an adversarial position to me would be very interested in talking to it. I pose whether the term "self-incrimination" should include one's second brain as part of one's self. The question was not whether police would put a PC in jail.

sokoloff · on Feb 6, 2023

With limitations. I cannot be compelled to testify against my wife (and probably not against my kids, though I’m unsure of that [edit: that seems to vary by state currently]), even if I personally am granted immunity.

bbor · on Feb 6, 2023

Wait is this a serious concern or a joke? I think asking a chat bot whether it’s admissible is the same as asking someone’s handwritten diary if it’s admissible.

sowbug · on Feb 6, 2023

Fortunately, US courts take these sorts of questions quite seriously. See, e.g., State v. Smith (Ohio 2009): "Even the more basic models of modern cell phones are capable of storing a wealth of digitized information wholly unlike any physical object found within a closed container." https://en.wikipedia.org/wiki/Carpenter_v._United_States is also a good read, as is https://en.wikipedia.org/wiki/Riley_v._California: "Modern cell phones are not just another technological convenience. With all they contain and all they may reveal, they hold for many Americans “the privacies of life". The fact that technology now allows an individual to carry such information in his hand does not make the information any less worthy of the protection for which the Founders fought."

If a cell phone is recognized as having a higher expectation of privacy than a mere passive document, then it stands to reason courts will also recognize a personalized machine that is even more earnest to help as my infernal "by the way..." Alexa home device.

PeterisP · on Feb 7, 2023

Those are all 4th amendment cases for warrantless searches violating privacy.

The fifth amendment has very different criteria - with a proper warrant, your most private things are admissible evidence. You can't be compelled to testify but all your most private notes in a safe can be "interrogated"; in many situations you can't be compelled to testify against your spouse but any writings or recordings of what you said about him/her are valid evidence. There is no debate that all the contents of your computer or phone can be used, they definitely can, the cases you quote are disputed only because they are "worthy of the protection for which the Founders fought" which is the requirement for a warrant.

I'd say that fifth amendment is not a privacy law (which is the 4th, stating that you have all the privacy without a warrant and no privacy with one), the 5th is essentially an "anti-torture" law to prevent coerced confessions, and there is no reason why it would apply to some physical evidence like the data for a trained model on your device.

sowbug · on Feb 8, 2023

Thanks for the analysis. Something still sticks in my craw.

You can read a letter. You can play an audio recording. You can look at a photograph. You can print a word-processing document. Those feel like static records. Each tells a single story, no matter how many times it's read.

An ML model, on the other hand, is just a long sequence of floating-point numbers. There is no meaningful model "viewer" that spits out a text file or JPEG. The only thing a non-engineer human can do with an ML model is interact with it when it's running. Thus, any output is the product of both the human and the model. It doesn't feel like a record. It feels more like a performance -- and a performance by the interrogator, at that.

If an ML-model "record" is a special kind of record that needs manipulation to produce human-readable output, then it feels like we're back in the 5th amendment department. No, chatting with a chatbot is not torture. But it produces a kind of evidence that is an unwelcome collaboration between the interrogator and the record (unwelcome by the owner of the record). It's one thing to let a jury look at a screen full of numbers. It's another thing to say that an expert witness used those numbers in a chat session that printed "Yes, it was my owner, in the living room with the candlestick."

If you lock up a person and ask them the same question over and over again, eventually you'll get the answer you want. The same will probably turn out to be true for that person's chatbot. Should society allow that second case?

michaericalribo · on Feb 6, 2023

Imagine a model that decides on its own to assert the Fifth on your behalf. But now imagine that AI decides to lock you out of your own system...

DavidPiper · on Feb 6, 2023

We already have those. They're called Google, Microsoft and Apple.

PaulHoule · on Feb 6, 2023

Would be nice to see some indication of how well it works in his case.

I worked on a ‘Semantic Search’ product almost 10 years ago that used a neural network to do dimensional reduction and had inputs to the scoring function from the ‘gist vector’ and the residual word vector which was possible to calculate in that case because the gist vector was derived from the word vector and the transform was reversible.

I’ve seen papers in the literature which come to the same conclusion about what it takes to get good similarity results w/ older models as a significant amount of the meaning in text is in pointy words that might not be included in the gist vector, maybe you do better with an LLM since the vocabulary is huge.

sandkoan · on Feb 6, 2023

I'd honestly argue that he might not have even needed OpenAI embeddings—any off-the-shelf Huggingface model would've sufficed.

Because of attention mechanisms, we no longer so heavily depend on the existence of those "pointy words," so generally, Transformers-based semantic search works quite well.

lpasselin · on Feb 6, 2023

I actually tried this last year, before OpenAI released their cheaper embeddings v2 in december. From my experiments, when compared to Bert embeddings (or recent variation of the model) the OpenAI embeddings are miles ahead when doing similarity search.

leobg · on Feb 6, 2023

Interesting. Nils Reimers (SBERT guy) wrote on Medium that he found them to perform worse than SOTA models. Though that was, I believe, before December.

PaulHoule · on Feb 7, 2023

Most of the practitioners I see attempting this are running text through an embedding and then using cosine similarity or something similar as a metric.

Nils has written a lot of papers

https://www.nils-reimers.de/

and I think the Medium post you are talking about is

https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...

and that SBERT is a Siamese network over BERT embeddings

https://arxiv.org/abs/1908.10084

which one would expect to do better than cosine similarity if it was trained correctly. I'd imagine the same Siamese network approach he is using would work better than cosine similarity with GPT-3.

There's also the issue of what similarity means for people. I worked on a search engine for patents where the similarity function we wanted was "Document B describes prior art relevant to Patent Application A". Today I am experimenting with a content based recommendation system and face the problem that one news event could spawn 10 stories that appear in my RSS feeds and I'd really like a clustering system that groups these together reliably without false positives.

I'd imagine a system that is great for one of these tasks might be mediocre for the other, in particular I am interested in some kind of data to evaluate success at news clustering.

PaulHoule · on Feb 6, 2023

I was thinking RoBERTa 3, longformer or Big Bird would be a good choice for this, though having any limit on the attention window is a weakness.

trane_project · on Feb 6, 2023

I've been thinking of using GPT or similar LLMs to extract flashcards to use with my spaced repetition project (https://github.com/trane-project/trane/). As in you give it a book and it creates the flashcards for you and the dependencies between the lessons.

I played around with chatgpt and it worked pretty well. I have a lot of other things in my plate to get around first (including starting a math curriculum) but it's definitely an exciting direction.

I think LLMs and AI are not anywhere near actual intelligence (chatgpt can spout a lot of good sounding nonsense ATM), but the semantic analysis they can do is by itself very useful.

throwaway675309 · on Feb 6, 2023

I've seen a number of projects around using GPT to generate curriculum and also flashcards in the past three months, I think this is one of the most popular ones: https://autolearnify.com

It's a very good idea in theory but takes almost as much work to verify that the flashcards and curriculum that it generates is accurate and not a hallucinogenic nightmare.

The biggest danger is that the target audience are not experts in the desired subject domain, so they have no way of sanity checking the generated curriculum.

trane_project · on Feb 7, 2023

When I played with it, I made it output a JSON file so that it would be easier to handle the output. And I specifically gave it the text to use. It did a pretty good job, but I ran into output size limits.

I agree that using the training data would probably generate more garbage. But it's the semantic analysis part that I think it's useful. In general, I think VCs and OpenAI are overhyping it by calling it "intelligent" and obscuring the very good use cases of the technology. AFAIK, no one involved has explained how a statistical model running on a Turing machine magically develops agency and awareness, which are requirements for actual intelligence (under my definition, at least).

james-revisoai · on Feb 7, 2023

Curious as to where you came to this impression?

I'd say the most popular applications are Knowt in the US for now, and Saveall.ai + Revision.ai (my company) in the UK, all been around with BERT/T5 etc long before this GPT trend.

The flashcard accuracy varies wildly amongst current solutions, that's for sure.

ramblerman · on Feb 7, 2023

Every now and again I come across a real gem on here. Love trane and the idea behind it, can't wait to play with it after work.

trane_project · on Feb 7, 2023

Thanks. I am working on simplifying how new material is added (for simple cases, editing a single JSON file and running a build command will be enough to build all the exercises). That should make it easier to create more stuff for it, which so far has been my bottleneck.

For now, this is the most general way to create the exercises: https://trane-project.github.io/generated_courses/knowledge_.... The JSON file thingy is just a script that automates creating these files given the specification, so they will be interchangeable.

tra3 · on Feb 6, 2023

This is fascinating.

Can I train it on 5 years of stream of consciousness morning brain dumps and then say "write blah as me"?

Before I do that, I'd love to know if training data becomes part of the global knowledge base available to everyone..

ilaksh · on Feb 6, 2023

This is not a fine-tuning example. It's an embedding search example. You use the embeddings to search for relevant knowledgebase chunks and then include them in the prompt. Which goes to the original model, not a model that you have trained more.

This is popular because it's much much easier to do effectively than fine tuning and the OpenAI model is very capable of integrating kb snippets into a response. What I have heard is that it's easy to overdo fine tuning with OpenAI's model and makes more sense when you want a different format of response rather than just pulling in some content.

Having said all of that, they do have a fine-tuning endpoint and I am guessing if you find the right parameters and give it a lot of properly formatted training data then it will be able to do an okay job. I have the impression it is not easy to do either of those things quite right though.

As far as privacy, no they will not share your data when you use the API. ChatGPT is different, they ARE using the inputs to train the model.

crosen99 · on Feb 6, 2023

> Having said all of that, they do have a fine-tuning endpoint and I am guessing if you find the right parameters and give it a lot of properly formatted training data then it will be able to do an okay job.

Unfortunately, the fine-tuning API cannot be used to add knowledge to the model. It only helps condition the model to a certain response pattern using the knowledge it already has.

bigpeach · on Feb 7, 2023

Would you be able to do something similar with non-text data (eg. tabular data)? For example, could you give it a bunch of excel files and ask it do give you the total units sold for an e-commerce site?

ilaksh · on Feb 7, 2023

What I think would make sense would be to do it in two stages, and embeddings probably aren't really what you want. You would want to parse the Excel files into a certain data structure, and quite possibly text completions could help with that.

Put that in a database or some files that you could use Python data science tools with or something.

Then use text completions to translate the natural language query into some short Python program or SQL query etc.

There are already data-focused tools for using OpenAI's newest models for doing this. Search for 'ChatGPT/GPT/OpenAI' data query, SQL, datatable, etc. See the OpenAI Discord #api-projects Discord, I have seen one or two like that.

bigpeach · on Feb 8, 2023

thanks for the response. unfortunately, the discord is full :sad:

michaericalribo · on Feb 6, 2023

These privacy considerations are highest-priority for any extended roll-out of LLM-based products.

Privacy on the side of model servers would be good. Open source models that can be run locally would be better.

feanaro · on Feb 6, 2023

I personally think anything server-side is unacceptable. Only open source and local will fly.

PartiallyTyped · on Feb 6, 2023

I have considered training a model on about a year’s conversations from my little community’s discord server and ask it so synthesise sentences as if I was writing them.

asdff · on Feb 7, 2023

Did the author show how this system outputted results? I see an example of a lexical search and the technical implementation, but no example of some semantic output showing how its relevant to the lexical search string without containing that string. The author used the literal search string "failure mode" as their example. I was wondering if chatgpt would bring up results relevant to the lay person interpretation of failure mode, a technical interpretation, or something in between.

danwee · on Feb 7, 2023

Umm, the only thing that stops me from doing this is uploading my notes to OpenAIs' servers.

asdff · on Feb 7, 2023

Exactly. They should be paying you for the training data you've given them not the other way around.

FiberBundle · on Feb 7, 2023

Does anybody know how search engines apply semantic search with embeddings? To my knowledge no practical algorithms exist that find nearest neighbors in high dimensional space (such as that in which word/sentence/document vectors are embedded in), so those wouldn't give you any benefit compared to an iterative similarity search as applied here. Which obviously is totally impractical for real search engines. There are approximate nearest neighbor algorithms such as Locality-sensitive hashing, but even they seem impractical for real world usage on the scale of the indexes that search engines use. So how can Google e.g. make this work?

gk1 · on Feb 7, 2023

There actually are practical algorithms for finding approximate nearest neighbors (ANN) at large scales. Some of them are open source like HNSW [1] and Faiss [2], and some are even offered inside managed services like Pinecone.[3]

[1] https://www.pinecone.io/learn/hnsw/

[2] https://www.pinecone.io/learn/faiss/

[3] https://www.pinecone.io/

whatever1 · on Feb 7, 2023

So if you keep adding notes furiously every day for years, do you asymptotically get your consciousness on—a-chip?

abrkn · on Feb 6, 2023

I’d love to have a ChatGPT that was also trained on all of the pages from my “second brain,” Roam Research.

Imagine, I could ask it questions about myself, my friends, and my business. It would in many ways know me better than me from reading all my journal entries.

How many years are we away from something like this?

DanielVZ · on Feb 6, 2023

Zero. I’ve done this with google drive and GPT-3 (thus quite limited in prompt length). The biggest hurdle for your requirements is the nonexistent public API for ChatGPT and Roam Research.

bigpeach · on Feb 7, 2023

Obsidian is just plain text. Only a matter of time before ChatGPT is accessible via API (there are some libraries that have reverse engineered the API; I put ChatGPT in VSCode when it first came out). Until then, GPT3 has an API.

johntash · on Feb 7, 2023

I was tempted to do something like this but couldn't get over the idea of sending a bunch of personal data off to openai (or any 3rd party really)

mtnygard · on Feb 7, 2023

Not sure if this is still supported, but Roam used to have an in-app query interface. You could use the JS console to run Datomic style Datalog queries.

qwerty456127 · on Feb 7, 2023

I don't imply any judgment (like good/bad) but I tend to suspect the major (not necessarily intentional) reason of all these "second brains" (I use too) to exist in the grand scheme of things is to be a high-quality input for AIs to learn.

Terretta · on Feb 8, 2023

Obsidian is offline static markdown files

If offline static markdown exist to be input for AIs to learn, then OK.

But for me, long before Obsidian, they're just notes, because the act of writing them causes you to process them "outbound" which is the same process you need to recall them later.

articsputnik · on Feb 7, 2023

Wow, this was super interesting as someone using a Second Brain daily. Thank you so much for digging into it, putting in the work, and sharing with us all! Much appreciated. I will follow you for more.

I am much excited to do more with my Second Brain, but one concern, as you point out, is to use chatGPT or similar; we'd need to upload all our private and sometimes sensitive notes, which is a no go for me. So happy that you do everything locally. I wonder what the equivalent would be to train the model to search and ask questions based on our second brain (plus the already trained information). That's also where Obisidan will win in the long run, as other tools do not have the data locally. Obviously, it's already in the cloud; they could train on them, but training on customer-sensitive data would be a big problem. Something I will follow closely.

college_physics · on Feb 7, 2023

Desktop search feels like it has stagnated for at least a decade. Yet its an obvious way to both enhance privacy, improve relevance and even open up entirely new capabilities

totetsu · on Feb 7, 2023

I think letting a language model make an outline of a topic you want to make notes on and writing in the details might not be such a bad thing.

lukemtx · on Feb 6, 2023

I wanted to do this! :D