Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I built an open source AI video search engine to learn more about AI (avse.vercel.app)
145 points by yoeven 10 months ago | hide | past | favorite | 49 comments
Hi there! When Supabase announced their recent hackathon, I thought it was a good time to build something to learn more about so many of the new AI models and tech out there. From the different techniques of embedding documents to the future RAG.

With the rise of short form content with TikTok and Youtube. A lot more knowledge is in videos than ever before. Finding specific answers within millions of videos can be difficult for any one person to go through. So the question is if there is Google that indexes text on website making it easier to find based on the context of on your question, why is there no Google that indexes video content making it easier for users to find answers within them.

So I built this to showcase that it's very much possible with the technology and infrastructure that is readily available.

I've indexed thousands of videos from Youtube and will be adding more, some of the things coming soon: - Index TikTok videos - Using whisper to transcribe audio in videos that don't have captions - Auto scraping both Youtube and Tiktok everyday to add new content

The tech stack: - Supbase (PostgreSQL, PG_Vector, Auth) - Hasura (GraphQL layer, permissions) - Fly (Hosting of Hasura) - JigsawStack (Summary AI, Chat AI) - Vercel (NextJS hosting, Serverless functions)

The code is opensource here:https://github.com/yoeven/ai-video-search-engine

Would love some feedback and thoughts on what you would like to see?




The problem with this project is that it doesn't solve the valuable problem, which is ranking/relevance.

It's effectively just a very simple "indexer", if you can even really call it that because it's only chunking audio for semantic search rather than actually indexing it.

Search engines are hard because of ranking and scale. This project does not solve either of those problems.

Personally if I was going to build this I would exclusively focus on the data side and use a pre-built traditional search engine like meilisearch, typesense, elasticsearch, etc to handle the indexing and search side. Adding semantic search into the mix upfront makes your life so much harder.


Yeah I agree to an extent. Using a traditional search engine would be simpler and easier to implement but wouldn't able to accurately contextualise the actual content of the video based on the users question which is the focus on the tool. However, I do agree that there is a lot of space for growth and adding a traditional form of full text fuzzy search which will help with some of the ranking problems and it is part of the plans to mix the best of both worlds :)

Ranking is a huge topic by itself, which beyond similarity/text matching, other topics like SEO, popularity, etc plays a huge part and those are aspects that I'm looking forward to understand better and see how the community can contribute as well!


> but wouldn't able to accurately contextualise the actual content of the video based on the users question which is the focus on the tool

Are you sure this is what users want? Semantic search is not appropriate for all situations.

> adding a traditional form of full text fuzzy search which will help with some of the ranking problems

Full text fuzzy search is NOT a performant search engine and is NOT related to ranking or relevance. Ranking is an independent process after finding matching results.

Semantic search would make more sense in the context of ranking rather than pure search. E.g. you use traditional search to find matching documents then semantic search on matches to rank them.

> Ranking is a huge topic by itself, which beyond similarity/text matching, other topics like SEO, popularity, etc plays a huge part

Based on my mediocre understanding of ranking, basic ranking is generally not about any of these factors. Presumably because they are too slow and computationally intensive. Maybe there are multiple layers of ranking for these different features though.

My understanding is that basic ranking is/was more about metrics like TF-IDF. I’m sure there are more advanced modern techniques, but also likely more complicated.

Search is a ridiculously big and complex topic. If you want this to be more than a toy project I think it would be wise to focus on much smaller sliver and have a much clearer value prop.

You are currently trying to tackle multiple big problems simultaneously.


This is great, I've been looking for exactly that! Often times there are "hidden gems" of knowledge in longer videos, for example how some calculations for a specific problem are done when the rest of the video has a different topic. I found it very hard to look for these hidden gems.

My test example for such a soltion would be: "does it find the chapter about thermal calculations in a video about electronics": https://youtu.be/8xX2SVcItOA?feature=shared&t=758

How do I add a video? I'd love to test your solution with this video.

Thanks for making the source available too! I've been trying to build something similar with jupyter notebooks, but I keep getting stuck at this "hidden gem" problem.


Hey there! Thanks a lot :) I've gone ahead and indexed the video you shared and you can try searching something like: "how does heat sink thermal calculation work" as it tends to work better with question.

You could also add more videos by clicking on the hamburger menu on the right of the search bar and click on "Index videos". However, you'll need to do a quick email sign up and you can index as many videos as you like!

Yeah the "hidden gem" problem is a difficult one to solve and I think with the current tech out there, we are getting a step closer with better solutions. The search engine still has a lot of work to be done especially for broader queries & improving accuracy.

Would love to see how the community can build on this!


Thank you for answering and indexing the video so quickly.

I'm blown away how good this works! I searched for the query just like you said and it really found exactly the relevant section of the video. Very impressive!

You're right, the query seems to be very sensitive to the exact phrasing. Searching for "heat sink thermal calculation" results in nine videos, (the wanted one on fifth place at least), while your query seems more targeted. "calculate heat sink power dissipation" works reasonably well too.

So again - wow. This is really great and exactly what I've been trying to build for myself too.

One really useful feature for me would be inclusion/exclusion of channels in the search results. For this example, when I'm specifically looking for calculations, I can probably skip channels like LTT from the search results.

Thanks for this amazing tool!


It doesn't sound ideal that it's so phrasing dependant - 5th on a list of 9 videos is poor if you're searching without knowing which video you're looking for and without knowing the best search phrase to use - unless those 4 above it actually also answer the same thing?

Is it random luck which phrases work best, or is it in a way that a frequent user of the service could learn the good and bad ways to structure a query (and/or can it be tweaked so that all queries work well)?

(Questions aimed at the world generally, not specifically at the person I'm replying to.)


Technically the other videos do talk about the above query, just that not enough videos of it have been indexed to produce a better result.

While yes promoting/querying tends work better in questions based on how its built, I've been exploring more ways by mixing full text fuzzy search along with the current method to allow for broader queries as well :)


Had a quick read of your Github page for this project: https://github.com/yoeven/ai-video-search-engine

Are you sure about the below?...How did you test this? My experience is the opposite.

Edit: For example, Youtube will return a video result of a video with a transcript containing the same quoted text that I've searched. Wouldn't that imply that Youtube has indexed the transcript?

>>FAQ

Doesn't youtube do this?

    Not really, Youtube doesn't search the transcribed audio of the video but instead relies on the written content of the uploader such as title, description, tags. While all the audio content goes unindexed.
(I've tried emailing you about this but not sure if you've seen my email.)


Hey! I've just emailed back :) Would be great if you can send a short recording of what you're referring to when searching on Youtube so I can better reply


I hope you have the appropriate limits in place. I one copied the transcript of a youtube video (podcast 5 hrs long) to chat with it in gpt4 api playground and after 3 chats it used up all my credits ($5).


Yeah that's a scary thing! Hopefully the price cap doesn't get hit.

Also the way its built, you can chat with a 5 hour video without breaking the bank because only relavant chunks of that video will be passed as context based on the question


How did you decide the chunking size? I’m working on a similar project and it seems our sweet spot was around 800 . But would really love to hear what others are doing here for RAG chunk sizes.


Context size is a huge limitation with current LLM design, but there are alread few open-source attempts at compressing LLM input/output to reduce costs.


I'm using the tiny mistral 7b for those 1hr long transcripts internal in my company. I was surprised that even the quantized 7b version easily chomped my 3090's vram - the context takes a lot. I think it goes up to 32k tokens (I go up to 20k). It hallucinates once every few sentences, but it's surprisingly a non-issue for my use cases (mostly for automated meeting notes, where I'm going through material anyway). 60 T/s is also great.

EDIT: of course GPT4 blows mistral out of the water for those very specific "needle in a haystack" or "sharp deductive reasoning needed" cases. Sometimes it makes people go wow, when I present that


Which 7B model are you using? 4bit I assume?


This specifically is:

TheBloke / Mistral-7B-Instruct-v0.2-GGUF / mistral-7b-instruct-v0.2.Q8_0.gguf

and I'm running it in LMStudio with the config:

  {
  "name": "Exported from LM Studio on 21.12.2023, 14:57:43",
  "load_params": {
    "n_ctx": 32768,
    "n_batch": 512,
    "rope_freq_base": 1000000,
    "rope_freq_scale": 1,
    "n_gpu_layers": 100,
    "use_mlock": true,
    "main_gpu": 0,
    "tensor_split": [
      0
    ],
    "seed": -1,
    "f16_kv": true,
    "use_mmap": true
  },
  "inference_params": {
    "n_threads": 4,
    "n_predict": -1,
    "top_k": 40,
    "top_p": 0.95,
    "temp": 0.2,
    "repeat_penalty": 1.1,
    "input_prefix": "[INST]",
    "input_suffix": "[/INST]",
    "antiprompt": [
      "[INST]"
    ],
    "pre_prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.",
    "pre_prompt_suffix": "",
    "pre_prompt_prefix": "",
    "seed": -1,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "n_keep": 0,
    "logit_bias": {},
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.1,
    "memory_f16": true,
    "multiline_input": false,
    "penalize_nl": true
  }
  }


Also checkout www.askYouTube.ai

Which does essentially this but requires no indexing of videos!


I'd not heard of www.askYouTub.ai, thanks! I got a much better set of results for the question "What's the fastest way to learn Flexbox CSS?"

Bookmarked.


This is wonderful. A fantastic step towards indexing video...

I have some ideas I'd like to share - something to improve this project:

1: Are you only scraping the video id and text content? (the meta-data and description text is still useful to have searchable maybe - at least as an option)

2: In addition to audio - you can capture video content as well into tags (which for videos without sound, poor sound, or non-speaking would make those searchable). audio content is only part of the information I personally want to search. being able to ask for videos "about"/"with" a topic.

3: defined search keywords: searching though different stacks could be speed up by specifying which stack to search (if you take the above suggestion about video tags). as a user I would want to specify if I am looking for audio content, or video content (such as showing a mechanical process of some kind, or how a specific design of dress looks vs just talking about it).

4: dark mode.

I really like this project and hope it becomes the best possible version of itself!


Thank for sharing!

1: Nope, I'm also taking the title, description and tags as part of the content as well.

2: Agreed! This is something I'm aiming to put out in the next few updates where the AI tries to understand what's going on in the video along with the transcribed audio. This would be great for video demos with no audio etc.

3: That's pretty interesting! pre-processing the search query and trying to understand the category of that search could narrow down the specific tags in that sense. Would be interesting to try something like that!

4: Yes!! hahaha this will be priority


This is amazing...

UX Feedback:

The vid bubbles are HUGE. https://i.imgur.com/TjHhtRr.png

It would be great to have TopicTabs... I search for "X" and it gives me one of the topic bubbles (as seen on the landing) - under my search bar. I enter a new search, the vids change, but the previous search results bubble is still there.

This is fantastic though.

--

It would be really interesting to package it in such a way where one can connect all their CCTV cams/footage to this. Then have all their video searchable:

"Show me any white vehicles that drove by in the last hour"

As seen on TV...


Thanks for the feedback! Always love UX feedback, will be checking it out :)

Also yes sometimes when you search about a topic that doesn't exist meaning not indexed it will try to find the closet next best. If you're looking for specific video categories, you could index them by clicking on the hamburger menu on the right of the search bar.

Yeah agreed! This tech has tons of use cases like the one you shared with CCTV. While code is open source, so anyone can take it and implement it for their use case :)


It strikes me as interesting that videos pretty much never change. With web pages it is easy to replace the blog/forum/etc with a bunch of spam. For the initial owner it might be tempted but eventually it is even expected as if it is the whole point of websites? People die, the domain expires, the next owner generates a spam page for it. If it doesn't expire it probably already has a ton of spam on it.

If people could actually find things the whole purpose is gone. (lol)


I like this. Another idea maybe is to build a podcast search engine too. The descriptions are there but many times the content is built up to ensure a particular podcast running time.

One thing I am struggling is adding videos. I was trying to see if I can add Huberman's podcast videos on YT and then find specific parts on say light therapy rather than going through the whole set of videos or even the video on sleep. I get the OTP but nothing works.


Have a very similar idea to what you mentioned about podcasts.

Had the same result in relation to the OTP.

What did you mean by this "the content is built up to ensure a particular podcast running time."?

Btw, the website on your profile didn't work for me.


Have you tried dexa.ai? Here is an example with Hubermanlab Podcast - https://dexa.ai/huberman


This is interesting. However, it seems it's only trained on summaries rather than the full video transcript, right? I asked it to give a list of questions asked by a host in a particular podcast episode — it couldn't provide them.

e.g. Q) "can you give me the complete list of questions asked by Lex in the Jeff Bezos episode"

A) gives me one paragraph and then says, "Please note that this is not a complete list of all questions asked in the episode, but only those present in the provided podcast chunks. For a complete list, you would need to listen to the full podcast episode."


I haven't been able to get a good search result yet. I tried "how do i improve my lap times at bathurst in assetto corsa competizione" and got a couple of irrelevant results about Mario Kart. I tried "how do i plant a tree in clay soil" and got only one result, about a tree planting cannon. Pretty bad.


the next step would be to produce descriptions of each scene in the video and search on that.

this is already possible for photos/images, so getting this for video is just a matter of time, but i don't know how far along the technology is at this point and how much resources it will take.


You read my mind! Beyond indexing the audio of the video, taking snapshots of the video content and indexing that would be pretty interesting as well. I've seen what GPT-4 vision can do for images, I'll be looking for an open source alternative that I can build around


for image analysis there are these two options:

https://www.adept.ai/blog/fuyu-8b

https://github.com/THUDM/CogVLM


Oh nice!! Thanks, this helps :)


There is a service that produces descriptions of each scene:

https://www.videogist.co/

But I don't know about searchability.


This is cool. Looks like a very basic version of https://netflixtechblog.com/building-in-video-search-936766f....


This is cool! Learned about JigsawStack.

Curious, can you share your decision making for adding Hasura on top of Supabase? I used Hasura a couple of years ago, recently started using Supabase and it seems to cover my Hasura use case in a simpler way.


Yeah! I'm the founder of JigsawStack, happy to help if you have any feedback or questions.

I've been using Hasura generally with vanilla Postgres DBs and I've mainly used it for three real reason: Client side permission management, DB dashboard and GraphQL.

Supabase has done an amazing job trying cover all those aspects with their own solutions but I still find Hasura really found the sweet spots for those aspects a little better and more fine tuned.

Client side permission: Hasura has an amazing UI to build permissions compared to Supabase sql based row level permissions where I still need to right code and then verify it

DB dashboard: managing the db, adding/removing columns, changing types, creating triggers, index etc, works/feels a lot better to me on hasura or maybe its just something I've gotten use to

GraphQL - I love writing in GraphQL, and Supabase Graphql layer isn't production standard at this stage.

Again, a lot of this is preferences and my personal opinion. You can build amazing products with just Supabase for sure, just a lot faster for me with Hasura right now :)


It does not work at indexing anything else than AI? Tried searching for something like zero knowledge proofs and it showed me courses in Java. Interesting approach though


Hi I've only managed to thousands of videos from specific channels right now, however you could index videos you would like to see by clicking on the hamburger menu on the right of the search bar and clicking "Index video". You can add as many videos as you like from Youtube and they'll become searchable.

If there are specific channels with tons of videos you like to see, share them here and I can add the channel videos directly in the backend :)


This is great! Can imagine it as a great start to collect training data for a multimodal LLM that can generate short educational videos on demand!


Very true! Using LLMs to build tools which LLMs will use to train and the loop goes on!


Pretty Cool! Why doesn't Youtube, Insta etc provide a service like this?

And how are you getting funds for expenses like whisper etc?


I think its a matter of time, hope to see these kind of service built into YT, insta etc. Also that's why start ups tend to exist too when the bigger guys don't do it :)

That's why I love open source tech, because whisper is open source, people have made it cheaper, faster and better. Check this out: https://replicate.com/vaibhavs10/incredibly-fast-whisper


I itched to build sth in this context. I'm so happy someone did actually do that!

Some videos are a real treasure


Yeah and its open source so you can help make it better


So this is RAG for video - VRAG?


Yeah kinda but I might be adding aspects of typical fuzzy full text search along with the RAG element for better search accuracy


Typo, "learn suapbase"


Thanks! Pushed a fix, should be up in a min




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: