Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: YouTube Full Text Search – Search all of a channel from the commandline (github.com/notjoemartinez)
540 points by notjoemartinez on May 20, 2023 | hide | past | favorite | 165 comments
yt-fts is a simple python script that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line. It allows you to query a channel for specific key word or phrase and will generate time stamped youtube urls to the video containing the keyword.



I love that a third party is stepping up here, but it's incomprehensible to me that Google doesn't do this themselves. They're a search company, and they own YouTube. The YouTube data — including the subtitle files — is already sitting there on their servers; they don't have to scrape it, they just have to index it. What are they even doing?

Fun thing to try: do a Google search with "site:youtube.com" in it. You get basically nothing, no matter what keywords you use. It seems that Google actually entirely ignores/excludes YouTube from their regular HTML indexing, and instead only relies on the YouTube backend to actively push content into (a special, separate part of) the search index. Which gets you "results from YouTube" and "video search" — but doesn't get you the ability to search youtube videos pages qua web pages. (Consider: you can find a post in a Reddit comment thread on Google. Can you find a post in a YouTube video comments section on Google?)

Heck, when I first heard about YouTube's autogenerated captions, my first thought was "oh, so this is Google building deep indexing of video through audio transcription, because they can't trust externally-provided subtitles, right?" But it's been 10 years, and I couldn't have been more wrong.


I would posit that google has determined that it’s more profitable to keep viewers on YouTube via controlling the viewing experience vs whatever additional ad revenue they’d gain by making videos more easily indexable. I’m basing this hypothesis on YouTube’s consistent trend of removing features related to controlling your own viewing experience. For example, removing subscription collections.


Actually, letting people search video text would enable less watching and that is probably the reason they aren't interested in it.


Exactly! Google is in the attention business, the more you stay the more they make


Best case for Google is you click through and watch 10 videos without ever finding what you're after, so you try again the next day.


I get it now. It's just like some grocery stores used to operate. They purposely moved stuff around so people have to wander in order to find things. Or they put common items far away so people have to interact with other stuff. Theory is that it increases sales.


Watch 10 videos and then fall down a rabbit hole and forget what you were even there to look for - the even more best case!


So they say, attention is all you need.


> Google is in the attention business

This is probably the best description of their business model.


attention abuse would be a better one, imo.


How would making videos more indexible move people off of YouTube? Once you're there, you'd stay there, with all the current recommender algorithms still in effect; all the external indexability would do is give Google (and Bing, and everyone else) more reasons to lead you into that labyrinth.


The same way they did a few weeks ago, when they changed the interface to only show you the options "Latest" and "Popular" for a channel, instead of the drop down box that was present before. You could get a list of most recent or older videos.

They are about keeping you engaged and keep suggesting to users what they think will keep you engaged with the platform. It's an entertainment platform not a library...


I don’t understand, can you explain more?


I mean this:

"Tell HN: YouTube's web UI just got even worse" - https://news.ycombinator.com/item?id=33371268

They used to have a drop down box, for oldest or most recent, when you clicked on the tab Videos for a YouTube channel. Now you will notice, it has now a "Latest" and "Popular", something that of course the algorithm will decide upon...


We seem to be locked into an ever more blatant "cycle of addiction". As with (other) drugs, the addict and the dealer are something like "two sides of the same coin".

Briefly, new tech / markets / apps ... i.e., 'innovation' ... often 'creates' tremendous 'value'*. Money (aka capital) rushes in - there's a boom, everyone's excited: consumers are getting new products, services, conveniences, improved prices; engineers, technologists, scientists, machinists, etc. are excited - building the new tech, exploring capabilities, using it to advance research, improving precision / throughout / etc. Dopamine hits all around, vibrant 'ecosystem', everything's lively, feelings of 'changing the world', etc.

But, then, the space reaches a comparative plateau. The low-hanging fruit is exhausted. The lucky/skilled'VCs' have made their 50x+ returns (that far more than cover the various failed ventures), the technologist innovators are moving into some other space, etc.

What's left are basically the 'landlords' and their crew. There's nothing inherently wrong with that. Those are fine 'occupations' with essential (enough) functions etc. The problem is, there are pressures, with our current culture, towards unlimited growth. There's this need ... this addiction to / normalization of 'neverending exponential curves'. So, you get rent-seeking, forms of 'maximum tolerable extraction' - a sort of exploitative feudalism... Locking up of all the resources now developed in the hands of a comparative few.

There's something of a spectrum, when it comes to economic sectors / businesses, as to how naturally monopolization forms. In general, unchecked, just about any sector / business area can fairly easily trend towards one 'company' controlling everything ... there's more that could be written here, but I'm out of time right now ... [this part should probably be edited / made into a more suitable transition]

Ultimately, every time a given market reaches a certain degree of concentration, the businesses left, particularly these days, increasingly act like "Polyergus" ants, or possibly Cuckoos, or other parasites of various types - organisms that use some degree of parasitic strategies. They 'farm' the rest of us for ever-increasing amounts of 'value'. But, not like responsible farmers, more like the human species as described by A. Smith in The Matrix - "Every mammal on this planet instinctively develops a natural equilibrium with the surrounding environment but you humans do not. You move to an area and you multiply and multiply until every natural resource is consumed and the only way you can survive is to spread to another area."

That pattern is woven throughout this way overly long comment (that also desperately needs editing) ... sadly, gotta run now.

* Some of these words / phrases are so clichéd / co-opted, it's hard not to put quotes of some type around them - partly calling attention to that aspect, partly calling attention to more fundamental meaning


Sounds like you might enjoy reading Doctorow's essay on "enshittification".

"Here is how platforms die: first, they are good to their users; then they abuse their users to make things better for their business customers; finally, they abuse those business customers to claw back all the value for themselves. Then, they die."

-- Corey Doctorow, https://pluralistic.net/2023/01/21/potemkin-ai/#hey-guys


I can imagine that training people to snipe the exact content they want based on external search could reduced minutes watched.

If I was searching for how to replace part X in appliance Y or car Z, better search might get me the answer with less watch time. That’s good for me, but might hurt YT.


If you get information by reading text on your terminal you didn't see the ads - I think the thinking stops there.


Whenever I notice an "obvious" potential feature that could improve user experience in a Google product (there are MANY), I automatically assume it's because of one of two things:

a. Doing it won't get anybody a promo.

b. They've considered it and determined that it will lose revenue to a degree that is not justified by the usability improvement.

This is one of the pitfalls of having an ad-based product instead of a fee-based product. User experience is just no longer the top priority.


I would posit that google has determined that it’s more profitable to keep viewers on YouTube via controlling the viewing experience vs whatever additional ad revenue they’d gain by making videos more easily indexable

This is exactly it. It's the same business logic that is employed in the regular Google Search.

Google doesn't make money from giving you correct search results. It makes money by keeping you searching for the results you want.


I despise YouTube purely for inflicting mandatory Shorts on the user.


This extension does a good job of removing them from your subscription feed. If you like me, guard your feed since you don't want to watch random stuff from the home page this is the only way to not totally clutter the feed with random shorts thumbnails. The very fact you can not disable shorts from your feed is outrageous, but obviously not surprising.

https://chrome.google.com/webstore/detail/youtube-shorts-blo...


I’ve noticed a similar problem with Tumblr’s app. Last year they added a “live” feature, apparently trying to carve out some minor slice of egirl pie, and proceeded to put it all over the UIs. Also popups asking you to pay money. Needless to say UX has dramatically devolved on their app.


Tik tok effect


I’m willing to bet that the features you’ve seen removed are taken out because their utilization is low, and there’s some associated maintenance cost. I’d also like to know what other features were removed. I had never heard of subscription collections, for example, and a search returned [this video](https://youtu.be/qGSHPhR8k8g) (I wish markdown worked on here) that says collections were a test feature (I’m guessing it didn’t do well enough to make it to prod).


Have you tried searching through your call history on a Google phone? It's awful. You'd think they'd have a solid search built in but it's nearly useless. Especially considering you're usually searching for that number you only dialed once and isn't in your contacts and is for some reason excluded from your recents, so you go into your call history (strangely hidden behind a menu) and...there's no search function. WTF? You'd think you could filter by area code, date or a general time which the call took place, Google is a search company after all, right? Or am I subtly being nudged to use some of their more profitable products to try and find it again?

FOSS dialer recommendations are welcome btw.


As much as I like to generally, I wouldn't blame Google for this, rather I blame the entire field of "UX".


I have this same exact problem. I thought I was crazy. The call history on Android/Pixel is absolutely terrible.


It works as good as Windows search, no better


Beyond conspiracy theories it's interesting to speculate why Google is not providing native search-in-subtitles and search-in-comments. The easiest explanation is that they don't trust the quality. They probably tried it and reached the conclusion that it doesn't improve search in any meaningful direction.

I know from experience that search in user reviews is very hard. Unless you really understand the review (which was tried via sentiment analysis) you cannot rank results well. But now with the new LLM models I think it would work better.


I’m pretty sure google uses the captions in search. As long as you search from within YouTube. I regularly search for keywords and find hourlong videos where it’s mentioned somewhere in the middle and nowhere in the description.


I still have doubts. Google uses anchors heavily in search, so it's hard to prove if a result is due to matches in captions or in a website linking to the video. Would be good to find proof one way or another.


FWIW I tried to find the norm MacDonald clip where he sings about the "impossible bat" and YouTube was spot on.


I suspect what you see is the search "learning" from the users -- someone is looking for a specific video that they've seen before but need to refine the search terms a few times before they find it. Thereafter, Google remembers the entire sequence of search terms and associates them with the video even though none of those terms might actually exist in the indexed video metadata. I've noticed that Google Search at least does this very obviously.


I mean Google search does it obviously using page rank which is that if someone links to the page with that word it uses it. Me I was searching for arcane words that I remembered hearing in the middle of a podcast. I doubt anyone actually linked or searched for the same with relevance to that podcast. Also the solution you're saying needs far more intuitive and intricate development than just indexing captions lol.


I think Google is now thoroughly infected with Big-Company-itis. A couple departments would like to and know how to comprehensively use AI across many services, which the consumers would love (though some would be confused by it). But legal, marketing, and some guy in a department called "Annex B?" are preventing it. So then the people in those departments get bored and go somewhere else and their perspectives and skills are lost.


At one point Google claimed their mission was to "organize the world's information and make it accessible". That's proven to be just as much of a joke as "Do no evil".


For all I care, they lost that claim with [Deleted video] - and by that I don't mean that they remove videos, for which there are countless valid reasons, but that there is no way to see what it was you liked, and that the lists you curate just deteriorate. There are many other, maybe more valid reasons they fail this mission, but that's the one that has plaguing me the longest.


It's quite intriguing that Google doesn't offer full-text search capabilities for YouTube, considering its position as a leading search company. However, I think there are several reasons for this, some of which may not be immediately apparent.

Firstly, if Google did offer this feature, it would likely be targeted by Search Engine Optimization (SEO) exploits. In essence, any time a new search parameter is introduced, there's a risk of it being manipulated to prioritize certain content—especially by those interested in gaming the system for increased visibility or monetary gain. If YouTube's search feature were to be plagued by such spamming, it could severely degrade the user experience and lead to Google having to strip it away. While not a guarantee, it's a probable outcome given the history of SEO misuse.

Secondly, YouTube's primary focus is on its recommendation algorithm rather than search functionality. With billions of videos hosted, the key goal is to keep users engaged by serving up content they're likely to enjoy, thereby increasing view times and ad revenue. The search feature, while useful, is not as integral to this objective. Further, offering full-text search could provide yet another avenue to manipulate the algorithm, which YouTube surely wants to avoid.

Finally, implementing and maintaining such a feature would require substantial resources. It would necessitate hiring teams of high-salaried employees to moderate and ensure fair use of the feature, adding considerable operational costs. Considering these factors, it seems that Google has made a strategic decision to avoid this feature for now.

That said, the fact that third-party solutions are emerging, such as the one shared here, shows that there's a demand for full-text search capabilities. It also underscores the potential that these solutions have when unencumbered by the constraints faced by a tech giant like Google. This provides a fascinating insight into the dynamic relationship between third-party developers and tech corporations and the way they can complement each other.


> With billions of videos hosted, the key goal is to keep users engaged by serving up content they're likely to enjoy, thereby increasing view times and ad revenue

Maybe for some users. I just use youtube to find a specific video I need (because people have stopped writing useful how-to's now that they can just make a 10 minute video covering about 1 minute worth of text), and a full text search would be so, so useful.


Regarding your second point... I think it's still important because recommendation algorithms work better when users can find content they enjoy outside of the recommended content. If they can't then the recommendations will become stale.


Google already does this themselves. If you search for rare words (e.g. try "indubitably") it will absolutely pull up videos that have the word in the auto-generated transcript and nowhere else (not in descriptions, not in comments).

Also, using "site:youtube.com" on Google works perfectly for me. If I look up "site:youtube.com david letterman" it gives me the David Letterman channel, followed by a seemingly infinite number of Letterman clips. Precisely what I'd expect.

The only thing I can reproduce that you're complaining about is that Google (and YouTube) search don't seem to index YouTube user comments, in contrast to Reddit. But Google doesn't seem to index comments-attached-to-content anywhere on the internet -- not even comments on articles at mainstream publications like the New York Times. Which is probably more of a feature than a bug -- comments on both YouTube videos and news articles tend to be a lot of emotional reactions and repeated opinions which aren't worth searching at all. In contrast, many (not all) Reddit threads are often very informative and the "main content", so it makes sense Google indexes them.

So I don't really see anything to complain about here, from my perspective.


What I'd expect is that

"having a distribution that's both radially symmetric" site:youtube.com

would return 3b1b's "Why π is in the normal distribution" video, which has that in subtitles at 22:28.

Even without the site: term, all I get is an allreadable.com page that's scraped the subtitles for that video. Allreadable appears at first glance to be a site owned by someone in China and hosted on liquidweb.


I just checked and it returns that video as the first result if you put your phrase into the YouTube search box in quotes, so that's good news.

I wonder why YouTube indexes the transcripts for YouTube search, but Google chooses not to include them in its index for Google search? Seems like an intentional decision since it's the same company after all.


Google doesn't even do a lot simpler things like searching by language or location. And the search is garbage. I am trying to learn Italian so that's what I am interested in but even when I enter a search term spelled correctly with its accents and everything I get anything from Brazilian Portuguese to French. They do a very helfpul translation of the term and return results that are unfortunately useless to me. (I would have loved to speak every language but I don't)


The default behaviour for multiple languages is bad, but the settings page for both region, and search results language work well. In case you haven’t seen the settings, yet.


I wish. I have set English and Portuguese only, in that order, and it keeps showing me Spanish, Italian, French.

Oh, and how about setting the default account among the many I’m logged in. Or the disable auto play setting that keeps getting reset every week.


On web search, you can append &lr=lang_it to the URL. Maybe make it into a Chrome extension.


This does not seem to be doing anything. It's also being removed from the url bar when the page loads.


Weird. It works for me (on google.com), including from an incognito window.


YouTube doesn't have to be good. It just needs to do the minimum it needs to keep users from switching to a different platform... which, because of the network effect of so many people and videos being on it, is not much at all.

If they had serious competition, they'd have to do more to keep users, but no such competition exists.


There’s twitch and I think Rumble is going to give it a run for it’s money.

I grant Rumble is only 1% the size of YouTube by viewership, but I think that’ll shift fairly rapidly and we can see 10% on rumble in 3 years.

My analysis https://austingwalters.com/an-analysis-on-rumble-nasdaq-rum/


Twitch is a completely different demographic and rumble is very small.

Close to no competition


Rumble will perish due to its controversial niche, which mind-numbingly is also its main moat


Hardly “controversional”, considering it’s just the completely legal views that YouTube censors.

What will really kill Rumble is the fact that they aren’t a YouTube competitor, more a small island off the coast that is mainly centered on said niche and lack the means of true innovation.

Personally I find Odysee/LBRY much more interesting. Fully open source protocol, more varied creators, transparent moderation due to a public blockchain, and P2P file distribution are all incredibly enticing features to me.


I kind of agree about the most and niche, but think you’re wrong about the conclusion.

That niche has ~50% of Americans and a large portion of the globe fine with it.

Put simply, as YouTube pushes out niches (guns, comedy, pundits, etc). They move to rumble. This increases the network effects on rumble and will help it grow rapidly. The more niches on rumble, the less of a need to go to YouTube. YouTube effectively kicked out the niches that half the country likes, so you’re going to get two networks. Rumble will let anyone profit off anything, provided it’s 1st amendment protected (supposedly, I have my doubts tbh). So it can build a bigger network.

More importantly, as I point out in the analysis, they’re well positioned as the large media outlets fade to take market share.

If people come to rumble for the pundits or Gun channels and stay for the sports, w.e. That’ll work well for them.


> supposedly, I have my doubts tbh

I do too. The same people I see pushing Rumble were the same people pushing Parler, which for all intents and purposes was a honey pot. Not to mention anything rooted in politics, regardless of side, will be endorsing the “right content” and punishing the “wrong content” in various ways.


I wouldn't be so sure, a big part of growing big as an organization is becoming extremely sensitive to the political winds, and Youtube's "content moderation" is really a function of politics in the West. Were Rumble to grow to a comparable size it too would face the same political risks and at that point it could only remain as it is if there was a sizable political force backing.

Someone already mentioned Odyssey and I too am more hopeful of that platform - a decentralized platform is the best positioned to avoid political interference and censorship.


> remain as it is if there was a sizable political force backing.

Exactly my point, they have half the support. YouTube has the other. It seems reasonable to assume a split of the networks so lots of room for rumble to grow.

Odyssey is going to be attacked by both political sides imo we have seen that already


I've not come across Rumble yet, is this comparable to Twitter vs. Truth Social (minus the drama)?


It’s closer to something between YouTube and twitch; they’re trying to grow by targeting the niches kicked off of youtube and twitch

To put in perspective: there’s a ton of attorneys on there, right and left wing pundits that had their channels kicked off YouTube.

They also have UFC, sports, gaming, gun channels, etc. and that seems to be where they’re trying to grow most.


There’s a company founded by a friend of mine called MediaDistillery, who are doing awesome stuff in this area. Real-time searchable massive video archives, with contextual understanding (e.g. “a WW2 fragment with a Jewish mother holding her baby”). Super useful for so many purposes.

And then there’s YouTube where you can’t even search subtitles. It makes me shake my head. Google seems to be at the forefront of AI, but doesn’t seem to be able to turn all that expertise into relevant products. Maybe the recent disruptions will shake them awake?


I mustn't be searching it right, I can only get this company: https://mediadistillery.com/ , the feature you mentioned is not accessible, and has that Ad-tech smell.


Social media companies are in a constant tug-of-war against the end users when it comes to controlling what the users see. The ideal is that the user has absolutely no control over on what they see and the social media company can fully dictate content. That is what makes the money.

Allowing users to freely query content in their own websites is completely antithetical to what they are trying to do. YouTube is also very aggressive in preventing scraping and limiting the usage of the official API. Which is quite ironic considering the history of the company.


Google: appeared open to start things off, then went whole hog on the MS embrace-and-extend philosophy, aiming to crush the life out of the entire web.


I think you're exaggerating a little when you say "site:youtube.com" doesn't work. If I search 'site:youtube.com apple watch' I get 143 000 000 results, and if I search something more specific like 'site:youtube.com "Featuring Dr James Grime"' I'll find exactly what I'm looking for. But you're correct that it doesn't seem to search video comments, only titles and descriptions.


The problem I ran into is that, like I said, YouTube doesn't seem to get indexed by hypertext inward-edges from other sites like a regular website does — and so you can't search by how you recall a video being described in pages that link to it; instead, you have to remember how the video describes itself. Which it may not always do well.


Some of us recall that when Google+ launched it lacked any search whatsoever.

That this was the case with a company whose name is synonymous with online search was ... simply mind-boggling.

The platform eventually did get search (actually a few different implementations), which varied between mostly useless to actually reasonably functional, though I'll note that HN's Algolia-based search is vastly more useful on an ongoing basis.

G+'s content, to the extent it survives at all, is largely on the Internet Archive's Wayback Machine which ... lacks search.


Conflict of interests. They want you faffing around provisioning all those valuable clicks for them to sell.


Google does it, Youtube does not.

Google will find a video when you search for a phrase that was said in it (as long as its bad speech recognition got it right), it will find a video with a text that shows up on an object with enough clarity to OCR (for example electronic component name on a pcb in the foreground). There is one plot twist - Google will not always do it when you search for VIDEO specifically :) but will gladly give you videos when searching text/images :)

Youtube search on the other hand will try

- suggest something you liked that has nothing to do with the search term entered

- popular videos at this moment

- videos mildly related to proper results. One of them had a horse in it? clearly you want more horses!

- videos with title mildly related to search term

- to ignore upload date filter when they panic (Christchurch mosque massacre).

For example YT search for "Si5351A" limited to this month will give 11 somewhat proper results mostly with "Si5351" (no A postfix) in title/description AND some dude DXing in Indonesian "Menerima Modif Radio Yaesu FTC 1540A Ke DDS System" because "Si5351A" is a "DDS" so its the same thing right? Its like when Im looking for "NSR Ro80" you should show me plenty of other cars because Ro80 is a car :). Searching for Si5351A without quote marks will show one additional video with Si5351B in the title.

Gets better, searching Google Video for Si5351A last month also gives ~11 results, but only 4 of those are direct YT links :]


Probably because YouTube =! Google Search, while YouTube is still a subset of Google. So, going an extra mile for YouTube and not for others might put Google Search in anti competition issues.

Then again, I also find it absurd. YouTube is one of the most valuable parts of the Internet. And its lack of searchability is criminal. At least the YT search itself should make up for it. It's shame it doesn't.


Google doesn't necessarily have to do anything special for YouTube, though. Google could "just" index YouTube videos as if they were any other web pages, in a standard way. It would then be YouTube's job, to make the data inside those video pages legible to Google's indexer. Where Google could enable this, by pushing for web standards to increase machine-legibility of video in HTML — e.g. standardized ARIA-accessible captions sources for the <video> element, etc.

If they got it set up such that in theory any web spider could come along and index a YouTube video — then there would be no anti-trust reason that Google couldn't just directly ingest the subtitle files off their own servers; it'd just be a bandwidth-saving optimization over the scraping process that they could otherwise do.


e.g. standardized ARIA-accessible captions sources for the <video> element, etc.

YouTube could literally be a minimal web forum with a video tag in the first post of each thread, but likely due to DRM and related motivations, they instead wrap everything in thick layers of obfuscated JS.

Comments were easily indexable too, before that was also removed: https://news.ycombinator.com/item?id=11053204

For a while there were various shady-looking sites that seemed to scrape YouTube video pages (including comments) and I could sometimes find them through Google (then going back to YouTube for the original video), but within the past few years those have unfortunately also either been delisted/censored from search results or died out.


They do have search in video but the launch was kinda miffed

https://www.socialmediatoday.com/news/google-tests-text-sear...


How do you think monopoly regulators would like it if YouTube videos were indexed with higher accuracy and detail in google searches than Vimeo video?

So sure, google can say ‘here’s a standard way to provide subtitles for a video which we’ll index’, but then that becomes a complete SEO side channel - google needs to validate that the subtitles actually match the content. And that means their bots downloading the video itself. And google really doesn’t want to go out there and argue that video needs to be downloadable by bots, because that’s the whole YouTube-dl case right there.


> Google search with "site:youtube.com" in it. You get basically nothing

That is not my experience! I regularly resort to this when the crappy inbuilt youtube search, which prefers to throw out algorithmic recommendations over returning actual search results, fails to come up with the goods.

Do you really get no results for, say: https://www.google.com/search?&q=intitle%3A%22thomas+brinkma... ?


Heck, when I first heard about YouTube's autogenerated captions

Off-topic, but since I don't use YouTube and you do, in your experience, how are the auto-generated captions? Are they accurate?

I've been unimpressed by speech-to-text engines in the past, so I'm interested to hear if this is a problem that Google's managed to solve.


As far as I can tell, Google (but not YouTube?) does search YouTube transcripts.

I have successfully Googled text in a video's transcript and found that video.

The transcripts themselves are pretty bad though (Google's using old tech).

They're usually good enough for auto-summarization though.


> but it's incomprehensible to me that Google doesn't do this themselves.

How is it incomprehensible that they don't give a shit about what you want to see and only care about what's profitable to them for you to see?


i used to imagine similar opportunities with google books. but they have done basically nothing with it. and that's been like 20 years.

if anyone could have disrupted the corrupt and unfair academic publishing world, it was Google. they just found it an uninteresting task. they preferred to work on G+, Stadia, Google Code, etc, https://killedbygoogle.com/


Rule #50 The better the Catalog, the Worse the Interface

Spotify and YouTube are the leading examples but there are definitely others.


YouTube profits from people scrubbing videos. Why on Earth would they want to offer full text search instead?


Its Google, the obvious seems to elude them even when its sitting in front of them.


I believe Bard has the capability of searching youtube transcripts


Very nice! FYI: sqlite ships with a full text search engine featuring a Boolean query language, highlight(), snippet() and scoring:

https://www.sqlite.org/fts5.html

I’ve not used it with enough content to know how much faster it is than LIKE ‘%my query%’ but it should be a lot quicker.

(Also, in most cases you don’t need to create an id column — every table has one already in the form of rowid.)


Not sure if this includes fuzzy search, but having it will make this much more usable.


In what cases is it unwise to rely on rowid over a id field?


I don’t think there are any. They are one and the same — if you create an integer primary key named id it is aliased to rowid:

https://www.sqlite.org/rowidtable.html


This is safer-- relying on implicit rowids can break things if you use a plaintext database dump (those don't have rowids), and having a column "id integer primary key" is clearer in the schema.


I think the integer autoincrement primary key is more explicit / less mysterious than the implicit rowid. Even if most of us have run into that explanation in the sqlite manual.


> yt-fts is a simple python script that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line.

Critically, this is per channel. I wonder if we can optionally configure this to share the downloaded transcripts to a central repository so eventually a good proportion of youtube's transcripts could be downloaded as one big text file.


> share the downloaded transcripts to a central repository

Sure, are you willing to host it and handle the absolutely inevitable legal issues?


I wish IPFS was better. It'd be an obvious solution to this. Content hash the YouTube ID and then distribute hosting.


There is something similar to a central repository at https://filmot.com/


It looks like you're running searches using LIKE: https://github.com/NotJoeMartinez/yt-fts/blob/050981c0519a96...

SQLite has a really power full-text search mechanism built in - FTS5. It can handle things like stemming and stop words and relevance ranking.

My sqlite-utils Python library includes helper methods for setting that up: https://sqlite-utils.datasette.io/en/stable/python-api.html#...


Thank you! I was able to integrate this into the project[1]. I'm also looking into using your openai-to-sqlite[2] library for semantic search.

[1]https://github.com/NotJoeMartinez/yt-fts/pull/25 [2]https://github.com/simonw/openai-to-sqlite


You're right. Thanks for sharing the link to your full-text search helper, really neat.


This will come in handy. I’ve always wanted to count how many times Lex Fridman has referred to something as a beautiful dance.


I want to do a word count on the word "love".


> yt-fts search "love" --channel "Lex Fridman" | grep "love" | wc -l

> 7060


Update, I couldn't get this to work. It returns 0 for me.

Running it without the grep etc says channel not found.


Update, I had to download all the video subtitles and then query the sqlite tables directly.

Below are the top 10 from just the podcast.

https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuK...

First column is "love"s per episode.

  Total "love" count in 376 episodes = 7,614
  Average "love"s per episode        = 20.25

  -----------------

  107 | Sarma Melngailis: Bad Vegan, Fraud, Pris | iZjby1LkTWQ
   98 | Andrew Huberman: Focus, Stress, Relation | lvh3g7eszVQ
   92 | Bishop Robert Barron: Christianity and t | WgytXF0SPh0
   80 | David Buss: Sex, Dating, Relationships,  | sndW9hzX-wA
   79 | Duncan Trussell: Comedy, Sentient Robots | jdIyNMkusLE
   76 | Rana el Kaliouby: Emotion AI, Social Rob | 36_rM7wpN5A
   75 | Edward Frenkel: Reality is a Paradox - M | Osh0-J3T2nY
   75 | Todd Howard: Skyrim, Elder Scrolls 6, Fa | H9AAnV59ddE
   74 | Travis Oliphant: NumPy, SciPy, Anaconda, | gFEE3w7F0ww
   74 | Kelsi Sheren: War, Artillery, PTSD, and  | PbN3HzKkW4M

  -----------------

  SELECT count(s.video_id) AS love_count, substr(v.video_title, 1, 40), s.video_id
  FROM Subtitles s, Videos v
  WHERE s.video_id = v.video_id
  AND s.video_id IN (
    SELECT v.video_id FROM Videos v
    WHERE v.video_title LIKE "%Podcast%"
    AND v.video_title NOT LIKE "%Podcast Clips%"
  )
  AND s.text LIKE "%love%"
  GROUP BY s.video_id
  ORDER BY love_count DESC
  LIMIT 10


This comment blew my mind.


Legend


I think I'll start to use exclusively CLI tools for discovering and downloading of YT content. The entire experience which starts from typing "youtube.com" in the address bar and pressing enter is obnoxiously unbearable.


self-promo, but you might find my extension helpful. https://lawrencehook.com/rys/


Downloading? Do you also not believe creators should be compensated for their content?


I already block advertisements on the web, so I see none when on a Desktop web version of YouTube. But I do not use Sponsor Block so those creators still get to show me their ExpressVPN ads or whatever the flavor is today. Also I use Patreon.


I think they should be compensated, but I don't think Google should be.


I hate being the poo pooer who says that subtitles are available via the API and wish the tool went that route.

I'm all for stuff being archivable with tools like youtube-dl, but I much prefer to see tools like this use the API despite its quotas because it goes beyond archiving a copy for reference. Tools that (ab)use scraping only justify anti-scraping efforts that journalists and the like use and escalate that arms race. I think one could still scrape a channel or two per day within API usage limits [1]--50 units per list, 200 units per download; quota 10,000 units per day.

[1]: https://developers.google.com/youtube/v3/docs/captions/downl...


agree that using the API is likely the nicer route. you can also apply for a quota increase, I recently applied for youtube API quota increase to 100,000 units and it was approved for my app (https://cmdcolin.github.io/ytshuffle/) I was concerned they wouldn't like that the app downloads so much data but it was approved without much question, they just wanted terms of service prominently displayed to end users


For videos without subtitles one could chain Whisper to auto-generate transcripts, though that would require downloading the audio and processing it


This is exactly what I’ve built. Nothing fancy just ytdlp + whisper + ripgrep + fzf and I’ve got a pretty interesting way to ctrl+f my YT history.


Mind to share? I'd like to try this out


Not OP, but I too wrote something nearly identical with whisper so I could creep on old EthosLab videos. Here's the gist:

from pytube import Channel

import whisper

channel_yt = Channel(channel_url)

video_yt = channel_yt.videos[0]

video_yt_stream = video_yt.streams.filter(mime_type="video/mp4").filter(res="720p").first()

video_yt_video_file_path = video_yt_stream.download()

audio = whisper.load_audio(video_yt_video_file_path)

model = whisper.load_model("tiny")

transcript = model.transcribe(audio)


This is great!

I had made something like this for my own use, but it was way more complicated. It took a user account, downloaded all the liked videos, ran it through some model and vectorized it, and then you could use natural language search to describe a scene in your history of liked videos and it would show you the timestamp and the thumbnail (with the link to start watching from the timestamp).

I ended up taking it offline because I didn't use it much, it was expensive and I couldn't see a path to monetization.

This solution being posted however, is really elegant, in particular because it is very resource-efficient.


Is it as effective as what you did? Did you need an API key for yours to work? What was your cost per video?


Nice. Combine this with an "ascii-art" the video converter in the terminal? There are some existing tools, a brief search yields this UNIX StackExchange discussion: https://unix.stackexchange.com/questions/160212/watch-youtub...


Apropos of the post: sometime back I had wondered if it would be possible to search videos for content by text keywords, but not to match text occurring in the video title, chapter titles or comments. Instead, if an app or library could somehow search for the given keywords by matching them with the spoken words in the video. That would be a potentially cool and useful feature.

I realize this may be technically impossible [1] or very difficult, but thought of mentioning the idea here.

[1] On further thought, speech recognition (as seen in mobile phones at least) has progressed to quite a good level (speaking as a layman for this part), so maybe the idea is not wholly infeasible. If an app or lib could somehow "internally" play the video, and speech-recognize the spoken words into text, then the problem would reduce to normal text pattern matching.

Putting the idea out here to see what devs think of it.


>somehow "internally" play the video,

Analogous to how headless browsers are used for automated testing or scraping of web apps.


Does this rely on an API key?

There have been earlier tools which permitted command-line / terminal access to Youtube, one of the best being mps-youtube:

<https://github.com/mps-youtube/mps-youtube>

That permitted searches for terms and channels (though not subtitle text within channel AFAIK), for music specifically, and compiliation of either temporary or saved playlists, with the option to play through a full selection of videos. It also offered either full-video or audio-only playback.

Google killed it by throttling API-key access.

Discussed previously: <https://news.ycombinator.com/item?id=32919545> <https://news.ycombinator.com/item?id=28571421>


It seems to just use yt-dlp, which supports fetching subtitles.


Thanks.


    $ python yt_fts.py download 'https://www.youtube.com/@ycombinator/videos'
    [...]
      File "/app/yt_fts.py", line 176, in get_channel_id
        channel_id = re.search('channelId":"(.{24})"', html).group(1)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    AttributeError: 'NoneType' object has no attribute 'group'

    $ python yt_fts.py download 'https://www.youtube.com/@ycombinator/videos' --channel-id UCj089h5WsDdh1q8t54K3ZCw
    [...]
      File "/app/yt_fts.py", line 191, in get_channel_name
        data = json.loads(script.string)
                          ^^^^^^^^^^^^^
    AttributeError: 'NoneType' object has no attribute 'string'
works great


This tool itself does works great.

This behavior is due to the YouTube cookies consent page.

I opened an issue about this specific issue: https://github.com/NotJoeMartinez/yt-fts/issues/1

Glad if you want to help.


Fixed.


As a workaround you can manually set the channel_name in line 82


Reminds me how much I dislike python error messages, not as much as java but still so much noise to signal by default.


This reminded me that Google Podcast search function clearly uses some kind of transcription index for its search. However, I wish they included at least partial matches as part of the search results (similar to how Google Books search works).


>yt-fts is a simple python script that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line

"Scrape all of YouTube subtitles"

So if that data is not good...then how is this useful. The captions in YouTube have been pretty bad. Are these the same thing? I'll test it out.


Another avenue for this is to use Tube Archivist, which takes the approach of locally mirroring videos and serving them up in a web interface, complete with comments, subtitles, and an index containing all the above. Definitely overkill if you just want to do a couple text searches though.


I have actually been working on a full blow full-text search engine for youtube by transcripts as a web application: https://clipbase.xyz

I'd love to know what y'all think!


Nice. Possible (?) improvement: provide a bit of context with each clip. For example: 2 seconds before the searched bit, and 2 seconds after.


Nice work. You could encode the text, load this into a vector database and allow semantic search.


Pardon my ignorance as I have not worked on Vector DBs yet, could you come up with an example how it'd be different than a full text search?


Here's a (kinda) ELI5: you would use a language model to create "embeddings" of the text, which you can think of as a set of numbers representing the "meaning" of a set of characters.

These numbers can be plotted as points in a space, and embeddings of things with similar meanings are plotted close to each other. So things like "exam preparation" would have embeddings close to things like "top study tips".

Say you have created embeddings for a large corpus of text (in this case all youtube captions) once. If you create embeddings for a user query, you can search for embeddings close to it, and these will be "semantically" similar to the query.

The advantage is that unlike traditional full-text search, the user doesn't need a query that includes words present in the text.


Do you have any resources that might guide one on doing something like this from scratch?



Here's a 6 minute speed run of something like that on weviate https://youtu.be/mBcBoGhFndY




Something like this? https://dexa.ai


Yes in theory although they are pretty expensive. I am doing something like this at work as I wanted to unlock the wealth of information we have in our tutorials, webinars etc.


https://weaviate.io/ Looks interesting. I was just reading about it.


If you're getting started with Weaviate, these two are probably what you need:

1. Wizard to create a docker-compose file: https://weaviate.io/developers/weaviate/installation/docker-... (e.g. choose the embedding model)

2. Sample notebook showing how to index items using the python library: https://github.com/weaviate-tutorials/vector-provision-optio...


   python3 yt_fts.py download https://www.youtube.com/@PerspectiveArts/videos --channel-id UCUCN8V_pO0xOFKLL4XG1tshnw
   Downloading channel
   Saving vtt files to /tmp/user/1000/tmpm4xoskpo
   Traceback (most recent call last):
   File "/home/user/src/yt-fts/yt_fts.py", line 273, in <module>
     cli()
   File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
     return self.main(*args, \*kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 1055, in main
     rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
   File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
     return ctx.invoke(self.callback, \*ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 760, in invoke
     return __callback(*args, \*kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/user/src/yt-fts/yt_fts.py", line 31, in download
     download_channel(channel_id)
   File "/home/user/src/yt-fts/yt_fts.py", line 82, in download_channel
     channel_name = get_channel_name(channel_id)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/user/src/yt-fts/yt_fts.py", line 191, in get_channel_name
     data = json.loads(script.string)
                      ^^^^^^^^^^^^^
   AttributeError: 'NoneType' object has no attribute 'string'


Probably due the YouTube cookies consent page.

I opened an issue:

https://github.com/NotJoeMartinez/yt-fts/issues/1

Hope it helps.


Fixed.


Loved the idea, yet to try it out. I would definitely download all videos of Lex and run some text analyzer/text cloud generator to learn about things being discussed


Nice work highlighting that "Life In The Big City" classic from the Ben Avery days


What I really would like to see on youtube is a full text search on video content, at least for videos with subtitles.


Youglish [1] is a website that allow you to search video with timestamp by transcript text

[1] https://youglish.com/


For what it is worth, we work on a tool[0] to index all local videos and images and later allowing query just using natural language. It is based on CLIP which has been trained on image-text pairs, but seems to work great for videos after applying some naive heuristics.

[0] https://github.com/ramanlabs-in/hachi


I stumbled across a ShowHN that did not get to the front page but seems to fit here:

https://clipbase.xyz/


Perhaps I'm wrong but how is this full text search? It's just using the LIKE operator


> how is this full text search?

It allows searching the full text, instead of just title, description, or keywords.


That is full text search. It just doesn't maintain an index because it's a small enough dataset per channel that it can brute force it in memory.


Put the subs into a vector db instead and enable semantic search. :)


Can you give some additional insight as to what this enables? Maybe some additional links for research? How does one store/format data on such a database?


Was thinking the same


Next step is to prettify subtitles into sentences using one of LLMs.


Wondering if this could be expanded to also search for comments


Thank You for share with us, Looking good to me


Why not fold this into an LLM interface?


nice




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: