yt-fts is a simple python script that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line. It allows you to query a channel for specific key word or phrase and will generate time stamped youtube urls to the video containing the keyword.
I love that a third party is stepping up here, but it's incomprehensible to me that Google doesn't do this themselves. They're a search company, and they own YouTube. The YouTube data — including the subtitle files — is already sitting there on their servers; they don't have to scrape it, they just have to index it. What are they even doing?
Fun thing to try: do a Google search with "site:youtube.com" in it. You get basically nothing, no matter what keywords you use. It seems that Google actually entirely ignores/excludes YouTube from their regular HTML indexing, and instead only relies on the YouTube backend to actively push content into (a special, separate part of) the search index. Which gets you "results from YouTube" and "video search" — but doesn't get you the ability to search youtube videos pages qua web pages. (Consider: you can find a post in a Reddit comment thread on Google. Can you find a post in a YouTube video comments section on Google?)
Heck, when I first heard about YouTube's autogenerated captions, my first thought was "oh, so this is Google building deep indexing of video through audio transcription, because they can't trust externally-provided subtitles, right?" But it's been 10 years, and I couldn't have been more wrong.
I would posit that google has determined that it’s more profitable to keep viewers on YouTube via controlling the viewing experience vs whatever additional ad revenue they’d gain by making videos more easily indexable. I’m basing this hypothesis on YouTube’s consistent trend of removing features related to controlling your own viewing experience. For example, removing subscription collections.
I get it now. It's just like some grocery stores used to operate. They purposely moved stuff around so people have to wander in order to find things. Or they put common items far away so people have to interact with other stuff. Theory is that it increases sales.
How would making videos more indexible move people off of YouTube? Once you're there, you'd stay there, with all the current recommender algorithms still in effect; all the external indexability would do is give Google (and Bing, and everyone else) more reasons to lead you into that labyrinth.
The same way they did a few weeks ago, when they changed the interface to only show you the options "Latest" and "Popular" for a channel, instead of the drop down box that was present before. You could get a list of most recent or older videos.
They are about keeping you engaged and keep suggesting to users what they think will keep you engaged with the platform. It's an entertainment platform not a library...
They used to have a drop down box, for oldest or most recent, when you clicked on the tab Videos for a YouTube channel. Now you will notice, it has now a "Latest" and "Popular", something that of course the algorithm will decide upon...
We seem to be locked into an ever more blatant "cycle of addiction". As with (other) drugs, the addict and the dealer are something like "two sides of the same coin".
Briefly, new tech / markets / apps ... i.e., 'innovation' ... often 'creates' tremendous 'value'*. Money (aka capital) rushes in - there's a boom, everyone's excited: consumers are getting new products, services, conveniences, improved prices; engineers, technologists, scientists, machinists, etc. are excited - building the new tech, exploring capabilities, using it to advance research, improving precision / throughout / etc. Dopamine hits all around, vibrant 'ecosystem', everything's lively, feelings of 'changing the world', etc.
But, then, the space reaches a comparative plateau. The low-hanging fruit is exhausted. The lucky/skilled'VCs' have made their 50x+ returns (that far more than cover the various failed ventures), the technologist innovators are moving into some other space, etc.
What's left are basically the 'landlords' and their crew. There's nothing inherently wrong with that. Those are fine 'occupations' with essential (enough) functions etc. The problem is, there are pressures, with our current culture, towards unlimited growth. There's this need ... this addiction to / normalization of 'neverending exponential curves'. So, you get rent-seeking, forms of 'maximum tolerable extraction' - a sort of exploitative feudalism... Locking up of all the resources now developed in the hands of a comparative few.
There's something of a spectrum, when it comes to economic sectors / businesses, as to how naturally monopolization forms. In general, unchecked, just about any sector / business area can fairly easily trend towards one 'company' controlling everything ... there's more that could be written here, but I'm out of time right now ... [this part should probably be edited / made into a more suitable transition]
Ultimately, every time a given market reaches a certain degree of concentration, the businesses left, particularly these days, increasingly act like "Polyergus" ants, or possibly Cuckoos, or other parasites of various types - organisms that use some degree of parasitic strategies. They 'farm' the rest of us for ever-increasing amounts of 'value'. But, not like responsible farmers, more like the human species as described by A. Smith in The Matrix - "Every mammal on this planet instinctively develops a natural equilibrium with the surrounding environment but you humans do not. You move to an area and you multiply and multiply until every natural resource is consumed and the only way you can survive is to spread to another area."
That pattern is woven throughout this way overly long comment (that also desperately needs editing) ... sadly, gotta run now.
* Some of these words / phrases are so clichéd / co-opted, it's hard not to put quotes of some type around them - partly calling attention to that aspect, partly calling attention to more fundamental meaning
Sounds like you might enjoy reading Doctorow's essay on "enshittification".
"Here is how platforms die: first, they are good to their users; then they abuse their users to make things better for their business customers; finally, they abuse those business customers to claw back all the value for themselves. Then, they die."
I can imagine that training people to snipe the exact content they want based on external search could reduced minutes watched.
If I was searching for how to replace part X in appliance Y or car Z, better search might get me the answer with less watch time. That’s good for me, but might hurt YT.
Whenever I notice an "obvious" potential feature that could improve user experience in a Google product (there are MANY), I automatically assume it's because of one of two things:
a. Doing it won't get anybody a promo.
b. They've considered it and determined that it will lose revenue to a degree that is not justified by the usability improvement.
This is one of the pitfalls of having an ad-based product instead of a fee-based product. User experience is just no longer the top priority.
I would posit that google has determined that it’s more profitable to keep viewers on YouTube via controlling the viewing experience vs whatever additional ad revenue they’d gain by making videos more easily indexable
This is exactly it. It's the same business logic that is employed in the regular Google Search.
Google doesn't make money from giving you correct search results. It makes money by keeping you searching for the results you want.
This extension does a good job of removing them from your subscription feed. If you like me, guard your feed since you don't want to watch random stuff from the home page this is the only way to not totally clutter the feed with random shorts thumbnails. The very fact you can not disable shorts from your feed is outrageous, but obviously not surprising.
I’ve noticed a similar problem with Tumblr’s app. Last year they added a “live” feature, apparently trying to carve out some minor slice of egirl pie, and proceeded to put it all over the UIs. Also popups asking you to pay money. Needless to say UX has dramatically devolved on their app.
I’m willing to bet that the features you’ve seen removed are taken out because their utilization is low, and there’s some associated maintenance cost. I’d also like to know what other features were removed. I had never heard of subscription collections, for example, and a search returned [this video](https://youtu.be/qGSHPhR8k8g) (I wish markdown worked on here) that says collections were a test feature (I’m guessing it didn’t do well enough to make it to prod).
Have you tried searching through your call history on a Google phone? It's awful. You'd think they'd have a solid search built in but it's nearly useless. Especially considering you're usually searching for that number you only dialed once and isn't in your contacts and is for some reason excluded from your recents, so you go into your call history (strangely hidden behind a menu) and...there's no search function. WTF? You'd think you could filter by area code, date or a general time which the call took place, Google is a search company after all, right? Or am I subtly being nudged to use some of their more profitable products to try and find it again?
Beyond conspiracy theories it's interesting to speculate why Google is not providing native search-in-subtitles and search-in-comments. The easiest explanation is that they don't trust the quality. They probably tried it and reached the conclusion that it doesn't improve search in any meaningful direction.
I know from experience that search in user reviews is very hard. Unless you really understand the review (which was tried via sentiment analysis) you cannot rank results well. But now with the new LLM models I think it would work better.
I’m pretty sure google uses the captions in search. As long as you search from within YouTube. I regularly search for keywords and find hourlong videos where it’s mentioned somewhere in the middle and nowhere in the description.
I still have doubts. Google uses anchors heavily in search, so it's hard to prove if a result is due to matches in captions or in a website linking to the video. Would be good to find proof one way or another.
I suspect what you see is the search "learning" from the users -- someone is looking for a specific video that they've seen before but need to refine the search terms a few times before they find it. Thereafter, Google remembers the entire sequence of search terms and associates them with the video even though none of those terms might actually exist in the indexed video metadata. I've noticed that Google Search at least does this very obviously.
I mean Google search does it obviously using page rank which is that if someone links to the page with that word it uses it. Me I was searching for arcane words that I remembered hearing in the middle of a podcast. I doubt anyone actually linked or searched for the same with relevance to that podcast. Also the solution you're saying needs far more intuitive and intricate development than just indexing captions lol.
I think Google is now thoroughly infected with Big-Company-itis. A couple departments would like to and know how to comprehensively use AI across many services, which the consumers would love (though some would be confused by it). But legal, marketing, and some guy in a department called "Annex B?" are preventing it. So then the people in those departments get bored and go somewhere else and their perspectives and skills are lost.
At one point Google claimed their mission was to "organize the world's information and make it accessible". That's proven to be just as much of a joke as "Do no evil".
For all I care, they lost that claim with [Deleted video] - and by that I don't mean that they remove videos, for which there are countless valid reasons, but that there is no way to see what it was you liked, and that the lists you curate just deteriorate. There are many other, maybe more valid reasons they fail this mission, but that's the one that has plaguing me the longest.
It's quite intriguing that Google doesn't offer full-text search capabilities for YouTube, considering its position as a leading search company. However, I think there are several reasons for this, some of which may not be immediately apparent.
Firstly, if Google did offer this feature, it would likely be targeted by Search Engine Optimization (SEO) exploits. In essence, any time a new search parameter is introduced, there's a risk of it being manipulated to prioritize certain content—especially by those interested in gaming the system for increased visibility or monetary gain. If YouTube's search feature were to be plagued by such spamming, it could severely degrade the user experience and lead to Google having to strip it away. While not a guarantee, it's a probable outcome given the history of SEO misuse.
Secondly, YouTube's primary focus is on its recommendation algorithm rather than search functionality. With billions of videos hosted, the key goal is to keep users engaged by serving up content they're likely to enjoy, thereby increasing view times and ad revenue. The search feature, while useful, is not as integral to this objective. Further, offering full-text search could provide yet another avenue to manipulate the algorithm, which YouTube surely wants to avoid.
Finally, implementing and maintaining such a feature would require substantial resources. It would necessitate hiring teams of high-salaried employees to moderate and ensure fair use of the feature, adding considerable operational costs. Considering these factors, it seems that Google has made a strategic decision to avoid this feature for now.
That said, the fact that third-party solutions are emerging, such as the one shared here, shows that there's a demand for full-text search capabilities. It also underscores the potential that these solutions have when unencumbered by the constraints faced by a tech giant like Google. This provides a fascinating insight into the dynamic relationship between third-party developers and tech corporations and the way they can complement each other.
> With billions of videos hosted, the key goal is to keep users engaged by serving up content they're likely to enjoy, thereby increasing view times and ad revenue
Maybe for some users. I just use youtube to find a specific video I need (because people have stopped writing useful how-to's now that they can just make a 10 minute video covering about 1 minute worth of text), and a full text search would be so, so useful.
Regarding your second point... I think it's still important because recommendation algorithms work better when users can find content they enjoy outside of the recommended content. If they can't then the recommendations will become stale.
Google already does this themselves. If you search for rare words (e.g. try "indubitably") it will absolutely pull up videos that have the word in the auto-generated transcript and nowhere else (not in descriptions, not in comments).
Also, using "site:youtube.com" on Google works perfectly for me. If I look up "site:youtube.com david letterman" it gives me the David Letterman channel, followed by a seemingly infinite number of Letterman clips. Precisely what I'd expect.
The only thing I can reproduce that you're complaining about is that Google (and YouTube) search don't seem to index YouTube user comments, in contrast to Reddit. But Google doesn't seem to index comments-attached-to-content anywhere on the internet -- not even comments on articles at mainstream publications like the New York Times. Which is probably more of a feature than a bug -- comments on both YouTube videos and news articles tend to be a lot of emotional reactions and repeated opinions which aren't worth searching at all. In contrast, many (not all) Reddit threads are often very informative and the "main content", so it makes sense Google indexes them.
So I don't really see anything to complain about here, from my perspective.
"having a distribution that's both radially symmetric" site:youtube.com
would return 3b1b's "Why π is in the normal distribution" video, which has that in subtitles at 22:28.
Even without the site: term, all I get is an allreadable.com page that's scraped the subtitles for that video. Allreadable appears at first glance to be a site owned by someone in China and hosted on liquidweb.
I just checked and it returns that video as the first result if you put your phrase into the YouTube search box in quotes, so that's good news.
I wonder why YouTube indexes the transcripts for YouTube search, but Google chooses not to include them in its index for Google search? Seems like an intentional decision since it's the same company after all.
Google doesn't even do a lot simpler things like searching by language or location. And the search is garbage. I am trying to learn Italian so that's what I am interested in but even when I enter a search term spelled correctly with its accents and everything I get anything from Brazilian Portuguese to French. They do a very helfpul translation of the term and return results that are unfortunately useless to me. (I would have loved to speak every language but I don't)
The default behaviour for multiple languages is bad, but the settings page for both region, and search results language work well. In case you haven’t seen the settings, yet.
YouTube doesn't have to be good. It just needs to do the minimum it needs to keep users from switching to a different platform... which, because of the network effect of so many people and videos being on it, is not much at all.
If they had serious competition, they'd have to do more to keep users, but no such competition exists.
Hardly “controversional”, considering it’s just the completely legal views that YouTube censors.
What will really kill Rumble is the fact that they aren’t a YouTube competitor, more a small island off the coast that is mainly centered on said niche and lack the means of true innovation.
Personally I find Odysee/LBRY much more interesting. Fully open source protocol, more varied creators, transparent moderation due to a public blockchain, and P2P file distribution are all incredibly enticing features to me.
I kind of agree about the most and niche, but think you’re wrong about the conclusion.
That niche has ~50% of Americans and a large portion of the globe fine with it.
Put simply, as YouTube pushes out niches (guns, comedy, pundits, etc). They move to rumble. This increases the network effects on rumble and will help it grow rapidly. The more niches on rumble, the less of a need to go to YouTube. YouTube effectively kicked out the niches that half the country likes, so you’re going to get two networks. Rumble will let anyone profit off anything, provided it’s 1st amendment protected (supposedly, I have my doubts tbh). So it can build a bigger network.
More importantly, as I point out in the analysis, they’re well positioned as the large media outlets fade to take market share.
If people come to rumble for the pundits or Gun channels and stay for the sports, w.e. That’ll work well for them.
I do too. The same people I see pushing Rumble were the same people pushing Parler, which for all intents and purposes was a honey pot. Not to mention anything rooted in politics, regardless of side, will be endorsing the “right content” and punishing the “wrong content” in various ways.
I wouldn't be so sure, a big part of growing big as an organization is becoming extremely sensitive to the political winds, and Youtube's "content moderation" is really a function of politics in the West. Were Rumble to grow to a comparable size it too would face the same political risks and at that point it could only remain as it is if there was a sizable political force backing.
Someone already mentioned Odyssey and I too am more hopeful of that platform - a decentralized platform is the best positioned to avoid political interference and censorship.
> remain as it is if there was a sizable political force backing.
Exactly my point, they have half the support. YouTube has the other. It seems reasonable to assume a split of the networks so lots of room for rumble to grow.
Odyssey is going to be attacked by both political sides imo we have seen that already
There’s a company founded by a friend of mine called MediaDistillery, who are doing awesome stuff in this area. Real-time searchable massive video archives, with contextual understanding (e.g. “a WW2 fragment with a Jewish mother holding her baby”). Super useful for so many purposes.
And then there’s YouTube where you can’t even search subtitles. It makes me shake my head. Google seems to be at the forefront of AI, but doesn’t seem to be able to turn all that expertise into relevant products. Maybe the recent disruptions will shake them awake?
I mustn't be searching it right, I can only get this company: https://mediadistillery.com/ , the feature you mentioned is not accessible, and has that Ad-tech smell.
Social media companies are in a constant tug-of-war against the end users when it comes to controlling what the users see. The ideal is that the user has absolutely no control over on what they see and the social media company can fully dictate content. That is what makes the money.
Allowing users to freely query content in their own websites is completely antithetical to what they are trying to do. YouTube is also very aggressive in preventing scraping and limiting the usage of the official API. Which is quite ironic considering the history of the company.
Google: appeared open to start things off, then went whole hog on the MS embrace-and-extend philosophy, aiming to crush the life out of the entire web.
I think you're exaggerating a little when you say "site:youtube.com" doesn't work. If I search 'site:youtube.com apple watch' I get 143 000 000 results, and if I search something more specific like 'site:youtube.com "Featuring Dr James Grime"' I'll find exactly what I'm looking for. But you're correct that it doesn't seem to search video comments, only titles and descriptions.
The problem I ran into is that, like I said, YouTube doesn't seem to get indexed by hypertext inward-edges from other sites like a regular website does — and so you can't search by how you recall a video being described in pages that link to it; instead, you have to remember how the video describes itself. Which it may not always do well.
Some of us recall that when Google+ launched it lacked any search whatsoever.
That this was the case with a company whose name is synonymous with online search was ... simply mind-boggling.
The platform eventually did get search (actually a few different implementations), which varied between mostly useless to actually reasonably functional, though I'll note that HN's Algolia-based search is vastly more useful on an ongoing basis.
G+'s content, to the extent it survives at all, is largely on the Internet Archive's Wayback Machine which ... lacks search.
Google will find a video when you search for a phrase that was said in it (as long as its bad speech recognition got it right), it will find a video with a text that shows up on an object with enough clarity to OCR (for example electronic component name on a pcb in the foreground). There is one plot twist - Google will not always do it when you search for VIDEO specifically :) but will gladly give you videos when searching text/images :)
Youtube search on the other hand will try
- suggest something you liked that has nothing to do with the search term entered
- popular videos at this moment
- videos mildly related to proper results. One of them had a horse in it? clearly you want more horses!
- videos with title mildly related to search term
- to ignore upload date filter when they panic (Christchurch mosque massacre).
For example YT search for "Si5351A" limited to this month will give 11 somewhat proper results mostly with "Si5351" (no A postfix) in title/description AND some dude DXing in Indonesian "Menerima Modif Radio Yaesu FTC 1540A Ke DDS System" because "Si5351A" is a "DDS" so its the same thing right? Its like when Im looking for "NSR Ro80" you should show me plenty of other cars because Ro80 is a car :). Searching for Si5351A without quote marks will show one additional video with Si5351B in the title.
Gets better, searching Google Video for Si5351A last month also gives ~11 results, but only 4 of those are direct YT links :]
Probably because YouTube =! Google Search, while YouTube is still a subset of Google. So, going an extra mile for YouTube and not for others might put Google Search in anti competition issues.
Then again, I also find it absurd. YouTube is one of the most valuable parts of the Internet. And its lack of searchability is criminal. At least the YT search itself should make up for it. It's shame it doesn't.
Google doesn't necessarily have to do anything special for YouTube, though. Google could "just" index YouTube videos as if they were any other web pages, in a standard way. It would then be YouTube's job, to make the data inside those video pages legible to Google's indexer. Where Google could enable this, by pushing for web standards to increase machine-legibility of video in HTML — e.g. standardized ARIA-accessible captions sources for the <video> element, etc.
If they got it set up such that in theory any web spider could come along and index a YouTube video — then there would be no anti-trust reason that Google couldn't just directly ingest the subtitle files off their own servers; it'd just be a bandwidth-saving optimization over the scraping process that they could otherwise do.
e.g. standardized ARIA-accessible captions sources for the <video> element, etc.
YouTube could literally be a minimal web forum with a video tag in the first post of each thread, but likely due to DRM and related motivations, they instead wrap everything in thick layers of obfuscated JS.
For a while there were various shady-looking sites that seemed to scrape YouTube video pages (including comments) and I could sometimes find them through Google (then going back to YouTube for the original video), but within the past few years those have unfortunately also either been delisted/censored from search results or died out.
How do you think monopoly regulators would like it if YouTube videos were indexed with higher accuracy and detail in google searches than Vimeo video?
So sure, google can say ‘here’s a standard way to provide subtitles for a video which we’ll index’, but then that becomes a complete SEO side channel - google needs to validate that the subtitles actually match the content. And that means their bots downloading the video itself. And google really doesn’t want to go out there and argue that video needs to be downloadable by bots, because that’s the whole YouTube-dl case right there.
> Google search with "site:youtube.com" in it. You get basically nothing
That is not my experience! I regularly resort to this when the crappy inbuilt youtube search, which prefers to throw out algorithmic recommendations over returning actual search results, fails to come up with the goods.
i used to imagine similar opportunities with google books. but they have done basically nothing with it. and that's been like 20 years.
if anyone could have disrupted the corrupt and unfair academic publishing world, it was Google. they just found it an uninteresting task. they preferred to work on G+, Stadia, Google Code, etc, https://killedbygoogle.com/
This is safer-- relying on implicit rowids can break things if you use a plaintext database dump (those don't have rowids), and having a column "id integer primary key" is clearer in the schema.
I think the integer autoincrement primary key is more explicit / less mysterious than the implicit rowid. Even if most of us have run into that explanation in the sqlite manual.
> yt-fts is a simple python script that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line.
Critically, this is per channel. I wonder if we can optionally configure this to share the downloaded transcripts to a central repository so eventually a good proportion of youtube's transcripts could be downloaded as one big text file.
Total "love" count in 376 episodes = 7,614
Average "love"s per episode = 20.25
-----------------
107 | Sarma Melngailis: Bad Vegan, Fraud, Pris | iZjby1LkTWQ
98 | Andrew Huberman: Focus, Stress, Relation | lvh3g7eszVQ
92 | Bishop Robert Barron: Christianity and t | WgytXF0SPh0
80 | David Buss: Sex, Dating, Relationships, | sndW9hzX-wA
79 | Duncan Trussell: Comedy, Sentient Robots | jdIyNMkusLE
76 | Rana el Kaliouby: Emotion AI, Social Rob | 36_rM7wpN5A
75 | Edward Frenkel: Reality is a Paradox - M | Osh0-J3T2nY
75 | Todd Howard: Skyrim, Elder Scrolls 6, Fa | H9AAnV59ddE
74 | Travis Oliphant: NumPy, SciPy, Anaconda, | gFEE3w7F0ww
74 | Kelsi Sheren: War, Artillery, PTSD, and | PbN3HzKkW4M
-----------------
SELECT count(s.video_id) AS love_count, substr(v.video_title, 1, 40), s.video_id
FROM Subtitles s, Videos v
WHERE s.video_id = v.video_id
AND s.video_id IN (
SELECT v.video_id FROM Videos v
WHERE v.video_title LIKE "%Podcast%"
AND v.video_title NOT LIKE "%Podcast Clips%"
)
AND s.text LIKE "%love%"
GROUP BY s.video_id
ORDER BY love_count DESC
LIMIT 10
I think I'll start to use exclusively CLI tools for discovering and downloading of YT content. The entire experience which starts from typing "youtube.com" in the address bar and pressing enter is obnoxiously unbearable.
I already block advertisements on the web, so I see none when on a Desktop web version of YouTube. But I do not use Sponsor Block so those creators still get to show me their ExpressVPN ads or whatever the flavor is today. Also I use Patreon.
I hate being the poo pooer who says that subtitles are available via the API and wish the tool went that route.
I'm all for stuff being archivable with tools like youtube-dl, but I much prefer to see tools like this use the API despite its quotas because it goes beyond archiving a copy for reference. Tools that (ab)use scraping only justify anti-scraping efforts that journalists and the like use and escalate that arms race. I think one could still scrape a channel or two per day within API usage limits [1]--50 units per list, 200 units per download; quota 10,000 units per day.
agree that using the API is likely the nicer route. you can also apply for a quota increase, I recently applied for youtube API quota increase to 100,000 units and it was approved for my app (https://cmdcolin.github.io/ytshuffle/) I was concerned they wouldn't like that the app downloads so much data but it was approved without much question, they just wanted terms of service prominently displayed to end users
I had made something like this for my own use, but it was way more complicated. It took a user account, downloaded all the liked videos, ran it through some model and vectorized it, and then you could use natural language search to describe a scene in your history of liked videos and it would show you the timestamp and the thumbnail (with the link to start watching from the timestamp).
I ended up taking it offline because I didn't use it much, it was expensive and I couldn't see a path to monetization.
This solution being posted however, is really elegant, in particular because it is very resource-efficient.
Apropos of the post: sometime back I had wondered if it would be possible to search videos for content by text keywords, but not to match text occurring in the video title, chapter titles or comments. Instead, if an app or library could somehow search for the given keywords by matching them with the spoken words in the video. That would be a potentially cool and useful feature.
I realize this may be technically impossible [1] or very difficult, but thought of mentioning the idea here.
[1] On further thought, speech recognition (as seen in mobile phones at least) has progressed to quite a good level (speaking as a layman for this part), so maybe the idea is not wholly infeasible. If an app or lib could somehow "internally" play the video, and speech-recognize the spoken words into text, then the problem would reduce to normal text pattern matching.
Putting the idea out here to see what devs think of it.
That permitted searches for terms and channels (though not subtitle text within channel AFAIK), for music specifically, and compiliation of either temporary or saved playlists, with the option to play through a full selection of videos. It also offered either full-video or audio-only playback.
This reminded me that Google Podcast search function clearly uses some kind of transcription index for its search. However, I wish they included at least partial matches as part of the search results (similar to how Google Books search works).
>yt-fts is a simple python script that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line
"Scrape all of YouTube subtitles"
So if that data is not good...then how is this useful. The captions in YouTube have been pretty bad. Are these the same thing? I'll test it out.
Another avenue for this is to use Tube Archivist, which takes the approach of locally mirroring videos and serving them up in a web interface, complete with comments, subtitles, and an index containing all the above. Definitely overkill if you just want to do a couple text searches though.
Here's a (kinda) ELI5: you would use a language model to create "embeddings" of the text, which you can think of as a set of numbers representing the "meaning" of a set of characters.
These numbers can be plotted as points in a space, and embeddings of things with similar meanings are plotted close to each other. So things like "exam preparation" would have embeddings close to things like "top study tips".
Say you have created embeddings for a large corpus of text (in this case all youtube captions) once. If you create embeddings for a user query, you can search for embeddings close to it, and these will be "semantically" similar to the query.
The advantage is that unlike traditional full-text search, the user doesn't need a query that includes words present in the text.
Yes in theory although they are pretty expensive. I am doing something like this at work as I wanted to unlock the wealth of information we have in our tutorials, webinars etc.
python3 yt_fts.py download https://www.youtube.com/@PerspectiveArts/videos --channel-id UCUCN8V_pO0xOFKLL4XG1tshnw
Downloading channel
Saving vtt files to /tmp/user/1000/tmpm4xoskpo
Traceback (most recent call last):
File "/home/user/src/yt-fts/yt_fts.py", line 273, in <module>
cli()
File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, \*kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, \*ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/src/yt-fts/.env/lib/python3.11/site-packages/click/core.py", line 760, in invoke
return __callback(*args, \*kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/src/yt-fts/yt_fts.py", line 31, in download
download_channel(channel_id)
File "/home/user/src/yt-fts/yt_fts.py", line 82, in download_channel
channel_name = get_channel_name(channel_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/src/yt-fts/yt_fts.py", line 191, in get_channel_name
data = json.loads(script.string)
^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'string'
Loved the idea, yet to try it out. I would definitely download all videos of Lex and run some text analyzer/text cloud generator to learn about things being discussed
For what it is worth, we work on a tool[0] to index all local videos and images and later allowing query just using natural language. It is based on CLIP which has been trained on image-text pairs, but seems to work great for videos after applying some naive heuristics.
Can you give some additional insight as to what this enables? Maybe some additional links for research? How does one store/format data on such a database?
Fun thing to try: do a Google search with "site:youtube.com" in it. You get basically nothing, no matter what keywords you use. It seems that Google actually entirely ignores/excludes YouTube from their regular HTML indexing, and instead only relies on the YouTube backend to actively push content into (a special, separate part of) the search index. Which gets you "results from YouTube" and "video search" — but doesn't get you the ability to search youtube videos pages qua web pages. (Consider: you can find a post in a Reddit comment thread on Google. Can you find a post in a YouTube video comments section on Google?)
Heck, when I first heard about YouTube's autogenerated captions, my first thought was "oh, so this is Google building deep indexing of video through audio transcription, because they can't trust externally-provided subtitles, right?" But it's been 10 years, and I couldn't have been more wrong.