
Show HN: FluidDATA API – Create your own audio search engine with FluidDATA - mathatoms
https://github.com/BitPlatter/fluiddata_flask_example
======
bmelton
Oh hi!

Despite the green user, I've actually worked with these guys and seen this
project up close and in person, and the amount of data they're harvesting here
is pretty overwhelming.

There are a few other places doing transcription nowadays, but they're _just_
doing that, and this is a bit richer an API for getting more tailoring done
against your source data.

Either way, if you're looking for a way to add audio transcription to your
podcast or vlog, this is a cool service. If you're looking to make that audio
searchable with the fewest number of steps, this is probably the coolest
service around.

------
Nadya
This is actually really cool! I'm guessing it is English only though? I don't
see examples of any other languages and due to the complexity of word->audio
matching I imagine other languages aren't supported.

I think a better title would be "Audio search engine for Podcasts using
FluidDATA". It had briefly gotten my hopes up that I'd be able to make a
search engine for my music, based just on the title.

~~~
mathatoms
Thanks. Its English only for now. We could support other languages if there is
a large enough demand.

------
chatmasta
Cool project, and a mammoth undertaking in terms of scraping and data
processing. Would you be able to share any details on what your ingestion
infrastructure looks like?

~~~
mathatoms
We were planning on writing up a blog post to go over what our backend looks
like. But essentially we have written a crawler to discover audio on the
internet and a distributed processing framework to download, extract metadata,
and transcribe the audio.

We've iterated through a few storage solutions and have settled on using
GlusterFS+zfs running on Storinators. So far we have about 350TB of data
indexed in our collection.

~~~
dandancanfly
That's pretty neat. After you download the audio and process it, do you delete
the data, or store it for safe keeping? 350TB is a healthy chunk of data.

~~~
mathatoms
We have enough storage to hold on to the data. We keep the data around so we
can retranscribe files as we update our language models.

------
shirman
Btw, what is the best way to search trought the audio files? For example if I
know for what sound am I searching for

~~~
dandancanfly
Are you looking for a specific sound, or is it a word in the English language?
These cats can help if it's a word in the English language: I'm not sure if
they can search for specific sounds, although I'm sure it's possible down the
road.

------
Uffizi
I like the ability to skip though the audio stream to your search term
locations within it, pretty cool. Nice work!

~~~
dandancanfly
Yeah! This looks like a promising way to quickly search many podcasts/audio
streams at once. I wish I had this tool for my college research

~~~
nibbleshift
If you haven't stumbled across it yet, you can check out the FluidDATA web
search that let's you search millions of podcasts by phrase or mention here
[https://fluiddata.com/](https://fluiddata.com/)

You can register here
[https://accounts.bitplatter.com/?next=https://fluiddata.com/](https://accounts.bitplatter.com/?next=https://fluiddata.com/)
to get 100 free searches per month.

~~~
Uffizi
Thanks, I just messed around with it a bit and enjoyed the discovery. You all
seem to have a ton of content across the web processed, it's very interesting.

