
Boxfish: Realtime Index of Every Word Spoken on TV - speek
http://techcrunch.com/2013/05/10/boxfish-android-api/
======
amitparikh
Closed captioning has always seemed to me to be a notoriously bad data set due
to misspellings and misphrasing. Has anyone tried to do (a better) speech to
text of a cable news channel, for instance?

Doing frequency and sentiment analysis on this dataset would be pretty
interesting.

~~~
mikebennett
Boxfish scientist & dev here.

Closed captioning does have a lot of noise in it, but we've done a lot of work
to tidy that up. We also have the benefit of capturing so much data that the
noise doesn't matter as much.

In real-time our systems extract and generate lots of information from the
closed captions. Our NLP system identifies entities (e.g. whitehouse, chris
brown, amanda berry), does frequency counts and we have some very large graphs
of entity co-occurrences (leads to our statistical learning), e.g. Rihanna is
commonly associated with Chris Brown.

We do a bunch more analysis on our graphs, including Latent Semantic Indexing
(LSI), which helps drill down into quantifying the relationships between
entities. Related to that we generate TFIDF scores for all identified
entities, i.e. gives a sense of how "important" an entity is.

By combining our large scale entity graphs (both Frequency and LSI) with
streams of closed captions we also do real-time topic extraction at multiple
time scales, e.g. what is the topic of conversation on CNN for the last minute
of conversation, for the last 5 minutes, for the whole program, etc.

Feel free to ask me more.

~~~
apaprocki
What about STT and/or machine translation for foreign channels? It is really
frustrating when foreign stations don't even have CC if you're trying to learn
the language and need more spoken input.

~~~
mikebennett
While we could use STT (some of our team have backgrounds in it) we sought to
use the cleanest existing signal, i.e. closed captions.

Part of the motivation for taking a statistical NLP approach is that it gives
us more flexibility for processing foreign stations / languages (we don't yet
do that).

I wonder could you time and geo-shift closed captions, i.e. show closed
captions in two languages at once on the same TV program? That could make an
interesting language learning tool and an interesting training set for machine
translation.

------
philsnow
One of the original incarnations of Google Video was something somewhat
similar to this (an index of closed-captioning data from a lot of different tv
streams). What they chose to do with it was different though: they allowed you
to search closed-captioned content and it would show you a few thumbnails and
the time of day when those words were said on air.

This memory is kind of hazy, ISTR it's from 2005 or so.

~~~
sc00ter
If we're waxing lyrical, there was a research paper out of Ireland from a
small telecoms research outfit, circa 1996 describing pretty much this, but
using teletext subtitles.

In addition to capturing and indexing the subtitles, it also captured the
video, and so allowed the captions to be used as an index to the video.

I don't doubt someone can come up with an even earlier incarnation!

~~~
philsnow
Everything old is new again :)

------
tibbon
I personally think there's a great deal that can be done with this data.

A few years ago, someone documented how to use an Arduino + Video Experimenter
Shield to easily log closed captioning data
([http://blog.makezine.com/2011/08/16/enough-already-the-
ardui...](http://blog.makezine.com/2011/08/16/enough-already-the-arduino-
solution-to-overexposed-celebs/)). Never got around to messing with it, but I
can imagine 100 interesting things to do with that data.

Very cool company. I'm glad someone's doing this.

------
nutmeg
Seems like scraping all closed captioning would be very valuable data indeed.
Is there anyone else doing something like this that provides an API or data
feed?

~~~
dangrossman
Copyright law would make such a feed of closed caption transcripts illegal to
distribute.

~~~
nutmeg
That makes sense. Would be legal to capture the data and present it similar to
a search engine? I'm guessing there is some sort of precedent for that sort of
thing?

------
quan
I'd love to see this used on Fox News to fact check everything they say in
real time.

------
mrilhan
I think the potential here is immense.

Boxfish, twitter, YouTube, Siri, and now with Ray Kurzweil @ Google...
thinkers are converging on doing to every other form of content what Google
did for structured documents.

The NLP trend is going to be amusing to watch at least (Siri, Summly), and
whether its time has come in the next 5 years or not I'm not certain. But I
know Ray Kurzweil knows this technology is inevitable.

\--

As for BoxFish, I think this is a good example of a neatly executed, well
funded startup with experienced founders and a solid space. No drama, no demo
day, no immediate fires to put out, cool $3m in the bank, Deutsche Telekom AG
subsidiary negotiating their deals for them, and "Yahoo just bought a kids
startup for 17m" - the topic is hotter than others. This is the type of
startup I for one daydream of having stock of or working at. Has high
potential to be worth $mmms or $bn in the future - you know, that all depends
and what not. But the makings are clearly there. Excellent work guys!
Congratulations.

------
thereallurch
Wont this just amplify existing trends instead of exposing new ones?

~~~
hayksaakian
Precisely. Prerecorded (the majority of) TV has never created trends.

~~~
ChrisSalij
(Boxfish dev here)

While the majority of TV is pre-recorded or repeated content (think of all the
repeats of the Simpsons, Real Housewives of X etc). We know whether a show is
recorded or live and the broad categories that a given show falls into. We
also break up our trending calculations into different groups based, News,
Sports etc and treat the data differently (as seen in the apps)

Also, bear in mind that while a show might be prerecorded it still may show
useful data. For instance The Colbert Report and The O' Reilly Factor are
usually recorded shows, however they can talk about drastically different
things from show to show, and even between segments in shows.

I grant that useful trends are more difficult to extract from sitcoms and
other things like that, but just because a show isn't live, doesn't that no
useful trending information can be extracted.

We look at trending data over various periods of time, from minute length to
longer so we can gather sentence level, show level, series level and even
channel level topics.

~~~
MisterBastahrd
You might be able to use indexing to show potential bias. If you have access
to the data, how often do the major news networks use the word "Obama" versus
"President" over the course of the day? Say... Fox News, CNN, CNBC, MSNBC.

~~~
ChrisSalij
We actually had a very interesting page up for this in the run up to the 2012
election, comparing and graphing mentions of Obama and Romney on each network
along with sentiment analysis of what they said about them.

The page has since been taken down but here are two of our blog posts about
the analysis.

* [http://blog.boxfish.com/post/30997338037/obama-vs-romney-who...](http://blog.boxfish.com/post/30997338037/obama-vs-romney-whos-beating-who-on-tv) * [http://blog.boxfish.com/post/32880728776/tvs-thoughts-on-our...](http://blog.boxfish.com/post/32880728776/tvs-thoughts-on-our-presidential-candidates)

------
skram
Here's one endpoint that seems to work and not require an API key:
<http://api.boxfish.com/v2/v3/trending/topics/?fields=count>

------
RK
Sounds similar to SnapStream. "Monitor everything said on TV"

<http://snapstream.com>

------
uptown
Reminds me a little of Bluefin Labs (acquired by Twitter). Just hook up this
data with a sentiment-engine of Twitter and you can come up with some
interesting correlations to how people react to television.

<https://bluefinlabs.com/>

------
krazykringle
Also: <http://archive.org/details/tv>

------
slifty
For those interested in a real time API of caption streams you should be sure
to check out Opened Captions: <http://openedcaptions.com:3000/>

Currently only for C-SPAN but that may change!

------
Finster
With the new Federal regulations stipulating that anything that originates on
TV must be captioned when streamed over the internet, Boxfish will be able to
get a fairly comprehensive picture of what's going on.

------
bravura
Is this only for US television? Or is it global?

What is the reach? I know several people who would be interested in this for
smaller countries.

I couldn't find this information on the homepage.

~~~
ChrisSalij
We currently only have US tv channels. We've experimented with others but
given the limited resources of a startup we haven't had the time to expand
out. We're definitely interested in it though.

------
e3pi
`HN DDOS' again? Still spinning after five minutes on:

<http://boxfish.com/#!search/Klinger>

------
deepinsand
Do they have a massive number of cable/satellite subcriptions? I've always
wondered how they and IntoNow get their signals.

~~~
kburkitt
[Boxfish:] We have lots of set top boxes :)

