
Once Nearly Invisible To Search Engines, Flash Files Can Now Be Found And Indexed - nickb
http://www.techcrunch.com/2008/06/30/once-nearly-invisible-to-search-engines-flash-files-can-now-be-found-and-indexed/
======
dmix
_I've been waiting about 10 years for this!_

Not really, I would never make a site in flash. It's not 1998 anymore and I
don't plan on selling cars or clothes, so this is a little late.

Good for whats left online, but I really hope this doesn't cause people to
revive all-out flash sites.

~~~
wallflower
Indexing of SWF/FLA files opens up some interesting searches of the now-not-
so-deep web:

Flash + game (mostly silly Flash games):
[http://www.google.com/search?hl=en&q=filetype%3Aswf+OR+f...](http://www.google.com/search?hl=en&q=filetype%3Aswf+OR+filetype%3Afla+%2B+game&btnG=Search)

Flash + algorithm (finds some interesting CS and _non-CS_ presentations):
[http://www.google.com/search?hl=en&q=filetype%3Aswf+OR+f...](http://www.google.com/search?hl=en&q=filetype%3Aswf+OR+filetype%3Afla+%2B+algorithm&btnG=Search)

> Not really, I would never make a site in flash.

There are a lot of photography portfolios online that use Flash (presumably to
protect the photos and for presentation - e.g. first impressions for potential
wedding clients). It would be nice if Yahoo/Flickr or Google image search
could auto-tag (caption proximity) and index their photos. Though not specific
to this announcement, it would be nice to have Flash-wrapped video
screencasts/vlogs indexed by their audio content.

------
Tichy
So they only provide that technology to Google and Yahoo!? That really sucks,
should be available for everyone.

~~~
marcus
Considering Adobe's track record with open standards I'm not really surprised.

Their purpose isn't to foster innovation, it is to remove an obstacle from the
decision path of people buying Adobe's products.

~~~
bdotdub
I completely agree. Their reason for releasing this isn't to allow people to
look into their SWFs, but for the two largest search engines to be able index
them.

This is an attempt to bump up the number of flash sites because now "not
indexable by Google" is not an issue.

------
dkasper
Interesting that the Techcrunch article hit the front page here and the Google
blog article [http://googleblog.blogspot.com/2008/06/google-learns-to-
craw...](http://googleblog.blogspot.com/2008/06/google-learns-to-crawl-
flash.html) hit the front page of Reddit.

------
apgwoz
I never understood why they didn't just use `/usr/bin/strings` to pull out
ascii, and keep the pronounceable text. This of course wouldn't help with data
that's being pulled in such as XML and things, but it would have been a
start...

~~~
jm4
That doesn't work even in the simplest cases. I know because it's the first
thing I tried when I needed to find child Flash files and external references
in Flash files. This is actually a tricky problem to solve. It's easy enough
to find external references if the entire URL is in a string or assigned to a
variable. These go in the constant pool and it's trivial to pull everything
from there and try to identify URLs. When URLs are built up using
concatenation they are much more difficult to find. If/else logic is even more
difficult. At this point you need to implement your own little Flash player.
The problem here is that it's not enough to only implement the few
instructions you're interested in. The Flash Player is a stack-based VM and if
you don't implement all the instructions your stack will get hosed.

The Flash files that pull in their content from XML are easy once you find the
external references. You simply crawl the references like you would any other
page. Text embedded in Flash is another story...

If you're curious to see how this works Describe SWF is a great little tool
and there's some good documentation on SWF at the Flasm site.

<http://www.flagstonesoftware.com/describe/index.html>
<http://flasm.sourceforge.net/>

~~~
apgwoz
Yeah, I realize it's not a trivial problem to solve, but I can't imagine that
many of the swf files on the web today are nothing more than some keyframes
and a bit of embedded text. I would think in this case strings would be
suitable enough.

Also, since you can't really specify which spot in the swf to load (assuming
you're going to display the swf in it's original context), I can't imagine why
it would matter if you took into play conditionals. I mean, obviously you have
to interpret everything, but aren't you really just after all the blobs of
text so you can do, for all intents and purposes, does this document contain
this text?

Thanks for the links!

~~~
jm4
Strings does not work at all. It spits out nothing but junk. There's not a
single chuck of legible content.

If you want something quick and easy you can just look for instances of
'getURL' objects. These will always have the index of a string in the constant
pool as an argument. From there you simply grab the value at that index and
you have the URL.

Something even more primitive would only look for interesting strings in the
constant pool. The constant pool is found at the top of an action block and
you don't need to implement any instructions to get to it. By the way, the
constant pool is the only place where you will find any "blobs of text".

If a variable or concatenated strings are passed to a getURL function in
ActionScript then it's different. This is a 'getURL2' object. In this case you
get an index to a table that contains variable names. From there you would
have to look up the value of the variable, but you only have that if you've
actually implemented all the instructions and you've been keeping track.

Text is... complicated... It's not just chunks of text or even character
arrays. It's arrays of objects that contain all sorts of information, the most
interesting bit being the particular glyph in a given font that should be
displayed.

You can find a surprising number of external references just by looking in the
constants pool of for instances of getURL without actually implementing a VM,
but anything beyond that is a lot of work.

The JavaDocs for TransformSWF (which Describe SWF is built using) have quite a
bit of useful information too.

Anyway, your original point that there's a simple solution that's at the very
least a good start is still valid because a high number of external references
in Flash can be found with minimal effort. For whatever reason, the major
search engines chose to ignore Flash. Maybe they were waiting for a good
solution instead of a good enough solution, but my guess is no one cares
except Adobe.

[http://www.flagstonesoftware.com/transform/datasheets/index....](http://www.flagstonesoftware.com/transform/datasheets/index.html)

------
kajecounterhack
Hmm this reminds me of another tool I found on YC News just a few weeks ago...
<http://www.flashprobe.com>

They probably used the same methods.

