
Broken crawler behavior with my binary protofeed file - protomyth
http://rachelbythebay.com/w/2013/04/14/protofeed/
======
anonymouz
Can one really blame the crawler for trying to parse URLs out of an,
allegedly, text/plain document?

I'd argue the MIME type starting with "text/" for binary data was wrong. One
could serve it up as application/binary or application/x-protobuf or something
like that.

~~~
protomyth
My question is why would a crawler think that there are urls in a text/plain
document? More to the point, why would it parse it for any formatted
information?

~~~
anonymouz
Why not? URLs are pervasive these days, and loads of plain text files contain
them. Maybe they are indexing plain text files, and while they are already
there, why not apply some heuristics to try to find URLs inside of them.

Of course the result won't be perfect, but probably better than nothing.

~~~
protomyth
At this point, I would actually think nothing would be better. It seems like
any attempt to part plain text for structured information or urls would just
add noise to search results. I can seen using it as text for terms but not
much else.

~~~
anonymouz
The crawler still hits the actual "URL" he found, so that provides sanity
checking. It's just another way to discover (potential) new URLs.

~~~
protomyth
I see your point and the sanity check is good. My problem with the whole thing
is that interpreting text is fraught with problems and I just don't see the
value from a search perspective. Something served as plain/text either is a
problem with the server or just a plain text file. Either way, it seems like a
poor value thing to add to a search.

