

Web image size prediction for efficient focused image crawling - boyter
http://blog.commoncrawl.org/2015/08/web-image-size-prediction-for-efficient-focused-image-crawling/

======
kitcar
Can't use just look at the http response to get the image size in KB, and
ignores the images below a certain size, or alternatively just grab the first
few bytes of the image, enough to extract the format header (which I believe
carries dimension/size info commonly?)

~~~
elektronaut
Yep, that's a viable solution. GIF and PNG include the dimensions in the first
few bytes. JPEG is a little more complex, but you don't need to download much
of the file to get dimensions.

I've used fastimage with success:
[https://github.com/sdsykes/fastimage](https://github.com/sdsykes/fastimage)

~~~
Gigablah
For nodejs there's [https://github.com/netroy/image-
size](https://github.com/netroy/image-size)

From testing with 1800+ image URLs, you can almost certainly get the
dimensions within the first 64kb.

------
onion2k
This is a fascinating idea. It'd be interesting to see a graph mapping the age
of a page to the size of the linked images. I would bet that there's a very
clear correlation - full page background images as a design trend really blew
up a couple of years ago, so prior to that it'll be much more likely that a
high resolution image is an anchor with the content of another image ('click
here to view' from a thumbnail) or just a 'click to view' link. These days
you'd need to drill down in to the CSS.

