Hacker News new | comments | show | ask | jobs | submit login
Gmail blows up e-mail marketing by caching all images on Google servers (arstechnica.com)
601 points by shawndumas 960 days ago | hide | past | web | 311 comments | favorite



> E-mail marketers will no longer be able to get any information from images—they will see a single request from Google, which will then be used to send the image out to all Gmail users. Unless you click on a link, marketers will have no idea the e-mail has been seen.

Absurdly wrong, marketers already use a unique image URL for each email recipient, and Google has no way to know that all of those point to the same image. So they won't see "a single request from Google", they'll see one request from Google per successful delivery to an inbox.

Now, an open question is if Google will make that request when the email is actually opened, which would allow marketers to determine if and when the email was read by the user, or if Google will make the request as soon as the email is received. The latter would enhance users' privacy at the cost of bandwidth for Google, but early tests indicate that they don't actually do that, waiting for the user to click the email to make the request.

I'd like to add that there's no possibility the Gmail team is stupid enough to not have considered this. They must know full well what they're doing, and marketing this as a privacy enhancement when it's actually detrimental to privacy is willfully dishonest.


> Now, an open question is if Google will make that request when the email is actually opened, which would allow marketers to determine if and when the email was read by the user, or if Google will make the request as soon as the email is received. The latter would enhance users' privacy at the cost of bandwidth for Google, but early tests indicate that they don't actually do that, waiting for the user to click the email to make the request.

I've just tested this.

The image was retrieved when I viewed the email in Gmail.

The tracking info basically comes back as "anonymous" and viewed from an unknown location.

The image was retrieved twice even though I only viewed the email once.

Currently I'd say that seeing the image being viewed is still valuable. I'm sure Google could move to proactively fetching the images in future destroying even that value.

The image is cached (via browser headers), but isn't aggressively cached (via reverse proxy). Fresh views on different browsers or a later session would still result in a request for the image back to the source server... registering the view again.

My personal view on all of this is that this is a bit Microsoft... you know, convenience and features over security and privacy.

For me, this "feature" leaks data about what I view to 3rd parties where today I block all images and do not leak that info.


Wait, do I understand you correctly that Google fetched the image even before you requested that images be displayed in the email? That seems like a boon to marketers, not a bane.


A dialog popped up when I logged into email, and before I'd seen this thread.

That dialog was much along the lines of "We'll now show you images in your email automatically." with a big "OK" button.

I don't recall whether there was a less prominent "No, thanks" as I was only logging on to reply to one question really quickly.

I suspect this is a UX anti-pattern. I've gone back into my settings and changed it back to "Don't display images".


The alternative to "OK" is "Settings," with a link that just dumps you into the main settings view… leaving you to find the Images section to change it back. Quite an opt-in.


Otoh, this makes it harder for email marketers and generally other people to see if I'm reading their mail.

+1


"Want to see if your users are opening their emails? Pay to display your message in gmail's promotions tab."


Now google has the data, and they almost defaulted the option for everyone. That gives them a big boost in data receives which they could use for marketing "(hey guys you remember you got × amount of image loads, we made your images get loaded by × more users now if you want the data we got it right here for the right price.)"


If you go to the trouble to figure out how to disable the "auto-load images" setting, then they're just as blocked as before.

If you accept the new default, it's far easier for them to track you, because Google will kindly be loading their tracking images on your behalf when you open the email, every time.


> I suspect this is a UX anti-pattern. I've gone back into my settings and changed it back

Gmail has done this enough times around already for to me become a UX anti-pattern in itself.

It's weird to look back to the days people were begging for invites and Gmail was the best game in town. These days the few times I'm forced to use GMail it almost makes me rage.

The "normal users" I know which are forced to cope with Gmail are constantly and consistenly confused, asking me where things they used to know have gone, and why Google is ruining Gmail.

Glad I migrated to something better, somewhere where I know for a fact that I am the customer, not the product.


To what did you migrate?


I moved to fastmail which I found wildly refreshing. In the same sense I found Gmail refreshing when it first was launched.

Once you try using fastmail, you will be surprised by how incredibly lightweight it is. When you first get used to that, trying to use Gmail feels like wrestling a horribly bloated pig. The UX has just become terrible.

As for fastmail or other options... I was very keen on being able to host things 100% myself, preferably using FOSS, because of privacy concerns and being in control of my own data.

After trying out various FOSS (and non-FOSS) solutions I decided none of them were good enough/polished enough for my needs.

In that regard fastmail is a compromise for me compared to what I ultimately want, but it's a compromise I'm more than happy with.


According to the announcement, if you've already selected the option to ask before displaying external content, when this feature rolls out to your account it will also default to set the new setting to ask before displaying external images.


Good info, cheers for that!


The way I read the comment, I think you are correct about it being a boon and not a bane. However, it is because the request was timed at the same time as the message read, not the other way around.

It seems the commenter functionally tested an external URL being requested by Gmail, and found that the request was coincident with when the commenter opened the email. This essentially "leaks" that you've opened the email as the image URL could be unique to your email message.


This is correct. This makes unique open counts much more reliable as it guarantees that the tracking image will be fetched. Before this change, one could never be certain how many people actually opened their email in gmail because some percentage of recipients would block images which naturally blocks loading the tracking image. Typically the tracking image is unique to each individual email sent.


But if Google always fetches the images, then there's no way for the marketers to know if the email was actually opened or not.


Apparently Google does not always fetch the images.

But even if they did, this is still more information leakage than the old default (don't load images).

Spammers who email via botnets and the like, with false return addresses, doesn't get bouncebacks to clean their lists.

But if you (or Google, on your behalf) give them a hand by reliably loading their tracking image, that flags your email as a valid one.

If you weren't actually reading the email, that's still a false positive I don't think you'd benefit by giving.


An alternative could be to let the user decide if Gmail should prefetch images from a sender or not.

Email from familiar senders would have images prefetched (thus avoiding leaks of user data).

And DDOSing concerns would be reduced because those emails would not be from a familiar source.


There is no advantage of prefetching images from familiar senders. It's not about faster image loading.

The ideal thing would be to just prefetch all the images sent to existing and non existing accounts. This way there is absolutely no way for a spammer to tell whether an email is existing or not.


> There is no advantage of prefetching images from familiar senders.

The advantages I had in mind were:

1. No leak of user IP address, cookies, etc

2. No leak of timing information (when user opened the email).

It will however leak that the email address is valid, which might be a fine compromise with a selected subset of senders.


Sorry, I wasn't precise. I was responding to the suggestion to only proxy for familiar senders. But, assuming that you can correctly identify who is familiar and who's not (there is scam detection as well), the benefit is minimal.

There is bigger benefit of doing this proxy for non-familiar emails.

Google could prevent leaking if email address is valid by simply prefetching images even when email is sent on non existent accounts.


Sounds like a great way to verify an email address to me. Send an email with a tracking image and if you get a hit, presuming they pre-fetch, you know it's a valid email.


> For me, this "feature" leaks data about what I view to 3rd parties where today I block all images and do not leak that info.

But that option to not see images is still there, and if you've defaulted it to off you still wont see images. Unless I misread the blog post.


> Convenience and features over security and privacy

I don't understand. It probably has to do with the Zeitgeist.

The images were originally blocked because of security and privacy concerns.

Rendering of images is potentially insecure because of bugs in the browsers. By proxying the images, Google as an webmail provider, screens you from your browser bugs. Solved.

Rendering of images is a privacy concern because of tracking. By fetching the images from another place, the attacker cannot know your location, your OS, language setting, etc... Solved.

Rendering the image allow checking when and if the email is ever opened, which can be useful in marketing^Wspam primarily because they can understand whether an email is active/exists or not. Not Solved.

The latter however is a general problem. There are many other ways to know whether an email account exists or not. Many mail servers respond with bounce emails anyway. They won't bounce on detected spam, but that's not the point; this feature is an additional barrier for image tracking for content that has passed through the spam filter already.


Consider this @codeflo:

1. Google may cache all images in all emails sent to gmail.com instantly and regardless of the existence of the address. This would remove the possibility for marketers to check user timestamp, remove user data from request and hide user email existence.

2. Google does _not_ need to save each image from each unique URL separately, all they need to do is fetch each image and check against an already existing (mega)array of images they've fetched. This greatly reduces storage needed, but doesn't do much for the bandwidth requirement, but they won't care about bandwidth in all their Googleness.

3. The single most important aspect of this change has been omitted in the article, and in your comment: This change completely eliminates the risk of CSRF attacks by spammers and the likes. CSRF attacks are still number 8 on OWASPs list of top 10 attacks.

My three cents ;)


I thought that 99% of the images included in emails for tracking purposes are single pixel transparent GIFs - so no biggie in working out which ones those are...


In the e-mail campaigns we send every image is a tracking image. It all just goes into our log files an is then post-processed, so the additional cost of processing every image is minimal compared to the rest of the cost of the send.

Using a separate tracking pixel is pointless unless you for some reason want to let some third party track the opens (which some people might, e.g. to prove certain open rates)


Then they switch to two-pixel transparent GIFs as a workaround.


god damn marketing geniuses.


Also, no more cookies that were set when loading those images


As someone who co-invented email image bugs with a million other programmers over 15 years ago, you are absolutely correct. (for the curious, my reason for coming up with it was to tell when a customer who requested a car insurance quote from the company I worked at read their email which instantly initiated an outbound call to them).

The entire article is just plain wrong. For instance:

This move will allow Google to automatically display images, killing the "display all images" button in Gmail.

Go ahead and do that, Google and you'll bring Web bugs back completely. How about this:

1. Marketer embeds a "jpg" file whose filename consists of a GUID that matches back to a user.

2. When you load that "jpg" file, it gives you an image unique to that user - maybe an MD5 hash of their GUID filename or some other thing unique to them that holds no value.

This would uniquely identify that a user reads their email. How would Google stop this? They can't cache it for multiple users. Both the filename and contents are unique to the user. If you think they could otherwise detect it for this particular situation, I could think of 100 other ways to do this that would not be able to be tracked in the same way.

There's no way Google or any other email provider can legitimately automatically load email images and not open the door to web bugs. No way.


There is ONE way: fetch every single image right away regardless of whether the email is even valid. And then don't store the cached version if there is no valid email. Otherwise store it and display to the user.

Since every result will be a positive - false or not - no information is revealed to the marketer AND the images are displayed.


There is ONE way: fetch every single image right away regardless of whether the email is even valid. And then don't store the cached version if there is no valid email. Otherwise store it and display to the user.

Sure, in theory that works. In practice, you make an easy may to do a Denial of Service attack against Google or innocent third parties so it in actuality would never work.

Send a million emails to Gmail accounts that each have fifty links to 1 MB JPG files hosted all over the Internet. The size of the file you send to Google in the email is what - 1k? The size you are making Google download is 50 MB. This is a 50,000:1 attack ratio. You could take Google down with a 56k modem.

You could also launch a denial of service attack against any other target, courtesy of Google. Send a million emails to Gmail users that downloads JPGs on a target web server. You can even make up the JPG names to be non-existent. Again, figure your email is costing you 1k in transmission and Google is putting tremendous strain on the target server downloading 404 error messages.


Google does SPAM filtering and has no obligation to deliver those messages. It can throttle its acceptance of messages from the same IP block. How is this different than any other (distributed) DNS attack mitigation?


It can throttle its acceptance of messages from the same IP block

There are literally tens of millions of computers infected with malware making them part of a botnet (ex: http://www.csmonitor.com/USA/2011/0629/Biggest-ever-criminal...). The cost to hire 1 million of these computers (all with unique IP addresses) to send emails out is trivial. You'd be shocked how cheap it is.

Now here's the thing... SPAM is pretty easy to spot because someone is trying to sell something. In this case, you're not trying to sell anything. You just need Google to download things. So you send an email with "Vacation Pics" in it. Sure, Google could filter them out but they're all from unique IP addresses from home computers across the country/world and they aren't trying to sell anything. They probably could filter out some of them at the cost of filtering out a lot of false positive legitimate mail.

Internet security is complex - you seem to think otherwise.


Snowshoe spam is pretty common now, with a lot of the larger networks mixing in spammers with legitimate traffic (lookin' at you, ovh.com).

As it is right now, spam is just mostly a drain on network and human resources.

This attack would also target storage. Google has enormous amounts of storage, so I've no idea whether it would be effective against them or not. But, it would be very effective against any smaller service providers that tried to do the same image caching thing (which so far seems to be a roundly bad idea, IMO).

It would also open up the possibility of using Google as an attacker to take down other sites. If Google's servers immediately requested the image for caching, then just send out a few million messages to Google addresses, each one with a unique URL pointing to some big image file on some site you want to take down. Most sites will accept a query string after an image URL, so ... bob's yer uncle. As it is now, this isn't an effective attack because you need to get people to actually open up the email message and then click "Display images", all within a short period of time.


Nope; still the same filtering; they are not going to cache things detected as SPAM; so an attacker would have to fool the anti-spam algorithm first... good luck with that!


I wasn't describing a new tactic for spamming, I was describing possible new tactics for targeted attacks. They're different things.


An attack that relies on the possibility of being able to send millions of emails without being detected as spam; so it is unlikely to succeed.


Gmail has probably the best spam filtering in the world, but you seem to think it's flawless, and it's not. I just now signed into my old Gmail account for the first time in ages. There's spam in there, and recent too. Roughly one per day by the looks of it -- messages that aren't newsletters, or from previous contacts, or anything that I signed up for.

And spammers aren't even that smart, they're just numerous and financially motivated.

If somebody wanted to annoy someone else with this, they could.

And, again, even if this doesn't work against Google directly, it certainly would work against other service providers who decide to follow Google's lead.

Heck, just look at the recent popularity of the WhatsApp spam that spread malware to tons of people (including Cryptolocker in some cases), or the "Secure Document" phish that made the rounds in Gmail in September.


Ok I see your point; but I want to add that one of the reasons your old account has spam is because you don't use it; I don't know what algorithm they are using but it certainly relies on usability data; I don't know exactly but could be: what language you usually speak in; what hours of the day you receive and read important email, what subjects you read about, etc etc.


Yeah, that could be. I have to spend a significant part of my time now dealing with spam (sysadmin), and Gmail's handling of spam is years ahead of most of the tools I have available.


But this is Google. They already want and try to load every url in existence. DoS attacks based on flooding actual content are largely irrelevant to them.

Attacks on third parties are trickier, but you can do the same sort of thing today; how often is this tactic used already? And can't you download 404 messages yourself for cheaper than 1k?


Google could check if GUID_1.jpg is the exact same image as GUID_2.jpg.

If the same, just show a cached version. No more tracking.

(of course, email marketers would thwart this by making each image slightly different. google would respond by checking if the images are almost exactly the same. repeat. whack-a-mole ad infinitum).


Eh. When they fetch GUID_2.jpg to check its content, it's already too late at that point; the marketer has been notified that this particular user has opened the email.

Google might save themselves some storage space, but not anyone's privacy.

If Google changed their policy to fetch all images when the email is delivered, that basically delivers a false positive to all marketers/spammers/etc. -- which is better than more accurate info, but it's still worse than just not loading the images.

It's still a confirmation that the email address is valid, plus... who wants to give spammers (or marketers, for that matter) a false positive? The unsophisticated ones won't realize that it's a Google change; they'll just put you on the "interested" list and move on.


But you can't do that without requesting both images from the host.


Please (re)read point #2 I originally made which addresses your point.


Google could retrieve and cache a random sample of the images and hash them. Then they could note which image links go to the same images. And they could identify image links by noting that a bunch of emails have the same structure except for one image link that's different for every recipient.

So I think they actually could take a pretty good stab a deduping images with unique tracking URLs. It might not be perfect, but even if it works 95% of the time they could still kill the profitability of the unique tracking image technique.


Easy: Make the "Dear <username>" part of the email an image. Boom, deduplication fooled.

(Also, they don't currently seem to do anything like you're suggesting.)


What if google just immediately fetches images for all received messages, regardless of deliverability? They can dedupe wherever possible, but the sender still ends up with no information about whether the message was delivered, much less read.


What if google just immediately fetches images for all received messages, regardless of deliverability?

I'm trying to work out whether this is more useful as a way to get Google to DoS themselves or as a way to get them to DoS arbitrary web sites of others. Either way, isn't this a gift to trouble-makers?

Of course Google would probably develop an automated defence against such attacks quickly if they happened in practice, but it seems any such defence would necessarily involve not caching all the images in advance, which would defeat the original point.


I'm fairly sure sending an email is more expensive than sending a GET, so it should be more effective for an attacker to make the requests directly than trying to use this to get google to proxy an attack.

I also strongly suspect that google's crawling infrastructure is more than capable of fetching a bunch of images for every single message gmail receives.

But even if I'm wrong about the above, google is perfectly capable of throttling their fetching to mitigate. (The problem really ends up looking an awful lot like crawling the internet, which is an area that google seems to have a bit of experience)


In reality, I'd be less worried for Google and more worried for whoever is hosting moderately large images that get linkjacked in numerous variations (http://www.example.com/largeimg.png?randomnumber=72435).

Google can't tell, a priori, whether or not a series of similar e-mails sent to many thousands of people with Google Mail addresses and containing similar but different image links like the above is a genuine mail going out to someone's list or a DDoS of www.example.com in which Google is about to become an unwitting participant.

By the time they've worked out whichever trick is being used this time (in the same way that they adapt to changing black hat SEO tactics, but probably only make major changes every few months) it's not hard to see a hostile party busting the bandwidth cap for anyone on a basic, low-volume hosting plan.


Why involve Google? Aren't sites on basic, low-volume hosting plans easy to knock over, without resorting to DDoS tactics? And if you're trying to knock over bigger sites, it doesn't seem like Google would make a very good DDoS platform in any case, since the requests would be originating from a relatively small range of IPs that a bigger site could just ban. Presumably the only reason they wouldn't want to ban the requests is if they're actually the ones sending the emails in the first place, so the problem sort of solves itself.


This is an old problem with an old solution. If you have an expensive-to-generate resource that you don't want automatically retrieved en masse, you use robots.txt to deny access to it.


AND it could create a dis-incentive to load up an email with unique images since as soon as you send the email out all of those gmail addresses are coming right back at your server to request the images.


And then you can cause Google to DDoS someone else's site by sending out spam containing lots of image URLs.


You can somewhat do this with the current system. I had no problems sending an email with 10 10mb images. Google happily fetched all 10 of the images off my server.

Not sure if they limit it at some point, but if a server accepts urls such as:

http://i.imgur.com/9Y5FDz7.jpg http://i.imgur.com/9Y5FDz7.jpg?1 http://i.imgur.com/9Y5FDz7.jpg?2 etc...

Google would fetch each separately. Send this out to a bunch of people, and it seems problematic. I'm going to be optimistic, and assume they built in some sort of limiting, but who knows.


Users would have to have previously agreed to "Always load images from domain.com"


The cost of a new domain per e-mail campaign would be trivial.

(and "normal users" do click the show images links)


We manage opt-in mailing lists for customers of restaurant chains, and for well timed, well-targeted campaigns they get open rates in the 30%-40% range with the majority of opens within 15-30 minutes of the send anyway. It takes more resources for us to handle the outbound mail load than to handle the inbound image requests, as for the image requests the url's contains enough info to do a trivial regexp based rewrite and fetch the images from a cache. I don't think handling a 100% open rate as soon as the mail was delivered would even remotely be a challenge.


Just curious, how big are those lists? There is a big difference in a restaurant with 10,000 customers and an e-commerce site with 3,000,000.


Yeah, good luck with that. Instantly, every hacker in the world would use Gmail as a DDoS amplification tool.


This is the correct answer. If google does not do this, well, we know whose side they're on.


PG recommended this method years ago for fighting spam. Puts a load on the spammers. Not sure if anyone has ever tried it at scale.


With google's machine learning brain trust, I think they could still do a pretty good job deduping. Maybe not perfect, but I'd bet on them to win an arms race.

Edit: ah, codeflo and EGreg are right. I was just thinking about the task of determining that the images serve the same role in each message (which I'm sure google could do a good job of). But (as they point out) in the "Dear <user>" case they'll still have to show the right image to the right user. Although, as Nacraile and jaxn say, if they load all those images eagerly they'll remove the value of those unique tracking images, and impose a cost on the sender.


There's a huge difference between something that they might do, in theory, at some point, using vaguely magic machine learning technology, and what that they are currently doing, right now, to address the privacy concerns over a change that they already rolled out to millions of users.


Are you saying Google will try to guess and reconstruct "Dear Marge" in the same font as "Dear John" instead of requesting both from the origin server?


Getting the font right is the easy part. Figuring out the formerly occluded pixels in the background image is the hard part.


> But (as they point out) in the "Dear <user>" case they'll still have to show the right image to the right user.

They could do a "This sender appears to be attempting to track you. We have disabled images as a precaution. Click here to load."

Scary enough to the average user it'd probably kill the the technique very quickly.


Scary enough to the average user it'd probably kill the the technique very quickly.

I'm not sure about that. It's nothing that desktop clients such as Thunderbird haven't been doing for years. I don't see any remote images in any e-mail until I click to say load them, and this works in much the same way that plug-in elements like Flash and Java are now click-to-show in various browsers. Numerous marketers and mailing list services still use the technique to track an approximation of reader numbers, though.


Nah, nobody pays attention to warnings attached to Gmail messages after having seen so many of them.

A particular example of crying wolf that comes to mind is the yellow box that says, "HEY! THIS SENDER ISN'T WHO THEY SAY THEY ARE!", which usually means that someone just forwards their .edu address to Gmail.


Or just cache unique images on delivery...


Although I'm sure it's possible to fool their image hashing algorithm, I doubt this will. Image hashing algorithms are designed to be resistant to small changes in the image and more advanced ones can generate hashes that determine how similar one image is to another. I haven't tried this, but you can probably see a proof of this using google image search. Add some text to an image, and see if Google image search can find the original.


Procedurally generated fractal backgrounds with random seed, that might work?

Anyway, I think it would be perfectly fine if Google matched up emails with similar content and where there was one image that was unique for everybody just remove it, maybe with a note to the user in that case.

Do NOT have a "click to load images" button. If users can't ever see them then it completely destroys spammers' ability to use them even for a rough sampling.

I would love to see spam as an advertising method be completely destroyed. It won't be, because even without tracking it is still easy and useful to spam out lots of ads, but this would help.


And Google could just mark any mail from sources who pull these shenanigans as spam


Yup. Though I suspect my university's alumni newsletter also has tracking images in it. I haven't checked, but if I were them I'd use a tracking image.


they dont have to even edit the image. they can just put a tracking id in the image metadata.


What if the images are slightly modified for each user? A mixture of image processing and hashing can be used for deduping though.


E-mail marketing companies would make a fortune coming up with "image composition plugins" for their systems. E.g. manipulating background patterns; alter the positioning of image elements; change text positioning, size, fonts; change objects in the images.

It'd be a gold mine.


I'm pretty sure this already exists -- I've gotten spam with obviously random noise, lines, etc. added to the background.

My assumption is that it's making the images unique.


In fairness, it's a partial privacy improvement - it masks IP and user agent, as well as repeat opens.


If no actual http request between the email recipient and the sender happens, doesn't that also imply that the sender has less opportunity to do all the regular http user tracking stuff to associate the browsing session with that email address? That seems vaguely beneficial.


It doesn't even mask repeat opens: as buro9 found in the sister comment, the image does not get cached, Google's proxy will request it again each time it is viewed.


Google was already image caching external email images, IIRC. So as far as amount of raw information being leaked, I think today's feature launch represents a step backwards.


> Google has no way to know that all of those point to the same image.

Try again. :)


Yeah, I am not sure where the original claim is coming from. It isn't that hard for Google to simply follow the links and compare the files as part of the caching process. So unless marketers start customizing the images in addition to the links, there isn't any reason why Google can't cache the images together.

And even if marketers do start customizing images, hasn't Google gotten pretty good at comparing very similar files? Isn't that how Google Music works without having a copy of every single individual upload?


The two of you are confused. The image link doesn't even have to return an image in order to successfully send information to marketers.

An image link in the email to:

http://thisalways404s.com/404/slg_read_my_email

will work just fine. So, Google will follow the link and compare the files and... wait a second, the information we care about - that slg read my email - has already been transmitted.


Unless they did it at the time of mail arrival in order to better compare and save on loading times. In which case the information would be useless.


They are unlikely to do that though: it would waste a lot of bandwidth requesting images that users are never doing to see because they'll delete the mail before opening it. Google may have bandwidth and server resources to cobble dogs with, but they are not going to waste it like that I assume. Also if they did you could easily perform a DoS attack (or just give someone a big bandwidth bill) by sending out a pile of email with an image tag pointing to a large object on a competitor's web servers.


I don't understand how that helps. If I send an email to user_x@gmail.com with an embedded image at http://example.com/my_spam_images/image_for_user_x.jpg and google makes a request to that, I know the mail has been delivered and the user in question exists and is seeing my ads, because a request for image_for_user_x.jpg showed up in my logs.

Now, when user_y also receives my spam and I get a request for image_for_user_y.jpg and I just serve the same file, Google is probably gonna deduplicate them on their cache or cdn or w/e, but only after they've sent me the request and confirmed that someone read my email.

I'm not trying to overload google's storage capability here (lol), I'm just interested in the information leak.


Similar to how they detect spam by comparing similar emails (eg, same sender, largely the same content, etc). So the first maybe 1,000 or so see the spam in their inbox, but after enough people report it, everybody else gets it in their spam folder.

So they could learn that for all these emails with largely the same content, this one image has a slightly different URL, but the image is always the same (or similar). So as with spam, the marketers might see the first few "opens", but once Google learns that they all similar anyway, the won't see any more.

Now, I don't know if that's what they are doing, but its certainly possible.


"I know the mail has been delivered and the user in question exists and is seeing my ads"

No, you just know that the email was delivered to the user's inbox. You don't know if the user looked at it or just trashed it.


Here's the social workaround to that: Send a few e-mail campaign with "time sensitive offers" with a timer that starts on retrieval off the image with a "oh, by the way, please not that for Gmail users it starts on delivery as Gmail loads the image right away".

People love their e-mail offers. The type of users e-mail marketers want the most - namely the ones that responds to their offers very well - would be up in arms if Gmail makes them start missing out on offers.


That is way more information than they should ever get.


And even if marketers do start customizing images, hasn't Google gotten pretty good at comparing very similar files?

For image tracking pixels, I'd just start returning PNGs of random dimensions with completely random pixel data. Ta-da, unique images that aren't similar at all (unless they wanna start doing wavelet transforms or something).


hasn't Google gotten pretty good at comparing very similar files?

Sure, but they'd have to download the file first. At which point the tracking has succeeded.


...the tracking has succeeded.

Succeeded in what, confirming that Google is still in operation? Please note that Google doesn't even have to confirm the receiving email address is valid in order to get the image.


I just checked and Google rejects an e-mail after the RCPT TO: stage if the recipient address doesn't exist so they would not receive the message content.

    $ telnet gmail-smtp-in.l.google.com 25
    Trying 74.125.142.26...
    Connected to gmail-smtp-in.l.google.com.
    Escape character is '^]'.
    220 mx.google.com ESMTP nh2si25383829icc.26 - gsmtp
    EHLO myhostname.mydomain.com
    250-mx.google.com at your service, [my.ip.was.here]
    250-SIZE 35882577
    250-8BITMIME
    250-STARTTLS
    250-ENHANCEDSTATUSCODES
    250 CHUNKING
    MAIL FROM: <myusername@gmail.com>
    250 2.1.0 OK nh2si25383829icc.26 - gsmtp
    RCPT TO: <non-existant-address@gmail.com>
    550-5.1.1 The email account that you tried to reach does not exist. Please try
    550-5.1.1 double-checking the recipient's email address for typos or
    550-5.1.1 unnecessary spaces. Learn more at
    550 5.1.1 http://support.google.com/mail/bin/answer.py?answer=6596 nh2si25383829icc.26 - gsmtp


"early tests indicate that they don't actually do that, waiting for the user to click the email to make the request."


It'd be nice to see these "early tests".


You can easily try it yourself.

python -m SimpleHTTPServer 8080

creates a webserver serving the current directory, you can then create an email linking to a file in that directory and observe when it gets queried.


Easily is a bit of a stretch, because most users are on NAT setups and they would need to go into their router settings and know how to set them up to allow the external request to get through. So, yeah easily if (a big if) you know how to do that, or if (another big if) you are on a machine that is directly visible on teh interwebs.


Easy to do with ngrok.


That's not the point though. The point is most users don't have the knowhow.


Of course, if Google instantly downloaded every single image reference in every email accepted by their MX, there is no longer any useful information in that fact.


Waiy are you saying Google will try to guess and reconstruct "Dear Marge" in the same font as "Dear John" instead of requesting both from the origin server? What if the image they didn't download includes a picture of a baby instead?


Ok, but then Google's still making a request to, here, a user-specific URL. The spammer may not know where you are, but they now know that your address exists.


But they already knew that because they didn't get a bounce response.


They don't normally get bounce requests, though -- a tracking image is easier.

You've probably noticed that most spam comes from "borrowed" email addresses, not ones the spammer actually controls. If anyone ever sends a ton of spam with your email address on it (this has happened to me) it really drives the point home.


I believe the claim was more of "there is no way for google to know those all point to the same image without following the link."


Once they followed the links, the tracking has already happened. Any deduplication after that only helps to reduce Google's storage costs.


Well of course it has a way. It can fetch the image. Then it will see it's the same - but now that means that the mail is 'reported' as delivered, possibly before the person receiving it has ever opened the message. This is an issue both for individuals (I don't want this information reported) and marketers (false positives).


... before making the requests. And then it's too late.

(Also, they might not actually point to the same image, it's very easy to make the images themselves unique if required.)


Do you know more about this? Are they, say, MD5ing images that are served, or can this be circumvented by changing a pixel for each request?


Even if they hashed every image returned, they'd have still made the request.


How could they know prior to fetching the image? Sure, after fetching they could determine it but that is too late for privacy.


De-duplication is already a "big deal" in lots of hosting situations, such as box.net, dropbox, etc. I doubt it's outside of the skillset of engineers at google to address the problem. Especially given that google already has the technology to do image searches based on other images.


De-duplication does not allow Google to know

    http://marketing.example.com/tracking/some_user.png
in cache is identical to

    http://marketing.example.com/tracking/some_other_user.png
without passing the second URL — a.k.a. the tracking data — to marketing.example.com.


I just did a quick test using Mailgun and recorded the results [1].

The TLDWatch is that Google does give you back the open event and they also give you the email address of the open since this data is encoded in the URL.

The data that is not accurate is what you would expect: IP address, geo-location and user-agent string.

We'll hopefully write a more extensive blog post shortly.

[1] http://blog.mailgun.com/post/gmail-open-tracking-test-at-mai...


> The data that is not accurate is what you would expect: IP address, geo-location and user-agent string.

Also any cookies present in the user's local environment from other actions (images from the same ad network in other emails or from visiting web pages that use the same ad network) are not going to be sent, so tracking you between locations is going to be neutered somewhat.


Google gets huge benefits from it but I think in the end this helps advertisers as well as still being able to track via the image url and a unique url/id. Good deals usually help both parties with some give and take.

Pros:

- Since all images flow through google, phishing and other malware attacks could be subverted.

- Images will be hosted faster in many cases (possibly less cost to run newsletters).

- Less connections on your server/cdn from multiples sources but google singularly.

- Still able to identify users legitimately but new users and newsletters will have more trouble getting your information initially.

Cons:

- Re-views later will not be tracked if out of cache, it will come from google the second time if is hasn't been purged (re-views are not big on newsletters anyways)

- Google getting all this data as well as your company

- The obvious 'national security' reasons

- Limited location and meta information


> 'national security' reasons

Which?

The email already has the image URL. What extra leak is there if Google fetches the image?


Fewer destination IPs is not a pro.

Also, google already has this data so why is that a con?

There is no national security reason.

Bottom line: Christmas came early for spammers this year.


> Now, an open question is if Google will make that request when the email is actually opened, which would allow marketers to determine if and when the email was read by the user, or if Google will make the request as soon as the email is received.

I reply to this question here. Sadly, a request is done each time and only when a user loads the image.

https://news.ycombinator.com/item?id=6898087

Which makes the Ars article even more wrong.


If google cached them as soon as they were received wouldn't they, in some cases, effectively perform a DoS attack on whatever was hosting the image? I.e. if a spammer sent out 1,000,000 emails w/ images they were hosting, they would immediately receive 1,000,000 requests for the image.


I think such a behavior is really easily spotted and filtered, if not already blocked by spam filters.

Google has the technology to do that, the question is whether they want to put in the investment of storage required to actually guarantee users privacy, of if they just want to spend the least amount they can get away with...


it does, indeed, send the request when it is actually opened. see: https://news.ycombinator.com/item?id=6895876

/me goes to go find and turn on the "block third party images" feature.


Are we sure they won't programmatically/statistically detect these tracking URLs and start skipping the request after the first dozen or so retrievals out of 10,000,000?


this is what the original announcement blog post seems to imply indeed


If they wait until the user opens the email to cache the image, then it's actually BETTER for marketers.

This essentially allows the marketer to track whether the email was opened by default.


> This essentially allows the marketer to track whether the email was opened by default.

And the spammer. Unless they decide to not load images by default from untrusted senders.


There is a simple way around that. You are saying that the fact that Google retrieves images means that the mail was delivered. But this is only true if Google retrieves images only from delivered mail.

If Google retrieves every image in every mail sent to @gmail.com, @googlemail.com etc, then the only thing the retrieve tells you is that you spelled "gmail.com" correctly - nothing about whether there is a mailbox there or whether it was delivered.


> If Google retrieves every image in every mail sent to @gmail.com, @googlemail.com etc, ...

But they don't. If they retrieve the image, the account exists (they reject mail after the RCPT TO: stage for non-existant accounts).


and presumably bounce it in other cases? I mean this could be a way of ensuring that certain addresses exist without needing to be able to receive the bounced mail if they don't. But how is that an interesting or useful vector?

If Google does this with every mail regardless of the inbox it goes to (spam etc) then it doesn't tell you any more information than you learn from not receiving a bounce. However, I could imagine a scenario where the bounce address is wrong anyway (spoof) - is this really that useful for anything?

I mean, presumably most combinations of common first and last name plus two digits go to a registered mailbox. How does being sure that it is registered (but knowing nothing else) without having to be in a position to receive the bounce, mean a compromise?

I'm open to the possibility that it does - but I'm not seeing it.

EDIT: another possible area of concern is that you can get Google to visit an address just by sending mail to johnsmith@gmail.com and calling the link an image. But can't you already do the same thing with the Google bot by including a link causing it to probably visit? This could be more instantaneous and hide the actual referring source of the visit behind an email, but I don't really see how this can be used for anything. For example if an extremely malformed server performs actions on the basis of a simple http GET then I guess you could craft that command into an image url, send it to any gmail address, and then Google will do your dirty work of actually visiting that link. But, really, is this a vector that is dangerous for anything? Don't URL's already get random Google traffic?


And?

Anyone can check for the existence of gmail accounts in this manner without screwing around with images.


Google could technically do a similarity measure between images coming from similar links (or similarly worded emails) and then provide the cached version. But then advertisers could hide random information in the image (steganography), making two visually similar looking images but with dissimilar enough content, that Google will be have to up the ante. And so on and so forth.

But you are spot on them being "willfully dishonest". The way I look at it, they are at this point trying to push the barrier and see where users will protest enough that they need to roll-back.


> Absurdly wrong, marketers already use a unique image URL for each email recipient, and Google has no way to know that all of those point to the same image.

wrong.

you can of course do simple image processing and identify similar images. if the marketers only change the name(url) of the file and not the content one-bit, you can trivially compare the hash of the file... even if the marketers change content, assuming the marketer sent the image %90 same and 10% customized per person, you can borrow techniques from image compression domain to compress this humongous data very efficiently.


And how do you get the image to compare it? By requesting it, which means you have to ask for it from a server, thus identifying that the image has been loaded.


The behaviour might change in the future, but I think an article in the Gmail help centre has some answers on the initial implementation of this:

https://support.google.com/mail/answer/145919?hl=en

"In some cases, senders may be able to know whether an individual has opened a message with unique image links" suggests Google (at least for now) fetches the images upon opening of the email.


A few years back I shared office space with an "email marketer", and even then he would tell me that each image URL contained a hash as part of the file name. This hash is linked to the email address. How would they be able to prevent that? Even if Google pre-fetches it, it still would (possibly harmfully) confirm they had saw it. Even if they didn't.


> and Google has no way to know that all of those point to the same image.

They could do content hash-based caching rather than URL-based caching. It would be more private, as the email senders would have to generate a unique image for each recipient.


It's impossible to know if the content for a different URL is the same without fetching it first, your point won't work.


> They'll see one request from Google per successful delivery to an inbox

But if the images are pre-cached before the user opens it, there is still no way of knowing the E-mail was read.

Unless the image is cached upon opening, which is a bit counter-intuitive.


"Absurdly wrong, ....,and Google has no way to know that all of those point to the same image."

I think the guys who implemented image search has better ways to figure this out.


I would hope that they cache the images before delivery to any inbox regardless of whether the inbox exists or not.


I think any email marketers who want to get around this easily can. Just change robots.txt to not give permission to Google fetching the images. Copyright will laws will prevent them from wilfully ignoring that, presumably.


robots.txt is for crawlers, it would not stop an email client from rendering the email on behalf of its user upon receipt.


So then what gives Google the legal right to fetch, store, and re-serve the images?


They're just acting as an email client. Alice sends email to Bob, Bob is granted rights to fetch and view images. Bob uses gmail, so he passes the rights onto google to do part of that serverside.


I'm not a lawyer, but I don't think copyright law allows for Bob to pass on that right.


He certainly passed on a significant portion of the rights or that server wouldn't be authorized to receive and permanently store the emails in the first place.

Do you think copyright law disallows running your desktop in the cloud?


I came here to say basically the same thing, so I'm going to chime in in support of what you're saying. If I'm a marketer and I send an email with a link to the image, and Google caches it in order to resend it for their own purposes (even if that's to shield their users from spam), how is that not a copyright violation? In this case, caching is no different than copying; and the kind of caching Google is doing in this instance is different than the caching an individual does via his browser, since the individual downloaded the image in the original instance.


If I had to guess, I would guess that sending an email with an image link is a rather obvious declaration of intent to share the image.


Yes, but with who? I think it's a declaration of intent to share with the person specified in the "to:", "cc:" and "bcc:" fields, but not with Google.


Unless they've somehow overlooked something, surely their terms of service will?


The terms of service are an agreement with their users, not with third parties who are emailing their users.


I see where you're coming from, and IANAL etc, but this surely must be a solved problem? For at least the last 15 years (longer?) we've had web mail where mail you send doesn't go directly to the user but via some other servers. If there is an actual issue here it's going to be a hell of a legal battle.


So they strip the images and show you a text only email and tell you to deal with it.


True, that would close off this avenue. Would be interesting to know if that's how they handle this currently.


There seems to be a lot of misinformation flying around. Here's [0] Google's support doc that clears up some of it.

The most important part is at the end:

"In some cases, senders may be able to know whether an individual has opened a message with unique image links. As always, Gmail scans every message for suspicious content and if Gmail considers a sender or message potentially suspicious, images won’t be displayed and you’ll be asked whether you want to see the images."

So Google apparently does not see read receipts as a problem. The privacy and security protections are about preventing other information (like ip, browser headers, cookies) from leaking, rather than read notifications.

If you care about maintaining your privacy, I would recommend disabling the new functionality.

[0] https://support.google.com/mail/answer/145919?hl=en&ctx=mail


Wow, I'm amazed by this. I was convinced that Google wouldn't have rolled out this new feature unless they had a way to avoid this sort of tracking. Isn't this exactly the privacy issue that led clients to adopt "Don't display images automatically by default" in the first place?

(Why wouldn't they let users combine this behavior with the old one? That is, don't display images by default, but if you choose to display them anyway, get the file from Google's proxy server.)


They do combine those behaviors. The proxying is on for everybody, regardless of whether you disable showing images automatically.


Actually, the most serious reason (and the reason why not displaying images was introduced in the first place for unknown senders) was the potential for exploits triggered via embedded images, e.g. http://news.netcraft.com/archives/2004/09/17/exploit_for_mic... or even just using it to bypass firewalls by having someone inside an organization execute a GET request (via img src=) from a browser inside the firewall. (P.S. This is one reason why write operations should always use POST...)


I thought "Don't display images automatically by default" was a response to porn images.


These two statements of yours seem at odds with each other:

"The privacy and security protections are about preventing other information (like ip, browser headers, cookies) from leaking, rather than read notifications."

"If you care about maintaining your privacy, I would recommend disabling the new functionality."


"Disabling the new functionality" refers to maintaing the old behavior of hiding images by default. This means that you're leaking nothing unless you explicitly decide to. This does not appear to disable the proxying, though, so you're still covered by these additional protections when you do explicitly show images.


What he meant is to disable the new functionality and revert to the default blocking of any images, which is definitely the most privacy-preserving option.

The new functionality seems to by default enable read notifications as Google seems to load all images by default. Unless that's false. Then it should have no impact on privacy.


The options are "Display images via this proxy mode that still leaks some privacy" and "don't display images. 100% privacy".


When I just tried disabling it, the images that were displayed after I clicked 'Display Images Below' all still went through the proxy.


It's a delivery receipt, not a read receipt. Delivery receipts already exist, in the sense that Google will tell (non-suspicious) senders if the message is deliverable.

Gmail is pre-fetching the image as part of the message.


I have an inbox crammed with unread messages. I okayed the new image behavior, checked a few messages and noticed how google loaded images without asking me, which is fine, I did agree on that.

I then reverted back to the old image settings but when I open unread messages now, they also have their images loaded automatically and without asking me.

Either the "revert back to original settings" are broken, or the caching of images is done when I enter my inbox, for all unread messages.


I'm not yet seeing this option available in my account. :(


So instead of my work mailing lists having accurate stats via ye olde tracking pixel, now only google does, which they will sell to their own chosen marketers and clients paying google's rate.

This isn't the privacy and common-sense win you think it is.


Your work mailing lists only have accurate stats if all of the recipients currently enable loading of images by default.

I don't think you're being sincere in your concern.


No one in the industry believes that open rate is 100% accurate, or even 50% accurate. It doesn't need to be. It's a directional measure: Which subject lines have better open rate, how does open rate decline as you send more frequently, which mailing lists or demographics have better open rates to the same email, etc.


"opens" for email messages have never been super accurate, since downloading images isn't on by default in more than just Gmail - your desktop mail client may have the same settings - who knows?

Because of this, any real interaction with an email message - where the user has to interact with the content, could then be counted as an, "open". If you get any other interaction with the message - say a user clicks a URL (that is also tracked), you just look and see if an, "open" was also recorded for this message, for this user - and, if not: record an open.

You'll still get instances when opens happen in messages, that aren't tracked, since no other interaction is done - there's that inaccuracy. But, if no other interaction is done, might as not count it as an open, anyways. Perhaps we should rename, "open" track as, "was this person, at all engaged, in the slightest?"

You can also then just give that a ranking: opened, clicked x amounts of links, followed through with a sale, DIDN'T unsubscribed - good rank!


Not sure why your "work mailing list" needs open tracking. Unless by "work mailing list" you mean "spam".


> which they will sell ...

Citation? I was not aware of this.


Google's an advertising company. Selling eyeballs -- or information on how to get more eyeballs -- is what they do.


> Selling eyeballs -- or information on how to get more eyeballs -- is what they do.

Selling eyeballs: yes.

Selling information: oh good lord no, so very, very no. That information is never ever leaving Google's servers.


If it leaves or doesn't leave Google servers we can't really know but the facts are: -Google has this information -They are marketing company with incentives to monetize it

Imo it's enough to be a bad thing. We can't rely on good will of companies to fight incentives and money in the name of privacy of their users.


You don't have to rely on "good will". Even if you are super pessimistic "companies only want money", Google will never sell or give up that data to other companies. That data is literally Google's lifeblood. That data is why companies use AdSense to advertise with instead of Bing Ads. That data is Google's competitive edge.

Even the evilest of companies isn't going to sell off its competitive edge, that's just idiotic.

Also you are wrong, Google is not a marketing company. Google is a middle man company. Google matches marketers to consumers. THAT is Google's business. Google is paid to be a match maker, nothing more.


Until the NSA gets to it.


Or until, years and years from now, someone buys Google.


when gmail will be as quaint as a model-T


He is only assuming. Since this is new, and the info was free to begin with, there might not be anything against this in the privacy policy? but I don't want to read the whole policy to find out.


Why aren't you asking for read-receipts if that's what's desired? If it's because people wouldn't send them, shouldn't that be their choice?


The vast majority of people don't care at all about read receipts.

Marketers don't care who _isn't_ reading their mail, they care how to send mail that more people will read.

Who subscribes to a direct email campaign (and doesn't unsubscribe, and doesn't flag as spam), and is still offended by the thought of the sender knowing it was read?

Those people can flip the setting back to prompt-to-show-images.


Not too different from them removing actual referrer URLs from Google Analytics.


Basically the same thing they do with keyword (not provided). You can get the referring keyword via adwords but not analytics with organic search. So if you pay for adwords you can see what keywords your users are clicking but if you don't pay then no data for you. Google only cares about user privacy when they can sell the data.


Google is starting to emulate the bullying tactics of Microsoft during its hey day. Don't be evil. not so much, I say.


Devil you know.


"Google will now be digging deeper than ever into your e-mails and literally modifying the contents."

This is a silly fear, all email clients already do this, as you don't display raw unmodified HTML from emails, you have to scrub it. They are just adding one new kind of scrubbing to the list of things they already must do.


It seems Gmail was already caching the images in emails (before today's change to display them by default).

Read Mailchimp's post (December 6th):

"Image caching still lowers our ability to track repeat opens, but turning those images on means we’ll be more accurate when tracking unique opens. At least, theoretically it should work that way"

http://blog.mailchimp.com/how-gmails-image-caching-affects-o...


I wonder whether this applies only to the web UI.

I've seen a couple startups that were working on dynamic email marketing - they fed in the content as an image, e.g. a "one-day promotion", but would change the image content server-side for future email opens to reflect current details. I guess that this breaks that functionality.


Such functionality should be broken - my email archive should be a permanent record of what you sent, not something the sender is able to tweak afterwards.


Google isn't new to this whole "caching" thing. I would expect them to respect whatever cache expiration directives the web server provides.


This raises an important point.


Regardless of any caching or deduplication, the fact that Google is acting as a proxy means an end to e-mail remarketing. E-mail remarketing requires the request come from the visitor's own browser so that the reply can set a cookie to identify that browser for later advertising. I wouldn't be surprised to see this show up alongside Google AdWords'/DoubleClick's web remarketing product now that Google's shut out the rest of the industry.


Seems pretty trivial for marketers to work around by making each image request a unique URL per-recipient. Assuming Google's proxy fetches the image when the user opens the email, you could use that technique to find out when a user reads the sent mail.

You wouldn't get the IP address like you would with conventional bugging, but you could still find out how many users read the mail and what time they did so.


Mmm, it sounds like Google might deduplicate the images....


I thought that at first too, but to deduplicate, they still need to first issue a request for each, download, and compare. So the requests have already been made, and the marketer received their information.


If you request every image (regardless of if the email was opened) and then dedup, then I don't think the marketers get much info.


Dedup on image data has no bearing on the information advertisers can collect. On image URLs it will cause advertisers to append query strings to image paths. Google can't win that particular arms race, but grabbing the images on email receipt instead of on email viewing neutralises any gains from that win.

One point nobody has raised yet, though, is that there could be valid use-cases for the sneaky stuff people were doing before. Images generated on the server that reflect current (updated or updating) info could be handy. It might even be worth serving different images to different clients based on user-agent strings. I'm skeptical on both counts, though.


I think it caches them the moment the email is sent?


That is only possible if the URL is identical. If senders use unique URLs per recipient, Google could guess at which images are identical based on surrounding e-mail content, patterns in URL components, and/or statistical sampling of some of the URLs, but a 100% accurate deduplication would either require identical URLs or a lot of bandwidth to read 100% of URLs.


Well Google already crawls a ton of URLs so it's hard to imagine bandwidth being a major issue, right?

Even the same URL can return a different image. That wouldn't be super useful for tracking, but they can only truly dedupe if they read every response.


holy smokes google is going to further grow their monopoly and try to take over the advertising market with this.

They'll cache and own even more of your data and keep it out of the hands of spammers - in turn spammers will have to buy into google to get data about you.

This isn't for us, this was done to make money off of us.


Google is a corporation. Everything they do is to make money.

The images they're caching aren't mine, anyway, and in many cases they're unsolicited. Sure there's the evil aspect to this (they own advertising), but there is the potential good of obfuscating your actually private data - the IP you check your mail from, when you check it, anything you send back with an HTTP request - from marketers. On solely that note, I'm all for it. But I'm also one of those who like the new Tabs setup, and rarely loaded images for emails from people I don't know.


But in many cases they are solicited, and people want the behavior that gets triggered by knowing about opens:

You always opens our offer e-mails? We'll send you more of the same of what you open, and less of what you don't, increasing the chance you'll find something you like.

Stopping web-bugs from the spammers will improved things, but stopping it from legitimate opt-in marketing mails will make the experience worse and less targeted for people.

The company I work for send millions of e-mails on behalf of customers. All opt-in, and I spend far more time than I'd like making sure we comply with all expectations of the mail providers and ISPs...

But I'm all for Google proxying and hiding IP, cookies etc - I wrote a webmail solution back in the day, and co-founded a company to run it, and frankly I pretty much assumed Google did this already; we did that back in '99 because it was the obvious thing to do.


While this will, indeed, take away some information from email marketers, it may also give them more, different information. Now, instead of getting a lot of personally identifiable information from some users (some=the percentage which still used clients that don't block image requests), they will probably get less information, but from a wider set of users - those who use the default settings in Gmail.

On the one hand, their proxy solution has a positive effect for privacy, but on the other hand, the load-by-default setting has a negative effect.

Either way, Google already knows which e-mails you're opening, so using an image proxy is not going to give them "even more of your data" that they didn't already have.


> They'll cache and own even more of your data

As a Gmail user, that's what I want.

> keep it out of the hands of spammers - in turn spammers will have to buy into google to get data about you.

Again, that's what I want. Spammers depend upon an incredibly low cost of sending emails with almost no accountability through botnets and foreign servers. If they need to open Google advertising accounts, provide a credit number, have to get their ads approved, and get charged market rates for their impressions ... I'm all for it.

But none of them will actually do it. Google won't make a dime from them.


When will Google scripts begin to edit the content of a lover's letter in order to protect its users from psychological harm?


Holy crap. If I'm reading this right, this basically destroys all pixel tracking, so no more email open stats, no nothing.


Not quite. According to testers in the other thread[1], Google pulls images on mail open. This means marketers will get more accurate email open stats because now the default behavior is to load images. However, they don't get cookies, IP, etc so they lose some of that capability. Not all bad for marketers and spammers.

I tried to opt-out of external content in Settings > General but unfortunately it's still loading images.

[1] https://news.ycombinator.com/item?id=6895606


Very interesting. And seems like it doesn't affect most mobile users since they use a default client, for the most part.


You have no right to know what your users are doing after you send them a message. The privacy invading "tracking pixels" shouldn't exist in the first place.


Why shouldn't content providers be free to monitor usage if it's legal and easy to do? Would you next start to argue SaaS providers shouldn't have any Web server logs at all since it's a way of "tracking" users?


For the same reason that sending me a postcard doesn't entitle you to enter my house and take an inventory of my fridge. To get that information about a user, you should need informed consent, not just a marketer's wistful greed. That same "why not?" attitude is exactly the same bullshit user-contemptuous line of thinking that the NSA and the FBI use with their "intercept it all and let God sort it out" eavesdropping schemes.


If I choose to visit a website hosted by a server operated by someone else, that server is involved and can do whatever it wants.

If a piece of content is delivered to me via the mail, I should be able to open a cached version as many times as I want without any request to the remote server.

And the cached version can be built for me by my mail system, which by ALWAYS fetching the resources protects me.

As a question of what SHOULD happen, I thinkread receipts should be voluntary.


If a piece of content is delivered to me via the mail, I should be able to open a cached version as many times as I want without any request to the remote server. And the cached version can be built for me by my mail system, which by ALWAYS fetching the resources protects me.

Agreed, and not even the server has to do that, any good e-mail client could (as it seems Gmail is now starting to do).


Web server logs are okay. Any javascript-based tracking and cross-site usage aggregation is not.


Yes, and as a consumer, I couldn't be happier. All that shit is customer-hostile privacy-invading bullshit and I'm ecstatic that Google is making a move that kills it. I vigorously object to things like tracking bugs that leak information about me without my consent, so I'm in favor of measures like Google's that get us further towards people being unable to go on data fishing expeditions such as were possible with inserting arbitrary tracking bugs in emails. HTML email is a giant attack surface not just for Outlook Express-style remote code execution, but for invasions of privacy. If a marketer can't do their job without invading my privacy, I have no sympathy for them and I hope that this technical change stabs their business model in the belly.


Email open stats have been bogus since most email clients switched to not loading external images by default.


Yup.

I'd really hate to be in the bulk email business today. Perhaps google will sell them back the data they used to create themselves.


The future of marketing is forming cohesive communities with your customers, anyways, not spamming the shit out of them.


What does a "cohesive community" look like and is anyone doing this today?


Indie/small/medium-sized videogame studios with forums.

Alternatively, if you're not the sort of company that naturally engenders an active community, Twitter. It's not perfect, but it's a far better experience for customers who may want to keep up with what you're doing but don't want a new email to delete every week.


No idea, and I don't think so. Just spitballing.


unique urls for each image still work.

httpurl/image.jpeg?email=your.email@gmail.com


:) !


This could be anything from wildy good to wildy terrible for e-mail marketers, depending on how this ends up working. If Google makes a request for an image when the user opens the e-mail, this will be really good for e-mail marketers since now they will get accurate open stats, whereas before they only got stats for people who hit "Display Images" (surely a minority.)

On the other hand, if Google (either now or in the future, crucially) alters the behavior to be smart about pre-caching images, then e-mail marketing is screwed. It will likely make sense at Google to do this, since it will improve the user experience to have the images be pre-fetched to the proxy server before they open an e-mail.

In other words, e-mail marketing vis a vis gmail is now in a Schrodenger's cat-like situation. We can't know if Google is now fully, partially, never will, or will in the future pre-cache images, so for all intents and purposes e-mail marketing data is both highly accurate and completely worthless at the same time :)


AdSense. This is all about pushing dollars to AdSense.

First they started filtering marketing messages into separate tabs, which I'm assuming dramatically cut readership. Now they're going to make it impossible to "bug" emails for read receipts. The only metric left is the "click".

Email marketing just became a whole lot less valuable.


So there is something I'm not seeing this being discussed anywhere. What about the idea of only storing one image for any and all recipients? Yes I know that the file name is changed based on recipient to try and track the users viewing of the image. But a simple md5sum of the file will indicate that it is potentially the same file as others being cached. Google would only need to store the references and the one file. Thus, the first person to view the file (per maximum cached time-frame) would indicate a viewing of the file, but subsequent users would never be identified as having viewed the file.

I recognize that there are a couple potential downfalls to this thought. 1) The time/processing it takes to determine the md5 could be problematic on such a large scale. 2) I have no idea how easy it is to change an image to be unique for each user.


But this only addresses the storage side. A request must still be issued for each unique URL in order to de-dupe.


Damn, I didn't think about that...


In some cases, senders may be able to know whether an individual has opened a message with unique image links.

https://support.google.com/mail/answer/145919?p=display_imag...


Speculation: how many steps away are we from some sort of AdWords for mass emailing?


Sounds perfect; you got to pay if a user reads your email. Kinda like Facebook charging a $1 for you to message someone who is not your friend.


Gmail already delivers paid advertisements through the promotions tab. I'm sure Google gives those advertisers plenty of metrics


True. A lot of marketing emails also tend to be sorted there.

I wonder if the Promotions tab gets a lot of attention from users? I archive everything in there as fast as possible.


It's important to keep in mind that in the past, viruses have successfully propagated through bugs in image decoders. Proxying the images provides an opportunity to remove malicious ones.


Give it a while and we'll probably get embedded Adwords ads inside e-mails replacing existing third party ads in order to "protect" you.


[deleted]


That's not the case. The only case when browsers send no referer header is when an HTTP request is made from a page loaded over HTTPS. Cross-origin has nothing to do with that.


In my opinion, Google doesn't care abut the tiny tracking images. It does care about 2M jpegs. So, while email marketers might be up in arms, this won't affect them at all it appears. If anything, they should be thanking Google for reducing their bandwidth.


They'll still have to serve the image to google though.


if you had a Gmail folder named "Ars Technica" and loaded e-mail images, the referral URL would be "https://mail.google.com/mail/u/0/#label/Ars+Technica"—the folder is right there in the URL

That just isn't true on gmail. The whole service is served over https and won't pass referral information.

Not to mention your IP and whatever other information they feel like embedded in links will still passed along when you click. So theres still some tracking going on, but they miss out on open without action emails (which is of course useful information to marketers).


All tracking cookie/email marketing/whatever other fluffy stuff aside, I'd love to hear about technical hurdles & challenges from Google engineers on this. Sounds like a very damn interesting/challenging task to undertake.


Some other technical aspects of this:

GMail serves all images from a datacenter in Mountain View, CA, so if your email's images were served from multiple datacenters or a CDN, there is a good chance they will load more slowly, depending on your caching headers. They optimize images on the fly, which may introduce more latency. Their optimizer doesn't take into account whether the optimized image is smaller than the original, so the image they serve is occasionally larger (and/or looks worse) than the original. The maximum image size seems to be about 10MB.


> GMail serves all images from a datacenter in Mountain View, CA

Citation needed.


A clarification: GMail's proxy is in two pieces; your email client connects to GMail's CDN, but in order to load content from the origin it passes through a pool of servers in Mountain View. My guess is that the transcoding servers are all located there.


Why do you think the servers are in Mountain View?


I set up some servers in the US and Asia and had some images proxied to them through GMail's proxy. The traceroute paths and latencies from the requesting IPs lead me to believe that the servers were in the US, most likely Mountain View.

I probably shouldn't have asserted Mountain View, as it's more of an educated guess.

Edit: here's a traceroute from Hong Kong to the Google server that made the proxy request. Is there a flaw in this method?

  @hongkong:~# traceroute 66.249.88.203
  traceroute to 66.249.88.203 (66.249.88.203), 30 hops max, 60 byte packets
   1  119.9.72.2 (119.9.72.2)  1.168 ms  1.130 ms  1.122 ms
   2  119.9.64.64 (119.9.64.64)  1.092 ms  1.058 ms  1.052 ms
   3  vl902.edge3.hkg1.rackspace.net (120.136.47.19)  1.307 ms  1.205 ms  1.304 ms
   4  RHI-0001.gw2.hkg3.asianetcom.net (203.192.178.65)  1.169 ms  1.148 ms  1.123 ms
   5  google.gw2.hkg3.asianetcom.net (203.192.178.30)  1.782 ms  1.755 ms  1.741 ms
   6  209.85.248.62 (209.85.248.62)  6.716 ms 209.85.248.60 (209.85.248.60)  17.183 ms 209.85.248.62 (209.85.248.62)  2.105 ms
   7  66.249.94.31 (66.249.94.31)  81.019 ms  80.999 ms  80.966 ms
   8  64.233.175.1 (64.233.175.1)  53.256 ms  53.077 ms  53.060 ms
   9  209.85.245.206 (209.85.245.206)  72.528 ms  72.488 ms 72.14.239.55 (72.14.239.55)  80.903 ms
  10  209.85.242.89 (209.85.242.89)  150.251 ms  148.859 ms 64.233.174.176 (64.233.174.176)  149.147 ms
  11  72.14.239.80 (72.14.239.80)  215.363 ms  215.368 ms 72.14.239.82 (72.14.239.82)  201.955 ms
  12  209.85.249.45 (209.85.249.45)  223.694 ms 72.14.237.119 (72.14.237.119)  211.817 ms  212.567 ms
  13  64.233.174.117 (64.233.174.117)  212.505 ms 216.239.48.103 (216.239.48.103)  215.874 ms  213.109 ms
  14  * * *
  15  google-proxy-66-249-88-203.google.com (66.249.88.203)  211.877 ms  212.484 ms  212.371 ms


Second edit: d'oh, I meant where ever Google's main west coast datacenter is, maybe Oregon? The point being that it appears that all of the traffic goes through a single geographic location.


Okay, you've earned a couple upvotes.


For the same reason Starbucks grows all its coffee in Seattle.


I understand why this feature was implemented. This feature is less private than not showing images. However, the general population cares more about convenience than security or privacy. What I don't understand is why they couldn't give me a "No thanks" option. Why am I automatically opt in? A "no thanks" button would not hinder the usability for those who care more about convenience than security, and it would allow people who are concerned about privacy to continue browsing as they always have.


Please explain why you seem to think that this is less private than not showing images.


Longer discussion here: https://news.ycombinator.com/item?id=6895606


Hmm - so we can flood Google's cache by sending an email full of links to procedurally generated images?


You'd never win that race.


Who said that you have to submit link to your server? Just force it to load large images from other servers. Well you can use help from Google search.


you'd flood your email server before you'd flood Google's cache.


But your email server would only need to send links to images that are gigabytes apiece, and that are generated procedurally, no?

The true bottleneck would be the pipe between your procedurally generated image server & Google's server.


Whose pipe is the one clogged completely by this? Yours or Google's? Congratulations, even if Google does this stupidly you just DOSed yourself. I'd imagine there are a ton of protections in place for their proxies to not take stuff offline, such as max-length limits, per-host limits, etc.


Google could impose a spam scoring penalty.

Penalize mass emails containing unique identifying image URLs for identical images.

Where identical means virtually identical.


What is e-mail marketing? this is a serious question. As someone who has nothing to do with marketing, advertising and all that stuff for me "e-mail marketing" is just spam. or what is it? newsletters? Does it mean that you buy mail addresses from someone and send your stuff to them and see if someone opens it?


It's when people subscribe to your email lists and you send them marketing messages. For example, you might sign up to an email newsletter from a shop you go to, and they will send you emails with marketing messages (sales/new products etc.)


From the google post about why they don't display images by default: "We did this to protect you from unknown senders who might try to use images to compromise the security of your computer or mobile device."

Surely it's this same technology google themselves use more than anyone else to identify users?


Oh please Google, I know you're listening: Transcode the cached images into WebP for Chrome and compatible mobile apps. Seems like a very, very good opportunity to evangelize the format and push it another notch toward ubiquity (while saving lots of bytes and improving email experiences)...


The article doesn't mention it, but the proxy IS transcoding the cached images. However, there are many instances where the resulting images are larger and/or look worse than the originals.


Yes, this is a good idea, hotmail should do the same to promote whatever MS is promoting ATM.


To be fair, MS's "new image format" (JPEG XR) actually seems quite a bit better than Google's "new image format" (WebP)...

JPEG XR actually adds significant features like OpenEXR- and radiance-compatible HDR encoding, whereas WebP is basically the same old 1980s functionality with better compression.

So while there's something slightly sketchy about doing this, I'd say the world would benefit more from MS doing it than from Google doing it...

[I use gmail and other Google stuff, and have an Android phone, and generally hate MS, but it's very hard to be enthusiastic about WebP...]


Depending on Google's caching strategy and deduping capabilities, I wonder if longer term any embedded IMG link will start to count towards your quota...

In order to maintain privacy it's been well discussed they would have to cache always and forever. So large images will definitely add up over time.

I also wonder, even if they have a persistent cache, you might still want to check the Last-Modified and Etag of the URI. I don't think many people embed dynamic images like this, and I'm not sure how most clients would handle it, but it's an interesting corner case.

Seems like a safer first step would have been turning on the Silk-like proxy and keeping image display logic the same. Then you have only benefits due to reduced Cookie, Referral, and IP masking, and could also look for corrupt images which are actually JavaScript and that sort of thing. This wouldn't have been a shot across the bow of the golden goose of open tracking, which imputes Google's true motives.

Saying that the proxy is enough to require everyone to opt-out of auto-images may be a bridge too far, especially when there are ways to register your domain so that inline-images ARE automatically displayed, which IMO is what they should be encouraging.

Another way at this would be to find a UI widget which helped users actually understand the possible tracking info they would be giving up to the sender.

Still further putting control in hands of the sender would be a data tag on the IMG which told Google they should cache, and in exchange would result in wider image viewership. Tracking opens, actions, and coverts is the most important metrics to providing feedback to improving copy, it's devious for a display ad company to fuck with this on shaky privacy grounds. I guess at least they do provide an opt-out, which will be used by ~0.1% of users...


This is what I'm seeing when testing it and found that all my tracking images (tracking pixel) have hits as below.

Remote address: 66.249.x.x [any google ip]

Referer: [not set]

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (via ggpht.com)


Browsers do not send referer over https, I don't know where they got the idea.


I'm not a technical person so forgive me if this is a silly question but would X-Forwarded-For help maintain tracking with this change?

http://en.wikipedia.org/wiki/X-Forwarded-For


I guess you could say that email marketers are getting Scroogled by Microsoft's ads. ;)


So how does this affect various email template applications such as Tout or Hubspot's Signals? IIRC Tout drops a pixel in each email and uses that as a primary tracking source for opens/CTRs etc...


I'm pretty sure this kills Tout's tracking when sending to Gmail users.


curious about this too. guessing they will not work...


Excellent, I've always wanted e-mail clients to do it this way.

You send me a mail - you've no business being able to track if/when/how I open the envelope, unless I explicitly wish to inform you.


I think it's the opposite. With Google's change, the sender can more accurately find out if you've opened the email, since a request to download the image will be made as soon as you open the email. However, what the sender doesn't get now is IP, location, and browser data.


Well, an appropriate solution would be to 'open, package and store' the email at the time of receipt, not reading - that may cost some resources of extra bandwidth and storage, but would provide a better functionality and permanence in case of opening that email two years later when the sending company and their servers may be out of business.


If you use django: https://github.com/juanriaza/django-tempus allows you to generate unique links


Does anyone know if this is just on the web client, or if it also affects emails retrieved from Gmail via POP/IMAP?

I'm not too keen on the idea of Gmail modifying the body of emails sent to me.


A side effect of this is you will never be able to send images to someone as a cut/paste if they are behind http auth since google can't log in and see them.


Aside from the negative effects, at least we will no longer see HTTPS warnings on the web UI due to any non-ssl images being served by the email sender.


I know Google doesn't like to give numbers, but I would be very curious to know how much storage/bandwidth they use to implement this.


Why aren't images and other content included in the email itself? It seems silly to send a link to an image rather than the image itself.


Images are the "ping" marketers use to track opens. So you'd have an image link back to `http://mysite.com/tracking?email=joe@stuff.com` which will alert them that Joe actually opened the email. This can give them a conversion rate: for the 5000 people we emailed, 230 of them opened the email, and 67 actually followed the link back to the site.


Cool. So it used to be the case that my AdBlock configuration got to filter the image loads in gmail, and now... not so much. Thanks, Google!


or you could just go to settings and disable image loading.


Just encode the random information in the domain :)


Aren't people just going to start fuzzing the images? This sounds like the beginning of another arms race.


As a user of email, I welcome this move. If I were an email-marketer, though, I would be livid.


unless they are implementing this for fake email addresses, this is TERRIBLE!

it makes it super simple to enumerate valid email addresses.

patch it to fetch the images on valid+invalid email addresses, then we'll talk


As others have pointed out, I don't think it's that difficult for you to enumerate valid gmail email addresses at the moment.


Great, now spammers will know my email account is legitimate.


The EU Antitrust Ogre is surely grumbling about this...


So now only Google will know everything about you...


This is going to affect companies like yesware.


Can it be opted out by the user?


According to reports, the inverse question is now asked on the email header.


Good.


Good riddance, Google!


Facebook is doing this from decades.


Facebook isn't one decade old yet.


I know, what I mean is it is doing it for a very long time.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: