Edit: You can invoke it something like this:
soffice --headless --convert_to html file.doc
Just pointing it out...
you've already determined that it's running on an ec2 instance, but it's somehow "suspicious" that the user-agent is libreoffice? and you're a "security researcher" but "curious if this is an automated process"? please.
sure, dropbox might owe an explanation (even though you certainly gave them permission to do this in their TOS), and you can call me cynical and jaded, but this seems like pretty shameless FUD that appears to be tied to an effort to shill a new product.
EDIT: first i thought this was written by the HoneyDocs founder. now i'm actually unsure who the author is.
Yes, because humans use LibreOffice over SSH/X11 from an EC2 instance. Probably LibreOffice is being used for the parsing/rendering on a server. Probably for something innocent like generating thumbnails or text-only previews.
Open/LibreOffice with Python bridge is quite handy in converting documents to PDF format and can be run in headless mode (using virtual frame buffer like xvfb) on a server.
If you don't want your cloud storage provider reading the data you give them, then _encrypt_ that data _before_ you upload it.
Edit: Okay I see it's based on FLOSS and that's great, but as far as I can tell they're still asking you to install binary blobs, which makes the whole thing pointless.
I had a little trouble getting it to run on one of my older Mac OSX machines, but I'm pretty sure that was because I had the remains of a previous installation of MacFUSE messing things up -
There's also a MacOSX/iOS/Android/Windows commercial "wrapper" around EncFS which is fully compatible with the compiled-from-source versions of EncFS I've got running on Mac OSX and Linux (ARM and x86) - it's a "binary blob", but if your security/convenience tradeoff lets you consider that, have a look at BoxCryptor Classic:
For me, the tradeoff of having secured/encrypted files available on iOS is worth the decrease in security by relying on Secomba GmbH not backdooring me at the request of the NSA or ASIO (my local security agency) – or anybody further down the security agency or law enforcement foodchain. I'm not actually trying to protect myself against targeted surveillance by any sufficiently powerful nation-state, but I feel good about knowing I'm not quite so readily caught up in "dragnet" surveillance…
And yeah, I see what you mean, but if you don't have access to the source, you don't know what they're making you install. I'm a huge FLOSS advocate, but in this specific instance it's more my paranoia talking. I believe I can trust them now, but how many clients will they need to have before the NSA blackmails them?
It's still a step forward, somewhat, but I find it hard to believe that there could be a successful product based on putting the user in full control (which is needed for real security).
But if you're the only user / potential accessor of the files, single-user strong file-system based encryption works.
This kinda reads like an ad for HoneyDocs...
Be concerned with the content and only the content. If the article has it, it's legit.
Eu odio iso, cada vez que un artigo menciona un servizo ou cae dun
enlace de afiliado e veredicto de alguén é que o artigo parece
publicidade. Prefire o seu contido de lectura a ser desprovisto de
mencionar os produtos ou marcas? Se bloggers nunca facer un centavo
off ligazóns afiliados?
Estar preocupado co contido e só o contido. O artigo ten iso, é
And hey, maybe I am interested in learning that language. Assuming your forum was appropriate (say, at a Galician convention), that could be a useful thing.
I'm not the most pragmatic person ever, but let me pose this: If you can't tell the difference, do you honestly care?
Wrong. Context is everything. You cannot look at data in a vacuum. You need to look at where, why, when and how - especially when it's sensational; i.e. something that may cause someone to take action.
I never said you could. We're not discussing the philosophy of objective statements that don't require context, we're discussing whether true facts are tainted by subjective elements around them. By definition, they can't be.
If the content is good, it doesn't really matter why it's there. But if you believe the context implies that the data is incomplete (aka, biased) or actually wrong, that's obviously relevant.
Especially if something was an ad, who cares? The data speaks for itself. You just need to verify the ad is correct and isn't mis-representing itself. In this case, the data is fairly objective, they aren't comparing their product to someone else's, just pointing out an action that a product was able to help perform, and that action produced interesting data about another service.
The article didn't come off as an ad, either. My main gripe was that some people complain over any mention of affiliated services. (People also gripe about non-labeled affiliate links by independent-millionaire, popular, respected bloggers.)
There's little doubt that somebody will think up a way to take advantage of this to "leak" information.
I wonder if this happens before or after Dropbox's dedupe step? I wonder if that provides an avenue to extract useful data?
Content is not modified by the context, a fact is either true or it is not. Everyone has a motive, it reminds me of how people call into question research sponsored by corporations as if people who work in government sponsored research are some how automatically saints with no ulterior motive.
To trust someone based on affiliate links is a quite silly line of deductive reasoning.
From the information provided it seems simple enough to verify, embed an image via URL into a doc file, upload to dropbox, see if the URL is accessed. No need to argue about motive.
Calling some work into question because of the authors' motives isn't a claim that some other group has no ulterior motives at all. Certainly everyone has some motives, otherwise we would never get out of bed. But some of those motives will change the discussion more than others. E.g. when HP sponsors a study that finds that their own ink works out cheaper than buying remanufactured cartridges, it's perfectly sensible to be more suspicious of that than if a study by a consumer organisation found the same thing.
There's a saying that likely applies here:
"When you hear hoofbeats, think of horses not zebras"
The only thing to see here is that DropBox is potentially opening themselves up to a vulnerability, would be interesting to see if GET file://etc/passwd worked...
On the other hand, given how many files are uploaded to dropbox every hour, it's inconceivable that a human, whether through deliberate management direction or mischief, is opening all these documents. I would more concerned about human intervention if occasionally, a document triggered a buzz some days after it had been uploaded.
If all documents are showing as opened within 10 minutes, then surely it is just an anti-duplication automated agent at work.
Certainly it's automated.
Paranoid part of me says it's NSA keyword scanning. I feel a little insane suggesting that, but it's certainly conceivable these days.
The other possibility is Dropbox is indexing the files for search?
Anyway, using Dropbox unecrypted is a terrible idea. EncFS has user-friendly frontends like Boxcryptor.
Honey Docs doesn't actually explain what the callback looks like in the doc file, but it doesn't look like it has anything to do with images.
I downloaded the credit card Honeydoc. The content looks like:
<html>Nicole Davis 4556062729618215<br />
Brian Baker 4556767839126624<br />
Patrick Jones 4916615717158539<br />
It could perhaps be from generating a thumbnail... But dedup wouldn't work like that.. I would be very concerned about a dedup algo that requires interpreting the contents of a file, and dedup'ing based on that.
Understanding the nature of the DropBox access would start with understanding how a "HoneyDoc" does what it claims it does.
Its a byte comparison.. so you still wouldn't use libreoffice to compare files.
The tracking behavior depends on a tracking pixel which may not always be processed by the client.
For example, with the credit_cards sample, the xls file is actually an HTML file with an img at the end (url linking to https://honeydocs.herokuapp.com/img/xls/...) and a client that only reads the plaintext (there are a boatload of command line utilities that fit the bill) won't fetch the image.
While I like the approach a lot more than Dropbox (that fights to obfuscate its own algorithm), I still don't feel safe. Anyone with access to the server could intercept your keys, and thus have access to your data.
TrueCrypt over some cloud-based solution is still the ideal option, but the lack of support for sparse images makes me hesitant.
EDIT: no affiliation with Sync.com (or Dropbox, for the matter). Just trying to find a decent cloud-based storage solution that fixes the exact problem exposed by the OP.
 I am not sure how LibreOffice does handle active content and furthermore I am not sure if there is a way to generate a ping back from LibreOffice without some kind of active content embedded. But to me at least, it somewhat implies that Dropbox, or whoever, runs LibreOffice in a not maximally locked down configuration.
The wisdom of executing "active" content embedded in such files is of course doubtful and something Dropbox should investigate. But if you want your files to be safe, you should instead use a service that encrypts them client side, which has the downside of losing the web interface that Dropbox offers (as this requires it to be able to access the decrypted files in order to serve them to you).
The article is written in a such a way that they are saying a lot by playing dumb... so hard to say it's misleading... but I know few security people who'd write something up with this tone.
Want some free EC2 time? Wrap your workload in a .doc and have Dropbox foot the bill.
Crocodoc is likely generating web previews of your documents.
Well played, HoneyDocs... Well played.
We do exactly this on our eCommerce platform, before wanging stuff into s3 or glacier and just keeping a reference kicking around.
On the other hand, you have just discovered an information disclosure (host IPs) vulnerability in dropbox.
I don't see a .doc file getting small enough to outsize a HTTP request inside of it, even if you used some funky compression, but I'm willing to hear otherwise.
One question would be if you could upload the document once and then somehow trigger a very tiny edit that causes them to rescan it.
We do use LibreOffice to render previews of Office documents for viewing in a browser, and have permitted external resource loading to make those previews as accurate as possible. While this could theoretically be used for DDoS, we haven’t seen any such behavior. However, just to be extra cautious we’ve temporarily disabled external resource loading while we explore alternatives.
It may be that you are big enough that even the limited bandwidth you need for normal operations is enough to take out smaller hosts, so you'd need to measure and monitor to see how well this works.
Could Dropbox perhaps let me disable this feature? I almost never use the web interface so I wouldn't miss it and I prefer that my documents are not opened after being synched.
That does seem likely - dropbox tries to only upload diffs, when a file gets changed: https://www.dropbox.com/help/8/en
They could not fetch it and have a little blank bit in the thumbnail.
Chances are they're using a library they didn't develop and did not think of the possibility of external resources being loaded.
Edit: The most secure way I can think to handle preview generation is to have a virtual machine firewalled from the internet that previews a single document and is then reverted.
I'd set up a docker to accept a single HTTP post with the document, and to return the thumbnail. The docker can then be shut down and a new instance spun up to wait for the next document to process.
It might be wasteful to spin up a new docker for each instance, but it's the only way to prevent some exploit in LibreOffice that might leak information somehow. A leak could be as terrible as embedding an entire document in the next thumbnail, or as simple as returning the wrong thumbnail (like from a previous request).
 LibreOffice was the user-agent that phoned home in the article.
If you're thinking of egress filtering except for the proxy, you can just HTTP tunnel right through it.
"How did this HTTP GET go through to my 'firewalled' PHPmyadmin site?"
You have to treat all user input as if it's toxic.
Also docx files are zip files which opens the possibility of a zipbomb. I wonder if LibreOffice has protection for zipbombs.
It wouldn't surprise me one could engineer a docx bomb that would consume gigs of memory.
Firewall unexpected outbound connections on machines doing their processing.
<img src="https://news.ycombinator.com/y18.gif" />
Article: Uploaded Documents to Dropbox Personal Account with Private Folders
So drop box indexes and creates thumbnails for private documents? This is because the NSA gives a bounty for friendly UIs, perhaps?