My wild guess is that watermarking is done on the client. Doing it at the server stage requires running an encoder for each user connected to the meeting, which increases broadcast costs imensily for Zoom. It would make sense for security reasons, but the trend with them seems to be profit instead.
If watermarking is therefore done at the client stage just before being heard/seen at each endpoint, then there is a good chance that it is hackable and watermarking code could be patched or audio/video extracted before watermarking occurs.
It would still require whistle-blowers to take this more envolved step before leaking a meeting though.
They probably do run an encoder on the server. When in a many-users meeting, everyone but the one speaking have reduced resolution and bitrate. This suggests they are encoding a low bitrate and high bitrate stream for each user, and switching as needed.
One can easily verify this by looking at their bandwidth utilization. It is highly unlikely they are uploading multiple streams from every participant and just muxing the right-sized stream for the other participants to download. Bandwidth costs might be more expensive than the transcoding costs. Remember, you can use ASIC accelerated transcoders which can be quite cost-effective. If they really want to cut corners, they can simply command-and-control every participant client to send low-res stream and restream to every other participant without full transcoding – at least for free tier.
I don't believe you would be able to see it in your bandwidth utilization. I suppose you would be able to estimate which stream you're receiving based on number of callers in a conference but your outgoing stream is always going to be the same, assuming you're not network constrained.
The stream is a single H264/SVC stream which contains layers of varying quality. It's sent to the Zoom "cloud" where the appropriately sized layer is extracted and forwarded to clients. It's called selective forwarding and it's essentially how all of the WebRTC style video conferencing solutions work once there are more than 2-3 participants.
> It would be incredibly cost prohibitive to run servers with live transcoding at their scale with a widely used free tier.
From the article, it looks like video watermark is only available for a meeting with only signed in users and locked to a particular domain. I don't know their pricing model, but speculating - I wonder if those features mean it's only available on some sort of premium tier and they could live encode only those meetings?
There are more reasons than just cost to do just for signed in users on a given domain.
- Privacy (expected for consumers, debatable for domains).
- Willingness to pay (if your stuff is worth protecting, might as well charge for it).
- Decoding the watermarks: I assume this is only done on demand by authenticated users at a very low rate (otherwise if it's freely available to anyone, you give leakers an oracle that lets them determine whether their media scrubbing / fuzzing has indeed removed the watermarks, and these systems are more effective when they have no way to check / must live in fear that they haven't scrubbed it right)
What you say is plausible but encoding multiple streams on weaker clients would be challenging I think (doesn't it run on tablets and the like?).
Also I don't know if it's a reasonable comparison point but Youtube does transcode server-side into a whole bunch of of different resolutions and video standards, going as high as 8K these days I think. Given the amount of video they receive every single second it's pretty wild how much processing power that must represent.
This isn't challenging. It is done for a majority of conference software. Essentially the client uploads the streams in such a way that they "depend" on each other. You can imagine there is a base stream at 480p 15fps, then you can add on some packets to get 480 30fps, and further to get 720fps.
I suggest the server tells the client which bitrate/compression to send, but watermarks probably don’t need a full re-encode on the server, could easily be in metadata or only key frames which would be fine. Scary though, I’m sure this will happen in all media at some point. The best way to get around this of course is use someone else’s account to leak from...
I imagine the only truly secure way to do it would be to transcribe the meeting (automatically or manually) and leak the transcription. The video/audio could be shared with one or two trusted journalists to prove it's legitimate (barring sophisticated deepfakes), and the journalists could post the transcript.
Maybe in theory even a transcription could be unsafe if the server actually manipulated the words in real-time in some way so that different listeners would hear different word synonyms at certain points, but that seems very convoluted and unlikely.
That would costs actual resources server side. My guess is, that their entire architecture is designed to do as much as possible on the clients, because otherwise they could not offer a free service.
I would never want to record in-software on the same computer Zoom was running. I would be way too paranoid about who knows what software or introspection the Zoom app is running to identify this sort of stuff.
I would, hypothetically, record using my phone, being sure to make it not visible from the camera.
You could also apply the watermarking on the sending user's side. During the meeting "handshake", the watermark could be issued to each participant. The watermark could then be encoded before pushing out to the server. This would then ensure that all video is encoded when it is received by any viewer, all while removing any work needed to be done by the server.
In theory you could test this by using the web client and seeing if the watermarking occurs and even examining the web client code directly. Of course at that point you could just decompile the native client to the same ends.
Some articles found by googling [1] [2] from two years ago describe this capability as "ultrasonic watermark" so it is not new. I think this is coming to light as Zoom has become popular with the pandemic. For a journalist wanting to sanitize audio I would think they need to remove anything higher than 15kHz.
Audio watermarking is old hat, and it’s FAR easier for Zoom than for say a music service, because people are used to imperfections/stuttering/blurring in their Zoom calls which can just be encoded watermarks.
Pasting a comment I found intersting & funny from one of the commenters of that article:
"...It is a strange thing that the real quality audio is now reserved for the pirates. This industry really knows how to hit a target."
Listening to the samples (I got a nice BeO over-ears headset that has very good performance), I also realized that Spotify gives me some noise, I also thought it's a codec/digital thing.. little do I know..
I'm a Spotify subscriber but I'd be the first to admit that Spotify's audio quality isn't great to begin with, even when set to high quality streaming. It's noticeably worse than uncompressed CD quality (ignoring CDs that were mastered from sources that were lossily compressed to begin with - what a great trend that was).
This isn't a complaint, more an observation: Spotify works really well if you're outside, in a car, or even in an office environment with plenty of low level background noise. It's not so great when you run it through a half-decent hifi in your own home. Still, good enough for casual listening. However, if you're paying attention, you'll notice the flaws easily.
So, some of that noise is probably just that: noise. But some of it will also no doubt be the watermark.
Fortunately I didn't pay that much for my setup. Amp and speakers are about 30 years old and were given to me by my stepdad about 20 years ago. Pretty much everything else is second hand from eBay and 25 - 40 years old (CD player, tuner, tape, EQ).
The biggest expense is the subwoofer, which I did buy new because used prices for a decent subwoofer are still pretty high, especially when you factor in the cost of petrol to go and collect the thing (most people don't want to post because they weigh a lot).
The only other new components are an inexpensive Bluetooth 5.0 receiver, the speaker cable (Bassface, which I want to say was about £2/metre - super-cheap by audiophile standard) and gold-plated banana plugs from RS components. All the interconnects are I think Amazon Basics.
So my total expenditure for the whole system is less than £1,000. Fully half of that is the subwoofer. Admittedly, that's still probably a fair bit by most peoples' standards, especially when it's perfectly possible to get very good sound from a hifi separates system for £250 or so (see Techmoan's video series on the topic, for example: https://www.youtube.com/watch?v=lSY1iZqH118), but it's chickenfeed for most audiophiles. Still, I'm definitely not one of those guys: it sounds more than good enough to me and I've no desire to fall any further into that particular black hole.
Except for one thing... I don't have a turntable. So what I'm probably going to do is buy a pair of SL1210s and a mixer to plug in to the system. I'm lucky enough to have a fair number of 12" singles from a freecycle "barn find" type situation a few years ago, and another time-consuming hobby to get through the rest of this pandemic will be no bad thing.
There is both a danger and a satisfaction to mostly cobbling together a nice sounding system from lots of second-hand parts though. The temptation for me is to do the same again with one or two of the other rooms in the house.
At first I thought they meant Bang & Olufsen, a high-end brand that prefixes all their products with Beo [0]. But I guess the industry is making Beryllium drivers now [1].
This is very poor opsec advice. Robust audio watermarking is standard technology for many years now, and can be licensed from multiple vendors. If Zoom (or any other actor) cares enough to watermark their audio, you must assume that it may be hard to detect and remove.
A vo-coder is probably the best off-the-shelf technology. Of course a challenge with making invasive changes to the audio (in order to defeat watermarking), is that people may claim that the audio is fake/misrepresented. Vocoded audio will not sound like the original speakers, and may have artifacts. Lipsync may also be slightly off. So one would have to be careful to communicate these limitations. Which the general public may not have much interest in understanding... Adversarial opponents may latch on to these things and use it to discredit the recordings.
An more conservative approach would be to transcribe the audio into text, and only offer the audio to (more) trusted parties for verification.
Reasonably effective stream watermarking happens every day and is done in the human vocal range with almost no listener impact.
In radio, Arbitron has a system working well within the lower audio range, even AM radio. AM is typically 5Khz bandwidth.
They use a spectral masking technique able to encode ID bits into streams that can be decoded with portable devices.
PPM Portable People Meter
Frankly, this kind of thing would go unnoticed by pretty much all listeners.
From the PDF I linked:
[...]all watermarking technologies use the well-known perceptual principle of “masking,” which
was first reported in the early 20th century and is a core technical basis for mp3, AAC, and a host of data-rate reduction
schemes.
In simple language, a loud burst of energy at one frequency will deafen the human auditory system to certain
other audio components at nearby frequencies for a period of time before, during, and after the loud signal.
Consider the following illustration: A tone burst at 1.1 kHz with an intensity of 0 dB will hide (make imperceptible) an
added signal at 1.11 kHz with a level of -30 dB for a period of about 10 ms before the burst and as much as 50 ms after the
burst. However, modern signal-processing techniques can still detect the existence of this added 1.11 kHz component even
though the ear cannot.
This is the basis of PPM and other similar watermarking technologies that use masking for
determining the frequencies and intensity of the data that can be added for the station-identifying watermark.
The PPM system constructs 10 spectral channels in the region from 1.0 kHz to 3.0 kHz. The original program audio
energy in each channel is evaluated for its ability to mask an added component. If that masking energy is insufficient,
nothing is added. Conversely, if the energy in a channel is large enough, a tone is injected, chosen from one of four
possible frequencies within the channel. For example, the channel centered at 1058 Hz might have one of the following
four frequencies injected: 1046, 1054, 1062, or 1070 Hz.
Each of the four frequencies represents 2 bits of information. If we assume that this process repeats at a 500 ms rate,
using all channels provides 40 bits per second or 2400 bits per minute of watermark code. Let’s further assume that a
radio station is credited for a listener if any code is correctly detected within a 3-minute interval. With the very large
number of encoded bits generated in 3 minutes (2400 x 3 = 7200 bits) and a station’s identification data needing perhaps
only 50 bits, there is massive excess capacity for redundancy, error correction, and for audio that does not have enough
high-frequency content for masking.
So if masking is used, I assume compressing the audio with any modern compression scheme from mp3 up should defeat that shouldn't it (because they drop masked signals to save bandwidth)?
Depends. The Arbitron system works through the HD Radio codec, which is a wavelet codec. It is basically hybrid mp3 type coupled with high frequency reconstruction on the receiver side.
Interestingly, that literally means fake signals on the receiving end above 8 to 10Khz! Was as low, and may still be as low as 5khz when used for AM. I have not kept up.
I could tell early on. It has improved a lot since then.
The Arbitron system appears robust. Noise, low signal quality, etc... do not generally impact it much. The effective bitrate needed is very low.
Given a larger sample of audio, it is likely to work.
A robust watermarking system will include some sort of error correction, so the answer is that it might, it depends on how much error it introduces.
A purpose built algorithm designed to thwart watermarking however is far more likely to be successful than a compression algorithm that is designed to maintain the integrity of the audio.
The phenomenon described by the quoted comment is called "temporal masking". There is "pre-masking", where a sound is rendered in-perceivable by a sound that _follows_ it (your "forgetting" case). And there is post-masking, where a sound is in-perceivable because of a masking sound that preceded it. And yes, this is due to inherent slowness / lack of temporal resolution in the auditory system.
Temporal masking widely exploited in all kinds of lossy audio compression (MP3, AAC etc), to remove the data that cannot be perceived anyway.
We just don't resolve detail to that temporal degree. You can't really "listen between" the periods of a 100 Hz sound, so being unable to recognize a 10 ms event preceding a much louder one is expected.
This is an entirely fair comment. And it's typical of my experience as well, and I have a fair amount related to audio, though not as extensive as yours.
My mind works differently when it comes to language and the scope of possible meanings is something I always consider relevant.
What concerned me here was someone taking the colloquial definition of "ultrasound" literally, and making assumptions that are not valid in this context at all.
What the word actually conveys is both a matter of subtlety and frequency.
Turns out, having read the entire discussion, both are relevant in terms of threat assessment, and thinking about what is said more deeply can have a positive impact on a discussion of this nature.
All of which is why I chose to point out what "ultrasound" actually does mean linguistically.
Edit: In my experience, such uses can and do happen. I personally allow for it and use context to parse. Where there is ambiguity, I generally won't dismiss it out of hand.
Subsonic comes to mind here. As does the question why the word did not appear regarding these watermarks.
The answer may just be someone with far less domain expertise attempting to communicate.
No, it means "beyond." Like you point out, "across" means something else. The Latin for "above" is supra or super.
> ultra-, prefix:
> 1. Signifying ‘lying spatially beyond or on the other side of’
> 2a. With adjectives, signifying ‘going beyond, surpassing, or transcending the limits of’ (the specified concept).
> Etymology: Latin ultrā beyond, employed as a prefix in the post-classical ultrāmundānus ultramundane, and the later ultrāmarīnus ultramarine, and ultrāmontānus ultramontane.
Your quick trip through the etymology triggered an opsec thought:
It may be worth a suggestive talk to expand how people take words.
A pop culture reference would be Daniel Jackson from the series, "SG-1"
We may often be constrained in our ability to understand and assess by our own preconceptions relating to language.
"Ultrasonic" was interpreted very differently by any number of us having this watermarking discussion. How often do we make assumptions about the possible field of play based on language basics?
How often do those fail to be sufficiently inclusive?
I bet it happens more than we realize.
Seems like a good basis for a DEFCON talk. "Where is Daniel Jackson when your team needs him?"
I was focused on the etymology; the actual usage of "ultrasonic" is generally confined to high pitches, not low.
Worthwhile point still, though I wouldn't have responded had the commenter not stated a specific incorrect definition. How does this connect to OPSEC though?
The relation goes right to threat and solution scopes. In this case, someone working from an incomplete definition may well also work within an incomplete set of greater assumptions.
There is what it could mean, what we take it to mean, and what it does mean.
Where those overlap or not could have a significant impact on behavior.
I wonder if this is another marketing gimmick similar to end to end encryption controversy they got into. I hope by ultrasonic they just mean beyond hearing and not really that watermark lives exclusively in ultrasonic frequency range.
Do they also talk about the process for identifying the participant who leaked the content based on the leaked recording? Do they need to retain the original copy of the recording to be able to extract the watermark?
This is fascinating and really made me think. The article is pitched at leakers but it could just as likely be pitched at journalists. If you consume news from an outlet that doesn’t follow The Intercept’s advice then complain immediately to the editor.
When you leak something it needs to be credible. Removing watermarks also reduces the fidelity and therefore the credibility. If 99% of the screen is blurred and the audio has been transcribed then how does the receiver know this is a real leak at all?
The answer lies in reputation. Leak high fidelity material to a trusted third party, usually a journalist. This can include just showing them the material though that involves meeting in person. They will verify the source material, summarise and down sample it to conceal the source actor, and maybe even destroy the source material itself.
The economics are simple: if you get a reputation for revealing sources then people will stop leaking secrets to you. Newsrooms that rebroadcast Zoom caps verbatim are revealing sources and need to clean up their act if we are to continue to rely on what’s left of The Fourth Estate.
I can think of others ways this sort of thing could occur, with the document being ‘real’. Some photocopiers use text recognition and so presumably just attempt a font match when printing. This has caused issues in the past and could cause this sort of confusion too (although this seems a smear attempt).
The article says CBS received downsampled copies. They failed to do due diligence, but they would have outed the hoax had they seen the “original” documents?
Well, CBS wanted a story to be true. But I assume it would have been much easier for experts to determine if something was typed a number of decades ago vs. laser printed recently. (As opposed to finding typographic discrepancies in a relatively low fidelity copy.)
Yes, and while they never properly owned up to this, it seems (from this article and others) that they learned from it and are trying to avoid a repeat.
Though I haven't seen them post an article describing in detail how watermarks on color laser printers work, that would be a bit too on-the-nose.
Not really. Imagine you want to leak a one page printed document.
If the covert signal (the employee ID of the leaker) is encoded with tiny dots then it could be filtered in or out with a band pass filter. Filtering out the data can be done with a fax machine if you can find one.
But if the covert data is embedded in the original signal at frequencies close to the signal “frequency” then you don’t stand a chance. The information band of written English provides a huge number of options for hiding other signals: spelling, choice of words, spacing, key phrases, capitalisation, the ordering of items in a list.
The government agency from where you are leaking — your adversary — has the original signal from which they adapted your slightly different copy that you leaked. All they have to do is hide 6 to 16 bits of data in 2kb of English language. This is trivial because they have the original plaintext with which to encode and later extract the signal.
How do you effectively tackle this, as a leaker? A journalist is the one type of filter capable of obscuring covert traffic at these frequencies. They read the document, summarise the implications, and then write that up. I probably wouldn’t let them keep the original text though.
Like another commenter mentioned, I doubt that the watermark here is super sophisticated, but the fact that it exists and is “unknown” creates a higher degree of risk for a would-be leaker. And that might be enough to stop some people from leaking.
That said, if you don’t work for a three letter government agency, in finance (especially at an investment bank) or at a corporation with tens of thousands of employees (and ideally, a tech company), there are plenty of non-technical reasons that leaking can be considered a relatively low-risk activity. The biggest reason is that the IT person tasked with finding a leaker, assuming it was from a meeting that many people attended, often isn’t paid enough and has a lot more valuable things to do than to try to play audio forensics. I know of several instances where companies have threatened to release the hounds, so to speak, to find a leaker, only for those hounds to be people who are either about to be laid odd or who have just lost a sizable portion of their team. Not a lot of motivation for those people to really care, so they just tell the angry executive they tried but couldn’t figure it out, the executive is placated by trying, and everyone moves on to another crisis.
And of course, many of these leaks only matter if the recording itself is widely shared or published. If something is recorded but given to a news organization who is instructed (or does their own due diligence and decides not to publish the audio/video/document) not to publish the recording but to use it as a source, well, good luck. In the US, shield laws typically prevent a news organization from turning over their sources.
It’s like with screeners for the Oscars. The screeners will be watermarked with your name and that’s usually enough to keep them off of torrent sites, but that doesn’t mean you don’t have a Dropbox or Plex account full of them that you share with your close friends and family. Like, sharing my WGA screeners with my mom is about as low-risk as it gets.
Zoom has a less than stellar reputation now for security. Meanwhile this entire feature seems like a checkbox for customers, just one more feature to add to the list. There's only going to be a very small segment of customers going to look into the details. It may not be worth it to them yet to build something more sophisticated.
They got that reputation 6+ months ago at the beginning of the pandemic. Since then their revenue has skyrocketted, giving them the capacity to hire, contract, and license to improve their security. They haven't shed the reputation, but reputation management is a different problem than security. Certainly for something high-risk, I would not put much weight on "Well, lots of people think they are bad at security, so the watermarking probably isn't very good".
6-9 months to find people to hire, hire them, and get them onboarded enough to contribute meaningfully? That isn’t much time at all. Definitely wouldn’t rely on that if I was leaking something, but if they didn’t have it implemented well before the pandemic I doubt they would be there yet.
Maybe they bought a high quality, off-the-shelf watermarking tool? Maybe the person hired to do the watermarking is smarter than the group in charge of security?
Problems with security engineering can also be an issue of correctly managing priorities more than not having pure technical skill. Spending a ton of effort on watermarking instead of getting the basic security stuff working wouldn’t be that crazy a story in the industry.
Journalists/whistleblowers have had to deal with the same set of issues for digital images and other documents & media for a while now. Visible & invisible watermarks, custom metadata and even non-standard binary manipulation means that shared files are pretty much fully trackable, and complete anonymization is out of reach for everyone but the most technical users.
Dumb question. Could one take additional measures with media to disguise watermarking (e.g. rather than uploading an image, take a crap photo of the image on a screen using a physical camera)
You can take additional measures, but you can never be sure that you've avoided all forms of watermarking. As an extreme example, suppose a company distributed 4 versions of a document to employees, each printed in red, blue, green, or black ink (a variant of steganography). Even though you've taken a crap photo, you still have the color in evidence, and they know which subset of the employee pool the leak came from.
Any number of techniques like this could be used, such that you can't be sure the "watermark" is gone without making the document functionally useless - especially if you don't know what the watermark is. e.g. maybe you take a black and white photo of the document above, but they also changed some of the wording on the page in each version.
+1 to my sibling comment's point about how each digital device you use to cover your trail will do its best to leave a new trail.
Another technique is variation of language (synonyms), orthotypography, variations in the white space of the document... so you can take a blurry B&W picture but they still got you because your version was the one saying "amazing" and not "fantastic" in the third paragraph and using the Oxford comma in the fourth paragraph, etc.
One I vaguely remember reading about years ago (not as an example, as an actual instance) included where the word wrap occurred, so on some copies a word ended a line and on others the word started the next line.
Going to analog and back by eg. photographing the screen or printing then re-scanning a document is a good way to ensure you've removed all document metadata, but brings its own challenges (see the article, where it mentions source camera identification), not to mention that the camera/scanner itself may add back its own metadata.
Watermarking may or may not survive that kind of process - depending on the kind of watermark it might be designed to still be detectable even in low-quality copies.
The more coarsely you filter the data (reducing resolution is essentially a low-pass filter on the image data), the more you reduce the bandwidth for a watermarking signal, but using spread spectrum and forward error correction techniques, the ratio of watermark to data can be brought arbitrarily low. There's no amount of obfuscation that will defeat watermarking if they algorithm/key is unknown and a huge amount of data needs to be released.
That is, maybe you can use video and audio filtering/manipulation to push the watermark bandwidth down to one bit of watermark per 1 GB of data, but with 100 participants and 7 GB of data, 7 bits is enough to identify the leaker.
Depends on the sophistication of the watermarking. Watermarks embedded in the media itself (audio or images) can often be more robust than the information in the media itself (i.e. even if you blur it so as to be unreadable the watermark can be extracted). It basically boils down to encoding a very small amount of data in the media as robustly as possible, and modern signal processing is very good at that.
My statement was pretty bad, since it is very simplified and can be misunderstood (since the "they" is ambiguous). I posted the relevant links below, these explain the whole story way better.
> First Look Media’s decision to fire me after I raised concerns about source protection and accountability – rather than to demote or seek the resignation of anyone responsible for the journalistic malpractice, cover-up, and retaliation – speaks to the priorities of The Intercept’s Editor-in-Chief Betsy Reed and First Look Media’s CEO Michael Bloom.
If you dont know the story it sounds like they have left the boat because of their own mistakes.
Two founding members have left because of the paper's direction failing to protect sources out of carelessness
They did much worse: when asking the government for comment on the leaked material, they sent a copy to the government. Result: Reality Winner was arrested even before they published the story.
A "in a hurry" hack might be to run it through a bad phone line? Still understandable, but you can probably guess that the phone provider is going to crush the spectrum and significantly reduce the bit depth.
In January 2018, a YouTube uploader who created a white noise generator received copyright notices about a video he uploaded which was created using this tool and therefore contained only white noise.
People are used to artifacts in zoom calls, the sound quality ain't always the best.
I also noticed that zoom corrects for lag in my connection by speeding up and slowing down speech. (I noticed that when my counterpart was running a metronome.)
Zoom could artificially introduce micro-lags to encode data. Eg as a human I can't really tell whether some sound starts on an even or an odd milli-second.
An earlier comment[0] says there's a reliable method for including ids and things in AM audio at 5KHz bandwidth.
POTS is a few KHz. So less, but without looking into Nielsen's fingerprinting methods I wouldn't say you could assume that the fingerprint wouldn't be preserved running it through a phone line first.
Because users have attendees leaking meetings and asking Zoom if there is any way to identify the leaker. This in turn informs Zoom that the ability to identify leakers is a desired feature for users. This might make the product seem more "secure" and "safe".
Since when is protecting privacy shady? There are a lot of confidential relationships that previously relied on meetings behind closed doors that now rely on teleconferencing: therapists, healthcare, courts, lawyers, students, etc. Those who are exposing their private information in confidence absolutely deserve to be protected.
Any of them, given the tools exist to do so. Also, the organization that is attempting to protect from the leak might not be the same organization that would be recovering from one.
It’s a strange and peculiar concept to most people. Even if you knew how you might approach removing such a watermark, you wouldn’t know how sophisticated it is, so you wouldn’t be able to know whether you’d succeeded or not. I’d guess most zoom users wouldn’t even know where to start with removing such a watermark.
I personally doubt it’s particularly sophisticated at all. But the fear of getting caught it creates would be enough to deter a significant portion of potential leakers.
Why is any watermarking necessary at all? Because DLP (which includes anti-leaking control) is a huge concern for most businesses, and working from home makes the problem even more serious. Zoom is trying very hard to position themselves in this market (and doing a rather good job of it), so in that context the feature makes perfect sense.
Example: Quarterly results for a publicly traded company before they are published. If those are leaked before the official time / date, whoever gets it first has an unfair advantage. They may get charged with insider trading, and the leaker as well.
>Quarterly results for a publicly traded company before they are published. If those are leaked before the official time / date, whoever gets it first has an unfair advantage. They may get charged with insider trading, and the leaker as well.
If this kind of thing is now done on a Zoom call, how is it any different than prior to Zoom being in the meeting, hearing the information and passing it on via another means, like a phone call? What value would leaking the call as in your example have?
I'm struggling a bit to think of any examples of private information being leaked that have really changed because of Zoom. Trusted employees are able to leak private corporate information, always have been.
A leaked Zoom meeting is a lot harder to deny. If a reporter gets a call saying “The CEO of BigCorp said a racial slur in the cafeteria” it’s hard to prove. A leaked Zoom meeting is more concrete evidence.
Have you ever had a meeting where sensitive information was shared?
Zoom meetings are like those, but with the sharing of sensitive information transmitted over the Internet. Someone could easily record their screen and audio and capture said sensitive information for subsequent sharing -- or "leaking"-- with someone else.
Would be pretty clever if Zoom disguised the audio watermarking as the kind of distortion you get when someone's audio / video is laggy and turns into a choppy mess for a fraction of a second. You wouldn't suspect it unless you knew!
They could easily jitter audio/video frame delay, or the codec noise in the video (main and delta frames, quantization, colour gamut) per receiver. If it is sharing a document view it could jitter the pixel colour (like a printer dither) or the position of characters.
It's basically impossible to be sure, though if you had multiple endpoint recordings you could probably identify likely deltas.
Examining the state of the zoom client might also be useful, because it is free to do this client side. Unless the IDs are cryptographically generated server side, you could tweak the client side to show someone else's client ID.
> if you had multiple endpoint recordings you could probably identify likely deltas.
I would think this to be the case of blurring the watermarks in a video feed, but since every zoom feed on a single is so different that might not work. Most of the deltas are not intentional.
They could do a lot of clever stuff, maybe a different compression ratio for each participant. Even the arangement of the participants thumbnails on the screen, if I were working on the "leaks mitigation" team I would suggest the server to store the arrangement per user, and if the user switches the faces around on the client app, to also record that order!
I'm super confused: if this is a feature for corporate zoom accounts, surely someone on Hacker News (or at The Intercept) has access and can mess with recorded audio to test what sort of manipulation can defeat the watermark. Unless you have to ask Zoom to process the watermark every time?? (If this is widespread, why has no one with knowledge of the process commented?)
Do not take this as advice on removing watermarking, but it sure would confuse the decoding team at Zoom to encounter a file that had been sent over a watermarked Zoom connection more than once.
I wonder if the watermark can only be decrypted by zoom and is unique every time, if not, you could fake it and blame someone else intentionally. From the article, it seems so but I hope it’s done properly.
techniques that i've seen in the past are indistinguishable from noise unless you have the correct key. that is, they use the fact that a key is a psuedorandom bitstream and that audio streams often have psuedorandom noise so ciphertext is ideal for adding into the noise.
i think i presented this paper for a course journal club from two decades ago, a decade ago on the topic:
The whole issue seems to be dedicated to watermarking, but relevant to this discussion is also this article:
"The basic principle borrows from spread spectrum communications. It consists of addition of an encrypted, pseudo-noise signal to the video that is invisible, statistically unobtrusive, and robust against manipulations."
Given that watermarking is so prevalent, journalists may have to begin treating leaked documents and files in the same way they treat anonymous sources. They can reword and refer to a document's contents, but they can't share copies any more than they can share a source's driving license.
Is zoom or anyone else obligated to tell us about all the means of identifying the original viewer/leaker?
Shouldn't we assume there are all kinda of steganography being employed? It feels like enough information could easily be hidden in PDFs, images, videos, and you wouldn't know unless you're doing bit-comparison to others' versions of the same.
They might also be using a small number of range of methods and you won't know which methods you haven't seen yet so you can't make assumptions about the next file based on the previous one.
Seriously though, I doubt what Zoom is adding (specifically for audio) is anything that new. Does anyone have experience removing this type of stuff? Would something like a bandpass filter for say, 100 Hz-15 kHz work?
From my perspective, the only fool proof way of removing all audio watermarks from a conversation is to run individual speaker detection, STT detection, and voice cloning algorithms to "recreate" the conversation from scratch.
Even things like background electrical hum have been used in audio forensics.
I've seen the demos of AI systems that can be trained on an individual's voice, and generate new speech in the same voice. If I was waterprinting a meeting, I would train a system like this on the fly on the people in the meeting and use it to dynamically insert filler words (um...) in a unique pattern into the audio stream for each member of the meeting. That would defeat any audio filter tricks or recreating the meeting from scratch, so you would want to summarize/rephrase the meeting too.
> I've seen the demos of AI systems that can be trained on an individual's voice, and generate new speech in the same voice. If I was waterprinting a meeting, I would train a system like this on the fly on the people in the meeting and use it to dynamically insert filler words (um...) in a unique pattern into the audio stream for each member of the meeting. That would defeat any audio filter tricks or recreating the meeting from scratch, so you would want to summarize/rephrase the meeting too.
But wouldn't doing something like that make the recording seem like a deepfake, potentially reducing it's credibility? To an outside observer, it may make a genuine recording appear to be fake.
I'm reminded of the controversy over the faked Bush Texas ANG documents. They were discovered by randos on the internet realizing the text looked like the output of a modern version of Word than the 70s typewriter they should have been written on if genuine. Imagine a similar but genuine document where the content was deliberately retyped using anachronistic equipment to obscure the leaker's identity.
I don't remember the details, but there was a lawsuit (in India IIRC) over some contractual documents, which were proven to be fake since they used Microsoft's Calibri font, but were supposed to be from before Calibri et al. was released.
you could let your recreation algorithm do the same thing. add and remove filler words randomly. This way in the end you can't be sure whos audio it was :)
Also need to quantize all of the pauses between speakers and the time of each new speaker starting, and the rate of speaking, since as others have pointed out, Zoom varies these anyway for delay compensation.
The journalist, editor, or their lawyer might need something genuine to be comfortable publishing. If that journalist and editor have a good reputation, however, the general public shouldn't need that. "The Intercept" may not have that good reputation? They seem at least as trustworthy as the typical USA war media firm to me.
I mean, should be as easy as recording your own zoom call with the Watermark enabled and a well known audio track (a metronome, dead silence, etc). Rip audio from the recording and examine it for anything outside of the metronome.
Probably need to do this several times for different participants and meetings to get an idea of what the watermark looks like and where in the spectrum it sits.
That is my first thought. I have a few specialty plugins in Wavelab that I would be curious to run a Zoom capture through. If its ultrasonic, then as you say a low pass filter should suffice, but theres a million ways to encode data...
Is a third-party recording software detection capability likely to be implemented in the future? And how successful would it be on Windows vs Unix & Unix-like systems?
Not necessarily. Watermarking an audio stream like this wouldn't require that high of a bit rate. It could easily be hidden "under" the content using coding techniques like direct-sequence spread spectrum.
A real world example is GPS, which uses a spreading code to provide about 30 dB of gain. GPS signals aren't directly observable relative to the noise floor in many receivers. It's not until after the signal is "de-spread" that it becomes observable in a spectrogram. This process requires prior knowledge of the signal structure.
In short, if you don't need to send data at high rate there are many ways to hide your signal.
We can thank Michael Ossmann for his DC25 talk about detecting and pulling DSSS signals out of noise. He also posts proof of concept code to do just that, including also detecting the chip sequences and chiprate. https://hackaday.com/2017/07/29/michael-ossmann-pulls-dsss-o...
We're also dealing with only 48KHz, which is 2-3 magnitudes less than what SDR people are accustomed to dealing with.
And we can also make our own detection systems by having a 3 person "meeting" with fake audio, with this fingerprinting on, and then comparing the 3 recordings.
Maybe, maybe not. You can probably make statistical changes to the lowest bits in certain samples or something that would be quite difficult to detect.
Such small changes in sample value would probably get lost quite easily after being encoded with some codec. And I don't see them streaming un-encoded audio. Right?
Yeah that's probably true but you can do something similar with the encoded audio (change the bits of the quantized frequency domain representation in whatever codec they use). But that would get lost on reencoding in a different format, probably.
The audio watermark seems trivial to work around, unless there's more to it than they're disclosing. A low- and high-pass filter may be all it takes to block it.
The visual watermark is more tricky, but thanks to streaming video piracy, we have a bunch of out-of-the-box watermark removal techniques.
It would be fairly trivial to encode information in single frame swaps (both audio and video) in such a way that these swaps are imperceptible and irreversible. There are many compression artifacts that could be used similarly (e.g. does a 1 bit increase in average screen color is rarely going to be perceptible).
Regarding hi/low pass, theoretically this should be fairly simple to defeat by spreading the information across multiple frames.
Note: I know next to nothing about watermarking, my comment is just a purely hypothetical attack on your trivial assumption. The trivial work around has trivial work around work around... ;) <insert that scene from the big hit>
This is DRM. I actually think that DRM has its place, it's just inaccessible to the people that need it. Imagine if you could employ these techniques when sharing secrets with a friend? Intimate photos? Wouldn't you want to be able to trace the provenance of a leak?
Hmm, I record meetings with Camtasia instead of fumbling around with the built-in functionality (specially when it's a meeting you are not hosting). I wonder to what degree this that gets rid of these invisible watermarks.
How do you remove an "invisible" watermark from an audio file? Is it just a frequency outside of normal hearing that can be removed by re-compressing the audio to remove the high and low range sounds?
I bet the system in use looks a lot like Arbitrons system.
It is near impossible to hear. And when someone does, it basically sounds like very minor league codec artifacts.
Someone very familiar with a given voice, or other content, may possibly be able to tell. But an A / B test of this VS. codec artifacts will likely prove inconclusive. It is hard given clean audio to compare against.
Once things have gone through a codec, all bets are essentially off.
This is not to say software detector could not be used. The average phone should do fine.
I wonder if using external cameras and microphones to record the screen directly, would that deal with all these kinds of invisible / inaudible watermarks?
A few years ago I was in a team and one of our coworkers was (perhaps due to sexism) continually stuck with charliework like keeping notes of meetings with clients. She eventually started recording all meetings in her cellphone and summarizing key points at a later leisurely pace so she could also effectively participate in the meeting.
This was like before the Olympics in Brazil, so it must have been 2015, maybe earlier. Since then, I've always assumed that someone is secretively recording all meetings where I have to wear a suit.
Much, much earlier, during the 2008 meltdown, meetings sometimes ended with loosen-up remarks that one wouldn't want recorded (one client was a big corp whose CEO was known to have an extremely attractive wife). These were valuable bonding moments, but maybe they underscored a corporate culture that had its downsides (like making the pretty girl in the team do all the charliework).
Also remember you may record your own reflection in the screen when using a camera to capture the video. Film in a dark, quiet room would be my approach.
If watermarking is therefore done at the client stage just before being heard/seen at each endpoint, then there is a good chance that it is hackable and watermarking code could be patched or audio/video extracted before watermarking occurs.
It would still require whistle-blowers to take this more envolved step before leaking a meeting though.