Hacker News new | past | comments | ask | show | jobs | submit login

The two cropped files are different. They are built deterministically, therefore you know whether the redaction was of a red circle or of a blue circle.

Typically that apply to things like redacted pdf lossily compressed as image, for which you have already a few candidates words (and you can bruteforce). You try them one at a time and see whether the compression artifacts match.

The uncropping algorithm is pretty straightforward in theory : remove jpg artifacts, and fill the cropped region with candidate image portion x, compress and compare the produced artifacts to the cropped image artifacts, try a neighbor candidate x+eps*N(0,1) and optimize (aka random search).

The artifacts are related to Fourier coefficients so the distance between artifacts isn't too irregular.

The remove jpg artifacts, can range from really simple to really hard depending on the class of problem you have.

If the background image is something digitally generated (like here in our example of red and blue circles), or a pdf file you can get the uncompressed version without mistakes.

If the background image is something like a photo then you need a little finesse : you need to estimate the compression level, then run a neural network outside the cropped regions (4 different images : above, below, left and right of the cropped region) that remove the artifacts (something like https://vanceai.com/jpeg-artifact-removal/ but finetune to your specific compression level), so you can estimate the artifacts, then you search for the image inside the region (eventually with a neural net prior but that increase the probability of data hallucination) such that the jpg artifacts of the compressed reassembled image, is close to the jpg artifacts of your cropped image.




There is far less information in the artifacts than in the part of the image that was redacted. And of course, you need to replicate the process of redacting the original image, which requires you to know about the algorithm that encoded the image. It also, if I'm not misunderstanding, requires you to have a version of the redacted image without the artifacts from the redacted section (or a second identical version with a different redacted graphic, like the red and blue circles)

Do you have a proof of concept of the technique you're describing? Otherwise I remain skeptical.


I get the approach I'm just saying your demonstration doesn't show any proof of actually uncropping the image rather proof if you have the original image and encoding settings the artifacts will match. This alone does not imply the approach can be used to uncrop it just shows the JPEG encoder used is deterministic. It's like if I used the original 2048 bit RSA key to prove RSA is insecure by saying "and then you can just brute force it". I haven't actually shown brute force generating the original key to create a matching output is something anyone can be expected to actually do, should be something that's worried about, or means using RSA keys is a "classic mistake beginners make" rather I've just demonstrated RSA is deterministic and look a bit alarmist.

One important piece overlooked by not doing an actual demonstration of blind reconstruction is lossy compression is naturally not a 1:1 mapping of original sources to compressed sources. If it were it'd be a lossless compression! This means your theorized recovery process ends early as it assumes that the first match is the only match and therefore the original. Had the claims of been fully demonstrated instead of assumed to be the same as recreation this would have been accounted for.

In practice JPEG blocks represent an 8x8x3=192 byte/1536 bit original source. From that we take the lossy compression and get some small number of representing bits (which is encoder and settings dependent) back, let's say 154 (1/10th size) for discussion. Of that some number of bits is going to be used to accurately encode the black square, let's say 1/2 just to be generous to the extraneous noise and let's also be generous and say the re-encode from original JPEG to obfuscated JPEG was nearly lossless (i.e. "100" quality) to help the numbers further. We're now at 154/2 = 77 bits representing 1536, in the generous case. 2^1536/2^77 = 1.5 * 10^439 possible original matches. Even assuming the image is largely losslessly compressible the numbers still don't come out in good favor. For most cases, like the pictures example, this rules out finding the original being practical without some additional guidance that hasn't been demonstrated either (and maybe there is some other guidance! It's just not shown what that would be or why we should assume it exist).

What would be a really good demo is focusing on something minimal (e.g. text instead of images) and regenerating the source material without knowing it before hand. E.g. it's safe to assume a typeface of a document that was screenshot and later obfuscated and now all you need to do is show that one understandable string of text has artifacts that match a short whiteout/blackout on the document. This is still an extremely hard problem but information theory tells us it might be (not that it is) at least reasonably possible.


You can use the well known approach like https://github.com/bishopfox/unredacter

I am just showing that jpg compression cast a digital shadow on itself that can be outside of the cropped area, which many people find unintuitive.

Matching the shape of this digital shadow (which is not very small (all the highlighted pixel region contain some info) ), combined with an educated prior on the missing portion (that must provide a way to represent the hidden portion in less bits than you can get from the shape of the shadow) , can help you recover the data.


You're getting downvoted a lot for some reason, but I just thought I'd say that I think this technique has potential.

It'll be fiddly to exploit (like the "undredacter" you linked), but I think you could get a text-recovery PoC going. Try to craft an "ideal scenario" (for an attacker - i.e. ideal censor-bar placement and font size), see if you can exploit it in practice, and build from there.


> You can use the well known approach like https://github.com/bishopfox/unredacter

If it's so easy and already done why not show it actually reconstructing real examples like that page. The reality is there is no existing well known approach for recovering data in the exact way you laid out. Similar approaches sure but they rely on different data sources (such as downsampled versions of text) instead of artifacts. That it is feasible in the easy case does not mean it is feasible in the hard case, why should the two be expected to hold the exact same amount of information about the source? Even if they do the method of reconstruction will surely be different.

> I am just showing that jpg compression cast a digital shadow on itself that can be outside of the cropped area, which many people find unintuitive.

That I agree 100%, and it's a great thing to show, it's just a bit overreaching to then name that "jpguncrop" with a description referring to aCropalypse instead. Maybe it could lead to something like that but what's in the repository does not match the name or description.

> Matching the shape of this digital shadow (which is not very small (all the highlighted pixel region contain some info) ), combined with an educated prior on the missing portion (that must provide a way to represent the hidden portion in less bits than you can get from the shape of the shadow) , can help you recover the data.

Great, you have a theory - demonstrate! I'll be the first to comment on how clever, persevering, and well executed the work was if you post a thread showing it works. Until that point though it isn't so just because you think it would work.

Here is a great test case https://i.imgur.com/sAEpfQ6.jpg. The font size should be small enough that data leaks out of the obfuscated area from the characters. I'm optimistic this blind reconstruction case is feasible given the additional constraints compared to the circle test but it would probably take a decent chunk of new work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: