`vis`: display non-printable characters in a visual format
`cat -e`: Display non-printing characters and display a dollar sign (`$') at the end of each line.
`hexdump -c`: Display the input offset in hexadecimal, followed by sixteen space-separated, three column, space-filled, characters of input data per line.
`od -a`: Output named characters.
Or `cat -v`, the namesake of http://cat-v.org/. :P
It really is a treasure. Take the time to browse, it's well worth it if you're somewhat interested in unix history and not overly familiar with the subject already.
(They give off rays or something ... took a cursory look to see if I was being unfair, and booom - https://www.gkayaalp.com/blog/20170206_emacs-argument.html )
TBH I do actually like Unix and Plan9, but having read a big part of cat-v website in the past, most of the views there are fanatical, and are more bitter, cynical and disingenuous than I can ever force myself to be. Especially the way incredibly useful and important OSS projects are denigrated and misrepresented is really annoying (see the harmful stuff section, for example).
One part is a matter of taste, and having been around both styles, I like better the 'keep it simple' approach. Too many functions/options overload my simple brain ;-)
I'll also add my favorite companion to `od` and hexdumps: `man ascii`.
P.S. May we please have a basic Markdown parser for HN?
$ apt-cache show xxd | grep Source:
Source: vim (2:8.0.1453-1)
I think DOM mutation observers would be a huge hit on performance.
As others have mentioned, I could see this being used regularly by someone who plays Eve
If you're an English speaker that doesn't often handle languages which gain from zero width characters you could just have a listener scan your clipboard for zero-width characters, silently strip them, and then re-populate the clipboard.
Although I've been thinking about building a clipboard filter for Windows a lot lately, I'm getting tired of copying text to the address bar or notepad to strip text formatting information.
1. Photo and video evidence was a game-changer for establishing facts and chronologies. As we've seen, it's becoming harder and harder to distinguish fake photos and videos from real. It's not a stretch to predict a time when we'll only be able to make probabilistic statements about the veracity of photos or videos. (E.g., "60% liklihood of being undoctored.") This is probably already true about photos, although the expense of making a perfect fake is still pretty high, in terms of expertise.
I'd argue the "natural state" is one where word-of-mouth and first-hand accounts are the most authoritative evidence we can have (other than physical evidence like DNA left behind). And even physical evidence left behind can't tell us what the person did while there or how an event transpired.
2. Tracking communications. In the digital age, we've some people have come to assume that all digital text is untrackable and anonymous. Your "11001110" is the same as mine. Historically, it was pretty difficult to transcribe information without leaving traces of the origin of that info. These zero-width characters plus all the other text fingerprinting methods, and ubiquitous tracking in communication logs make it nearly impossible, again, to communicate with others without leaving a trail. And then there's the writing style analysis which makes it tough to write anything without leaving telltale fingerprints.
So, I'm proposing that we are returning to the "natural state" of things. Probably overstating things a bit, but still an interesting thought to consider.
First-hand accounts aren't always reliable.
I’m not sure there’s enough benefits for spending processing time, effort and resources on cryptography on any of these fronts.
Now smartphone apps taking “secure” pictures have been there for a while, for use cases like crash site photo for insurance claims for instance. They do fill the niche I think.
(from trivial to more complex ones)
1. Using Invisible Characters
The linked article basically talks about a very basic version of it, which should be enough to get the idea. Of course, more sophisticated techniques will use more than just two characters, and will take the position of each invisible character into consideration while encoding & decoding watermarks, ensure it's uniformly distributed throughout the paragraphs etc.
Can be defeated by simply removing invisible characters.
2. Using Unicode Characters That Look Alike
The same working mechanism as IDN homograph attack.
Can be defeated by simply removing "out-of-place" letters/characters, after determining the language of a given text/paragraph/sentence etc.
3. Using Unicode Equivalence
> Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet).
Can be defeated by simply normalising the text, and line endings!
(Now it's getting harder!)
4. Changing the Layout of Documents
You can change (the rendering of):
(a) the margins
(b) the ligatures
(c) the space between
(i) specific characters [kerning]
(ii) consequent words/lines/paragraphs
Can be defeated by copying the plaintext, and pasting it to a text editor, and applying steps 1, 2, 3.
Also, bear in mind that if you can still -unintentionally- leak information:
(a) when you use a document editor (LibreOffice Writer, Microsoft Word) as the layout engines might act differently depending on your software version, platform, file format etc.
(b) paper size (A4, US Letter, ...)
5. Substituting with Synonyms
If characters can be replaced by their equivalents, why not replace words or sentences even? Words can be substituted by their synonyms based on an algorithm that can create fingerprints accordingly, and I presume even sentences can be rephrased with the latest advancements in AI/ML.
Also, some of the substitutions will be unintentional: a scribble on a piece of leaked document can also leak information about its leaker (different spellings in American and British English, ways of writing date & time, decimal separators etc).
Can be defeated by paraphrasing.
The list can probably extended even further but this was all I could remember on the top of my head. =)
[x]: Not that I'm working in a related field, but I researched watermarking & fingerprinting techniques for a similar but much more extensive project for journalists/whistle-blowers to detect fingerprinting/watermarking in documents.
Let me know if you are interested and we can collaborate!
At the age of 5, Hawking visited Oxford, incidentally passing through Fakenamingham-on-Watermarkshire.
Dictionaries/encyclopedias have been known to insert entirely fake entries as a way of proving ownership.(http://articles.chicagotribune.com/2005-09-21/features/05092...) In the age of ebooks and print-on-demand, those could be tailored to the individual licensee.
Just like non-existing places/roads on maps, so-called "trap-streets":
I second the recommendations of vis, cat -e (or cat -vte), and od -a. I didn't know about hexdump; I always used xxd.
Also I disagree with the following:
> If you're working on a website there's a good chance that someone will want to copy/paste it, automate using it, etc. Zero-width spaces will create no end of hassle to those people.
I could work around that with literally just 1 line of code:
newString = replace(originalString, "​", " ")
It's up to you to decide what to do with that information.
> newString = replace(originalString, "​", " ")
If I could have registered "zerowidth.character" I would have. You've got to handle different characters and such.
I should say that the site is mainly a joke because we had to use a particular tool which had a bunch of zero width spaces in it, and it got on our nerves.
Ahhh I didn't get that from the site. I thought it was focused on one specific type of zero width character
> I should say that the site is mainly a joke because we had to use a particular tool which had a bunch of zero width spaces in it, and it got on our nerves.
That's fair enough. :)
Suppose, for example, that you want the emoji "Man Facepalming Medium Skin Tone". There is no single Unicode code point for that. Instead, it uses five code points:
U+1F926 FACE PALM
U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4
U+200D ZERO WIDTH JOINER
U+2642 MALE SIGN
U+FE0F VARIATION SELECTOR-16
Here's a list of emoji which use the ZWJ this way:
EDIT: also what jedanbik said
I know it's pretty basic stuff but it does the job. The encoder outputs both in stdout and into a file so you can copy/past it more easily (from Sublime Text for example)
Apply an autoload rule with these settings:
Name: zero length replacement
HTML: Output only
URL: <entering no url matches all sites>
Type: Regular expression
With: <span style="background-color:#F00 !important">😳</span>
Not really, JFYI ;):
~WHITE SMILING FACE~ ~THUMBS UP SIGN~