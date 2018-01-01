At first look it doesn't pass the search engine test.
Set region to Austria, and Ritchie is result nr 1. set Region to "All Regions" and he's result nr 10.
e.g. replacing one letter with three that sound similar but are different adds 4 to the distance - which is quite a significant distance already.
What if you want to use it in the opposite direction? Let's call it the CIA/NSA direction, where you know what unobfuscated phrase you want to find, but not how someone may have obfuscated it. This is arguably much harder, especially if you do not store some sort of representation of the content "as sounds/phonemes".
Something I've had in the back of my mind for a while is the idea of swapping out a regular Lucene tokenisation & analysis for one that treats phonemes as tokens instead of (stemmed/analysed) words being the tokens... with a similar arrangement on the query side. I think it'd have interesting capabilities... this being one.
What's the advantage to this?
Slavik pipl shud fajnd dis kvajt ridebl.
Edit: and now I feel like Petter Solberg. [1]
On a related note:
For my Russian friends it was obvious that Chewbacca is
Chelovek (man) + Sabaka (dog) while I would have never made that connection.
łud = would
Translation: Lovely concept. You could even try using an even more distant language, Hungarian that is. I wonder how much of this is comprehensible to others.
Edit:
όλσο, ουάν κούντ υιούζ σούπερ γουίερντ Γκρήκ κάρακτερς.
(also, one could use super weird Greek characters)
edit- also a little german knowledge by i don't know how much that helped.
> Thing is though, Spud, whin yir intae skag, that's it. That's aw yuv goat tae worry aboot. Ken Billy, ma brar, likes? He's jist signed up tae go back intae the fuckin army. He's gaun tae fucking Belfast, the stupid cunt. Ah always knew that the fucker wis tapped. Fuckin imperialist lackey. Ken whit the daft cunt turned roond n sais tae us? He goes: Ah cannae fuckin stick civvy street. Bein in the army, it's like being a junky. The only difference is thit ye dinnae git shot at sae often bein a junky. Besides, it's usually you that does the shootin.
Scots is a fellow-descendant from Middle English (alongside modern English), with some minor grammatical, and more substantial vocabulary divergence.
Ser10us14 th0u6h, 18 n0t t51s 0b4u8cat10n be44er a6a1ns4 c0mpu4er8?
PS. Kudos for the domain name! Vivid and apt nouns are what English is best at. Though I may have used spelfakr.com
If you want obfuscation you really need something that can do heavy simplification.
It would look completely normal, except the letters and their symbols would be swapped. When I type "a" it would show as "z" and so on.
The scrambled webfont could be embedded, and the scramble could happen per font. An OCR or some reverse engineering could decipher the page, but as far as google indexing and all the modern "reading web content" is concerned, it would all read as random text.
Call it pagefucker or something. You'd do it to a page, and the result would be the modified text and the webfont to render it.
Just a thought!
So URYYB would read HELLO. Or rather, the reader reading HELLO will have no idea the document actually says URYYB as the contents of that text because the font renders URYYB as HELLO.
Presentation detached from substance.
Without the "key" font it'll just read URYYB.
The upper bound for obfuscation is that the obfuscated text should still be readable by a human with minimal effort. To read, the average reader will looks for patterns like "replace j with the y sound." Once these patterns are determined, coding them into your NLP AI is trivial.
> The goal of the project is to make text hard to read for computers yet fairly easy to read for humans
it has to be close to bulletproof. Humans should be able to decipher obfuscated text while computers should never be able to decipher the same obfuscated text. This is going to be impossible since AI is a fast follower to human ingenuity.
Think of it this way: the security measures on most homes are completely inadequate to stop a determined attacker. Nevertheless, they work because most attacks are opportunistic.
So, using your analogy, when it comes to AI every attacker is a determined attacker so your security needs to be ~100% bulletproof.
The ML goal would be to be able to do the same thing.
I found it impossibly difficult to read and I'm an native English speaker. I'd wager people who studied English as a second language might find it harder.
Lat kan nu'gruk da blah kos lat ar ash dafft pyn.
I think the issue I have is I tend to parse words visually rather than phonetically.
The difficulty in learning it lay, as you point out, in that people tend to parse words visually rather than phonetically. The challenge was to alter your mindset while reading, to try to parse the words phonetically rather than visually. Once you learned to do that, it was fairly easy to use it - but yes, it took some practice.
Of course I understand that these examples are not interchangeable. I just wanted to offer my observation after actually having spent some time doing something similar to what this site tries to do "for real" (well, if you can consider an RPG real:)
The only time I had to check the original lyrics was for the pass<->pes substitution, since I couldn't figure it out and the only candidate I arrived at was just silly ("piss") ^^
>'kause jou gnaw jusd wuaed thoe sai
The problem, very simply, is that's just not what the text says.
E.g.: "I soumetymese scie jou pes outcyde mi dour"
In my mind, "soumetymese" reads as "s-how-meh-teemesseh".
"Scie" in Italian means trails. That automatically reads as "she-eh" (not "see"), as it's supposed to be pronounced in Italian.
"Jou" reminds me of the joule energy measurement unit, therefore automatically comes out as "jaw-ou" instead of "you".
And so on.
Yes and "pes" surely doesn't remind me of "pass" but something entirely different and quite probably offending.
Almost smells like a bug. English phonetics don't read J as a Y sound in any word I know of.
English is my 2nd language btw, but my practice reading Chaucer in high school helps with this. English before spelling standardized was ... fun.
'cause you know just what they say
Of which, only 'they' was wrong. Doesn't seem that unreadable.
To you, maybe not. But my point was that myself (specifically) and many others who aren't as comfortable with the written English language would find that text impossible to parse compared to other human-readable obfuscation techniques.
In my case I am dyslexic. I don't like disclosing that as I try my best not to let it affect me (I don't see it as an excuse, I see it as a weakness I need to try harder at - but that's a digression). If I struggled then I would bet that a lot of others who equally have to work harder at reading would struggle to read that as well.
So your comment would translate to:
>A high, high percentage o--- the translated text is e--- readable. A few edge c--- don't make it unreadable.
Does that make this scheme okay? It's 80% easily readable!
Again, to you specifically. Different people have differing abilities at reading the English language. What you might find easy others would not. In my case I could not read about 80% of the example text. Which was why I made the point to discuss specific people (like myself) in my example rather than assuming everyone would be able to read that text equally competently (like you are doing).
We are already used to deciphering English spelling,which is arbitrary at best.
I found that I am usually better than my British colleagues at understanding broken English. I suspect it has a lot to do with how you listen to a language.
https://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/
Plugging those words into google search, and it's not able to make any sense of it, besides linking to that one article. Wouldn't this be sufficient?
"The goal of the project is to make text hard to read for computers yet fairly easy to read for humans (like bbboing, just differently)."
(it says that right there on the page)
Can someone explain it in a straightforward manner, please. Scusi if already done above.
