Hacker News new | comments | show | ask | jobs | submit login
Spellfucker (spellfucker.com)
299 points by a3n on Apr 28, 2017 | hide | past | web | favorite | 138 comments



"The goal of the project is to make text hard to read for computers yet fairly easy to read for humans"

At first look it doesn't pass the search engine test. https://duckduckgo.com/?q=I%27ve+bien+aloune+whyth+jou+eensi...


Obscuring "I've" sneaks Lionel past the Duck.

https://duckduckgo.com/?q=eyeve+bien+aloune+whyth+jou+eensid...


Check the 3rd result, it's the lyrics.


Sorry this might be OT, but I found it interesting that the 3rd result for you is the lyrics and for me it is something completely different (Lionel is just the tenth result). I thought Duckduckgo returns the same results for everyone and that the results are not personalized. Does someone know what kind of personalization does Duckduckgo use? In settings, region and language can be selected, but I have it set to 'All Region' and 'Browser preferred' which is en-US. Does Duckduckgo after all track me and this way customizes the results? Or am I missing something here?


It's probably just the region/language settings.

Set region to Austria, and Ritchie is result nr 1. set Region to "All Regions" and he's result nr 10.


That seems reassuring. I think I have panicked too soon.


Interesting. Yes, number 10 for me (All regions; Browser preferred lang.) The Instant Answer (IA) is lyrics to "Inside Out" by Eve 6.


I do get a result for lyrics at number 10. I meant to imply Lionel was wearing a hat and dark sunglasses to sneak by the Duck with only one entry of the first 40 results. Not so bad compared to the original query! If only he had removed the fake mustache...


I suspect there is some part of their algorithm using levenshtein distance behind the scenes (or something similar). Spellfucker would not help with this.


actually it does deal with this particular metric somewhat in a bunch of cases.

e.g. replacing one letter with three that sound similar but are different adds 4 to the distance - which is quite a significant distance already.


So perhaps an algorithm that combines soundex with Levenshtein distance would handle this?


That's actually very impressive!


Actually, it fails on Google. DuckDuckGo just apparently has an excellent spelling correction system.


i think it uses bing's spell correction.


Good test! No this is one way: you know the obfuscated.

What if you want to use it in the opposite direction? Let's call it the CIA/NSA direction, where you know what unobfuscated phrase you want to find, but not how someone may have obfuscated it. This is arguably much harder, especially if you do not store some sort of representation of the content "as sounds/phonemes".


The point is that if you are the CIA, searching through a bunch of obfuscated text, just like you are already (hopefully) already compensating for basic spelling errors, it apparently would not be very difficult to expand the scope of that same mechanism (in the same way that Duck Duck Go has) to support searching through and understanding this obfuscated text.


Absolutely.

Something I've had in the back of my mind for a while is the idea of swapping out a regular Lucene tokenisation & analysis for one that treats phonemes as tokens instead of (stemmed/analysed) words being the tokens... with a similar arrangement on the query side. I think it'd have interesting capabilities... this being one.


"The goal of the project is to make text hard to read for computers yet fairly easy to read for humans"

What's the advantage to this?


Defeating censorship, frustrating accounts to analyze or index communications, etc.


Spamming people and making it impossible for blind people to use the internet?


Maybe provide a challenge for NLP researchers?


Spam comes to mind.


I don't think anybody would have the patience for a spam message that had gone through this.


Ju kud tejk ej lengvich from ej diferent lengvich femili tu erajv et samfink similr. For furdr obfaskejshn ju kud juz diferent transliterejshn.

Slavik pipl shud fajnd dis kvajt ridebl.


Æss a nårvidsjæn itt vassent tu hard tu riid jur vraiting laik dis. Aj vånder håo diffikult itt iss får a slav tu ønderstænd nårvinglisj....

Edit: and now I feel like Petter Solberg. [1]

[1] https://youtu.be/Kaeh8FRPANs?t=4s


Pole here. Trivial. Then again, I studied in Denmark, so I know how to read those fancy letters.


Russian here. Was easy to read. Took a few seconds to figure out the sounds for the letters I didn't knew, and, of course, it looks odd so just glancing over the phrase doesn't work - but no issues otherwise. Don't know any Nordic languages.


룼키 야짘 나 한굴 볼체 틀룯나 폰얕.


Ez e törk ay ken sey dat dis fred iz may feyvrıt on eyçen soğ fağ. Ay uandır if a fing layk dı OP'yz iz possibıl uif avır languicis dat hev regular and fonıtik spelling rûls?


Translation: As a Norwegian, it wasn't too hard to read your writing like this. I wonder how difficult it is for a Slav to understand Norwenglish...


African here. Was readable in second pass. Pretty cool trick though.


English here. Easy enough to understand too...


Ju kud mejk jt horrorshow jf ju fjlly wjs vocabulary ala Anthony Burgess' Nadsad.

On a related note:

For my Russian friends it was obvious that Chewbacca is Chelovek (man) + Sabaka (dog) while I would have never made that connection.


"The writing of Hungarian is largely phonetic" So this program wouldn't work for hungarian. Oh well. Your comment would look like this pronounced by me on an elementary level: Jú kud ték a lengvidzs from ö diffrönt lengvidzs femili tu örrájv et számszing szimilár. For fördör ábföszkésön jú kud júz diffrönt trenszlésön.


Akszualli dis is en obsfuskejszon juzing an akszual slawik lenguydz (uidaut juzing speszal slawik karakters but that // łud potenszalli bi iwen more obfuskejted). Ol klir?


I find it amusing how big the differences are between the Slavic languages. So, krige is obviously Polish (not even trying to hide it with the ł, but also using "sz" for "sh"). My guess would be xixixao is Czech or Slovak. Correct?


Czech, correct. I didn't use our letters for the sounds easily replecated in english, so sh and ch would be š and č in Czech.


All clear, yes except for 'uidaut' and 'łud' -- is the latter "would"?


uidaut = without

łud = would


Lávly konszept. Ju kud ívön tráj úzing en ívön mór disztant lengvidzs, Hángérien det iz. Áj wándör háu mács of disz iz komprihenziböl tu ádörz :D


Your digraphs and trigraphs make my head hurt...

Translation: Lovely concept. You could even try using an even more distant language, Hungarian that is. I wonder how much of this is comprehensible to others.


Nicely done! So it works. I badly want to see the software, however, which can do the same translation.


I find the comparison with Slavic languages quite funny :) Just keep in mind, simple transliteration is easily reversible. Spellfucker - not.


It's not a straight transliteration -- it's not like we are writing IPA phonetic marks by-the-by in two different languages. For example your "th" in "thing" doesn't exist in Hungarian so I will use "sz" which by the way is one of the common pronunciation mistakes we make: replacing /θ/ with /s/. There are sounds in Hungarian which do not exist in English and yet if I wanted to "encode" the word "duke" I would likely use "gyúk" thus replacing /djuːk/ with /ɟu:k/.


As soon as I start to read it with a Slavic accent (after reading the last line) it clicked


I could also read it OK, with concentration (not Slavic). Ai theenke thet eefe ay red laikh theiss four ey uayl eye wood gette phassterh kueekh-Lee.

Edit: όλσο, ουάν κούντ υιούζ σούπερ γουίερντ Γκρήκ κάρακτερς. (also, one could use super weird Greek characters)


Brazilian here with a somewhat good english proficiency. Could understand almost everything. Thanks for that experiment :)

edit- also a little german knowledge by i don't know how much that helped.


I could read it fairly well, and I'm Swedish. Now I feel Slavik.


Funny, am Swede and thought this was written by a Scandinavian!


Aj dident nou der vr sou meni Slavik pipol araunt dis plejs. Veri hepi tu si felou Slavs komentink on Hekr Njuz


indonesian/malaysian language is also phonetic with almost same pronunciation as slavic langiages, maybe except letter C and J, they literally write in Malaysia "bas" (bus), "kaunter" (counter), kolej (college), etc.


Russian here. Didn't have any trouble. Why woulf non-slavic people findbit harder?



Only a couple of words caught me out. Not Slavik. ^^


This is what it's like for an English person to read Scots:

https://sco.wikipedia.org/wiki/Main_Page


Ah yes, time to go and read some Irine Welsh books again.

> Thing is though, Spud, whin yir intae skag, that's it. That's aw yuv goat tae worry aboot. Ken Billy, ma brar, likes? He's jist signed up tae go back intae the fuckin army. He's gaun tae fucking Belfast, the stupid cunt. Ah always knew that the fucker wis tapped. Fuckin imperialist lackey. Ken whit the daft cunt turned roond n sais tae us? He goes: Ah cannae fuckin stick civvy street. Bein in the army, it's like being a junky. The only difference is thit ye dinnae git shot at sae often bein a junky. Besides, it's usually you that does the shootin.

https://www.goodreads.com/quotes/809012---thing-is-though-sp...


I've probably spoken with 3 or 4 Scots in my entire my entire life, and I've heard a few speak in various news/radio spots. Listening to a Scottish person speak really sounds like someone speaking English with a heavy accent. Seeing it written is nothing short of amazing.


The Socts language[0] is not to be confused with Scottish English[1] (although it's easy to see the elements of the Scottish accent in English in Scots orthography).

Scots is a fellow-descendant from Middle English (alongside modern English), with some minor grammatical, and more substantial vocabulary divergence.

[0]: https://en.wikipedia.org/wiki/Scots_language

[1]: https://en.wikipedia.org/wiki/Scottish_English


Neb, lug and mooth are my favourites :)


The output reminds me of the writings of Bascule in Feersum Endjinn [0] but it's somehow not quite as legible!

[0] https://en.wikipedia.org/wiki/Feersum_Endjinn#Writing_style


Aj kud ríd it kvajt ízili. Hel, Aj ívn nou lotz of pípl hír hú spík end rajt lajk tat ál dze tajm end sink itz inglyš. But mejbí tatz cóz Aj em Slavík :)

Ser10us14 th0u6h, 18 n0t t51s 0b4u8cat10n be44er a6a1ns4 c0mpu4er8?

PS. Kudos for the domain name! Vivid and apt nouns are what English is best at. Though I may have used spelfakr.com


They should make a version of this that replaces words with homophones. This way spellcheckers would also not pick up that the document has been messed with.


Not sure if I've missed the point, but good forensic software can parse grammatical structure and use that. The English language has enough flexibility for every person to have a unique grammatical flavor. And that probably extends to each document as well.

If you want obfuscation you really need something that can do heavy simplification.


That's really cool. It reads surprisingly like Chaucer.


You say that like it's a good thing.


I generally find Chaucer easier to understand than Shakespeare. Not sure if I'm alone in that.


Couldn't we scramble the font?

It would look completely normal, except the letters and their symbols would be swapped. When I type "a" it would show as "z" and so on.

The scrambled webfont could be embedded, and the scramble could happen per font. An OCR or some reverse engineering could decipher the page, but as far as google indexing and all the modern "reading web content" is concerned, it would all read as random text.

Call it pagefucker or something. You'd do it to a page, and the result would be the modified text and the webfont to render it.

Just a thought!


"When I type "a" it would show as "z" and so on". Someone mentioned http://www.rot13.com/ - is it what you loukyngue phor? :)


Ah yes! But the shift or scramble would be at the font level.

So URYYB would read HELLO. Or rather, the reader reading HELLO will have no idea the document actually says URYYB as the contents of that text because the font renders URYYB as HELLO.

Presentation detached from substance.

Without the "key" font it'll just read URYYB.


Frequency analysis makes this kind of cipher very easy to crack...


At the face of it, this seems like a fool's errand.

The upper bound for obfuscation is that the obfuscated text should still be readable by a human with minimal effort. To read, the average reader will looks for patterns like "replace j with the y sound." Once these patterns are determined, coding them into your NLP AI is trivial.


It doesn't have to be bulletproof to serve its purpose, as anybody who remembers downloading "Boon Joovi" MP3s from late Napster can attest. And if some transformations are ambiguous between multiple original spellings, so much the better.


Since the stated goal of the project is

> The goal of the project is to make text hard to read for computers yet fairly easy to read for humans

it has to be close to bulletproof. Humans should be able to decipher obfuscated text while computers should never be able to decipher the same obfuscated text. This is going to be impossible since AI is a fast follower to human ingenuity.


That's a pretty narrow reading of making it "hard for computers" to do something.

Think of it this way: the security measures on most homes are completely inadequate to stop a determined attacker. Nevertheless, they work because most attacks are opportunistic.


There are far, far, far more incompetent home burglars than competent ones. On the other hand, only a handful of AI's exist and by definition, they are getting better and better.

So, using your analogy, when it comes to AI every attacker is a determined attacker so your security needs to be ~100% bulletproof.


I think you're making a mistake in assuming sophisticated AI is going to be used everywhere. This will easily defeat simple word filters, for instance, which are still commonly used.


I think the idea is that if you have random obfuscation rules, a human can generalize almost immediately to read it. Like, you could string together paragraphs of different obfuscation rules and your brain would be able to switch between rules fairly quickly.

The ML goal would be to be able to do the same thing.


Interesting idea though I don't agree with his statement:

> The goal of the project is to make text hard to read for computers yet fairly easy to read for humans

I found it impossibly difficult to read and I'm an native English speaker. I'd wager people who studied English as a second language might find it harder.


I find that this is very much mindset thing. I once "learned" a toy language used for communication in a role playing game (Dym'Yak, of the Dimday Tribe, of Defias Brotherhood EU server in World of Warcraft if anybody cares). The point was to make a language that seemed like gibberish to outsiders, but was parseable by members. The rules were simple - a small dictionary of special words (say a hundred game specific words or so), vague rules about spelling and then phonetical writing all the way. I.e. you pretty much had to subvocalize the text as you were reading it. It was quite hard in the beginning, but after learning to ignore your preconceptions about known words it became fairly easy.

Lat kan nu'gruk da blah kos lat ar ash dafft pyn.


I appreciate what you're saying but (if I understand you correctly) I think that example is quite a bit different to what's happening here as you're talking about is learning a completely new language rather than parsing intentionally misspelt english words. Unless you're saying the written form of your language was extremely loose on spelling?

I think the issue I have is I tend to parse words visually rather than phonetically.


Well, I should have been clear that the language was based on English, so it more or less ended up as "intentionally misspelt" English with the exception of the hundred word base dictionary.

The difficulty in learning it lay, as you point out, in that people tend to parse words visually rather than phonetically. The challenge was to alter your mindset while reading, to try to parse the words phonetically rather than visually. Once you learned to do that, it was fairly easy to use it - but yes, it took some practice.

Of course I understand that these examples are not interchangeable. I just wanted to offer my observation after actually having spent some time doing something similar to what this site tries to do "for real" (well, if you can consider an RPG real:)


Ahhh I see. That does sounds pretty similar to the title project then. Interesting anecdote as well. Thank you for sharing.


Also a native English speaker. I wouldn't call it easy, but I had luck loosening my expectations for the letter sounds and imagining someone with a heavy accent saying the words. An exercise in fuzzy pattern matching.


I would say that it might be easier to read if you are native in some other language because it depends on sounds of letters that you might be unfamiliar with. I'm a Swede and I find it quite easy to read..


Or perhaps easier. Some native speakers may find it hard to imagine their (written) language looking any different, whereas folks learning it as a foreign language have a benefit of perspective.


I actually found it quite easy to read and English is my second language. Maybe it's the other way around and native speakers have a harder time, since the way things are written is more ingrained in their thinking.

The only time I had to check the original lyrics was for the pass<->pes substitution, since I couldn't figure it out and the only candidate I arrived at was just silly ("piss") ^^


Seconded. Definitely unreadable. For anyone who didn't click through, a line from their self-chosen sample:

>'kause jou gnaw jusd wuaed thoe sai


The trick is to run it through your "phonetic processor." Try pronouncing it out loud quickly and smoothly while listening to the sound of your voice. I couldn't quite read it directly, like I do with normal English text, but with a moment's practice I was able to do the phonetic step internally without actually vocalizing.


but "thoe" read that way is a word (i.e. "though"), and th is basically never pronounced as a hard t the way the line I quoted has it meant to be read. "wuaed thoe sai" is somehow meant to be deciphered as "what to say." I would say nobody reading using your method (or any other) can read it correctly. The other comment, for example, misread it (generously) as "they", to make it make sense grammatically. "you know just what they say."

The problem, very simply, is that's just not what the text says.


Of course you're right. The text is obfuscated, after all. It rarely is obvious for individual words, but I had very little trouble understanding it on the phrase and sentence level. It's very similar to that party game Mad Gab, although that one is intended to be much more challenging to understand.

https://en.m.wikipedia.org/wiki/Mad_Gab


Italian here with a quasi-native command of the English language. It's very hard to read. My mind tends to parse the words as if they were to be read literally.

E.g.: "I soumetymese scie jou pes outcyde mi dour"

In my mind, "soumetymese" reads as "s-how-meh-teemesseh". "Scie" in Italian means trails. That automatically reads as "she-eh" (not "see"), as it's supposed to be pronounced in Italian. "Jou" reminds me of the joule energy measurement unit, therefore automatically comes out as "jaw-ou" instead of "you". And so on.



> "I soumetymese scie jou pes outcyde mi dour"

Yes and "pes" surely doesn't remind me of "pass" but something entirely different and quite probably offending.


Indeed, that was the only word I had problems with, so I had to check the original lyrics.


I can read that phonetically, but the J in jou and jusd being a different sound. That's just mean.

Almost smells like a bug. English phonetics don't read J as a Y sound in any word I know of.

English is my 2nd language btw, but my practice reading Chaucer in high school helps with this. English before spelling standardized was ... fun.


I read that as:

'cause you know just what they say

Of which, only 'they' was wrong. Doesn't seem that unreadable.


> Doesn't seem that unreadable.

To you, maybe not. But my point was that myself (specifically) and many others who aren't as comfortable with the written English language would find that text impossible to parse compared to other human-readable obfuscation techniques.

In my case I am dyslexic. I don't like disclosing that as I try my best not to let it affect me (I don't see it as an excuse, I see it as a weakness I need to try harder at - but that's a digression). If I struggled then I would bet that a lot of others who equally have to work harder at reading would struggle to read that as well.


Which means that it is unreadable - because that's not what it says! You couldn't read it.


A high, high percentage of the translated text is easily readable. A few edge cases doesn't make it unreadable.


Your statement would be true even if the translation consisted of replacing every fifth word with exactly the four characters x--- where x is the first letter of the word being replaced.

So your comment would translate to:

>A high, high percentage o--- the translated text is e--- readable. A few edge c--- don't make it unreadable.

Does that make this scheme okay? It's 80% easily readable!


> A high, high percentage of the translated text is easily readable

Again, to you specifically. Different people have differing abilities at reading the English language. What you might find easy others would not. In my case I could not read about 80% of the example text. Which was why I made the point to discuss specific people (like myself) in my example rather than assuming everyone would be able to read that text equally competently (like you are doing).


That's why I didn't respond to you specifically, and only the person asserting that it's straight-up unreadable. It isn't.


I actually appreciate the opinion, that for some people it is difficult to read such kind of text. I also acknowledge, that some words are too fukt to be read properly :) However, the fact that this algo was written in one night and most people say it works, I feel a potential. It is all about polishing the lib of replacements, which is a part of the algo.


The problem with this line is 'gnaw' is actually a word. I find it very difficult to look past what is actually written and try to convert a correctly spelled word into something else.


I would say it is probably the other way around. I found it trivial, yet I have spent a total of 5 weeks in an English speaking country.

We are already used to deciphering English spelling,which is arbitrary at best.

I found that I am usually better than my British colleagues at understanding broken English. I suspect it has a lot to do with how you listen to a language.


Sorry, I cannot fathom the use for this. It is just harder to read for humans, CIA and NSA will hardly choke on this and to hide it from Google search there are far better ways. If you are serious about encryption then just use real encryption. If not then use ROT-13.


Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

https://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/

Plugging those words into google search, and it's not able to make any sense of it, besides linking to that one article. Wouldn't this be sufficient?


The implementation of this is called bbboing http://www.oik-plugins.com/oik_shortcodes/bbboing-obfuscate-.... I mentioned it on the website:

"The goal of the project is to make text hard to read for computers yet fairly easy to read for humans (like bbboing, just differently)."


Like bbboing, but in my opinion, it's not anywhere near as easy to read as a human. Yes you can do it, but the link I pasted is nearly indistinguishable in reading speed.


I am not sure I understand, do you mean Cmabridge method is easier to read? If yes - I agree with you. Spellfucker is just too young and needs a lot of polishing.


Try reading it while clicking the "obfuscate again" button. It's almost easier.


True. Because your brain keeps/interprets the easiest version possible. ;)


That's a bold claim!


Great for generating something that foreigners will have a hard time with.


Disclaimer: I am the creator of the Spellfucker. Please note, this project was written in one night by a non-native English speaker. Tweaking a library of replacements would definitely give better results. The algorithm needs improvements in terms of complexity, but it is not the top priority I think. I am glad some people actually liked the project and I would be happy if there are any contributions, especially to the replacement library, so we can work on it together :) Love, Igor.


AKA an English to Welsh translator.


Though the obfuscation of some common sequences appears to have randomization, it looks like infrequent patterns are constant, for example, the "obfu" in "obfuscation" seems to always be fucked as "hobphu". "ob", as in "object", always seems to turn into "hob", when "awb" and "ahb, plus a randomized sequence of consecutive `b` characters, would add even more obfuscation.


- Not all patterns have more than 1 replacements, which makes them still easily revertable.

(it says that right there on the page)



Hahahah, brilliant! :D


I'm a native English speaker who learned Spanish as an adult, and for some reason my brain jumps to pronouncing these unfamiliar words with a Spanish pronunciation, which made it a struggle to read.


Wow, finally a good orthography for English!


cool project, but why not spelphuckar.cum?


Maybe for the same reason English-Chinese dictionary covers are in English in the US.


I like this. I would love to see this being used in APT data exfiltration / DLP bypasses.


So this is what we'll use to encrypt our communications when Skynet takes over I guess.


If this thing ever needs a G-rated name, might I propose "Chaucerizer".


Remarkably like my own spelling when 8 years old.


This is how I feel reading Old English.


Does not work with Greek. :(


What's the use case?


Like a Captcha, I think. Text unparsable by robots. Cover your tracks. That sort of thing.

Can someone explain it in a straightforward manner, please. Scusi if already done above.


Creative. Nice


Wonder what "Creative" means in this context.


To the project creator: you override all the default Bootstrap fonts with "Lato", which, at least here, doesn't exist.


Thank you, fixed.


[flagged]


If your account is less than a year old, please don't submit comments saying that HN is turning into Reddit. It's a common semi-noob illusion, as old as the hills.

Please resist commenting about being downvoted. It never does any good, and it makes boring reading.


It says on his profile that his account it 1739 days old. Or am I missing something?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: