The example is… not the best: if you’re checking if two pages are identical, you should match up each byte and compare them byte for byte. Hashing just adds an extra step: you’re going to need to read every byte anyway!
unless the file is very small (in which case a simple download by HTTP(S) or read via network filesystem may be faster than negotiating a SSH connection and running the remote hash command). In either case accessing a pre-computed hash of the remote content will likely be faster than both options, assuming a stored hash exists and you are looking to verify your local copy.
Yup. A better example might be something like assigning all the letters a number value (a 1, b 2, ...) and then summing all of them. It's important that the analogy be simple to comprehend but it should still be, well, analogous to the real process.
Hi js2! I do it in the immediate next section, "Improving the algorithm" (take all the letters and assign a number value). Do you think there's a way I can improve that part or say that better?
I try to ramp it up from a loose anology to slowly model the actual working (although I'm no expert on that!)
Given this is for a non-tech audience I think this is reasonable. It basically talks about it being a lossy encoding of things that represent the original. Also, if the words / chars were stored as an array or string already you wouldn’t need to read every byte.
Hello Vore!
Yes, of course. I wrote this for a few non technical friends - a simple example they can relate to (like we skim docs) seemed like the best way to start off without scaring them off the article :P
The webpage example does a good job of setting the stage for passwords. You definitely do not want to compare passwords that way unless you are trying to create a textbook timing vulnerability.
This is misleading because it only covers message digest hashing, presenting that as being the definition of hashing "what hashing is".
The other important hashing is the reduction of an object of any kind (including, but not limited to texts) being used as a key, to a numeric code (usually fitting into a single machine word) to be used as an index for fast look up in a "hash table" data structure.
This is related to testing whether two things are the same, but the security aspect isn't important. More important is the aspect that the hash code is exploited as an integer in order to jump to the right bucket or starting point in the hash table.
By the way, the credit for hashing (the indexing technique for quickly finding items in a data set, not message digests) belongs to some person in China.
I never see this mentioned. For instance, look at the Wikipedia page on hashing; mainly some 20th century white dudes are mentioned there, and one A. D. Linh (Vietnamese-looking name) is credited with open addressing.
However, decades before that, Chinese character dictionaries used step-by-step algorithms for finding chinese characters in a dictionary, which are identical to hashing.
First you examine the character according to a list of precise rules, which have various cases. The cases assign numbers to parts of the character according to structural features, and those numbers are then combined to form a code, such as the "four corner code".
From the code, you can proceed directly to a page of the dictionary, where you search through a short list of characters that have the same code.
Unless there is prior art for that, that is where the credit lies.
"The Four-Corner Method was invented in the 1920s by Wang Yunwu, the editor in chief at Commercial Press Ltd., China. It was based on experiments by Lin Yutang and others."
The significant thing about the prior art also is that it was hashing non-textual, but rather graphical objects. A chinese character is text, of course, but the hashing method treats the features graphically. It's not looking at the etymologically correct radicals and components and using their individual code values, but purely visual features like crossings, boxes and dots.
Before the Four Corner method, there were existing numeric lookup methods for dictionaries, like using the 214 Kangxi radicals. That also a kind of proto-hashing method. However, the hashing function is difficult (you have to already know which part of the character is considered the radical and which radical that is under variation). The number of buckets is pretty large, resulting in long searches. The Four Corner Method is a refinement: you can hash any character without knowing what it its radical, and the code space is decently large to reduce the searching through collisons.
Hello there! Author here. Just a small guide for non-technical people about hashing - a small seemingly scary word, but understanding that could potentially make so many other things click.
I've been hashing files for years but just recently heard it used by an old radio geek use it as slang for noise, which suddenly made a lot of sense etymologically - a hash function takes structure and turns it into noise. Onions and potatoes optional.
The first thing that comes to mind is adding all the characters in the input. It's easy to understand how if the values for two inputs are different then necessarily the two inputs are different.
Also, technically bcrypt is not a general hash algorithm, but rather a password-hash construction. Where SHA1 and MD5 are primitives that can be used in arbitrary crypt constructions, bcrypt is already a construction specialized for passwords. You wouldn't (or at least shouldn't) hash a file with bcrypt.