> Zing uses a different solution to keep it consistent despite object relocations, in which they delay storing the id. hash until the object relocates. At that point, it’s stored in a “pre-header”
IIRC, that's an old David Bacon invention [1]:
> For non-moving collectors such as mark-and-sweep, the system can define the default hash code of an object to be its address, or more generally, a function thereof. For moving collectors, we can use the tri-state encoding technique from Bacon et al. [7] (also developed independently by Agesen and used in the Sun EVM), where the states of an object are unhashed, hashed, and hashed-and-moved. For the first two states, the hash code of the object is the address. When the collector moves an object whose state is hashed, it changes its state to hashed-and-moved and copies the old address to the end of the new version of the object. If space is available, the system can encode hash code state in two bits in the object header .
Is it even still a hash code/table in that case, technically? I thought the whole point of a hash table was the hash function "randomly" but reproducably assigning a position to items. This way, it's more some indexed storage with randomized positions.
Can't use the object content since that can change.
It is not uncommon to use the state of an object to derive the hash code. It of course puts the burden to ensure that objects do not change in undesired ways while they are required to yield a stable hash code, for example while used as a key in a hash table, onto the developer. One common solution is to restrict the hash code calculation to immutable parts of the state, for example an ID.
The advantage of deriving the hash code from the state is that you can use the hash code to speed up object equality tests, i.e. if the hash codes of two objects do not match, they can not be equal.
I think what you're asking is "this getHashCode function always returns a random number within a large range, but a hash table has a small set of positions, so isn't this more like a sparse array rather than a hash table?"
The answer is that the "hash code" is indeed not the final position, instead it is the input to a function to compute the final position. So, lets say a hash table has 100 positions, then the hash table implementation will essentially do:
var position = object.getHashCode() % 100;
which converts the random number from getHashCode into a position within the hash table. Since the final "positions" are ints 0-99, the actual underlying storage would look a lot like a 100 element array.
> This way, it's more some indexed storage with randomized positions.
Summing up, basically yes, you could think of a hash table as indexed storage, where the positions are "randomised" by a modulus on the number of positions. Of course the "random" bit happens when the object hash code is generated, the rest is deterministic. In real life it looks like the minimum number of positions is at least 1022, and that number can change as entries get added to the hash table, but that's the gist of it.
The module operator is really slow. But a bit shifting operator is much faster. Having hash tables that grow in sizes of two, or more precisely where you can just shave off some bits of a random number to get a position will yield the best performance.
In JS, hash tables need to be growable Incase they get dense. In that instance you don’t want to compute object hash again since it’s expensive. You just want a cheap, here’s the new position.
> I think what you're asking is "this getHashCode function always returns a random number within a large range, but a hash table has a small set of positions, so isn't this more like a sparse array rather than a hash table?"
No, that was not really my question. FWIW, the same problem often exists for actual hash tables too: For example, the Java hashCode method returns a 32-bit number that needs to be mapped first too.
My question is whether a (non-deterministic) random "hash" function can still be called as such, and therefore the structure a hash table.
I believe this is ultimately a question of lexicography: what do the terms "hash code" and "hash table" mean?
The name "hash code" comes (UIAVMM!) from the idea of chopping an input into little bits and mashing them all together, just like you'd do to potatoes or beef to make hash browns or steak haché. I imagine the name "hash table" comes from the fact that is typically probed using a hash code. In fact, i would be interested to know if the first hash tables had separate steps of hashing and input and reducing it to a table index, as modern ones do, or whether the input was hashed directly to an index.
An "identity hash" like this is clearly not made that way.
However, lexicography is not etymology. If you want to know what a word means, you have to look at how people use it. In the Java world, at least, anything that comes out of hashCode() is a hash code, even if it's not made by traditional hashing. And universally, the data structure implementing a map by reducing keys to indices into a table and then dealing with collisions somehow is known as a hash table.
If that hasn't been answered elsewhere, then the important part is that the hash code for a particular object is always constant, every call to getHashCode for that object returns the same value, so in a sense it is deterministic (the same object always hashes to the same value, which is all the hash table cares about) even if the actual number was initially generated using a non-deterministic process.
> the hash function "randomly" but reproducably assigning a position to items
Right, which is a pretty good description of what this does. It's random because it literally generates a pseudo-random number. And it's reproducible because it only generates it once and then stores it with the object to reuse it later.
> This way, it's more some indexed storage with randomized positions.
That's a great one-sentence description of a hash table. :)
That's not reproducable. Hashes are reconstructed from the original data even if it no longer exists as a formal in memory object. This is just a random ID.
For mutable objects in an OOP language, the object's identity is "the original data". The values that may happen to be stored in any other fields at various points in time are not part of the object's state and cannot affect the hash code. (If they could, mutating the object would change its hash code, which would be Very Bad.)
Shouldn’t they just call the attribute “objectIdentity” then? The fact that it’s used to access a hash table in some use cases doesn’t make it a “hash code”. You’d just as well call it an “equality checking code”, or name it after any other use case you could think of for a unique identifier.
Ruby calls its equivalent construct “object_id” and that makes so much more sense.
They only need a unique id for the object, they can't hash based on the object contents because those can change and because two objects with exactly the same properties are not equal in JS:
There were several exploits against different languages (PHP, Python, Node) which allowed denial of service attacks by abusing the hash function to craft a special hash table where all elements have the same key.
Randomizing the hash function makes it unpredictible to an attacker.
I think you’re skipping the most contentious point which is that if it’s random it’s not a hash function any more. It’s more of just a unique ID each object has, and calling it a hash any more is misleading.
In the case of Python, for certain built-in types, __hash__() combines information about the object with a salt value. The salt can be manually forced by an environment variable, but if not supplied that way will be randomly chosen on interpreter startup.
The result is consistent hashing within a given interpreter process, but unpredictability across multiple processes/invocations, which is what's needed to avoid a denial-of-service vector created by predictable hash values for built-in types.
Salting isn’t what I mean by random, I mean literally having the hash value have nothing to do with the item you’re hashing.
Either way, terminology is weird here, but it seems like in V8’s case the word “hashing” better describes what happens to the hash code when the hashtable algorithm decides what bucket to place the object in (likely a modulus operator or a bit shift), at which point the “hash code” isn’t a hash value at all but just a unique ID that’s a seed value for the real hash function... but again, terminology can be misleading.
What are you saying here? That it does not classify as a hash function (determinism, uniformity, various forms of collision resistance, one-wayness) or that it is not a hash of the contents?
I’ve conceded elsewhere that “hashing” is an acceptable enough term for what’s happening here, but, for nomenclature I’m more familiar with, v8’s randomly-assigned “hash codes” are something I would be more comfortable calling “identifiers”, since they’re assigned values that exist as state on the underlying objects.
When I think of a hash value, I think of something that’s calculated via some properties of an object, not just a stored number.
If it is a random number it is not an identity hash function. It is just a position. And in my terminology this makes no sense here, as you normally hash the id/content of the thing you want to store, which is not happening if you return something unrelated to the object. What am I missing?
Edit:https://github.com/multiformats/multihash/issues/13 suggests this is a replacement for using the content of the object, which is not available, to replace the step of hashing something. Is that it? Then it would be a replacement for a replacement of a hash, and not a hash anymore.
A hash function takes some state and produces a nicely-distributed random-ish number from it. Given the same state, it produces the same hash.
I think you're OK with that definition of a hash function, right?
In a mutable object-oriented language like JavaScript, two objects with the same properties are not the same object. Conversely, an object before and after mutating a property is still the same object.
Thus, the "state" that matters for hashing is the object's identity, not its properties. Thus, V8's hash function produces a hash value given the identity of some object. Since there are no useful bits associated with "identity" that V8 can use to produce a hash value, it has to synthesize some at object creation time by calling rand.
(If V8 didn't move objects in memory during GC, then the object's address in memory would be a good candidate for its identity. Indeed, in language implementations where objects don't move, the address is often used as an input to the hash function.)
So I actually looked up in my text book from way back how the course I learned this stuff in defined hash functions. I wanted to understand why this felt so wrong. According to that book a hash function is merely a mapping of a String (could also be called a word) of arbitrary length to a fixed length String, no requirement for having a nice distribution or ranomdishness. I guess this helps the V8 terminology, if one remembers that the fixed length part can be somewhat ignored when looking at numerical values.
> Thus, V8's hash function produces a hash value given the identity of some object
So if I got that right V8 is using the identity of an object as hash function, but since the identity is a virtual concept it replaces the identity with a randomly generated number - or is it later on doing something with that identity value? If not still feels strange to call that a hash, but I think I have to agree that it is a valid hash function given the definition.
> a mapping of a String (could also be called a word) of arbitrary length to a fixed length String
Right. It takes a longer sequence of bits and reduces it to a shorter sequence of bits, with a resulting chance for collision.
> since the identity is a virtual concept it replaces the identity with a randomly generated number
It's not replacing anything. Here's another way to think of it. V8 needs this function:
int getHash(Object object) { ... }
Given a reference to some object, it returns a hash code for it, based on the identity of the object. The requirements are:
1. Returned values should be nicely distributed over the number space.
2. If you same object in, you must get the same hash code out. Even if the object is mutated between calls.
This sounds like a reasonable definition of a hash function, right?
If all this function had access to was the object's properties and it had to be pure (couldn't modify anything), it's actually impossible to implement this function. (This, strangely enough, touches on fundamental ideas around mutability, identity, and equality.)
So V8 relaxes the requirement that the function be pure. getHash() then generates a random number for the object, stuffs it in some hidden chunk of memory inside the object (what the linked article is about) and declares by fiat that that number is now the hash code for that object's otherwise-intangible identity.
> This sounds like a reasonable definition of a hash function, right?
Right.
> So V8 relaxes the requirement that the function be pure. getHash() then generates a random number for the object, stuffs it in some hidden chunk of memory inside the object (what the linked article is about) and declares by fiat that that number is now the hash code for that object's otherwise-intangible identity.
Thanks for writing it out like that, That's exactly how I understood it, it is good to have that confirmed.
I understand why that is called a hash in V8 now, but I feel exactly like ninkendo writes in a comment above:
> You’d just as well call it an “equality checking code”, or name it after any other use case you could think of for a unique identifier. Ruby calls its equivalent construct “object_id” and that makes so much more sense.
But your position is not wrong, this is only about what you expect to stand behind which word, and since I agree that the definition of a hash fits to that scheme this is only about connotations. Thinking of this hash as an object_id makes it clearer to me why this is done and how it could be used (and I'd absolutely use an object_id as input to a hash function, the usage does not have to differ, though that is not done here).
> You’d just as well call it an “equality checking code”, or name it after any other use case you could think of for a unique identifier. Ruby calls its equivalent construct “object_id” and that makes so much more sense.
The number V8 generates for each object is not guaranteed to be unique. Like any hash code, collisions between distinct objects may happen. That's another reason why I think it's useful to consider it a hash and not an "equality checking number" or "ID".
Objects in some programming languages (like JavaScript, but not like others such as Haskell) have an identity value. This is an abstract value. It's a theoretical thing. We pretend it's there but it's not like a usual value. It's not something you can actually see or calculate with. You can just compare them.
A hash function takes a value or a compound value and produces a simpler integer from it.
An identity hash function takes an identity value and produces a simpler integer from it.
Based on all these definitions, returning a random integer, as long as you store it and return it again the next time, is an identity hash function.
> you normally hash the id/content of the thing you want to store
That's what we're doing. You've said it yourself. We hash the id (identity) of the thing you want to store. You could call the identity the content of the object as well if you wanted to.
It's not about replacing the actual content on the object, because JavaScript maps are, by design and by definition (23.1.3.6, 7.2.10), explicitly identity maps. They don't want to consider the content of an object, except for this abstract identity content.
I'm not sure if you're arguing I'm mistaken, or the that the whole idea is wrong. But this is what many languages do and it's all very conventional terminology and practice.
> I'm not sure if you're arguing I'm mistaken, or the that the whole idea is wrong. But this is what many languages do and it's all very conventional terminology and practice.
Neither, I just felt the terminology not fitting, I'd have called it id generation and left out the hash. V8 is generating the hash/id in the hash function, not doing something directly with the input or computing it (as in, calculating) from the input, right?
Like if the hash terminology was added because one came from hash tables. I guess what I meant is that using and storing a random number here feels like a trick to fulfill the contract of a hash function without actually implementing one, since those usually look different (but I'm more used to see cryptographic hash functions). My definition of a hash function includes returning a word of fixed length, in practice that is not relevant here because the numerical representation is the same (001 = 01 = 1). But apart from that I can see now that the shown scheme fulfills the contract of a non-cryptographic hash function, the storing of the id and the virtual identity make it somehow fit.
> not doing something directly with the input or computing it (as in, calculating) from the input, right?
It is doing something directly with the input and calculating something, if you make the leap to consider that the input object has a hidden abstract value which represents the object's identity.
> An identity hash function takes an identity value and produces a simpler integer from it
I would expect an identity hash function to be exactly that: i.e. a trivial function that takes a value and returns that value unmodified (i.e. the identity function).
It happens to describe well this case because the input to the has function itself is the unique identifier and as it is already uniformly distributed, the has function can just return it unmodified.
In the simple case of using memory location as object identity (ie. in system without moving GC) and typical textbook hashtable implementation simply using the pointer as hash value is highly suboptimal, because heap pointers do not have remotely uniform distribution (low order bits are always zero due to allignment constraints, high order bits tend to be constant and usually also zero...). On the other hand you can use more advanced hashtable design that is explicitly tuned for non-uniform hash value distribution. See CPythons dict for cannonical example of that, although it's motivations are somewhat different (the major reason is that in CPython even real hash values of object's content are intentionally non-uniform to allow for simple way of causing integer-valued floats to have same hash value as ints of same value).
In this context, an 'identity hash function' means that it's a 'hash function for an identity value'. It doesn't mean 'a hash function that is an identity function'. It's confusing!
Hash codes aren't actually exposed in JavaScript. It's an implementation detail. The only requirement is that key lookup is consistent with equals [1].
For a string key, the hashcode field is probably a cached hash of the value, since that's how an equality comparison works. Most objects are only equal to themselves, are mutable, and get moved in memory, so storing a unique identifier is the only way to know if it's the same object that you saw before.
For a hash code, the identifier doesn't need to be unique since all that matters is that it's consistently placed in the same bucket. All you need is to store a sufficiently long random number.
No it isn't. This is not a unique id. Two different objects could easily have the same hash value.
People are mixing up the mathematical term "identity function", meaning a function that maps its input directly to its output, with object identity. I have never heard this usage of the phrase, as in a function that takes an object identity as input.
That's the usual approach for advanced object systems with a copying garbage collector. I do the very same. The class pointer is not unique because it might change, so just a random unique id is stored, with 0 reserved as unknown. The lookup could be done per hash, but mostly it's just praying. Similar for symbols and unmutable strings, the id's just need to be unique.
In this case it's just the class.
What would you use other than a random number? The memory address of the object? Well that may not be well distributed and it possibly leaks information.
I would use the memory address, if the GC is a non-moving collector. With a good hash function it should be fine. However, given most GCs are moving, including V8s, I can see why they chose the implementation they did.
Yeah I don't understand this. How can this possibly work for programs like this:
var s1 = readWholeFile('file.txt')
var s2 = readWholeFile('file.txt')
// now s1 and s2 are different strings, but with the same content
assert(s1 == s2)
assert(s1 is not s2)
var d = {}
d[s1] = 'xyz'
print(d[s2]) # should print xyz.
However if s1 and s2 really have random hash codes how does this work?
Is there a difference between "strings" and "objects" in v8? Mutable objects have one hash codes based on IDENTITY, whereas strings have a different hashing strategy based on VALUE? I didn't see that in the blog post.
This seems to be an abuse of the term "hash code"?
In Python, you can't hash (some) mutable objects:
mylist = []
d = {}
d[mylist] = 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
OK I guess that makes sense... I find it odd that such a random number is called a "hash code". I guess because it's an implementation detail and not exposed to the user?
When I think of user-facing hash tables, I think that the keys must respect object equality (and sometimes equality can be implemented with identity, for singletons).
But I guess JavaScript has no concept of user-definable hash functions for objects, like Python and Java do? So this is purely an implementation detail?
You can extend my example to something like:
var s1 = readWholeFile('file.txt')
var s2 = readWholeFile('file.txt')
var array1 = [0, s1]
var array2 = [0, s2]
var d = {}
d[array1] = 'xyz'
print(d[array2])
Hm I just tried this and I noticed that JavaScript decays the array1 key to a string!!! That is really bad.
But I guess there is some logic to it, because hashing with mutable keys isn't exposed to the user. Because it doesn't make sense!
But in many languages, if array1 and array2 are hashable, then they should retrieve the same value, even though they are not identical objects.
So their hash codes can't be random numbers -- they should be derived from their values. Overall I find this whole discussion pretty confusing :)
> When I think of user-facing hash tables, I think that the keys must respect object equality (and sometimes equality can be implemented with identity, for singletons).
It's hard to respect object equality when the language itself doesn't have consistent notions of what equality actually is.
Also wondering this. I found a comment on StackOverflow from 2015 [0] that might offer some insight:
> "It's probably worth mentioning that V8 goes to great lengths to avoid using a hash table for object properties whenever possible, preferring to compile known property references into direct index references rather than run-time hash lookups for performance reasons (though this is only possible sometimes - depends upon the code)."
Since this blog posts mentions they store the hash code as a property of the object, that might imply they are using "direct index references" instead of computing the hash at runtime.
It does seem like a couple layers of indirection, though. This is way outside my expertise, so I would love to hear some more comments on this.
The hash cannot be computed, because it's not really a hash at all. In JavaScript's Map and Set, Objects are considered equivalent if they are the same object instance, not if their contents are the same. So really the "hash" is a unique tag. I believe the object's memory location can not be used, as the object may be relocated, and/or a different object stored at the same location at a later time.
That's the most interesting point in the entire article, and there's no explanation associated. Why are they using random numbers for the hash code?