Hacker News new | past | comments | ask | show | jobs | submit login
Ruby's hash is a Swiss-army knife (akshaykhot.com)
168 points by software_writer on Aug 20, 2023 | hide | past | favorite | 77 comments



Since Ruby 3, the automatic coercion of keywords to a hash—the second example underneath "Passing Hash to Functions" in this post—is considered a legacy style and is generally frowned upon in new code. That is to say, code like the second call to `foo` here:

    def foo(kwargs = {})
      kwargs
    end
    
    foo({k: 1})  # ok: passing hash argument
    foo(k: 1)    # ok: keywords coerced to hash
One of the strongest arguments for avoiding this sugar is that it makes the code more brittle in the face of future changes. In particular, in the example above, if we add a new keyword argument to `foo`, then any call which omitted the curly braces will break, while calls which used them will keep working fine:

    # added a new keyword arg here
    def foo(kwargs = {}, frob: false)
      kwargs
    end
    
    foo({k: 1})  # still ok: `frob` defaults to false
    foo(k: 1)    # ArgumentError: no keyword: :k
This is touched on in the blog post describing the extensive changes made to keywords in Ruby 3: https://www.ruby-lang.org/en/news/2019/12/12/separation-of-p...


This might be me having huffed too many types lately, but I feel like I would want that to break.


Adding a parameter that has a sensible default value? I don't think I'd want that to break anything.


The inverse here - specifying {} directly - does not at a glance imply a default value is being constructed.

(Or perhaps it does and this is just some Ruby-ism I'm not exposed to)


That's kind of the point of default values, though. If you wanted to make a change to the method and you want it to break to alert you to all the calls, then you can add a keyword without adding a default value for it.

    def foo(kwargs = {}, frob:)
      kwargs
    end


You're right, thanks for the demo and sharing the link, really appreciated. I'll update the article to mention this.


My favorite little-known fact about Ruby hashes is that they respond to `to_proc` and can be used as procs. For example, you can do this:

a = { 1 => 'a', 2 => 'b' }

[1, 2, 3].map(&a)

#=> ['a', 'b', nil]


One of the most beautiful things in Ruby that I have ever seen is this fibonacci code.

  fib = Hash.new do |k, v|
    next 1 if v == 0 || v == 1
    k[v-1] + k[v-2]
  end


With caching...

   fib = Hash.new do |k, v|
     next 1 if v == 0 || v == 1

     unless k.key? v
       k[v] = k[v-1] + k[v-2]
     end

     k[v]
   end


Wouldn't you get the same result using memoization idiomatic syntaxes?

k[v] ||= k[v-1] + k[v-2]


No, because that is equivalent to

    k[v] || (k[v] = k[v-1] + k[v-2])
And that first k[v] (unlike k.key?(v)) will trigger the Hash.new block again, so it'll recurse until it runs out of stack. But neither check is necessary, because the Hash.new block will only ever get called if k.key?(v) is false.

If you want a more compact version, you could do:

    fib = Hash.new do |k,v|
      next 1 if v == 0 || v == 1
      k[v] = k[v-1] + k[v-2]
    end


That was the first thing I tried, but it blew the stack :) You can still blow the stack if you pick a number that's too high, like k[20000] or something. But if you pick a lower number & cache that, then you can (eventually) call a higher number without blowing the stack. This recursive approach is a horribly inefficient algorithm anyway, so I don't think it's worth optimizing :)


But caching doesn't required here. Hash.new calls block only if value isn't initialialized.


I just did it for fun. This particular recursive approach is super slow for numbers of nontrivial size, so I was just curious if I could even make the caching work in the block. It's not worth optimizing a suboptimal query when a more efficient option is available anyway.


You don't need the `unless k.key? v` guard. The `Hash.new` block only gets called when the key is not present in the hash.


The caching makes it faster the trade off being more using more memory. I just wanted to see if it’d work.


Let’s make it a one liner :D

fib = Hash.new {|hash, key| hash[key] = key < 2 ? key : hash[key-1] + hash[key-2] }

Example: fib[123] # => 22698374052006863956975682

Makes use of memoization.


Yes, I was going off of memory and thought that it was memoising it, but I made a mistake and fixed it.

  fib = Hash.new do |k, v|
    next 1 if v == 0 || v == 1
    k[v] = k[v-1] + k[v-2]
  end
I like yours better though, it seems a lot simpler with the less than 2 check.


You can also recursively define fib as a lazy sequence that is the pairwise sum of fib and fib shifted by one. (Clojure, from Rosetta Code).

    (def fib (lazy-cat [0 1] (map + fib (rest fib))))

    => (take 10 fib)
    (0 1 1 2 3 5 8 13 21 34)


    Explanation:

       0 1 1 2 3 5   ;  this is fib
    +  1 1 2 3 5 8   ;  this is (rest fib) 
    ---------------
       1 2 3 5 8 13  ;  this is (map + fib (rest fib))
                     ;  and the sequence needs to be initialized with (lazy-cat [0 1] ...


Or an Ackermann:

    A = Hash.new { |a,(m,n)| a[[m,n]] = m==0 ? n+1 : n==0 ? a[[m-1,1]] : a[[m-1, a[[m, n-1]]]] }

    A[[3,4]] #=> 125

    A.inspect #=> ... long
However, the application utility of a self-populating lazily-evaluated lookup structure goes further. A hash with a default function works great as a simple caching wrapper for all manner of results, for example when talking to slow or fine-grained APIs.


I don't quite understand how this code works. Where does the `nil` come from? What operation are we performing on 3 that causes it to return `nil`?


1, 2, and 3 are being passed as lookups to the a hash. 3 is undefined on the hash, hence nil.


Ah, that makes sense. Thank you!


That's cool, I didn't know that!

But now I'm wondering where that would be a better solution than just using the `Hash#values_at` method...?


It is just a hash map with a few common functions defined; hash maps are occasionally useful, but what is all the praise about?

The mentioned "simple" bit is arguable: as a language's building block, a hash map is relatively complex and specific, since those can be built out of lists (or trees, though lists/arrays may be preferable for efficiency), which can be built out of pairs (product types, cons cells, tuples; unless going for that efficiency, though it can still be pretty efficient with trees). Maybe it is one of those "simple versus easy" mix-ups.


Well yeah, it's "simple" (or easy) to use, definitely not simple to implement. But having an easy-to-use hashmap is par for the course for most newer languages - not only Ruby, but also PHP (associative arrays), JS (objects), Go (built-in map type) etc.


This is a lovely overview. Hash is a great example of how delightful it can be to program in Ruby.

One more technique worth noting is the chained functional style of using Hash, which you can do in Ruby because Hash inherits from Enumerable. If you're prototyping a script to do some data-cleaning, this makes it easy to build up your pipeline and iterate on it. For example:

    foobar = { ...your data here... }
    foobar.map do |k, v|
      # ...
      # do some transformation here
      # ...
      # and then return the replacement key/value for this entry
      [key, new_value]
    end.select do |k, v|
      # do some filtering here, e.g.:
      is_foobarable?(k, v)
    end.map do |k, v|
      # ...
      # do some expensive transformation here on the smaller data set
      # ...
      [key, newer_value]
    end.to_h
(Note that you have to call #to_h at the end since the Enumerable functions will coerce the hash into an array of arrays.)

Now your code literally shows the pipeline that your data is falling through — and each of these steps can be side-effect-free, with no mutations to the original foobar structure.


Unless things have changed and Ruby has stream fusion now, this is bad advice for scale. You are iterating over a fat object multiple times. Even if its uglier its much better in this case to create an empty array/hash, iterate over with #each and #<< to the hash.

I worked at the largest Rails shop in the world and this would be rejected in code review.

Edited to add more detail: the only method you need to write to implement Enumerable is #each. Every step of your pipeline here is _another_ call to #each. Just do it once.


> I worked at the largest Rails shop in the world and this would be rejected in code review.

Not sure if this means GitHub or Shopify. Until earlier this year I worked at GitHub for a decade, leaving as a principal engineer, primarily writing Ruby.

This would not be rejected at code review there unless the Hash had e.g. millions of values and, even then, it might not be a meaningful performance problem in context.

If the Hash is very small and will always be: readability trumps Big-O "performance" when n is very small.

Am I being pedantic and appealing to authority? Yup but, well, you started it and I hate to see helpful "Ruby is nice" comments like the grandparent get crapped on for no good reason.


There's #lazy to turn things into a lazy enumerator, to be iterated over when you so desire with e.g #force.

If you're going to iterate over an accumulator variable, use the for keyword instead of each, it's faster.

Alternatively, one can use .reduce({}) { |h, (k, v)| ... h } or .each.with_object({}) { |(k, v), h| ... } which makes the block not close over an external variable, and makes the assignment to that variable "atomic" (wrt the hash construction, the variable will only contain the final result, that is if a final variable is needed at all, which it may not with implicit return of the last value)


Yeah, that's a great demonstration of the technique.

One tiny tip.. :) you can pass a block to `.to_h`, so instead of using `.map` + `.to_h`:

    h.map { |k,v| [k, v] }.to_h
.. it can be simplified to:

    h.to_h { |k,v| [k, v] }


Ruby’s Hash is probably the handiest data structure I’ve ever encountered. Thanks Matz.


Definitely! As a matter of fact, this is the default data structure I use when writing Ruby ETL code (e.g. https://github.com/thbar/kiba/wiki).

Methods like "except" (https://docs.ruby-lang.org/en/3.2/Hash.html#method-i-except) or "fetch" (raising an error on missing key) are very convenient to write defensive data processing code!

Similarly, in Elixir, I use Maps a lot for the same type of jobs (https://hexdocs.pm/elixir/1.15.4/Map.html), with similar properties.


Kiba looks like a really cool framework, thanks for posting it!


I wish it was typed though. So many times I’ve seen a function that takes a hash of “options” or “config” and have no idea what that actually contains. Even for official rails methods it’s often complex to know what the possible options are. Some of them seem almost internal with how obscure they are.


RBS+steep to the rescue! We typed our configuration this way for ddtrace Ruby. On the external (set) side it makes it very easy to explore configuration in an editor, on the internal (get) side it makes us ensure we don't make mistakes.


I've been in Python/Django for about a year now and I really miss Ruby's hash vs the dict.

`.dig(:key, :key, :etc)` is so nice to find deeply nested data without blowing up.

One thing I don't miss is knowing whether a hash's keys are strings vs symbols. While it's easily solvable, I've definitely lost time only to smack myself that I need to use a str but was always feeding a sym and swore that this should be a sym based hash.


I don't know ruby and if it's the same, but if you're not using a library like funcy or toolz for a nested get helper, you can do `dict.get('key', {}).get('key2', {}).get('key3')`. Not the prettiest, but can do in a pinch.


Thanks! i didn't know about these and will check them out for sure. Tired of `if key in hash:` nested layers.


In my opinion, a better alternative to nested `if key in hash`:

    try:
        value = data["foo"]["bar"]["baz"]
    except KeyError:
        value = None


I know it's been almost a year but I still haven't accepted this pattern.

Using errors as general flow control makes me uncomfortable. It shouldn't be an error or exception except in.... Actual problems.


It should be noted that ** operator works like .merge, and also accounts for the order of key definition in a given hash declaration (whatever is declared earlier gets overwritten if a key with same name is used in the same hash declaration later).


Hash is so powerful in Ruby that people often overuse them.

One of the most common issues I found on Ruby code-bases is to not create classes to represent their domain and simply use hashes everywhere.

The downside is that a hash has no shape. It can (and will) be anything you want it to be, often causing havoc once the system grows.

Checks for keys everywhere. Almost all statements use the safe navigation because you never know what shape you're dealing with. Multiple places performing the same map/reduce/filter/etc. All because people stick to hashes a bit too long.


Amen. This is an issue at the company I work at. Common typos when looking up has keys will return nil - this has a tendency to silently keep working and blow up with a runtime error further down the chain. I am trying to insist on using .fetch to force an exception.

When the company switches to 3.2 I will insist on everyone using the new Data class for value objects rather than hashes.


Ruby newly added type system can also help here. For starters, it'd be nice to know what type(s) the keys and values can be.


That's what POROs are there for. I truly don't want a faux type system only to then keep using hashes for everything...


This is perhaps the biggest killer feature of typescript: that your object literals (which are basically hashes) can have types. Or possibly interfaces.


> as a programmer building run-of-the-mill CRUD web applications, having so many choices can be really daunting and confusing. When do you choose which type?

I feel similarly after getting back into Java after years of Javascript. Java has a ridiculous number of arrays and lists: Object[], List, Iterator, Iterable, Stream, and I think I'm forgetting a few. Some functions want or return one, some want or return another. I'm constantly convertings these from one to the other, and it's never pretty. (What the hell is a Spliterator?!)

Meanwhile, in Javascript, there's just the Array. (It's not even really an array, but a hashmap that only takes integers as index, but nobody cares; it works.)

And, like Ruby, Javascript also has easy object literals that is basically the exact same as the hash in Ruby. If there are any meaningful differences, I'd love for someone to enlighten me. The only difference I'm aware of, is that the preferred type for keys in Ruby is the Symbol, which is a very lightweight string, whereas in JS it's just a regular string.

But, like the author, I strongly prefer to work in a language that just has a couple of sensible data types instead of making me choose from thousands that have no meaningful differences beyond their incompatibility.


For me the most useful bit about #fetch is not that it throws, it's that it allows setting a default value while distinguishing between a missing key and a key with nil as a value. If I want to handle key presence without a default I just use #key? as exceptions come with additional performance cost (building the stacktrace)


Reminds me of https://steve-yegge.blogspot.com/2008/10/universal-design-pa...

with its observation that since hash table is O(1), it may mock almost any other data structure.


The only thing I really miss in the default Ruby hash is property access for keys like JavaScript has.

I know, I know, OStruct does this but it's not the thing that gets built from hash literals so it's less convenient.

It's a minor quibble


OpenStructs are ungodly slow anyway. If you want such accessors you probably want a plain Struct anyway and not mix hashness and method accessors. You can even define additional methods on structs:

    Foo = Struct.new(:a, :b, keyword_init: true) do
      def frobz
        ...
      end
    end
Structs are quicker to instantiate than classes but slower to call methods on, so if you instantiate a lot but seldom call they're great. Otherwise a class is better.


The new Data class in Ruby 3.2 also does the job (although it's intended for immutable data)

https://docs.ruby-lang.org/en/3.2/Data.html


Also that it maintains insertion order (since Ruby 1.9) opens up more use cases.


Do you know any use cases, other than iteration? Would love to know any interesting ones. Thanks!


Yeah, you can use it to implement a simple (but not necessarily very performant) LRU: https://github.com/SamSaffron/lru_redux/blob/037ee594aded597...


Very cool, thanks for sharing!


In my opinion Lua's equivalent tables or more of a Swiss army knife.

It's been a while since I've used them but afair they combine hash/dict, array and objects in one data structure.

They're hashes as usual. With a special constructor or int keys they turn into arrays, last I looked the Lua interpreter even optimizes contiguous ranges of keys as arrays. And by adding metatables and metamethods they also emulate JS-style objects.


My feeling with Lua is that it's lacking arrays due to minimalism, as in "we don't need arrays because tables will do". Probably the same reason why there is no "continue" in a for-loop.

In comparison, Ruby has arrays and hashes, and gives me the feeling "yes these do exactly what you expected, but wait there is more!" In that sense it does feel more like a Swiss army knife.


On the contrary, they acknowledged that hash tables only was pushing it a bit too far, so they implemented arrays sort of "on the side" of tables.


While is share the praise for Ruby's hash I find the argument that it might be daunting for a developer (of any persuasion) to pick between a `SortedList` and `Dictionary` worrying.


PHP's associative array is very similar yea?


Yes, hashes in Ruby, associative arrays in PHP, maps in Go[1], dictionaries in Python[2] and C#[3] represent the same concept, a collection of key-value pairs.

[1]: https://gobyexample.com/maps [2]: https://docs.python.org/3/tutorial/datastructures.html#dicti... [3]: https://learn.microsoft.com/en-us/dotnet/api/system.collecti...


The interesting thing is that it definitely feels different to me.

While the basic data structure is obviously identical to its counterparts, the way it interacts with the ecosystem as a whole makes it feel way more powerful. The interaction of hashes, symbols, and blocks lead to a language which feels like it is designed with DSLs as first-class citizens. This leads to things like Ruby on Rails, which at first glance feels like pure magic.

I get a somewhat similar feeling about lists in Python. Because of its very extensive list comprehension you start thinking in lists, leading to some very clean code which would be extremely awkward to implement in many other languages.


> The interaction of hashes, symbols, and blocks lead to a language which feels like it is designed with DSLs as first-class citizens.

Totally agree.


Yes. The OP's post is basically "baby's first API." The are a very handy API but relying on them at scale is much less efficient than structs or arrays. This post doesn't even go into the different performance characters of these APIs, which can be a major foot gun.

I've noticed a lot of novice-level posts on here lately. Although I admit the post was very well written.


Novice level posts are great. Someone found something exciting to them, not new to us, but we should encourage them to keep exploring and writing.


Also, the votes imply interest!


I find it funny that he said he didn't know what to pick between 5 types of IDictionary in C#.

* Hashtable

* SortedList

* SortedList<TKey, TValue>

* Dictionary<TKey, TValue>

* ConcurrentDictionary<TKey, TValue>

I don't even know C# (just Java), yet I know that you'd probably want to use Dictionary<TKey, TValue> most of the time.


To be honest, I wrote that just to add a little drama ;)


My experience is that Struct only performs better than Hash in a pretty narrow set of circumstances, rarely enough to be a deciding factor unless profiling suggests it to be.


Ruby structs or c-family structs?


Selling it as associative array is nice. (And I know the docs do that, too)

It's a doubly linked list, array, hashmap all in one, with some weird magic when handling numeric indices.

There's often the time when using another language that I made a bad choice by picking either an array or hash or list while later I need one of the other properties PHP would give me by default. The extra cost for most uses is neglectible.


> It's a doubly linked list, array, hashmap all in one, with some weird magic when handling numeric indices.

In what way, exactly, is PHP's associative array (a hash table of course, like all these things) a doubly linked list ?

Sure, in a language which isn't very fast anyway, I can believe it's reasonable to use your hashtable as an array, O(1)~ or O(1)* [Expected or Amortized constant time] probably feels enough like 1 [a single operation] in this language that you shouldn't worry about it. Explaining to beginners why they can't write digits["four"] is more trouble than it's worth, so how about just make it work instead.

It is a hashmap, so, that makes sense. These are different words for essentially the same type.

But how is it a doubly linked list ?


Seems to have changed in the implementation indeed and I now didn't spend much time thinking through the current implementation.

But the old implementation had pointers to next and last (previous) element in the list: https://github.com/php/php-src/blob/PHP-5.3/Zend/zend_hash.h...


Oh I see. Yes, this is a very common way for C programmers - in particular - to make a hash table, it's the design that C++ enshrines as std::unordered_map and it's the data structure you'd have been taught in a University data structures course in the 1980s, and maybe even the 1990s [depending on who is teaching and where]

While internally this is indeed using linked lists, I don't think you can make use of that from PHP which is what I'd imagined was meant. Suppose I had a million integers that I found somewhere, and then I want to splice in a different million integers I have, so that in total there are two million integers. With linked lists that's very cheap, basically instant, but with most data structure, including a hash table, not so much.

Linked Lists are very rarely what you needed, even though when they are what you needed they're invaluable. They are really easy to implement in C, so it's sort of a "I've got a hammer, now all I see are nails" situation in C.


Well, the hash table also uses list for handling collisions, but also using it to doubly link all elements.

And it's true, on modern hardware linked list are often bad due to data locality making iteration bad ... but that data structure was designed before (while, as said, revamped meanwhile ...)

But it's quite fancy. If you deal with it with "proper" indexed keys data will be packed in an array (or vector, ifyou prefer C++ terms; while most data stored in there is just pointers taking away some benefit of the packing...) while also keeping the doubly-linked-list pointers and once you have holes in it or string keys or something it plays hash and gives away some of the ordered storage ...

In the end that extra bookkeeping everywhere makes it slow and bloated for everything, but from a usage perspective it's quite nice.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: