Hacker News new | past | comments | ask | show | jobs | submit login
Pstore: Ruby Built-In Hash Persistence (github.com/ruby)
100 points by hstaab 3 days ago | hide | past | favorite | 38 comments





I do a lot of ML and AI work nowadays... I miss Ruby a lot especially the its culture around ergonomics.

I recently had the need to build an internal system that distributed workloads across many workers via a client/server model. I did the proof-of-concept using druby [1] and it turned out to be so simple and stable that we just ran with it. It'd been years since I had used that library and instinctively I assumed we'd get the prototype out and then rebuild it using some sort of web service and utilize a high concurrency web server but druby just worked!

[1] https://github.com/ruby/drb


drb is awesome. I've had the good fortune to be able to use it once. The simplicity of it compared to anything else is amazing.

how have i never seen this!! :D

It keeps me coming back.

Ruby itself is just such an enabler.


There have been some interesting ML gems rolled in the past few years:

https://ankane.org/new-ml-gems

Any thoughts on what the Ruby community would need to build in order for it to become an attractive tool for AI work?


My guess is some kind of corporate sponsorship. Someone with deep pockets to maintain it, encourage new apis keeping up with the latest papers, and make sure it works out of the box with the accelerator people want to use this month.

The web framework part is basically sponsored by 37signals https://37signals.com/32

Maybe that's why Ruby is best known for Ruby on Rails.


A huge cultural shift. People in scientific computing speak Python and R.

Something would need to happen that makes Ruby far more attractive. Say performance parity with Crystal or Nim.


I think it’s more than that, Julia exists and adoption is still slow. Lua and torch were plenty fast and they were still replaced by pytorch. I think to compete with python you need at least a fraction of the de-facto corporate sponsorship for python in the ML space.

As a primarily Ruby dev I'd prefer the AI/ML ecosystem not be split-brained between two languages that are semantically 90% the same thing. Just learn Python and integrate the models into your Rails (or whatever) apps.

Have you tried Scala?

It's more of a cultural thing. People tend to write Ruby in a literate fashion and think critically about their APIs. Scala devs get a little over their skis sometimes playing with language features.

Don't use this. Marshal has too many issues. If you really need persistence and can't use something like Postgres, use the Ox gem instead. It's more reliable between versions of Ruby and easier to parse from other languages if you ever have to.

> use the Ox gem

The main thing is that it's part of the standard library. If you import a gem anyway, often you'd be well off with sqlite.

As for storage format, there's also:

https://ruby-doc.org/stdlib-3.1.2/libdoc/yaml/rdoc/YAML/Stor...


I love the simplicity of YAML::Store. It was introduced in Ruby 1.8, almost 20 years ago (https://github.com/ruby/ruby/commit/55f4dc4c9a5345c28d0da750...).

I even created a little gem when I was starting with Ruby, 10 years ago, that was a very thin wrapper around it so that I could play around using an ActiveRecord like syntax (https://github.com/brunnogomes/active_yaml). I used in some pet projects so I could do stuff like:

  p = Post.new
  p.title = "Great post!"
  p.body = "Lorem ipsum..."
  p.save

  Post.all # => [#<Post:0x895bb38 @title="Great post!", @body="Lorem ipsum...", @id=1>]

  Post.find(1) # => #<Post:0x954bc69 @title="Great post!", @body="Lorem ipsum...", @id=1>

  Post.where(author: 'Brunno', visibility: 'public')
  # => [#<Post:0x895bb38 @author="Brunno", @visibility="public", @id=1>, #<Post:0x457pa36 @author="Brunno", @visibility="public", @id=2>]
And have access to the data directly in the YAML files.

Good times!


The problem with YAML is that meaningful whitespace means that the size grows quickly for highly nested documents. I don't love XML, but there is a reason I recommended Ox. I've used it for real projects and it never fell over like so many of the alternatives I've tried where databases were not in the cards.

The problem with XML is that angle bracket expressions take up too much space because you need to duplicate element names. I don't love JSON, but there is a reason I recommend OJ.

...

The problem with JSON is that the keys take up too much space because they are duplicated. I don't love BSON, but there's a reason why I recommend bson-ruby.

And I could keep going... ;)

The benefit of using YAML is precisely that there's meaningful whitespace. Different strokes for different folks.


I don't get the value of "it's in the standard library". Ruby has the amazing (fir scripts) require "bundler/inline" that allows you to use a single file for code and Gemfile, as well as auto installing the dependencies, so going for standard library doesn't seem to provide any practical value except offline support

I used pstore for an ad-hoc monitoring service on an outdated windows server running an outdated ruby version - it was easy to set it up to run from task scheduler every five minutes and check resident memory of an old ruby service - logging the ram, and killing/restarting it if it was over 1 GB (this all on 32bit ruby with the limits of 4gb address space per process).

Sure there are many things that "should" have been fixed above - but just having any old ruby version on hand was enough to help check for a memory leak and mitigate it - while taking the time to figure out if the leak could be plugged.

And offline support (a server in dmz/locked down wrt new software) is big too!


Is Marshal still tied to Ruby version? Boy was this fun about ten years ago for a system I inherited that Marshaled huge complex objects into TokyoTyrant and back. You try migrating or upgrading a system where the runtime version is tied to EVERY object in a database.

> too many issues

Such as?


Marshal is Ruby's version of pickle in Python: it serializes arbitrary objects, which means that correct deserialization requires arbitrary code execution.

This is bad enough on its own, but it also makes pivoting a file read/write primitive into code execution much easier.


Why the "don't use it"? Just say "use it with caution" or, since we are being rude telling people what to do whenever pickle or marshal comes up, just don't say anything and assume people know what they are doing.

I don't think I phrased that in a particularly rude way, but I'm sorry if it came across as rude.

The answer is that we have serialization techniques that are as good on all the dimensions that matter (speed, serialized size, etc.) and better in terms of security. Pickle and Marshal are, at best, footguns in otherwise very safe language ecosystems.


https://github.com/ruby/psych defaults to only loading permitted classes since 4.0 so that seems less of a concern now?

`psych`, used for YAML, is a different thing than Marshal. pstore uses Marshal. https://ruby-doc.org/core-2.6.3/Marshal.html. I don't believe psych will be involved with pstore.

I'm honestly not sure, though, how much I should be worried about the fact that someone who has write access to my database can maybe escalate that to an arbitrary code execution if I use pstore. Literally not sure. Write access to my DB seems pretty disastrous already...


Pickle is fine (in a pinch). It's not meant for untrusted data.

Anything is fine when the data is trusted. The problem is that the data is almost never actually trusted :-)

Interesting. Transactionality is implemented via a regular thread lock, this means in a concurrent Rails app where this library is used in a hot path you might suffer some contention. Best is to use for marshaling data in non-hot paths such as stand alone scripts or app start up. I only say this because it's quite different from expectations around transactions in an SQL sense.

Note, this is a wrapper around Ruby’s Marshal class.

Mentioned in the linked article.

I would think this would have limited usefulness for most web applications as the latest trend for web apps is to think of the deployed code as ephemeral, and local files are not something devs often rely on. I guess if you're mounting block storage or some other virtual file system that would be another thing. For non-web applications, this could be a simplistic replacement for what people often use sqlite for. The readme doesn't talk much about concurrent access to the store other than the transactions, so concurrent operations may also be a limitation.

pstore has been a built-in with Ruby stdlib for as long as ruby has existed, so _over_ 20 years.

I'm assuming it pre-dates Rubygems because it really should be a gem. I can't speak for Japan but few people in the Western world seem to use it.

There was a time when some stuff was being extracted (removed) from Ruby core and becoming gems and I really tought PStore and YAML::Store were going to be among those, but no, they decided to keep them in core. So maybe there are some important enough use cases that justify it being there.

Or maybe it would be a hard task that didn't justify the effort.


Many parts of the stdlib are being slowly gemified, that's the case of `pstore` too hence why it has it's own repo.

It's now no longer technically stdlib, but a "default gem", a gem that is installed by default with ruby, see: https://stdgems.org/

Since a few years every version remove one or two rarely used default gems. The Ruby core team just doesn't like big breaking changes.


Pstore also uses Marshal behind the scenes, so I assume has similar caveats you see in other comments on this thread.



Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: