Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Avi Bryant: MagLev recap (avibryant.com)
30 points by toffer on June 1, 2008 | hide | past | favorite | 21 comments


All this hype about MagLev, OODBs and Smalltalk that has come out of this Rails presentation has got me wondering about how some of this stuff works.

As I understand it, coding in Smalltalk involves developing code in an 'image' that is persisted to disk when you exit (sort of as if it dumps the entire memory contents to disk in a binary form that can be read back in). What sort of applications is Smalltalk suitable for? I know Seaside is a web framework, but a lot of the examples I can find show off more traditional GUI apps.

Does the image approach mean that Smalltalk is not really suitable for quick and dirty sys admin type scripts that you would use Perl or Ruby for?

I am fairly sure that HN doesn't use a traditional DB - and that if I recall PG said that Viaweb didn't either. With Viaweb, when a user logged into their admin console it loaded their environment into a process (and process memory) and kept it there (I think I read that somewhere).

Does that mean that each user had their own Lisp process fired up and kept around, and that data was just kept in process memory (and important stuff persisted to disk somehow)?

What I was thinking then is that if you had a web-app like Basecamp for instance (ie, an Account with no shared data between accounts), would it be technically feasible to have a different Smalltalk image for every single account that is loaded only when that account is being used, with basically all the account data inside the image? I am thinking memory could be a serious limiting factor here!

Sorry for the long rambling post - just trying to get my head around this stuff - hopefully someone on here knows something about it ...


As gnaritas posted below, yes, you could do that, yes, that's what Dabble DB does, and yes, memory is a serious limiting factor. That's where Gemstone comes in: unlike the Squeak VM, which we use, the Gemstone VM does not expect all objects to be in memory at all times, and can lazily load them in as needed.


Avi, you guys are seriously going to open up the stuff that's above the VM? I had been thinking about again picking up my own project to put Ruby on top of Smalltalk. I'm envious of your approach with working with the VM hackers to get custom bytecodes. This would be great for getting Ruby to run fast.

What of parallel projects on other Smalltalk VMs? It would be good for the Ruby community to have options. Gemstone might want to keep some sort of competitive advantage, however. (Email this userid at yahoo.com. Linkedin is not letting me send you a message, and so fulfilling their policy of keeping out riff-raf.)


I think and hope we seriously will, but it's not up to me. Certainly one side effect would be that, with a little work, you would be able to run Ruby on other Smalltalk VMs as well, and I think this would be a Good Thing. We'll see what happens.


Did you bootstrap a Ruby parser in Ruby into the image? Or did you take another approach?


I know nothing about the internals of any of the Ruby VM's so this may be a totally stupid question ...

How hard would be it be to 'marshal out' the entire running Ruby VM to a binary image and load it back in again like the way the Smalltalk VMs seem to do, or add the ability to marshal out objects as a memory image of some sort, meaning they can be marshalled out and in really efficiently (ie stuff a true Ruby object into Memcached instead of a string representation of it ...


You should be able to marshal out everything in just about any VM. The reason why Ruby and the JVM dont' do this is because it just hasn't been done yet. The whole point of VMs is that they are virtual machines. That virtual model is just data. The VM is just another program. So saving state, then restarting later from that state is quite doable.

Image-based development is a very different style, however. Lots of people are in love with their command lines and text editors.


So does that mean that if I had a Dabble DB instance that had many megabytes of data in it, you need enough RAM for the Dabble code plus all my data to be in memory at once (and so on for each customer), or can you do some lazy loading magic to only load part of my data?


So on for each customer who is active at any given time, yes. We do some lazy loading magic but not much.


Can you expand more on what makes that different from loading them in from swap?


For one thing: the second you have a full GC, you're going to need to swap in the entire object memory, dirty those pages, and then swap them back out. It doesn't take a very large image before this brings your machine to its knees.


Of course, that's only true if you're using a totally flat heap. I don't see any reason why you couldn't be a little more clever about using a hybrid GC approach (generational + thread-local allocation pools, for example) to minimize the size of the dataset that had to be paged in or out in a given sweep.

RScheme and other persistent language implementations have certainly been able to use the underlying OS paging mechanism pretty efficiently, and the performance numbers Varnish HTTP cache seems to suggest that it can work well for some kinds of web application workloads, too.


How does gemstone get around the garbage collection issue?


There was some talk about letting multiple images of VisualWorks Smalltalk share memory. that way, you could put the Class Library and all that jazz into shared memory, which would be shared by multiple instances of the VM and image which would hold runtime state and stuff like the account data you mention. If you then wrote a facility to save the shared and non-shared stuff separately and in two different files, then you'd have it.

Also, there was a variant of Smalltalk built in the days just before and after the ParcPlace/Digitalk merger that was designed to be usable for writing quick and dirty admin apps. The minimal image was tiny. Something like 45k.

In any case, there doesn't need to be any barrier to non-image based development in such a virtual machine. Just about any virtual machine has an image in memory. Image based development is about being able to code in the debugger -- about having absolute runtime power over all aspects of the development environment. If you don't need 100% of that, then you don't need image-based development in any virtual machine.

(Because Smalltalk went whole-hog into having this absolute runtime power over everything, even the dev environment, the image became a strange loop, and could not be rebuilt from source due to these chicken-egg paradoxes. But since you already have an image, it doesn't matter. Example: nil is an instance of UndefinedObject, a subclass of Object, whose superclass is nil.)


Great example about nil.


Actually, it's a so-so example. You'd have to build just a little into the VM to get around it. (All of the classes that are in the loop and nil)

Another one is MetaClass. Class and MetaClass are both (eventual) subclasses of Behavior. But the Class of Class is an instance of MetaClass. (I'm confused as to why the Rubinius people have MetaClass as a subclass of Class.)


> Does the image approach mean that Smalltalk is not really suitable for quick and dirty sys admin type scripts that you would use Perl or Ruby for?

No, it's not, but that's what we have Ruby for. :-)

> would it be technically feasible to have a different Smalltalk image for every single account that is loaded only when that account is being used, with basically all the account data inside the image?

I think that that is how DabbleDB works today. They have a different image for every account.

Avi's post from March about how Gemstone works has a good overview of things: http://www.avibryant.com/2008/03/ive-had-a-numbe.html


I've spoken to Avi and that's exactly how DabbleDB works. He front ends the system with Apache and uses a rewrite map to launch an external Ruby script that checks to see if the image for that request is running. If it is, it returns the port its running on, if it's not, it launches the image, records what port it's on and and proxies the request to it. Images time out after X amount of time and shut down on their own.

Each customer gets his own image, though the customers data is stored outside the image using an image segment. This lets them upgrade the Dabble image without messing with the customers data. When an image fires up, it loads up any image segments into it.

Basecamp is also exactly the kind of app that fits well with this architecture. Lot's of small independent databases.

It's not that different from how Hacker News uses load-table to store the customer data in files and loads everything into ram when the web server starts up. Both systems are basically homebrew object databases very well for the limited load placed on them. Gemstone is the granddaddy ODB that can handle anything you can throw at it.



As an undercover Java guy at RailsConf and someone who worked with Gemstone/J, I am starting to think that Gemstone has finally found the passionate niche that it's always been looking to find. Rails and Ruby. The only thing Rails codes dislike more than Java is SQL


"The only thing Rails codes dislike more than Java is SQL"

Familiarity breeds contempt, doesn't it?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: