Hacker News new | past | comments | ask | show | jobs | submit login

What's wrong with the Viaweb/Arc/HackerNews/Mailinator approach of just using in-memory datastructures (hashtables, linked lists) and then journaling out changes to the filesystem as records that are read in on startup? It's incredibly simple and blindingly fast as long as you stay on one server, and you can get several thousand QPS of capacity on that one server (vs. like 10 with a Django/Rails + SQL database solution).

Another highly underrated solution is using MySQL/PostGres as a key-value store. Just create one table for each entity type, with the primary key as the key and a JSON or protobuf blob as the value. You're using completely battle-tested solutions, you've got bindings in basically every language, you're doing basically the same work (at the same speed) as your NoSQL solutions, but you have a lot more flexibility to add additional indices and can rely more on pre-existing functionality than a MongoDB or CouchDB solution.




> What's wrong with the Viaweb/Arc/HackerNews/Mailinator approach of just using in-memory datastructures

https://news.ycombinator.com/x?fnid=cjVXpi8HxVR5TTze3bqSCa

  Unknown or expired link.
Oh I remember now...


That's caused by using closures to create dynamically generated "callbacks" on the server, not keeping data structures in RAM. If you ask for some old item not in memory, it just gets lazily loaded.


Sure you have full permalink support, but why do you have to rely on closure to do pagination ?

My guess: because by relying on in-memory data-structures you can't do what any half assed php forum do, ad hoc queries.


I suspect he doesn't have to rely on closures to do pagination: they're a programming convenience that means you don't have to do things like think about what state persists between pages.

Anything you can do with SQL you can do with in-memory data structures. If you're interested, I'll be happy to take any SQL query and convert it to some Python list comprehensions on arrays of dicts.


> Anything you can do with SQL you can do with in-memory data structures.

Sure, but unless you also do some indexing manually, you can't really query your whole dataset when it start to become too big.


Some statements like self-joins become relatively compact in SQL though...

BTW, do you miss Java's more advanced structures (say MultiSet) when programming in Python/Go?


Pretty rarely, at least in Python. I don't miss MultiSet, because Python has that (collections.Counter). Ditto LinkedHashMap (collections.OrderedDict). Those are the two "extended" collections that I most often use. I do miss the absence of balanced binary trees occasionally, since sometimes it's useful to have an associative container with a defined iteration order, but sorted(dict) is usually good enough where performance is critical. And Python's heapq module is a bit harder to use than Java's PriorityQueues, but all the functionality is there.

I think I'd miss these a bit more in Go because the built-in datatypes are privileges in some of the language statements, but I haven't written enough Go code to really feel their absence.


Because it's much cleaner and more powerful if the thing that generates the next page is a closure rather than just an index. Among other things it lets you show each user a different set of items (depending on whether they have showdead turned on for example).


> It's incredibly simple and blindingly fast as long as you stay on one server and you can get several thousand QPS of capacity on that one server (vs. like 10 with a Django/Rails + SQL database solution).

Wait, what? Even if vertical scaling was a good idea, scaling is far from the only reason you should have more than one server for anything serious.


Isn't this thread about non-serious use? Pretty much everything I see here is about how MongoDB is only suitable for prototypes, how it doesn't even guarantee writes, how they just want something quick & dirty to build a MVP with. The parent poster asked for something to replace MongoDB with - if the use-case is prototypes and "web scale" startups that don't have users or a product yet, I think a single server with in-memory data structures is a perfectly adequate starting point.

If you do get to the point where you need some redundancy (and don't yet need to scale horizontally), you can proxy all writes to a second server running the same codebase, have it update its in-memory data structures in the background, and hot-swap it over if the master dies.


I would say that a MVP should be written in such a way that you don't have to waste time rewriting from scratch once the concept is validated. If you're writing a MVP to be disposable you aren't necessarily launching it on a real server with persistent storage anyway, more likely Heroku or at the very least AWS, but in either case you're well equipped to do the right thing from the outset rather than being forced to totally rewrite your app to enable a real architecture later on.


You will have to rewrite anyway. Multiple times. If you pick a RDBMS you will have to rewrite it to scale, if you pick MongoDB you will have to rewrite it for reliability, if you pick Heroku or AppEngine you will have to rewrite it to avoid paying them a good chunk of your profits.

That's probably the biggest surprise I learned from working in a fast-growing, well-functioning engineering organization. The half-life of code in a market that's actively growing and changing is roughly 1 year, i.e. 50% of the code you write now will have been removed within a year from now. And attempts to optimize for problems you're going to have in a year, rather than the ones you have now, actively make things worse because you inevitably have a different product direction in a year, and baking in last year's speculative assumptions just means there's more code you have to work around.


If it takes you a year from persisting serialized data on the hard drive of your one server to using a real data store, you're fucked either way. As low as that half-life might be, that's no reason to make deliberately short-sighted engineering decisions to make it even worse, especially when all the quick and easy ways of shipping an MVP effectively preclude that strategy. You're gonna go through all the effort of shipping your MVP to a real server but you're not going to go through the effort of setting up a database? Are you kidding me? Setting up Heroku with shared Postgres is not only much quicker to ship to, but it gives you a software and data architecture that you can much more easily improve in the future.


You understand that Hacker News uses precisely this persistence strategy (in-memory data structures with persistent state written to the filesystem on the hard disk of the server), and has been going on 6 years now?

You also understand that most of the advice easily accessible on the Internet comes from people trying to sell you something, and so they have a vested interest in you adding many layers into your software stack that you don't need?

If you work in an actual engineering organization that has a clue what they're doing, mmap() is your best friend, and the more layers you can cut out of the stack, the better off you are.


If anything it's this fantasy of vertical scaling that's perpetuated by "people trying to sell you something". If you're going to go with "Hacker News does it, therefore it's okay", I guess that means it's sensible for any web app to use a Lisp dialect of their own invention implemented on top of Scheme, for URL's to be generated pseudorandomly and time out, and so forth.


MySQL now supports the key-value use case via the memcached API since MySQL 5.6.

https://blogs.oracle.com/MySQL/entry/nosql_memcached_api_for...


> What's wrong with the Viaweb/Arc/HackerNews/Mailinator approach of just using in-memory datastructures (hashtables, linked lists) and then journaling out changes to the filesystem as records that are read in on startup?

That works for some things. However, it's no more a foolproof magical solution than MySQL or MongoDB or Cassandra or Oracle or... It just has different tradeoffs (non-primary key queries will tend to be a problem, you'll have to make your own replication, sharding will be a problem, etc etc).


Well, that's of course true. All engineering systems face trade-offs.

The nice thing about doing the dead simple solutions first is that they give you time to focus on the things all startups have to do (getting users, building product) and then fall down at the the things that very few startups have the luxury of needing to deal with (scaling, fault tolerance, reporting, alternative views of data).

Throughout the lifetime of my first startup, I was obsessed with the question of "What are we going to do when we need to scale?" It failed because it had a daily userbase measured in the dozens. Then I went to Google to learn how to scale things. And it turned out the biggest lesson I learned at Google was not how to scale things (though I did learn that too), but that you shouldn't scale things, not until you need to. Because the process of designing for scale slows you down significantly, and makes it much harder to develop a system that's usable and performs well under small workloads. Google products take forever to launch, because they have to scale to millions of users from day 1. As a result, their product decisions are very often questionable in early versions. Most startups don't have the luxury of Google's brand name and billions in cash to tide them over that learning process, and need to hit the ground running.

Focus on the problems you have, not the problems you hope to have in the future.


Leaving aside the issue of scalability (generally, by the time you find out that you need to scale up, it's already almost too late if you haven't had being able to scale up in the back of your mind all along), there are other reasons that you don't necessarily want to commit to a solution that makes it difficult to use more than one machine; availability is the obvious one.


I think most people who have never worked for a very fast-growing company grossly underestimate the number of rewrites that it will require anyway. You are not committing to a solution that makes it difficult to use more than one machine; you are trying to get to the point where you need to replace that architecture. Pretty much all your other architectural choices will be bad ones at that point anyways.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: