Hacker News new | past | comments | ask | show | jobs | submit login

Only one thread can access the data at any given time, so it seems like most of the things you'd expect to be guaranteed by a single thread still are. I found this comment particularly interesting

   Unlike most databases the core data structure is the
   fastest part of the system. Most of the query time
   comes from parsing the REPL protocol and copying data
   to/from the network.
I wonder if anyone in the Redis ecosphere has explored a binary client server protocol, something that could be parsed/compiled on the client and then executed without parsing on the server, if the above is really true seems like that might offer even more perf gain than multithreading on the server.



From having played with/worked on profiling and optimizing Redis in the 2.6 timeframe, I can confirm that at least for small/simple operations, this is true, the data structure access is a small fraction of the cost.

One related choice that Redis makes (or made at the time) is to rely extremely heavily on the malloc implementation, rather than doing work to manage it's memory internally. Even a very trivial, naive free list provided a modest speed-up, for example.

There are a lot of these choices in the code base, largely owing to maintainability concerns (though antirez can surely speak for himself). Given how easy it is for an otherwise uninitiated C programmer such as myself to hack on it, I struggle to disagree with the prioritization. :)


The excerpted comment in a format mobile readers can see without left/right scrolling:

"Unlike most databases the core data structure is the fastest part of the system. Most of the query time comes from parsing the REPL protocol and copying data to/from the network."


The human-readable/writable protocol is one of my favorite things about Redis, tbh.

I can see cases where a really optimized system could benefit from a binary protocol, but I suspect it'd be a loss for most people.


Why not just offer both?


That was my thinking as well, though taking a peek at the actual code suggests that there's a pretty deep expectation that the client is speaking strings, e.g. in code that handles the ZRANGE command[1] I see

    if (c->argc == 5 && !strcasecmp(c->argv[4]->ptr,"withscores"))
and a quick grep suggests that's a common pattern

    % grep argv src/*.c | grep -c -e 'str\(case\)*cmp'
    482
I guess this means someone would have to tackle creating an intermediate binary format first, rewriting the command handlers to expect that format, and then making client libraries that can produce the format. Perhaps still worth it in the end, but not trivial.

[1] https://github.com/antirez/redis/blob/unstable/src/t_zset.c#...


Is this really "unlike most databases"? I remember MySQL posting profiling data years ago showing that for looking up by primary-key, 3/4 of the time was spent parsing SQL. (They went on to introduce support for querying with the Memcached protocol to address this)


That's really surprising if true, considering the SQL should only need to be parsed once.

    SELECT foo FROM Table WHERE key = @mykey;
Then you bind the parameter to whatever you're interested in.


Prepared statements are per-connection and a lot of time you want to use connections from a single pool that's used for all you different queries, so you can't really use them.


Even with that, the SQL would be parsed once per connection? So, the costs should be de minimis, unless the benchmark were short indeed?


> Even with that, the SQL would be parsed once per connection?

In a webserver-like context it's once per query one way or another - the server process is stateless-ish between page loads, so each page load is either a from-scratch connection or a connection taken from a pool, but even if you're pooling you can't use prepared statements in practice (you can't leave a prepared statement on a connection that you return to the pool because you'll eventually exhaust the database server's memory that way, and you'd have to resubmit the prepared statement every time you took a connection out of the pool anyway because there's no way to know whether this connection has run this page already or not).

If you assume a page that's just displaying one database row, which is not the only use case but a common one, then each page load is one query and that query will have to be parsed for each page load, short of doing something like building a global set of all your application's queries and having your connection-pool logic initialise them for each connection.


In a database product I'm familiar with, the prepared statements are cached according to their content and those cached objects are shared between connections. Only if they fall out of the cache do they have to be re-parsed. I had assumed that's how all databases worked.

I'm somewhat surprised at the mechanism you're describing, but now I read the documentation it does seem to be the case. I wonder if a small piece of middle-ware might be sufficient to replicate the behavior I'm describing on a connection pool, and whether that would be desirable.


See my post for essentially a "binary" interface: For my key-value store (I wrote because I thought that writing the code would be faster than understanding Redis, :-)), a client uses just standard TCP/IP sockets to send a byte array. The array has the serialization of an instance of a class. Then my key-value store receives the byte array and deserializes to get a copy of the client's instance of the class. So, with the byte array, maybe can count the interface as "binary"? I'm unsure of the speed of de/serialization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: