Cacheman - a fast distributed hashtable for Windows


Cacheman is a fast, distributed hashtable for Windows, implemented purely in managed code. This is a personal side-project which I started and abandoned several months ago and picked up a few weekends ago. If you just want the binaries, scroll to the bottom of the post to get the raw, early bits (and I do mean raw and early!)

It all started with my fascination with memcached and how well it works for several heavy-traffic sites. I set off to create a bare bones hashtable with a set of requirements in mind

  • Really fast and really scalable. I didn't have a firm numerical target in place but wanted to atleast do a few thousand requests per second. I wound up doing a lot more than that ( details in the how-stuff-works section below).
  • Client agnostic i.e anyone should be able to take a look at the protocol and implement a client in their language/environment of choice. Though Cacheman can't work with current memcached clients, I've deliberately made the wire protocol similar so that I can make it compatible with some work
  • Ops-friendly. Put simply, it shouldn't be a pain in the rear to deploy and maintain on hundreds of production machines. James Hamilton has written extensively on this subject - and I would be overjoyed if I get around to doing half the things he talks about.
  • Windows and .NET friendly. I wanted something that felt 'native' to Windows. I don't have a good justification for this - just that I work on Windows and managed code a lot and happen to like them :)

Note that memcached meets a lot of the requirements above. My biggest reason for starting from scratch was to just see 'whether I could do it' :-).

I'll walk you through a little demonstration before digging into how it works and why certain design decisions were made.

A quick tour

If you grab the binaries below, you'll see 3 binaries - the server, a console/client and a library that you can link to your own apps.

  • Server:
    Run CachemanServer.exe with the /? argument to see the various arguments that you can give. By default, it'll try to bind to the first IP address and listen on port 16180 and set a maximum memory size of 100mb for the items it holds. You need to run with administrative priveleges the first time as it installs a few perf counters on first launch.

image

  • Console:
    Since it was a pain to write client code to test out the server, I wrote a bare-bones console. Run 'CachemanConsole.exe' and at the prompt, type 'connect the-ip-address-the-server-is-listening-on'. From there, you can do basic get/set/delete operations as well as run a few homegrown stress tests. Note that the times in the screenshot below are not representative of server performance - the first 5 ms set is more due to the client being run for the first time.

image

  • Client library:
    The console is actually a thin layer over CachemanAPI.dll which is where the meat of the client lives. Frankly, I haven't had much time to work on the client API and it needs quite a bit of work and polish before it can be used in a production system.

Dare Obasanjo has written extensively on how typical code using memcached in .NET looks like and the same pattern is applicable here too. Here's some sample code using the client library. Ignore the IPEndPoint stuff - the next release will let you write this in a nice config file and never have to deal with IP addresses and port in code.

image

How stuff works

  • In terms of general architecture, Cacheman is similar to memcached as opposed to other distributed hash tables (for e.g, EHCache from the Java world is an in-memory cache with async replication to the server). The server listens on a socket, parses commands from the client and responds appropriately. The client itself is pretty dumb and is just a thin layer over some networking code. This means that Cacheman is neither a read-through nor a write-through cache at the moment.

  • When you do a GET/SET/DELETE operation for a specified key, the client first needs to figure out which Cacheman server instance to talk to. To do that, the client does a quick FNV hash of the key and then mods that with the number of servers to get the server to talk to.

The disadvantage of this approach is that when you add a new server node or remove a server node, the cache needs to get repopulated. The fix for this is a consistent hashing algorithm which I haven't gotten around to implementing (and from the little I know, there are not too many of these algorithms out there)

The other choice is to use a central 'master cache server' which stores a lookup table of server nodes at which the clients can constantly poll. I'm not sure whether I like this too much as it seems to be a lot of added complexity and brings its own set of problems.

  • The client talks to the server using a simple network protocol. My protocol looks similar to memcached's text protocol but is stripped down and simplified. I looked at a bunch of options and my choice was influenced by a few things
  • Binary protocols are a pain to debug - text protocols are simple and clean, especially if you are up at 3AM poring over a Wireshark trace, IMHO :)
  • The protocol I have is small and lightweight enough that it doesn't have much parsing or network overhead.
  • I wanted to make my server work with the huge set of memcached clients out there today.This is going to be a challenge since the internal implementation is vastly different.

The networking layer also takes care to not do small writes. This, along with the protocol design lets me get a perf win by setting TCP_NODELAY on my sockets.

  • The most interesting part of all this was the implementation of the server. Being new to scalable servers, I wrote a few naive implementations which just didn't scale. My first implementation had a model where the server had a thread dedicated per socket. This was quite poor scalability-wise and just lead to a lot of thrashing threads.

The model I have now is built around NT's IO Completion ports. Unlike the previous model, there is a M*N relationship between sockets and threads (rather than a 1:1 relationship).

When data is queued to the completion port (from the client), one of the waiting threads is woken up and it goes to work on the data. Once it is done processing (either handling the request or deciding that it needs more data), it goes back to waiting by calling GetQueuedCompletionStatus internally. This lack of thread affinity for client sockets lets me multiplex several clients to a few active threads.

On my home machine (2.4GHz Intel Core 2 with 2GB RAM), I can push the server to around 16000 requests per second (clients and server running on the same machine). I believe I can get it to around 25K requests on the same box with a bit of work but anything above that is going to mean some serious work.

Cacheman comes with a nice set of perf counters for you to get at these numbers anytime

image

image

However, my code isn't entirely async. When the server sends data back to the client, it blocks on the send. This was a deliberate choice - whenever I did perf tests with async sends, I saw a noticeable dip in speed (as measured in requests processed per second). My theory here is that the hit is due to the context switch between the 'request-processing' thread and the 'socket send' thread.

  • Internally, the cache items are stored in a giant dictionary. I plan on moving to a better model in the future as right now, I'm forced to take a lock over the entire store for any destructive action. Cache expiration is pretty naive at the moment - you have a choice of using either LRU or LFU to expire items from the cache (apart from the items which get thrown out due to living past their lifetime). I plan on adding more cache expiration algorithms in the future and looking into a generational model. But the current model is good enough to ensure that you don't run out of memory on your server :)

The Bits

There is a lot of work left to be done (a better client, wrapping the server into a NT service, making things more ops friendly, lock-free internal data structures,etc) but I wanted to try the 'release-early-release often' approach for once. Things are quite busy at work (some kick-a** Popfly features in the pipeline as usual) so only expect bug fixes over the weekend :)

Be warned - these are really early, really raw bits. Stuff will crash or not work. The next version will change everything. Demons will be pulled out of your nose. Use at your own risk! Have fun and send feedback through the comments or to mail@sriramkrishnan.com or sriramk@microsoft.com

Acknowledgements

I don't usually have an 'acknowledgements' section but I just had to have one this time. A shout out to my friends who saw very little of me these past weekends. :) And to Brad Fitzpatrick and the rest of the Memcached folks for their awesome work.

Like this? Get new essays in your inbox.