Load-balancing Socket.io without Redis

Back story

We recently updated our multiplayer, HTML5 game Jetstream Riders with a few new features namely, bots and sounds. The game is written using the ImpactJS library and uses SmartFox2 for the multiplayer server – or at least it did. While updating the game to support bots there was discussion of replacing SmartFox with a NodeJS equivalent to improve replication and stability of the multiplayer server as we moved our deployment infrastructure into Docker containers.

The actual multiplayer logic was relatively simple to reimplement in Node… On socket connection, we place users into a lobby where they are informed every 30 seconds of what the current level is. Once users hit the Play button, they leave the lobby and join a game room, once 10 players or 8 seconds of idling (whichever comes first) have passed, the race begins and standard multiplayer IO occurs wherein we receive client data at arbitrary rates and broadcast all player data at fixed 100ms intervals.

Before release it was calculated that this new single-threaded multiplayer server would need 2-3 servers to cope with peak loads, so three VMs were brought online to run the new app. To get users from the same location to play together, load-balancing was done by hashing the client IP to target a particular VM. This was great for high load, people could see their friends alongside them, but during quiet times the load-balancing worked against the “fun” element; rooms seemed to be ghost towns with few or no other humans around. This was a damp towel on what had been an exciting game update, something had to be done.

NodeJS Clustering

The obvious solution was to swap in a different adaptor for Socket.IO – however, we had no working knowledge of Reddis nor infrastructure in place to create Docker images for it. A more complex, but time-critical solution was needed. We did have Zookeeper available to us, but after a couple of hours trying to plug in the Socket.IO Kafka library we had to abandon it.

Step in: Node’s cluster module which while wouldn’t solve load-balancing between multiple machines, it would allow us to split work across multiple threads on a single machine and as mentioned, was going to be a short-term solution until a centralised data store could be implement. Here’s what we ended up creating:

GIF


So, what’s occurring…

When the process is first started, a child worker is spawned, it’s the children that create Socket.IO instances and listen to the Websocket connection. While the worker is under capacity things occur as per the single-thread app. It’s only when we hit the worker’s concurrent user value that things change. Firstly, the worker informs the master thread it has reached its cap, the master then spawns a new child worker and once that emits its “online” event, the previous worker is disconnected from the cluster. The old thread will continue to operate but will no longer be picked by the OS when scheduling which thread to use for execution. This ensures that the Websocket port is only exposed to a single thread during the handshake process.

We ran into some issues initially when Socket.IO was using XHR polling as well as Websockets. The worker would receive Socket.IO’s “connection” event on an XHR GET request, then after a second or two an upgrade would occur but after a new thread had been spawned and so it would sporadically fail. We had to turn off the polling transport:

var socket = io(protocol + '://' + host + ':' + port, {
  transports: ['websocket']
});

Testing

We ran this through some load-tests and pumped several hundred concurrent users through the server with only a couple of failed requests. There is a small window between a new worker being spawned and an existing worker being disconnected where a Websocket could be initialised and fail for the above-mentioned handshake scenario, but Socket.IO is robust enough to reattempt connections on failure.

All told, this was around two day’s work and a good stand-in for short to medium term usage. Our plan going forward is to provision a Reddis container and then make a decision between running multiple single-thread multiplayers servers or continue to use the multi-threaded app.

2 thoughts on “Load-balancing Socket.io without Redis

  1. Hi Daniel,

    Thanks for the interesting post. I have a couple of question:

    1) Which tool did you use for socket.io load-tests?

    2) I implemented a NodeJS chat application using socket.io and hosted as web app in Azure. There we use the Azure load balancer (2 instances). When some load is generated (and eventually the second instance is used) we get “400 Bad Request” back to the client. For some of them I can see “transport =polling” from the Chrome dev tools, therefore I was wondering whether you had the same or similar issue.

    Many thanks in advance,
    Francesco

  2. Hiya,

    So if memory serves, we wrote a custom client that emulated the game client. It was a piece of JS we ran in multiple browsers. It would connect to the server and once placed into a game room and given the “game-start” message, each instance of the client would send dummy data packets at the usual rate. We would then spin up 50, 100, 150, etc, clients per browser and get them connecting almost at the same time. We had to use multiple browsers as they had an internal limit of max connections. One thing we were mindful of was that a new web socket connection was created by each client and Socket.io wasn’t sharing them. A netstat alerted us that even though 50 clients were running, only a few sockets were opened.

    Then we kept ramping it up to see what the maximum concurrent capacity was for a given box. When things started capping out we saw sockets start closing and errors in the Network panel.

    We added some logging to the server side which counted incoming connections and ensured that matched with our browser debug, then left it running overnight to make sure we didn’t have any memory leaks.

    As for your second issue, I don’t recall any 400s – you might want to check you are forcing a WebSocket transport (disallowing polling from the browser). A telltale sign things aren’t working are errors during handshake as part of it takes place on one thread and some on another. I have no experience with Azure so would recommend getting things working locally, on 2 separate machines / VMs to get a working baseline.

Leave a Reply to Daniel Jackson Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>