Load-balancing Socket.io without Redis

Back story

We recently updated our multiplayer, HTML5 game Jetstream Riders with a few new features namely, bots and sounds. The game is written using the ImpactJS library and uses SmartFox2 for the multiplayer server – or at least it did. While updating the game to support bots there was discussion of replacing SmartFox with a NodeJS equivalent to improve replication and stability of the multiplayer server as we moved our deployment infrastructure into Docker containers.

The actual multiplayer logic was relatively simple to reimplement in Node… On socket connection, we place users into a lobby where they are informed every 30 seconds of what the current level is. Once users hit the Play button, they leave the lobby and join a game room, once 10 players or 8 seconds of idling (whichever comes first) have passed, the race begins and standard multiplayer IO occurs wherein we receive client data at arbitrary rates and broadcast all player data at fixed 100ms intervals.

Before release it was calculated that this new single-threaded multiplayer server would need 2-3 servers to cope with peak loads, so three VMs were brought online to run the new app. To get users from the same location to play together, load-balancing was done by hashing the client IP to target a particular VM. This was great for high load, people could see their friends alongside them, but during quiet times the load-balancing worked against the “fun” element; rooms seemed to be ghost towns with few or no other humans around. This was a damp towel on what had been an exciting game update, something had to be done.

NodeJS Clustering

The obvious solution was to swap in a different adaptor for Socket.IO – however, we had no working knowledge of Reddis nor infrastructure in place to create Docker images for it. A more complex, but time-critical solution was needed. We did have Zookeeper available to us, but after a couple of hours trying to plug in the Socket.IO Kafka library we had to abandon it.

Step in: Node’s cluster module which while wouldn’t solve load-balancing between multiple machines, it would allow us to split work across multiple threads on a single machine and as mentioned, was going to be a short-term solution until a centralised data store could be implement. Here’s what we ended up creating:

GIF


So, what’s occurring…

When the process is first started, a child worker is spawned, it’s the children that create Socket.IO instances and listen to the Websocket connection. While the worker is under capacity things occur as per the single-thread app. It’s only when we hit the worker’s concurrent user value that things change. Firstly, the worker informs the master thread it has reached its cap, the master then spawns a new child worker and once that emits its “online” event, the previous worker is disconnected from the cluster. The old thread will continue to operate but will no longer be picked by the OS when scheduling which thread to use for execution. This ensures that the Websocket port is only exposed to a single thread during the handshake process.

We ran into some issues initially when Socket.IO was using XHR polling as well as Websockets. The worker would receive Socket.IO’s “connection” event on an XHR GET request, then after a second or two an upgrade would occur but after a new thread had been spawned and so it would sporadically fail. We had to turn off the polling transport:

var socket = io(protocol + '://' + host + ':' + port, {
  transports: ['websocket']
});

Testing

We ran this through some load-tests and pumped several hundred concurrent users through the server with only a couple of failed requests. There is a small window between a new worker being spawned and an existing worker being disconnected where a Websocket could be initialised and fail for the above-mentioned handshake scenario, but Socket.IO is robust enough to reattempt connections on failure.

All told, this was around two day’s work and a good stand-in for short to medium term usage. Our plan going forward is to provision a Reddis container and then make a decision between running multiple single-thread multiplayers servers or continue to use the multi-threaded app.

Student Experience Walkthrough

The challenge

To create a step-by-step tutorial to onboard new users of our new Student Experience.

The design goal

Walkthrough portrait design Walkthrough landscape design

The design called for screenshots of the website to be overlaid with an orange filter and then a white ring would contract inwards onto the focal point. It had to work in both landscape and portrait modes and from a maintenance point-of-view, we didn’t want to have to re-screenshot the website if the Student Experience changed over time.

Concept #1

The team had originally envisaged using SVG filters overlaid onto the actual site. However after some prototyping this became a non-starter.

Concept #2

The idea was to use something similar to Google’s bug reporting tool – load the application in the browser but off-screen, screenshot it with an as yet unidentified third party tool and then apply overlay effects using the Canvas API.

This is where work began, and it wasn’t long till we had a working prototype using HTML2Canvas. A pretty powerful tool that reads the DOM and renders those nodes into its own Canvas; essentially a browser inside a browser. There were some sticking points, each needing its own massaging to get working, but the screen-shotting phase was “done”.

Highlighting

With an image (Canvas element) as the starting point it was time to recreate the design above, without animation. At Mangahigh we like to develop with an MVP mindset so that we aren’t using up valuable time on the bells and whistles. This was after all an exploratory project so if animation couldn’t be done, at least we had solid static images as a release-able fallback.

The failure in SVG filters was due to the design requiring two different effects needing to be added to create the orange overlay. Firstly, all colour needs to be removed, so every pixel was converted to greyscale, then the orange overlay applied.

walkthrough-filtersIn the image above you can see the difference of applying a greyscale filter before the orange overlay; colours still bleed through.

Next came the highlighted circle. The trick to this was to make a duplicate of the screenshot and keep it free from filters. A circle clipping path was cut at the desired coordinate and the resulting image was pasted on top of the orange-tinted canvas:

walkthough-mask

The final touches were to add a stroke to the circle and encapsulate the screenshot inside a phone bezel. We demoed the process and the response from the rest of the team was great.

Great success?

No. Not even close.

While it performed amazingly well on Chrome running with an i7 core and 8Gb of RAM, it took roughly 1 second to create each screenshot on a Nexus 5, and got worse the older the hardware. Chrome and Safari had rendering issues with HTML2Cavas on iOS devices failing to render pseudo elements. And on high-DPI screens, the image was blurry.

To the Node-mobile!

It became apparent we couldn’t take screenshots from the device, there were too many variables out of our control. That left using static images or getting the backend to generate them, specifically PhantomJS and handily, it has its own WebKit renderer. So we dropped HTML2Canvas and a chunk of code we’d written to automate the screenshotting process and created a Phantom script to navigate around and generate the images.

Blowing it up

Previously we screenshotted the device at its native resolution and scaled it down, but our beautiful, crisp, vector-based Student Experience looked horrible when Canvas got its hands on it. Our solution as to scale the whole site up by the window.devicePixelRatio and down by a scaling factor (the screenshot is roughly 80% of the screen area). But since we had no access to browser information during server runtime, we create a set of images for each DPR (1-3).

When compositing these screenshots we worked outwards, the screenshot was left at its large resolution, the bezel was scaled up to fit, then the canvas was the targeted size. With a simple bit of CSS, the canvas element was then scaled down by the original device pixel ratio

canvas {
    width: 320px
}

Now we had ultra-crisp screenshots, ones which could be generated whenever we wanted by running a Grunt task; no need to maintain a folder of images. If we choose to in the future, we can generate these screenshots in our build process. The effort in generating these images if offloaded from the user’s device so we don’t have to deal with browser inconsistencies and under-powered hardware.

Animation time

Adding the animation was trivial; by duplicating the original screenshot, one for tinting and one for circular focal point it was simply a matter of leveraging window.requestAnimationFrame to draw ever decreasing circles and pasting it onto the destination canvas:

var ctx = outputCanvas.getContext('2d'),
    now = new Date().getTime(),
    delta = now - (lastFrame || now);

// ensure the radius never exceed the focal point's end radius
radius -= easeInOutQuart(animationRunningTime, 1, delta, duration);
radius = Math.max(radius, focalPoint.radius);

// clear last frame
drawStartingFrame(ctx);

// draw tinted
ctx.drawImage(tintedCanvas, 0, 0);

// draw masked colour
drawFocalPoint(colourCanvas, ctx, focalPoint.x, focalPoint.y, radius);

// do we queue another frame?
if (radius > focalPoint.radius) {
    window.requestAnimationFrame(animateFocalPoint.bind(this));
}

Final thoughts

There are some optimisations that could be made to the final code, but at the time of this post being written, the walkthrough is being prepped for release. We will soon find out if this is a success story. From a technical point-of-view, this was challenging but incredibly rewarding to work on. Rewarding to investigate new technology and rewarding to overcome obstacles, which is the whole point of programming isn’t it?