Scaling Websockets beyond 23K connections
I already wrote about how I reached a 23K ceiling with my websockets server. For some reason, I was absolutely unable to have more connections than this magic number, even with different configurations.
Now I’m rolling up the sleeve and taking another dash at this. Let’s break the 23K ceiling!
WebSocket Scaling Bottlenecks
There are two main bottlenecks when scaling WebSockets: “file descriptors” and “CPU/Memory”.
What are File Descriptors and why is it related to WebSockets?
People have told me something about “File Descriptors” being a bottleneck when scaling WebSockets.
A file descriptor is an integer (1000, 1001, 1002, etc…) that represent an open file in UNIX systems.
This is relevant to WebSockets because of TCP connections. Each WebSocket connection takes a TCP connection. Each TCP connection uses a single file descriptor.
You can never have more websocket connections that there are file descriptors.
However, is this really usually a bottleneck?
I read in places that “OS limit on the max number of file descriptors which is generally 256-1024 by default”. But on a newly deployed droplet, I run the command ulimit -n
and it gives me a number more than 1M. In my case, I don’t think this has been a bottleneck, but it’s something to keep in mind, for sure.
Note: I run most of my apps through Docker. Thankfully, it seems like Docker inherits the limit from the host. (I just tried running docker run --rm ubuntu:18.04 bash -c "ulimit -n"
).
Are TCP connection limits relevant? (no)
In addition to the “file descriptor” piece of the puzzle, are there other constraint concerning opening lots of TCP connections?
I often see the number 65536 thrown around. This is the maximum amount of ports defined in the TCP protocol. You can expose port :3000
and port :65536
, but never :65537
or :100000
(not possible).
Each TCP connection has a unique (source-ip,source-port,dst-ip,dst-port)
set. Usually, servers have a fixed (dst-ip,dst-port)
, and the client has a fixed source-ip
. This means, the only variable we can change is source-port
, which gives the limit of 65536 connections for every source-ip
. (In practice it’s more like 64000-ish, some ports are reserved, some privileged).
However, it is important to understand that this is not a limitation of how many incoming WebSocket connection a server can have, it’s a limitation of how many outgoing connection a single client can have with the same server.
This is a limit for how many connections a single client can have with a websocket server.
In other words, this limitation is not relevant in my context.
CPU/Memory
Keeping many WebSocket connections open and alive, requires memory, CPU, and network bandwidth.
I should keep an eye on CPU and Memory graphs when doing the tests to see if those are causing any issue.
Load testing different setups
For the load tests, I set up a mechanism make real visitors on our website open websocket connections to a url. Since the site has a lot of traffic, this should be a good way to create many connections from real clients.
First, I’ll try running the test on an Ubuntu droplet.
Ubuntu Digital Ocean (4 GB Memory / 2 AMD vCPUs)
The test goes well, and I get around 16K connections. Then suddenly, all connections reset to 0. I wonder if this is the Bun script crashing or something. Nothing visible in logs, though.
Can it be that it ran out of memory? Looking at the specs, it doesn’t look like we hit 100%, but it is indeed pretty high, so it is possible.
Ubuntu Digital Ocean (16 GB Memory / 8 AMD vCPUs)
To test if the specs in the last test were a limiting factor, I try with a machine with 4x higher specs.
This time, I hit a limit at around 26K.
Looking at the metrics, it actually looks like we hit a limit on the CPU this time.
Now I try the same test, but this time using a way simpler websocket echo server. This server should use less resources. Seeing the connection number increase:
This setup went up to around 53K it flattened out. The CPU and memory did increase and eventually hit the limit. PS: The inflection point you see at 35K is because I adjusted the load test config.
At this point it is pretty clear to me that the limits we are seeing is not some magic WebSocket or network-layer limit, but rather a limit of the recources of the server (CPU and Memory).
Ubuntu Digital Ocean (32 GB Memory / 16 AMD CPUs)
Trying another time on a beefed up 500$/month Digital Ocean instance. Getting to around the same level…
Scaling Horizontally
Last time, when I hit the 23K, I tried to scale horizontally, only to see the 23K connections distribute over the two instances. This led me to think there was something else going on in the network layer.
However, now that I have run more tests, this seems weird. The 23K limit seems to be about how CPU and Memory which should increase without problem with multiple instances.
I’ll take another look at the 2-instance setup and see if I can replicate it.
render.com (4 CPU 16 GB — 1 instance)
It’s kind of hard to read the metrics. Given the tests I did with ubuntu above, I doubt the memory is actually at 0.5%.
render.com (4 CPU 16 GB — 2 and 4 instances)
Now I’ll scale the render deployment up to 2 instances.
Aaand again, we go up to 22K something… Super weird…
Ok let’s try to put the instances up to 4. There is something weird going on with the graph, but it for sure only goes up to about 23K aggregated
I did not get to the bottom of this yet.
ideas for other things to test:
- What about two droplets behind cloudflare load balancing?
- What about some other platform like DO app scaling?
- What about multiple instances of the bun app in a beefed up server?