Boosting Server Stability with E2E & Load Testing for Reliable Deployment

– Overview

Hello, I’m a lead developer at MeFriend.ai. Our server is mainly implemented with Python. But we encountered stability issues when a large traffic joined at once

We are using various logging and analytics services like Metabase and Mixpanel, we could investigate where these bottlenecks occur and look into them. Through API log tracking, latency checks, and other analyses, we identified that our chat server struggled to handle high traffic.

MeFriend.ai – Click Here!

In this post, I’ll guide you our process of identifying and resolving these bottlenecks in MeFriend’s chat server using E2E Load Testing and Load Testing to simulate real conditions and monitor performance.

Now, let’s start how we can deal with this issue!

– Steps

E2E Testing Flow
Define Testing Traffic
Define Testing Goals
Results

– E2E Testing Flow

There are three main types of tests:

Unit Tests
Integration Tests
E2E Tests

Although we conducted all of these tests during development, unexpected issues arose under heavy traffic. Since resolving these issues quickly was crucial, we decided to combine E2E testing with load testing.

_{E2E (End-to-End) Testing: A testing method that simulates the real service flow, replicating how an actual client would use the service.}

By creating E2E test code that follows the exact flow of real user interactions, we could monitor all incoming logs across the servers in real-time, allowing us to identify issues fast.

The diagram above shows the main flow of our chat server, broken down into key parts. On the left is the flow for users who chat without logging in, and on the right is the flow for users who are logged in.

Empirically, we noticed that the chatroom entry process for logged-in users was less stable. Therefore, we decided to separate these two cases and test them as closely as possible to real-world conditions.

There are various load testing tools like Grafana K6 and Apache JMeter.

However, for speed and familiarity, we created a webSocket clients in Python for quick and efficient testing.

– Define Testing Traffic

We’ve considered moving our chat server from Python to a more stable and resource-efficient language. This is because we believe that Python, while powerful, might not be ideal for deployment purposes. (personal opinion of our team)

To explore alternatives, we prepared tests using Golang, very fast and resource efficient.

If our tests show that Golang handles traffic more better, we’re ready to migrate our chat server to Golang immediately.

We decided to gradually increase traffic, starting from a lower level and scaling up to a load higher than our usual service capacity.

As resource usage increases, our Kubernetes cluster scales automatically according to our autoscaling policy. So, we are going to focus on the server’s processing capacity under load.

If you’re interested in how MeFriend automated server scaling and built a flexible, cost-effective infrastructure, feel free to check out the article below.

Cost Effective and Flexible Provisioning

Cost Effective and Flexible Provisioning

Additionally, we varied the number of concurrent requests and intervals to cover a wide range of scenarios, using configurations such as:

1 request every 10 seconds, total 10 requests
1 request per second, total 50 requests
10 requests every 10 seconds, total 100 requests
20 requests every 10 seconds, total 100 requests
50 requests every 10 seconds, total 500 requests
100 requests every 10 seconds, total 1000 requests

– Define Testing Goals

Our testing goals here:

Ensure the processing and waiting queues function correctly.
Confirm that all requests are handled successfully.

To maximize cost efficiency, we introduced multiple queues due to the limitations of inference resources. This setup is designed to handle high user traffic without service disruptions, ensuring all requests are managed appropriately through the waiting and processing queues.

Ultimately, even when many requests arrive simultaneously,

users who come first, start chatting
users who come later have to wait in queue

– Results

We set up automated tests with results sent directly to a Discord Webhook like below:

However, starting from the test with 50 requests every 20 seconds, the success rate of the Python server dropped sharply.

Since this load was lower than our actual service traffic, the test results indicated that the Python server was highly unstable under such conditions.

If this server can’t even handle this level of traffic, I didn’t think this server were up to the level to be deployed.

On the other hand, golang server could handle higher level of traffic successfully.

Analysis

The Python test results showed failed cases only for authenticated users, leading us to suspect an issue within the authentication logic specific to these users.

We discovered that some authentication API requests weren’t reaching their destination through our cluster. Despite multiple adjustments to the Python logic, the results couldn’t find why python server can’t handle heavy traffic.

Ultimately, golang showed very successful test results, so we decided to change our chat server to golang.

While changing to golang has resolved the issue for now, we’re committed to pinpointing the exact reason why these problems happened in the socket server specifically.

After switching to Golang, the server began operating very stably, and resource usage dropped to below 10% compared to the previous setup.

We’re incredibly grateful to our users who continued to trust and use our service despite the initial instability with the previous chat server.

We’re very committed to improving and delivering an even better experience for users. Thank you for reading this detailed post!

If you want to see how MeFriend.ai save cost in image server, click here! ↓

AWS Cost Optimization by Reducing Image Cost

AWS Cost Optimization by Reducing Image Cost