In my experience, most performance improvement tasks take the following form: “This request is working too slow. We have to make it faster.” Sound familiar? In general, to find and remove all mistakes and nonsense parts in the program code, the request itself and all related actions must be thoroughly examined. On this occasion, however, we had to come up with a completely different approach.
The Rails app in question was a calendar app used by various groups of users to create events that are repeated according to a variety of simple rules (e.g., “every third Thursday of each month”). The events contain some data and a list of participating members. Our primary performance improvement task was to perform global analysis of our backend to improve its state, allowing it to work with at least 200,000 active users with the same configuration. Our initial fact-finding was disheartening. How many active users did it support to begin with? We didn’t have a clue. Which methods were the slowest ones? We didn’t know. In the name of all that is good and holy, which element should be optimized first? We had no idea. Clearly, we were off to a stellar start.
One of the users was connected to a proxy server, so we used request logs (which appeared after some finagling) as a scenario for our load testing. That scenario was transferred to JMeter and we started to iterate the Requests per Second (RpS) value to find the maximum value for the server.
We had our first important discovery when we heard a great deal of loud cursing directed at the development team and the server itself. Remember: the server used for load testing must have no other users and processes except the load testing process. Using that protocol, you are able to receive more reliable data while other team members are able to continue their work. So, after making apologies and moving to a separate server, we received a value that we used as the sentinel value: 6 RpS. As we discovered later — once we managed to understand the readings of JMeter correctly — that value wasn’t correct.
We then received the log from the server used by Rails to save virtually all the information we needed to know (e.g., SQL-requests and “view” rendering reports with precise timings for each step performed, as well as other useful information created by our application and related gems). This log was used as the main base in our valiant fight for improving performance.
Our first task was to get rid of N+1 requests during database operation. Due to the unobvious nature of Active Record related objects loading, it’s incredibly easy to miss them. I’ll use a database as an example to explain the problem of N+1 requests:
SELECT * FROM public.cars WHERE owner IS NOT NULL
And if we want to check if all wheels on each car are not flat, then each iteration will cause the following request:
SELECT * FROM public.wheels WHERE car_id = AND NOT broken
There’s a special gem available for Ruby on Rails. It’s called bullet, and it’s used to control database requests and add information to the application log about detected N+1 requests, including information about the strings and values to be corrected. So that was our first step, because it’s relatively fast and easy. Later, we noticed that the rendering of several views was taking between 500 and 1,000 milliseconds. That timing occurred because the process of rendering involves database requests. Undoubtedly, bullet can’t find such cases, but that didn’t stop us. We simply had to check that the model contained all related objects before sending it to rendering.
We were unable to fix one of the N+1 requests until we found that ActiveModel#dup was not making duplicates of related objects. In that case, we chose ActiveRecord::Associations::Preloader, which can be used to create an array of duplicates and load all related objects at once. Furthermore, it was the only way to avoid the N+1 request in the event that the data is requested via ActiveModel#find_by_sql, because it simply ignores .includes and all other related objects loading methods.
Various modern applications contain several operations using the “fire-and-forget” principle (e.g., if we send a push notification or an email and its successful or unsuccessful status will not affect the app’s behavior). Such operations should be moved to the background job — or at least to a different thread — to avoid blocking responses. In the project at hand, all emails were sent in the main thread, so moving them to sidekiq was a relatively complex task. That was the second important discovery we made: there are several things that must be done the right way from the beginning. If an application sends emails and it’s not the main task, such emails will not block the server responses. After setting up asynchronous sending of email, we were able to save 800 to 1,000 milliseconds on each action used to send an email. (Here’s a great article about sending asynchronous email using sidekiq and deliver_later.)
The next stage was to check CACHE records in the log. Talking about a single request, such records show that we received some information from the database and then using the database again, instead of caching the data. Unfortunately, I can’t say that we managed to get rid of all such cases. However, if you’re not planning to make some kind of massive refactoring, this method can give you up to 50 milliseconds even in the worst case.
JMeter table started to become more and more logical. The basic DELETE and PUT requests were working fast. However, the GET requests, which involve relatively complex rules for data collection, were taking too much time. While it makes sense that more complex tasks take more time, we still wanted to reduce the time it took.
Lucky for us, GET requests are cacheable, so that’s what we started to do. For this purpose, we selected redis for several reasons. First, it was already installed in the system for sidekiq. Second, according to several articles we found online, redis works no worse than Memcached. Third, it’s very convenient to work with keys because “wildcard” symbols are supported.
To enable proper caching, we needed to make two things happen: create the cache, and delete it when the information stored in cache is no longer actual. The first step is pretty easy: simply check the existing results before performing the complex and time-consuming operation to receive data. Second, put to cache after performing any new operation. The process of cache invalidation, however, is relatively complex and can be invoked in unexpected places.
I haven’t talked about “view” caching intentionally, even though the standard elb engine and jbuilder support it. Here’s the main problem: if we had only one entity in our response, everything would be relatively easy. We could use the “id” and “updated_at” entity suffixes. We would receive something like this: “event_42_2016-12-21 03:36:24.000.” However, everything becomes more complex when we discover that “event” representative contains users with their profiles, as well as several entities related to “event.” The update of each of the entities means that the cache is no longer valid. All this makes the process of cache key creation pretty complex. Keeping it up-to-date during the process of association deletion or addition is simply impossible.
Discriminating readers will have noticed that in this situation, the process of load testing scenario creation was relatively short and fast. The main reason was that proper loading testing would’ve take a lot of time, and its benefits wouldn’t have been worth the trouble. However, we are happy to share our strictly theoretical ideas of what we could have done if we’d had more time.
First, we could have discovered the typical behavior patterns of the users. For example, we may have used the following categories (among others):
Second, by conducting a database “production” analysis, we could have received information about the correlation of different user types, using it as a scenario. In addition to the more realistic resting procedure, we could have:
Unfortunately, we had to go through the entire process described above to realize a key finding: that it was necessary to transform the number of active users into a measurable metric (RpS). Fortunately, the performance improvement we achieved was sufficient, and we didn’t have to try anything more. Of course, there are undoubtedly several other avenues we could’ve tried to improve performance even further. The modern market requires not only impressive performance of currently available features, but also constant development of new ones.
In addition to tests with target RpS, it can be useful to take a three to four times larger value to find all future bottlenecks or just potentially low-performing moments. By using this method, we were able to discover the insufficient size of the database’s “connection pool”, as well as non-optimal memory usage, which caused errors in the most unforeseen situations. Need a performance boost for your application? Let us know!
Andrey Oleynikov has built his entire career within Distillery’s walls, starting out as a junior developer and rising to become a head of Web Department. Over the years he has mastered both the .NET and Rails frameworks, along with the C# and Ruby programming languages. Andrey is a big fan of beer and DevOps, though not both at the same time.