Tag Archives: performance

Percentiles aggregation with Redis

We all know that we should monitor pro-actively our applications and response is a natural choice when working on performance. Concerning timings, it’s important to keep this golden rule in mind “Average is bad !” Averages are used too often for in monitoring because they are quite easy/memory efficient to compute (sum of values/number of samples). There are hundreds of posts and articles that explain this and why you should not rely on it (like Anscombe’s quartet). Percentiles are much better, but less easy to compute especially on a large-scale distributed system (big data, multiple sources, …)

Aggregating percentiles is a common problem for everyone and it exists a lot of ways to do it. I highly suggest to read this article from baselab : this is just a must-read if you plan to work on this topic !  One section in this last article refers to “Buckets” and this is exactly the approach we will use here.  “Another approach is to divide the domain into multiple buckets and for each value/occurrence within the bucket boundaries increment this bucket’s count by one”

buckets

Increments ? Do you know an appropriate key-value store for this purpose ? Yes, Redis.  My implementation consists of storing all buckets into a redis hash, each bucket being a hash field. The field name is the upper bound and the value is the number of samples for this interval. We use HINCRBY (complexity O(1)) on client/emitter side to increment the number of samples for the target bucket. This will give us some sort of histogram in redis for that event. It could be incremented by one to many clients because commands are isolated and atomic. On the other side (supervisor, admin dashboard, …), in order to get the N-percentile for that hash, we need to get all fields, sort them (hash fields are not sorted by default) and find the exact field name where the aggregated number of samples (starting from the first) is beyond the target number of samples (total samples * N-percentiles).

One particular thing to notice is the bucket size. There is no rule concerning this and it could a constant, linear, exponential, … scale. The choice of the bucket size depends mainly on two factors : the desired accuracy (small size means better accuracy) and the domain (delta between  min/max). For example, a constant bucket size of 50 ( [0-50],[50-100],[100-150], …) is not accurate for small values (ex 130 => [100-150]) but much more accurate for higher values (like 1234 => [1200-1250]). Over the time, I generally use some kind of varying bucket size inspired by HDR Histogram.

Bucket Size/Round Factor Step Lower Value Upper Value
1 0 0 128
2 1 128 384
4 2 384 896
8 3 896 1920
16 4 1920 3968
32 5 3968 8064
64 6 8064 16256
128 7 16256 32640
256 8 32640 65408

Number of redis hash fields

For example, a sample value of 2040 is located in the interval [2032-2048]. This is just an example of a non-linear scale that provides a good accuracy, while keeping an acceptable number of hash fields

The number of hash fields is important because it will directly impact on memory or storage (if configured). Hashes have a hidden memory optimization using ziplist when the number of fields is below a configurable threshold (512 by default). So, we can’t have too many fields/buckets if persistence is required. In practice, depending on the number of samples and the distribution of values, some buckets are often missing. This is not a problem and give us more possibilities to select the bucket size.

Querying Redis

During my work on this topic, my initial intention was to create a LUA script to compute percentiles in redis : we don’t need to get all the fields+values on processing side, and percentile computation is just a matter of sorting/counting hash fields. After several test, I can easily say that it is a very bad idea ! The best version of my script is quite efficient –I suppose-  but taking several ms for the worst cases (eg. thousands of buckets). You should know that only one script can be executed at the same time on redis and I prefer to do more IO than blocking my redis instance. By the way, the complexity of HGETALL is O(1), so quite CPU-efficient.

Precision on client side

This implementation has a nice advantage: the precision is configured on client/emitter side, because the algorithm to compute the bucket is located there. You could change the algorithm at runtime or even have multiple implementations that you can apply to your components. On server-side, it doesn’t matter because you just have to sort and count. This is pretty good when you have multiple kind of apps and components, all having a unique domain (nano seconds, milliseconds…)

Smart Key Pattern

Since now, I’ve considered only one hash, but this is not enough if we want to create beautiful graphs (99th percentile for the last 7 days per hour). A common way –in NoSQL- to solve this problem is to use the smart key pattern. A Key can be much more that a simple constant string value. If we know how we will query the data model later (and we know in this case), let’s add dynamic parts into the key. We’re working with timeseries so a rounded timestamp is a natural choice. All the keys will follow the pattern ‘mykey:hour:roundedts’.

We can’t aggregate percentiles, and there is no “magical algorithm” to compute percentiles from percentiles. If we want to investigate response times during a few minutes, we need aggregations  for every minute. The best way to do this is to increment several smart keys at the same time (minute, 5 minute, hour, day, …) on client side. That’s not a real problem on client side if your binding implementents pipelining.

Limit the impact on client side

We’ve seen that buckets can be incremented quite easily on client side, I mean without spending too much cpu cycles. However, itis not always a good thing. Depending on your applications, a better approach could be to implement some kind of batch processing. If you receive 100 requests during the same second, there is a high probability that you could group your timings (10 requests < 10ms, 30 requests between 10 and 20, 15 between 20 and 30ms, …). This will reduce significantly the number of commands executed.

To illustrate the concept, I’ve created a small gist using nodejs. This example is not prod-ready and far from my current implementation (C#) but it should help you to understand the principles of this approach.

A few line of codes later, and with the help of your favorite chart library, you can visualize this kind of graph. Please take a moment to compare the average and percentiles. Percentiles are definitively muc better !

 

range

Performance Tips & Tricks with Xamarin for Android

I was recently involved on a project made with Xamarin. This new mobile app is not released yet, but it’s on a good track and I’m satisfied of the result. The initial objective was to create several Xamarin apps (Brands x Areas  on Android + iOS) with an important focus on code sharing & quality. I can’t talk too much on the software architecture but it’s built with MvvmCross.

This article targets Xamarin for Android (aka MonoDroid) only. I hope to find the time to write the same kind of article for iOS, but most of things listed here are pertinent for both platforms.

Watch out network requests

The primary data source for the app are public web apis. These days, it’s a fairly common architecture for a mobile application but don’t forget that latency on mobile networks matters. On a mobile network, we can’t really expect to get the response in less than 100 ms like on desktop browsers. To illustrate this, we’ve simply generated trace messages on every HTTP request ! (url, headers, and sometimes content) It may seem very basic, but it’s terribly important. This helped us to detect duplicated calls, lack of caching, ghost UI controls, duplicated events handlers …

Here are some HTTP logs at app’s startup. Same color = Same uri. Look at the timings, any problem here?

xam_iosI suggest you enable this kind of logging very early in your project, so everyone will be familiar and educated to mobile latency. Logging all IOs was certainly one of the best idea, but it’s only half of the jjob. To reduce the number of requests you could

  • Use Batch requests (combine several calls into a single request) or design your api following “Scenario Driven Design” principles
  • Use Abuse local caching, for example with Akavache or Sqlite
  • Use modernhttpclient and check gzip compression
  • Use pertinent speculative requests and background refreshe

Watch out memory allocation

My team and I am are used to develop server side code (web sites, web apis, workers, ..), running on servers having decent hardware. Unfortunately, you don’t have –yet- that luxury on a mobile app. I don’t think there are limitations but to be clear your app will have to run with only a few MBs. You should be more concerned on memory than ever. There are mainly two problems :

  • Excessive allocation adds pressure on GC

Garbage Collection & Memory and Performance Best Practices are must read to understand GC mechanisms on Xamarin. Why GC is so bad ? It just stops all running threads, including the UI thread L. Bye Bye, smooth animations and fast rendering. Minor collections are cheap (only a few ms, sometimes 50 ms) but Major collections are quite slow (from 100 ms to 1 sec). Don’t forget that Major GC collects Gen1 and large object space heaps (LOS). The LOS is where objects that require more than 8000 bytes are kept.  8000 is small, terribly small, especially when you have to consume web apis.

  • Memory leak increase major collections and leads to maximum GRefs reached

If you don’t pay attention on memory, your nursery, major heap or LOS will be rapidly full. Since Xamarin.Android 4.1.0, a full  -major- GC is performed when a gref threshold is crossed. This threshold is 90% of the known maximum grefs for the platform: 1800 grefs on the emulator (2000 max), and 46800 grefs on hardware (maximum 52000).

Here are some GC minor messages, visible via logcat. You should always try to understand these messages, and when they happen.

xam_gcs.pngObserve how the LOS size increases after HttpClient calls, a major GC is expected very soon…

Here is another typical example: Suppose translations contains 2000 items, and that this code is executed on each webview. Failed !

xam_ls.png

To detect excessive allocation and memory leaks, Android Studio are Xamarin Profiler are great. The produced mlpd file by the last tool can also be opened by Heap-Shot (mono).

Optimize your images

iOS has a built-in image optimization process but not Android. Like for Web performance Optimization, it’s better to have small & optimized images for both the GPU and your app size. Especially on Android, where are there are several folders, it is easy to include hundreds of images.

Many online services are available but you can also use directly pngout/opting for png and jpegtran for jpeg files. Just for your information, 25% was saved of the total assets size.

Note : please be sure to not include useless resources

Optimize your webviews

It’s also very common to have webviews in a mobile app. You can think of the WebView as a chromeless browser window that’s typically configured to run fullscreen. It’s sometimes required to explicitly NOT use native code for some topics (login, registration, content boxes with html, ….)  There are two well-known issues with them. First, webviews and their mobile browser counterpart don’t have similar performance profiles (Do u webview ?) ; it’s less accurate since latest versions, but that’s still something to take into account. Second, a webview is often focused on one feature and that’s all. Traditional SPAs has references to all scripts and styles in the main document, leading to a few seconds lost at startup (networks requests, parsing, layout …). That’s a terribly bad idea to integrate an SPA (made with any JS framework), and hope the page to be fully loaded in less than 5 secs. This simply means that your webviews should be optimized for your native app usage.

  • Log every request of your webview
  • Intercept HTTP requests and sometimes replace response
  • Use Javascript interfaces to bootstrap your JS
  • Check HTTP caching headers
  • Optimize your web resources (js, css, images, …) for a webview usage

Here is what you should AVOID (aka our initial attempt to integrate a SPA into a webview)

xam_webviews.png

Optimize your .NET code and keep your prod build clean

Developers often add debug messages, diagnostics code, local variables … don’t forget to cleanup that kind of code. Here is one of my favorite examples:

xam_traces.png

string.format and JsonConvert are executed even in a RELEASE build.

I will not talk about Asynchronous programming, which is just mandatory to create a responsive and professional app in 2015 but as a general perf recommendation, try to understand the “hot path” on your app/views. For example, the primary serializer used in this app is Newtonsoft.Json. It’s not the fastest .NET json serializer but certainly the more mature on Xamarin. There are some tips to make it go even faster. Using stream for deserialization is also great to avoid allocation on LOS, because the threshold could be easily reached with json format (see memory section, only 8KB).

Finally, like for any app, keep your production build optimized and clean.

  • Use a Release configuration
  • Remove dead code, unused variables & classes
  • Disable Android debugging
  • Review logging & diagnostics

Understand the Android Platform and controls

The app may be written in C#, using Mono/Xamarin … but it’s still running on Android. If you are familiar with Windows development, Xaml or WP, please try to forget everything because it’s a completely different platform. Is it better to use a listview or a recyclerview ? How to implement properly a Pager Adapter ? Is my layout optimized ? So yes, you will have to read a lot of articles here but it’s not a waste of time, because it is very close to the –official- Android development. Most of the articles, tips, SO answers that you can read on Android also apply to Xamarin.Android. I’ve already talked about webviews and we’ve done a lot of things that are specific to Android to get the fastest page load. At the middle of the project, a memory leak was identified on a recyclerview. There were simply too many instances of a custom controls.

xam_mvxframe.jpg

Our understanding of the inner PagerAdapter was just totally wrong, and we’ve fixed it after reading several SO answers.

Here the main tricks & tips that help me a lot. I wish I knew this at the beginning. My team and I did a lot of codes fixes & fine tuning that are mostly irrelevant without our  context, that’s why you have to find your way into Xamarin development.  I Hope this will help you.

Storing time series in Microsoft Azure Table Storage

In a recent project, I needed to store time series into Microsoft Azure Table Storage. As for most of NoSQL databases, the design is very important and you have to mainly think about your access patterns.

A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Time series data often arise when monitoring an application or tracking business metric, but occur naturally in many application areas like finance, economics, medicine …

appinsights

Time series are very frequently plotted via –beautiful- charts. Here is an example on availability on Application Insight. In this example, the end user can view response time for several time ranges. Data is aggregated depending on the selected time range (every second for last hour, every minute for last day, every 5 minutes for last 48 hours, …) to keep it easy to understand.

If you’re unfamiliar with Table Service on Azure, I highly recommend you to read the Introduction then the Design Guide ; after reading both articles, I hope you will understand why storing time series is not so easy and needs this kind of article.

To illustrate following designs, I use random generated data, basically one point/second. Having one data point per second makes it very clear and easy to read. In your implementation, you can have more than one point per second; if doesn’t matter.

Basic design

A first design could be to create a basic entity for each data points, like this one

public class DataPoint : TableEntity
{
    public long Value { get; set; }
    public DataPoint() { }
    public DataPoint(string source, long timeStamp, long value)
    {
        this.PartitionKey = source;
        this.RowKey = timeStamp.ToString();
        this.Value = value;
    }
}

Why a timestamp for the time property? It’s simply a better choice because RowKey is a string (constraint by Table Service). Ticks could be an idea but I prefer a common value (Unix timestamp) for time axis, so any client could request this table via REST endpoint.

This will produce this kind of results

ts_basic

The RowKey equals here to the Unix timestamp in seconds (due to my test data) but it could be the traditional Unit timestamp (ms). Now If you want to query entities for a specific time range (two hours), you need to use a range query on the RowKey. This is the 2nd most efficient query type on table storage. Here is an example for the first two hours of 2015:

// time range : first two hours of 2015
var from = new DateTime(2015, 1, 1, 0, 0, 0);
var to = new DateTime(2015, 1, 1, 2, 0, 0);

// generate the range query
string filter = TableQuery.CombineFilters(
    TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, "mysource"),
    TableOperators.And,
    TableQuery.CombineFilters(
        TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.GreaterThanOrEqual, ToSecondsTimestamp(from).ToString()),
        TableOperators.And,
        TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.LessThan, ToSecondsTimestamp(to).ToString()))
        );

TableQuery<DataPoint> query = new TableQuery<DataPoint>().Where(filter);
var results = table.ExecuteQuery<DataPoint>(query);

The generated filter is $filter= (PartitionKey eq ‘mysource’) and ((RowKey ge ‘1420070400’) and (RowKey le ‘1420077600’))

This design seems to be good but it isn’t for real world scenarios. First, the rule #1 of Table Service (scalability) is not followed here. Performance is not the same for 1 000 rows and 1 000 000 000 rows. Having a hot partition (always the same PartitionKey) is an anti-scalability on Table Service. Second, the number of data points included in the desired time range affects performance : in my example, for 2 hours, we will receive 7200 entities (one entity for every second). The REST service limits the number of returned entities to 1000 per request. So, we will have to execute several requests to get all entities for the selected time range. Fortunately the .net will silently do this job for us but it’s not the case for all clients. And what about last 24 hours ? What about data every ms ?

Pros Cons
Natural representation No Scalability (hot partition)
Easy to query with Range queries Inefficient queries (limited to 1000 entities per request)
  Slow inserts (limited to 100 points by EGT)

A better approach could be to store row in reverse order. New data points are automatically added at the beginning of the table. Queries should be a little more efficient, but not so many in facts.

Advanced Design

The key idea with time series is that one dimension is well-known: the time. Every request contains a lower bound (from) and an upper bound (to). So, we will use the Compound key pattern to compute smart Partition & Row Keys thus enabling a client to lookup related data with an efficient query. A very common technique when working with time series is to round the time value with a magic factor. To fully understand these magic factors, you need one of my gists. ToRoundedSecondsTimestamp() is heavily used in the my examples. For example Tuesday, 28 April 2015 12:04:35 UTC could be rounded to

Factor Rounded timestamp Datetime
1 (no factor) 1430222735 Tuesday, 28 April 2015 12:04:35 UTC
60 (one hour) 1430222700 Tuesday, 28 April 2015 12:04:00 UTC
3600 (one hour) 1430222400 Tuesday, 28 April 2015 12:00:00 UTC
86400 (one day) 1430179200 Tuesday, 28 April 2015 00:00:00 UTC

In this design, each row contains more than one data points; basically all points included in the current step (between the current rounded timestamp and the next rounded timestamp). This is possible thanks to DynamicTableEntity in Table Service. Just to illustrate the concept, here is an example with PartitionFactor = 60 secs and RowFactor = 5 secs.

ts_advanced

Notice how PartitionKey/RowKey are now rounded timestamps. This is exactly the same data in my first design. In this example, each row contains 5 data points and each partition contains 12 rows. Of course, in a real scenario, our objective is to maximize the number of points in a single row.

To get the first two hours of 2015, the query is a little bit more complicated but not less efficient.

var from_ts = ToRoundedSecondsTimestamp(new DateTime(2015, 1, 1, 0, 0, 0), RowFactor);
var to_ts = ToRoundedSecondsTimestamp(new DateTime(2015, 1, 1, 2, 0, 0).AddSeconds(-1), RowFactor);

// generate all row keys in the time range
var nbRows = (to_ts - from_ts) / RowFactor + 1;
var rowKeys = new long[nbRows];

for (var i = 0; i < nbRows; i++)
{
    rowKeys[i] = from_ts + i * RowFactor;
}

// group row keys by partition
var partitionKeys = rowKeys.GroupBy(r => ToRoundedTimestamp(r, PartitionFactor));

var partitionFilters = new List<string>();
foreach (var part in partitionKeys)
{
    // PartitionKey = X and (RowKey=Y or RowKey=Y+1 or RowKey=Y+2 ...)
    string partitionFilter = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, part.Key.ToString());
    string rowsFilter = string.Join(" " + TableOperators.Or + " ", part.Select(r => TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, r.ToString())));
    string combinedFilter = TableQuery.CombineFilters(partitionFilter, TableOperators.And, rowsFilter);
    partitionFilters.Add(combinedFilter);
}

// combine all filters
string final = string.Join(" " + TableOperators.Or + " ", partitionFilters);
var query = new TableQuery<DynamicTableEntity>().Where(final);
var res = table.ExecuteQuery(query);

//do something with results...

The generated filter (with partition Factor = one hour, row Factor =4 minutes) is $filter= (PartitionKey eq ‘1420070400’) and (RowKey eq ‘1420070400’ or RowKey eq ‘1420070640’ or RowKey eq ‘1420070880’ or RowKey eq ‘1420071120’ or RowKey eq ‘1420071360’ or RowKey eq ‘1420071600’ or RowKey eq ‘1420071840’ or RowKey eq ‘1420072080’ or RowKey eq ‘1420072320’ or RowKey eq ‘1420072560’ or RowKey eq ‘1420072800’ or RowKey eq ‘1420073040’ or RowKey eq ‘1420073280’ or RowKey eq ‘1420073520’ or RowKey eq ‘1420073760’) or (PartitionKey eq ‘1420074000’) and (RowKey eq ‘1420074000’ or RowKey eq ‘1420074240’ or RowKey eq ‘1420074480’ or RowKey eq ‘1420074720’ or RowKey eq ‘1420074960’ or RowKey eq ‘1420075200’ or RowKey eq ‘1420075440’ or RowKey eq ‘1420075680’ or RowKey eq ‘1420075920’ or RowKey eq ‘1420076160’ or RowKey eq ‘1420076400’ or RowKey eq ‘1420076640’ or RowKey eq ‘1420076880’ or RowKey eq ‘1420077120’ or RowKey eq ‘1420077360’)

So, what are good PartitionKey/RowKey factors? It’s up to you and depends on your context but there are two constraints. First, a table service entity is limited to 1 MB with a maximum of 255 properties (including the PartitionKey, RowKey, and Timestamp). If you have data every second, 4 mins (60*4 = 240) seems to be a good RowKey Factor, but if you have several points per second it won’t be pertinent. The second problem is the filter length. Having too many partitions and rows will create a very long filter that can be rejected by the table service (HTTP 414 “Request URI too long”). In this case, you can execute partitions scans by removing row keys in the produced filter but this will be less efficient (depends on the number of rows in each partition)

Pros Cons
Highly scalable Point queries are not always possible (Uri length limits)
Fast inserts (if required) Granularity should be defined at the beginning (how many rows/partition ? how many points ?)
Dev tools can’t be used with dynamic columns

To conclude, Microsoft Azure Table Storage is a fast and very powerful service that you should not forget to use for your application, even if they are not hosted on Azure. However you should use proper designs for your tables, very far from traditional relational designs. Choosing a good couple of PartitionKey/RowKey is very important. We covered in this article a simple use case (time series) but there are so many scenarios… In terms of pricing, it’s very cheap ; but your choice should not be driven by costs here, but by scalability and performance.

Source code is available here : Basic and Advanced 

How to avoid 26 API requests on your page?

The problem

Create an applications relying on web APIs seems to be quite popular these days. There is already an impressive collection of ready-to-use public APIs (check it at http://www.programmableweb.com ) that we can consume to create mashup, or add features into our web sites.

In the meantime, it has never been so easy to create your own REST-like API with node.js, asp.net web api, ruby or whatever tech you want. It’s also very common to create your own private/restricted API for your SPA, Cross-platform mobiles apps or your own IoT device. The naïve approach, when building our web API, is to add one API method for every feature ; at the end, we got a well-architectured and brilliant web API following the Separation of Concerns principle : one method for each feature. Let’s put it all together in your client … it’s a drama in terms of web performance for the end user and there is never less than 100s requests/sec  on staging environment; Look at your page, there are 26 API calls on your home page !

I talk here about web apps but it’s pretty the same for mobile native applications .RTT and latency is much more important than bandwidth speed. It’s impossible to create a responsive and efficient application with a chatty web API.

The proper approach

At the beginning of December 2014, I attended at the third edition in APIdays in Paris. There was an interesting session –among the others- on Scenario Driven Design by @ijansch.

The fundamental concept in any RESTful API is the resource. It’s an abstract concept and it’s merely different from a data resource. A resource should not be a raw data model (the result of an SQL query for the Web) but should be defined with client usage in mind. “REST is not an excuse to expose your raw data model. “ With this kind of approach, you will create dump clients and smart APIs with a thick business logic.

A common myth is “Ok, but we don’t know how our API is consumed, that’s why we expose raw data”. Most of the times, it’s false. Let’s take the example of twitter timelines. They are list of tweets or messages displayed in the order in which they were sent, with the most recent on top. This is a very common feature and you can see timelines in every twitter client. Twitter exposes a timeline API and API clients just have to call this API to get timelines. Especially, clients don’t have to compute timelines by themselves, by requesting XX times the twitter API for friends, tweets of friends, etc …

I think this is an important idea to keep in mind when designing our APIS. generally,  We don’t need to be so RESTful (Wwhat about HATEOS ?). Think more about API usability and scenarios, that RESTfullness.

The slides of this session are available here.

Another not-so-new approach: Batch requests

Reducing the number of request from a client is a common and well-known Web Performance Optimization technique. Instead of several small images, it’s better to use sprites. Instead of many js library files, it’s better to combine them. Instead of several API calls, we can use batch requests.

Batch requests are not REST-compliant, but we already know that we should sometimes break the rules to have better performance and better scaliblity.

If you find yourself in need of a batch operation, then most likely you just haven’t defined enough resources., Roy T. Fieldin, Father of REST

What is a batch request ?

A batch request contains several different API requests into a single POST request. HTTP provides a special content type for this kind of scenario: Multipart. On server-side, requests are unpacked and dispatched to the appropriate API methods. All responses are packed together and sent back to the client as a single HTTP response.

Here is an example of a batch request:

Request

POST http://localhost:9000/api/batch HTTP/1.1
Content-Type: multipart/mixed; boundary="1418988512147"
Content-Length: 361

--1418988512147
Content-Type: application/http; msgtype=request

GET /get1 HTTP/1.1
Host: localhost:9000


--1418988512147
Content-Type: application/http; msgtype=request

GET /get2 HTTP/1.1
Host: localhost:9000


--1418988512147
Content-Type: application/http; msgtype=request

GET /get3 HTTP/1.1
Host: localhost:9000


--1418988512147--

Response

HTTP/1.1 200 OK
Content-Length: 561
Content-Type: multipart/mixed; boundary="91b1788f-6aec-44a9-a04f-84a687b9d180"
Server: Microsoft-HTTPAPI/2.0
Date: Fri, 19 Dec 2014 11:28:35 GMT

--91b1788f-6aec-44a9-a04f-84a687b9d180
Content-Type: application/http; msgtype=response

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

"I am Get1 !"
--91b1788f-6aec-44a9-a04f-84a687b9d180
Content-Type: application/http; msgtype=response

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

"I am Get2 !"
--91b1788f-6aec-44a9-a04f-84a687b9d180
Content-Type: application/http; msgtype=response

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

"I am Get3 !"
--91b1788f-6aec-44a9-a04f-84a687b9d180--

Batch requests are already supported by many web Frameworks and allowed by many API providers:  Asp.net web api, a node.js module, Google Cloud platform, Facebook, Stackoverflow,Twitter …

Batch support in  asp.net web api

To support batch requests in your asp.net web API, you just have to add a new custom route  :

config.Routes.MapHttpBatchRoute(
routeName: "batch",
routeTemplate: "api/batch",
batchHandler: new DefaultHttpBatchHandler(GlobalConfiguration.DefaultServer)
);

Tip : the DefaultBatchHandler doesn’t provide a way to limit the number of requests in a batch. To avoid performance issues, we may want to limit to 100/1000/… concurrent requests. You have to create your own implementation by inheriting DefaultHttpBatchHandler.

This new endpoint will allow client to send batch requests and you have nothing else to do on server-side. On client side,  to send batch requests, you can use this jquery.batch, batchjs, angular-http-batcher module, …

I will not explain all details here but there is an interesting feature provided by DefaultHttpBatchHandler : the property ExecutionOrder allow to choose between sequential or no sequential processing order. Thanks to the TAP programming model, it’s possible to execute API requests in parallel (for true async API methods)

Here is the result of async/sync batch requests for a pack of three 3 web methods taking one second to be processed.

batch

Finally, Batch requests are not a must-have feature but it’s certainly something to keep in mind. It can help a lot in some situations. A very simple demo application is available here. Run this console app or try to browse localhost:9000/index.html. From my point of view here are some Pros/Cons of this approach.

Pros Cons
Better client performance (less calls) May increase complexity of client code
Really easy to implement on server side Hide real clients scenarios, not REST compliant
Parallel requests processing on server side Should limit batch size at server level for public API
Allow to use GET, POST, PUT, DELETE, … Browser cache may not work properly

Towards a better local caching strategy

Why Caching?

A while ago, I explained to some of my co-workers the benefits of caching. I’m always surprised to see how this technique is so misunderstood by some developers.

In computing, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere.

The thing is that caching is already present everywhere: CPU, Disk, Network, Web, DNS, … it’s one of the oldest programming techniques, available in any programming language and frameworks. You may think that it was mandatory with –only- 8 Ko of RAM two decades ago, but don’t too naive: it’s still a pertinent approach in our always-connected world : more data, more users, more clients, real-time, …

In this article, I will focus only on application caching through System.Runtime.Caching. Nothing really new here, but I just want to review 3 basics caching strategies, that you can see in popular and oss projects; it’s important to have solid foundations. Even if the language is C#, many concepts listed here are also valid in others languages.

Local Caching Strategies

By Local/InMemory cache, I mean that data is held locally on the computer running an instance of an application. System.Web.Caching.Cache, System.Runtime.Caching.MemoryCache, EntLib CacheManager are well-known local cache.

There is no magic with caching and there is a hidden trade-off: caching means working with stale data. Should I increase the cache duration? Should I keep short TTL value? It’s never easy to answer to these questions, because it simply depends on your context: topology of data, number of clients, user load, database activity…

When implementing a local caching strategy, there is an important list of questions to ask yourself :

  • How long the item will be cached?
  • Is data coherence important?
  • How long it takes to reload the data item?
  • Does the number of executed queries on the data source matter?
  • Does caching strategy impact the end user ?
  • What is the topology of data: Reference data, Activity data, session data, … ?

The –very- basic interface we will implement in those 3 following examples contains a single method.

 public interface ICacheStrategy
 {
 /// <summary>
 /// Get an item from the cache (if cached) else reload it from data source and add it into the cache.
 /// </summary>
 /// <typeparam name="T">Type of cache item</typeparam>
 /// <param name="key">cache key</param>
 /// <param name="fetchItemFunc">Func<typeparamref name="T"/> used to reload the data from the data source (if missng from cache)</param>
 /// <param name="durationInSec">TTL value for the cache item</param>
 /// <param name="tokens">list of string to generate the final cache key</param>
 /// <returns></returns>
 T Get<T>(string key, Func<T> fetchItemFunc, int durationInSec, params string[] tokens);
 }

Basic Strategy

The full implementation is available here.

        public T Get<T>(string key, Func<T> fetchItemFunc, int durationInSec, params string[] tokens)
        {
            var cacheKey = this.CreateKey(key, tokens);
            var item = this.Cache.Get<T>(cacheKey);
            if (this.IsDefault(item))
            {
                item = fetchItemFunc();
                this.Cache.Set(cacheKey, item, durationInSec, false);
            }
            return item;
        }

This is similar to Read-Through caching. The caller will always get an item back, coming from the cache itself or the data source. When a cache client asks for an entry, and that item is not already in the cache, the strategy will automatically fetch it from the underlying data source, then place it in the cache for future use and finally will return the loaded item to the caller.

Double-checked locking

The full implementation is available here.

        public T Get<T>(string key, Func<T> fetchItemFunc, int durationInSec, params string[] tokens)
        {
            string cacheKey = this.CreateKey(key, tokens);
            var item = this.Cache.Get<T>(cacheKey);

            if (this.IsDefault(item))
            {
                object loadLock = this.GetLockObject(cacheKey, SyncLockDuration);
                lock (loadLock)
                {
                    item = this.Cache.Get<T>(cacheKey);
                    if (this.IsDefault(item))
                    {
                        item = fetchItemFunc();
                        this.Cache.Set(cacheKey, item, durationInSec);
                    }
                }
            }

            return item;
        }

This version introduces a locking system. A global synchronization mechanism (a single object for every cache item) is not efficient here, that’s why there is a dedicated synchronization object per cache item (depending on the cache key). The double-checked locking is also really important here to avoid useless/duplicated requests on the data source.

Refresh ahead strategy

The full implementation is available here.

        public T Get<T>(string key, Func<T> fetchItemFunc, int durationInSec, params string[] tokens)
        {
            // code omitted for clarity

            // not stale or don't use refresh ahead, nothing else to do =&gt; back to double lock strategy
            if (!item.IsStale || staleRatio == 0) return item.DataItem;
            // Oh no, we're stale - kick off a background refresh

            var refreshLockSuccess = false;
            var refreshKey = GetRefreshKey(cachekey);

            // code omitted for clarity

            if (refreshLockSuccess)
            {
                var task = new Task(() =>
                {
                    lock (loadLock)
                    {
                        // reload logic
                    }
                });
                task.ContinueWith(t =>
                {
                    if (t.IsFaulted) Trace.WriteLine(t.Exception);
                });
                task.Start();
            }
            return item.DataItem;
        }

In this implementation it’s possible to configure a stale ratio, enabling an automatic and asynchronous refresh on any recently accessed cache entry before its expiration. The application/end user will not feel the impact of a read against a potentially slow cache store when the entry is reloaded due to expiration. If the object is not in the cache and If the object is accessed after its expiration time, it’s similar to the double-checked locking strategy.

Refresh-ahead is especially useful if objects are being accessed by a large number of users. Values remain fresh in the cache and the latency that could result from excessive reloads from the cache store is avoided.

Experimental Results

To see the impact of each strategy, I’ve committed code Github. This program simulates fake workers, that get the same item from the cache during 60 sec for each strategy. A fake reload method taking one sec is used to simulate access to the datasource.  All cache hits, cache misses and reloads are recorded. This may not be the best program of the world, but it’s fairly enough to illustrate this article.

Strategy Number of Gets/Reloads Avg. Get Time (ms) < 100 ms
Basic 3236 / 181 59.17 ms 94.15 %
Double-checked locking 3501 / 11 51.22 ms 94.10 %
Refresh Ahead 3890 / 11 0.20 ms 100%

Between the basic and the double-checked locking strategy, there is an important improvement in terms of reloads from the data source, with nearly the same avg. response time. Between the double-checked locking and refresh-ahead strategy, the number of reloads is exactly the same but the response time is greatly improved. It’s very easy to use a cache, but be sure to use the appropriate pattern that fits correctly to your use cases.

Bonus : Local Cache invalidation

One year ago, I’ve posted an article on Betclic Tech Blog about implementing local memory cache invalidation with Redis PubSub feature. The idea is fairly simple: catch invalidation messages at application level to remove or more items from a local cache. The current version is more mature and a nuget is available here.  This can be easily used in many ways:

  • Invalidate items, such as remove them from the cache
  • Mark items as stale, to use Background reloading with external event..

To conclude

We covered here only some variations of the cache-aside pattern. I hope that you’re now more aware of the possibilities and troubles you may have with a local cache. It’s very easy and efficient to use a local cache, so be sure to use the appropriate pattern that fits correctly to your use cases.

By the way, in an era of cloud applications, fast networks, low latencies, a local cache is –still- very important. Nothing, absolutely nothing is faster that accessing local memory. One natural evolution often cited for a local cache, is a distributed cache. As we’ve seen here, this is not always the unique solution, but there is another story.

The full source code is available on Github.

Book review : Writing High-Performance .NET Code

Cover-Tall-2000x2828-212x300
Ben Watson has been a software engineer at Microsoft since 2008. On the Bing platform team, he has built one of the world’s leading .NET-based, high-performance server applications, handling high-volume, low-latency requests across thousands of machines for millions of customers.

More recently Ben Watson posted two articles on codeproject, both are “Best C# article of the month”. So you can trust this guy.

There are so many books, so many articles & blogs, so many tweets talking about .net performance. Why do we need another book in 2014 and even why do we still need to understand .net internals today when our applications may be hosted on servers with 16 cores and 112 GB of RAM (Azure A9) ?

First, optimizing your code will scale much more better than scaling up/adding new hardware. For sure, you can add more cores, more RAM to your production servers for few €/$ per month, and this is may be a pertinent approach in some cases.  But don’t be naive, a bottleneck in your application is like the sword of Damocles “with great fortune and power come also great peril and anxiety”

More than ever, it’s difficult to predict the performance of a single line of code. By integrating nuget packages, OSS librairies, commercial components … everywhere (“don’t reinvent the wheel, use a framework, they all say”) … our code becomes some kind of magic. In an Agile world, our codebase changes very often and it’s the same of our dependencies: Any commit could potentially destroy your application performance. Good developers have full tests suites (unit, integration, e2e, …) Why not performance tests and benchmarks ?

More than 13 years after the first version of .net, it’s still there, very competitive & efficient in many areas. We can use .net to develop web sites, universal apps, android/iOs app, …. “One language to rule them all, and it’s C#”. Recent additions like async/await and the new features of C# 6 show us that it’s an active and dynamic language. Learning .net runtime internals is certainly not useless.

This is my feeling about dotnet performance today. That’s why I bought this book one month ago.

If you’re a bit concerned about performance, you may know that it’s a very broad topic. What is performance? response time, CPU on a front server, low available memory on a back service … The famous quote of Donald Knuth “Premature optimization is the root of all evil” is a really good argument for detractors “don’t spend time of code optimizations, it’s useless”

One simple rule, maybe the most important in this book is : Measure, Measure, Measure! Don’t try to guess, it’s impossible; don’t try to optimize everything, it’s also impossible. By measuring accurately, you will be efficient. Most of the time to reach your performance goal, you just have to change a few areas in your code, and not everything.

In this book, the author lists a wide range of tools that could help you when investigating performance issues: Visual Studio Profiler, PerfView, CLR Memory Profiler, MeasureIt, … This tool chain should be part of every Performance Chief Architect. “Optimizing performance is meaningless if you do not have effective tools for measuring it”.

This book is quite recent (summer 2014), so the content is quite up-to-date; For example, there are specific chapters on Windows Phone, TPL, .net 4.5 … By the way, it’s a delicious mix between .net internals, best practices and recommendations. It’s hard to talk about code optimization without talking about Garbage collection & Jit compilation.  It may not be the best explanation (in terms of details), but it’s certainly one of the easiest to understand. If you’re quite new in .net, it’s mandatory to understand these topics.

The middle of this book is like a catalog. What is a performance cost of throwing exceptions? Class or struct ? Why Enum.ToString() is so CPU-consuming ? How to work with strings efficiently ? Any difference between for & foreach ? …. You may have so many questions and there are so many answers in this book. Each topic is illustrated by code, IL, … and how to diagnose this problem with your new favorite tools (PerfView, CLR Profiler, Dump, …)

What I like also about this book is that it’s not only focused on code. The Performance effort should not be supported by only one guy (you or me) but by the whole team. Having always up-to-date production metrics, automated benchmarks & profiling, educated product owners, is very important for a successful performance strategy.

It will definitively be part of my library about performance, near to others famous books on this topic. I recommend it to every .net developer.

Writing High-Performance .NET Code

The importance of useless Micro-optimization

Micro-optimization is “the process of meticulous tuning of small sections of code in order to address a perceived deficiency in some aspect of its operation (excessive memory usage, poor performance, etc)”

We’ve all seen this kind of debates : Is X faster than Y? Should I have to replace A by B in my code? It’s not specific to any language, but it’s a real question for all developers. Programmers are generally advised to avoid micro-optimization, unless they have a solid justification. Yes, it depends

Most of the time, Micro-optimizations are confirmed by benchmarks: a very specific code section is executed thousands/millions times to illustrate the problem and to confirm initial hypothesis. A is X times slower than B. Of course, in real world applications, we rarely call one piece of code so many times, so this stuff may seem inappropriate. But the trouble is that the extra N-milliseonds is CPU time on the server. A web server could simply be idle waiting for request, processing other requests, … but instead it is busy executing the same inefficient methods over and over. Those N-ms can even become N-s during peaks load.

Improving performance of an application is an endless road. It’s a very time-consuming activity and don’t forget that’s it’s not the core of your Business : Your Boss wants you to deploy new features and not pointless optimizations. It’s very common to spend several hours (days?) to reach your performance goals.

By performance I mean something that limits scalability of your application. It could be CPU, network IO, Memory … You may know that all applications (systems) have throughput limits. Your Job, as a chief performance officer, is to be keep fast response times in all circumstances (unexpected system difficulties, light load/heavy load) and to be far from these limits.

The performance of a code section is a mix between frequency & efficiency. I don’t care to write an inefficient code if it’s rarely used but I’m really concerned by most called operations. Have you already count how many times ToString() is called in your .net project ? We can’t blame someone else for having written un-optimized code that we want to use in our –different- context. For example, .toString() has maybe not been coded to be used as a key for Dictionaries.

Since 10 years of professional programming, I’ve seen serious performance problems. “I don’t understand why it takes 5 sec to generate my page each evening. Everything is cached and I’ve run many benchmarks: all my methods take less than 5ms.” Yes, but these methods are called 100 times while generating your page. Simply speaking, 5 ms is too slow for your context !

Even with a clean architecture it’s sometimes difficult to determine the critical rendering path.  That’s why I’m a big fan of profiling tools like Visual Studio Profiler: they give you a complete overview without changing your application. Collected data during a load test session,  can be precious.

I often see Micro-optimization as a challenge: let me try to significantly improve the performance –and scalability- of your application by changing the fewest lines as possible. If I can’t do this, I’m either not good enough or there are no interesting optimizations.

Examples in .net

A few months ago, I already explained on the Betclic Techblog one bad experience we had with Structuremap and Web Api. To summarize, we’ve significantly reduced CPU usage of a web application, just by splitting IoC bindings into multiple containers.

More recently, I was mandated to do a performance review on a WCF service consuming too much CPU & memory. Without any doubts –thanks to my favorite profiler- , one method is causing several alerts & warnings.

buildmessage

This method is in a base class. Because the project follows the Onion architecture, dependency injection is heavily used here to manage dependencies (at core, service & repository layers). As a consequence to be in a pure stateless world, this method can be invoked several times during each request processing.  The average call rate is approx. 150/sec during peaks load, so this is a good candidate for a micro-optimization.

How to improve this method? It doesn’t look so bad at the first view ….

A first idea could be to optimize string operations. We all know the obvious beginner mistakes of string concatenation. Memory allocations are fast on modern PCs, but they’re far from free.  String.Join()/StringBuilder seems better in this case.

A Second better idea is to remove Enum.ToString(). The type of Legislation/Brand/Platform is enum  The method Enum.ToString() is called implicitly in this code section (during concatenation). An interesting article on codeproject explains the troubles with Enum.ToString().

A few minutes later, I finally produced this extension method (Gist)


public static class EnumExtensions
{
    public static class EnumCache&lt;TEnum&gt;
    {
        public static Dictionary&lt;TEnum, string&gt; Values = new Dictionary&lt;TEnum, string&gt;();
        static EnumCache()
        {
            var t = typeof(TEnum);
            var values = (TEnum[])Enum.GetValues(t);
            var names = Enum.GetNames(t);
            for (var i = 0; i &lt; values.Length; i++)
            {
                Values.Add(values[i], names[i]);
            }           
        }

        public static string Get(TEnum enm)
        {
            return Values[enm];
        }
     }

     public static string AsString&lt;TEnum&gt;(this TEnum enm) where TEnum : struct, IConvertible
     {
         return EnumCache&lt;TEnum&gt;.Values[enm];
     }
}

Let’s run benchmarks it on 50000 iterations (check the gist for the full source code)

vsprofilerenum

consoleoutputenum

Faster and Less allocated memory, Not so bad. (Note : It’s even a little better than the original codeproject article but with less possibilities). For sure, a .net Guru may find a more efficient way but It’s enough for me: I reached my performance goal. We now have an optimized ready-to-use extension method with less than 30 lines of code, that’s 40 times faster. Very well done!

But, We’re ALL WRONG !

Optimizing code means you have to think differently. Simply create a property (eventually lazy),  computed just once per request, and inserted somewhere in the scope. That’s it.

In every project, there are code conventions: this class to manage logging, this one to format date format, this one to translate string… We should be very careful with this kind of helpers, extension methods, utilities; we all have this kind of classes in our projects to factorize pure-technical stuff. Unsupervised usage may lead to serious performance issues.  Never do anything more than once.

Is this micro-optimization a complete waste of time? No at all. We’ve done a lot of interesting things: explore .net internals, run profiling sessions & benchmarks. The final solution is still valid but just not so important in our context. But it could be useful in the future…