This article will reflect on the tradeoff between readability and performance of MATLAB code. It will explain and motivate the splitapply construct in MATLAB by solving the following problem: What is the best way to calculate the average Height per Gender?

Loading data

The code in this article was written and tested in MATLAB R2019b. MATLAB comes with some dummy data in the file patients.mat that you can use to reproduce the code in this article.

data = table(Gender, Height);
data_cat = data;
data_cat.Gender = categorical(data_cat.Gender);

Tables are a very useful datatype for data analytics in MATLAB and for this example, VersionBay converted a few variables into a table and ensured that the Gender column is of a Categorical Type. The motivation for doing this is that Categorical datatype is designed for variables that have a finite set of discrete of categories, such as Gender.

data100 x 214292
data_cat100 x 22426

Using for loop

GenderList = categories(data_cat.Gender);
for idx = 1:length(GenderList)
    idxHeight = data_cat.Height(data_cat.Gender==GenderList(idx));
    HeightAvg_WithFor(idx, 1) = mean(idxHeight);

The code above is probably what most MATLAB users would end up doing to calculate the average Height per Gender. The approach is:

  1. write a for loop for the gender types
  2. find the indexes per Gender (Male and Female)
  3. calculate the mean of the Height based on each set of indexes

The issue here is that for such a trivial operation there are 5 lines of code, and not everyone can easily read the first line in the for loop, so this approach even though a common one makes it hard for others to quickly grasp what is happening, making it hard to maintain. This issue is amplified if the developer of the code does not add any comments.

Using “splitapply”

byGroup = findgroups(data_cat.Gender);
HeightAvg_WithSplitApply = splitapply(@mean, data_cat.Height, byGroup);

There is a more elegant way of solving this problem with 2 lines of code. The approach is to leverage built-in functions in MATLAB: findgroups and splitapply.

  1. store the indexes of all the Groups (in this case by Gender)
  2. split the table per Group,
  3. apply the mean function per Group.

This is a very simple elegant way of solving this problem, however, not many people are aware of: findgroups (see documentation) command and we are passing a function handle of mean with the @ sign in the second line. The nice thing about this approach is that even if you are not aware of findgroups and using function handles, the code above is easy to read. The difficulty here is now to find out about findgroups and the splitapply function on your own. MATLAB is not the only language that has something like splitapply so one would only need to search for this in MATLAB documentation. As a side note, this has been in MATLAB since R2015b.

Using “grpstats”

HeightAvg_Withgprstats = grpstats(data, 'Gender', "mean")

Now if you have the Statistics and Machine Learning Toolbox you might know that there is an even easier way of doing this by using the grpstats function (see documentation), which is available since R2014a. This simple line of code is readable and covers the most common statistical operations: mean, min, max, count, standard deviation, variance, median and range. It is widely used with Statistics and Machine Learning applications hence it appears in the Statistics and Machine Learning Toolbox from the MathWorks.

Performance considerations

Up until now, the article has focused on the readability and elegance of the code. However, performance is also an important aspect to consider when looking at different implementations. If one writes code that is easier to read, it does not mean that it performs better.

For more accurate performance measurements we increased the dataset by 100000 with the following command: data = repmat(data,100000,1). Using the runperf from the performance testing framework in MATLAB, VersionBay ran each implementation 14 times (including 4 warm-up runs) and compared the median of the 3 approaches. The chart shows the time it took for each approach in different MATLAB versions.

ReleaseMedian for loopMedian splitapplyMedian grpstats
Update 1
Update 7
Update 7
Update 3
Update 9
Update 6
Update 6
Update 6
Update 3

Looking at the table carefully there are 3 things worth mentioning:

  1. The for loop approach is just over 5.7 times faster than the splitapply and 13.2 times faster than grpstats.
    • even though the for loop is the hardest to read it does outperform the other two implementations
  2. splitapply improved its performance by 39% in R2018b
  3. The for loop implementation got 16% improvement in performance from R2017a to R2019b.

If you found this analysis interesting and would like this done on your code feel free to contact us.