This article will reflect on the tradeoff between readability and performance of MATLAB code. It will explain and motivate the
splitapply construct in MATLAB by solving the following problem: What is the best way to calculate the average Height per Gender?
The code in this article was written and tested in MATLAB R2019b. MATLAB comes with some dummy data in the file
patients.mat that you can use to reproduce the code in this article.
load('patients.mat'); data = table(Gender, Height); data_cat = data; data_cat.Gender = categorical(data_cat.Gender);
Tables are a very useful datatype for data analytics in MATLAB and for this example, VersionBay converted a few variables into a table and ensured that the Gender column is of a Categorical Type. The motivation for doing this is that Categorical datatype is designed for variables that have a finite set of discrete of categories, such as Gender.
|data||100 x 2||14292|
|data_cat||100 x 2||2426|
Using “for“ loop
GenderList = categories(data_cat.Gender); for idx = 1:length(GenderList) idxHeight = data_cat.Height(data_cat.Gender==GenderList(idx)); HeightAvg_WithFor(idx, 1) = mean(idxHeight); end
The code above is probably what most MATLAB users would end up doing to calculate the average Height per Gender. The approach is:
- write a
forloop for the gender types
- find the indexes per Gender (Male and Female)
- calculate the mean of the Height based on each set of indexes
The issue here is that for such a trivial operation there are 5 lines of code, and not everyone can easily read the first line in the
for loop, so this approach even though a common one makes it hard for others to quickly grasp what is happening, making it hard to maintain. This issue is amplified if the developer of the code does not add any comments.
byGroup = findgroups(data_cat.Gender); HeightAvg_WithSplitApply = splitapply(@mean, data_cat.Height, byGroup);
There is a more elegant way of solving this problem with 2 lines of code. The approach is to leverage built-in functions in MATLAB:
- store the indexes of all the Groups (in this case by Gender)
- split the table per Group,
- apply the
meanfunction per Group.
This is a very simple elegant way of solving this problem, however, not many people are aware of:
findgroups (see documentation) command and we are passing a function handle of mean with the @ sign in the second line. The nice thing about this approach is that even if you are not aware of
findgroups and using function handles, the code above is easy to read. The difficulty here is now to find out about
findgroups and the
splitapply function on your own. MATLAB is not the only language that has something like splitapply so one would only need to search for this in MATLAB documentation. As a side note, this has been in MATLAB since R2015b.
HeightAvg_Withgprstats = grpstats(data, 'Gender', "mean")
Now if you have the Statistics and Machine Learning Toolbox you might know that there is an even easier way of doing this by using the
grpstats function (see documentation), which is available since R2014a. This simple line of code is readable and covers the most common statistical operations: mean, min, max, count, standard deviation, variance, median and range. It is widely used with Statistics and Machine Learning applications hence it appears in the Statistics and Machine Learning Toolbox from the MathWorks.
Up until now, the article has focused on the readability and elegance of the code. However, performance is also an important aspect to consider when looking at different implementations. If one writes code that is easier to read, it does not mean that it performs better.
For more accurate performance measurements we increased the dataset by 100000 with the following command:
data = repmat(data,100000,1). Using the
runperf from the performance testing framework in MATLAB, VersionBay ran each implementation 14 times (including 4 warm-up runs) and compared the median of the 3 approaches. The chart shows the time it took for each approach in different MATLAB versions.
|Release||Median ||Median ||Median |
Looking at the table carefully there are 3 things worth mentioning:
forloop approach is just over 5.7 times faster than the
splitapplyand 13.2 times faster than
- even though the
forloop is the hardest to read it does outperform the other two implementations
- even though the
splitapplyimproved its performance by 39% in R2018b
forloop implementation got 16% improvement in performance from R2017a to R2019b.
If you found this analysis interesting and would like this done on your code feel free to contact us.