This article will reflect on the tradeoff between readability and performance of MATLAB code. It will explain and motivate the `splitapply` construct in MATLAB by solving the following problem: What is the best way to calculate the average Height per Gender?

The code in this article was written and tested in MATLAB R2019b. MATLAB comes with some dummy data in the file `patients.mat` that you can use to reproduce the code in this article.

```load('patients.mat');
data = table(Gender, Height);
data_cat = data;
data_cat.Gender = categorical(data_cat.Gender);```

Tables are a very useful datatype for data analytics in MATLAB and for this example, VersionBay converted a few variables into a table and ensured that the Gender column is of a Categorical Type. The motivation for doing this is that Categorical datatype is designed for variables that have a finite set of discrete of categories, such as Gender.

 Variable Size Bytes data 100 x 2 14292 data_cat 100 x 2 2426

### Using “for“ loop

```GenderList = categories(data_cat.Gender);
for idx = 1:length(GenderList)
idxHeight = data_cat.Height(data_cat.Gender==GenderList(idx));
HeightAvg_WithFor(idx, 1) = mean(idxHeight);
end```

The code above is probably what most MATLAB users would end up doing to calculate the average Height per Gender. The approach is:

1. write a `for` loop for the gender types
2. find the indexes per Gender (Male and Female)
3. calculate the mean of the Height based on each set of indexes

The issue here is that for such a trivial operation there are 5 lines of code, and not everyone can easily read the first line in the `for` loop, so this approach even though a common one makes it hard for others to quickly grasp what is happening, making it hard to maintain. This issue is amplified if the developer of the code does not add any comments.

### Using “splitapply”

```byGroup = findgroups(data_cat.Gender);
HeightAvg_WithSplitApply = splitapply(@mean, data_cat.Height, byGroup);```

There is a more elegant way of solving this problem with 2 lines of code. The approach is to leverage built-in functions in MATLAB: `findgroups` and `splitapply`.

1. store the indexes of all the Groups (in this case by Gender)
2. split the table per Group,
3. apply the `mean` function per Group.

This is a very simple elegant way of solving this problem, however, not many people are aware of: `findgroups` (see documentation) command and we are passing a function handle of mean with the @ sign in the second line. The nice thing about this approach is that even if you are not aware of `findgroups` and using function handles, the code above is easy to read. The difficulty here is now to find out about `findgroups` and the `splitapply` function on your own. MATLAB is not the only language that has something like splitapply so one would only need to search for this in MATLAB documentation. As a side note, this has been in MATLAB since R2015b.

### Using “grpstats”

`HeightAvg_Withgprstats = grpstats(data, 'Gender', "mean")`

Now if you have the Statistics and Machine Learning Toolbox you might know that there is an even easier way of doing this by using the `grpstats` function (see documentation), which is available since R2014a. This simple line of code is readable and covers the most common statistical operations: mean, min, max, count, standard deviation, variance, median and range. It is widely used with Statistics and Machine Learning applications hence it appears in the Statistics and Machine Learning Toolbox from the MathWorks.

### Performance considerations

Up until now, the article has focused on the readability and elegance of the code. However, performance is also an important aspect to consider when looking at different implementations. If one writes code that is easier to read, it does not mean that it performs better.

For more accurate performance measurements we increased the dataset by 100000 with the following command: `data = repmat(data,100000,1)`. Using the `runperf` from the performance testing framework in MATLAB, VersionBay ran each implementation 14 times (including 4 warm-up runs) and compared the median of the 3 approaches. The chart shows the time it took for each approach in different MATLAB versions.

 Release Median `for` loop Median `splitapply` Median `grpstats` R2015bUpdate 1 0,2631 2,4606 2,9780 R2016aUpdate 7 0,2624 2,0914 3,5884 R2016bUpdate 7 0,2717 2,0594 3,5869 R2017aUpdate 3 0,2827 2,0668 3,8783 R2017bUpdate 9 0,2687 2,2490 3,9078 R2018aUpdate 6 0,2801 2,2325 3,8711 R2018bUpdate 6 0,2594 1,3641 3,1833 R2019aUpdate 6 0,2441 1,3626 3,1269 R2019bUpdate 3 0,2375 1,3574 3,1418

Looking at the table carefully there are 3 things worth mentioning:

1. The `for` loop approach is just over 5.7 times faster than the `splitapply` and 13.2 times faster than `grpstats`.
• even though the `for` loop is the hardest to read it does outperform the other two implementations
2. `splitapply` improved its performance by 39% in R2018b
3. The `for` loop implementation got 16% improvement in performance from R2017a to R2019b.

If you found this analysis interesting and would like this done on your code feel free to contact us.