On MATLAB Answers a common question is:

“How can I parse multiple csv files in a folder and collect all the data in one MATLAB variable?”

MATLAB Answers

There are 2 parts to this question:

  • Reading Multiple Text Files
  • Parsing a CSV File Quickly

To set the scene for this technical article VersionBay used the following CSV file. This file has over 36 thousand rows and 18 different columns with different datatypes, all separated by commas.

Reading Multiple Text Files

To tackle the first part of the question most MATLAB users will start with:

listFiles = dir('*.csv');

This works as it returns a structure with a list of the text files. This leads to the following common MATLAB construct.

for idx = 1:length(listFiles)
    file_name = listFiles(idx).name;
    % use custom MATLAB import function, e.g. from Import Wizard
    data{idx} = importfile(file_name); 
end

An alternative could be to leverage the datastore function that is available in MATLAB since R2014b:

ds = datastore('*.csv');

MATLAB creates a datastore associated to the given files. For text files, properties for parsing the files are pre-populated, such as: FileEncoding, Delimiter and MultipleDelimitersAsOne. The magic happens when one performs:

all_data = ds.readall();

This returns the data from all the text files in the variable all_data.

Parsing a CSV File Quickly

MATLAB has several ways of doing this and to the beginner it can be overwhelming the number of ways to do the same thing.

There is the person who comes from C and will use fscanf

function data = my_parse_file_function(filename)

fp = fopen(filename);
fileFormat = '%d,%d,%C'; % this is customized according to the data
data = fscanf(fp, fileFormat);
fclose(fp);

Or there is the person who attended MATLAB training. He may try the Import Wizard and generate a function that will use textscan. This will produce something like:

function FLinsurancesample = importfile(filename, startRow, endRow)
%IMPORTFILE Import numeric data from a text file as a matrix.
%   FLINSURANCESAMPLE = IMPORTFILE(FILENAME) Reads data from text file
%   FILENAME for the default selection.
%
%   FLINSURANCESAMPLE = IMPORTFILE(FILENAME, STARTROW, ENDROW) Reads data
%   from rows STARTROW through ENDROW of text file FILENAME.
%
% Example:
%   FLinsurancesample = importfile('FL_insurance_sample.csv', 2, 36635);
%
%    See also TEXTSCAN.

% Auto-generated by MATLAB on 2018/12/14 16:58:46

%% Initialize variables.
delimiter = ',';
if nargin<=2
    startRow = 2;
    endRow = inf;
end

%% Read columns of data as text:
% For more information, see the TEXTSCAN documentation.
formatSpec = '%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%[^\n\r]';

%% Open the text file.
fileID = fopen(filename,'r');

%% Read columns of data according to the format.
% This call is based on the structure of the file used to generate this
% code. If an error occurs for a different file, try regenerating the code
% from the Import Tool.
dataArray = textscan(fileID, formatSpec, endRow(1)-startRow(1)+1, 'Delimiter', delimiter, 'TextType', 'string', 'HeaderLines', startRow(1)-1, 'ReturnOnError', false, 'EndOfLine', '\r\n');
for block=2:length(startRow)
    frewind(fileID);
    dataArrayBlock = textscan(fileID, formatSpec, endRow(block)-startRow(block)+1, 'Delimiter', delimiter, 'TextType', 'string', 'HeaderLines', startRow(block)-1, 'ReturnOnError', false, 'EndOfLine', '\r\n');
    for col=1:length(dataArray)
        dataArray{col} = [dataArray{col};dataArrayBlock{col}];
    end
end

%% Close the text file.
fclose(fileID);

%% Convert the contents of columns containing numeric text to numbers.
% Replace non-numeric text with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
    raw(1:length(dataArray{col}),col) = mat2cell(dataArray{col}, ones(length(dataArray{col}), 1));
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));

for col=[1,4,5,6,7,8,9,10,11,12,13,14,15,18]
    % Converts text in the input cell array to numbers. Replaced non-numeric
    % text with NaN.
    rawData = dataArray{col};
    for row=1:size(rawData, 1)
        % Create a regular expression to detect and remove non-numeric prefixes and
        % suffixes.
        regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
        try
            result = regexp(rawData(row), regexstr, 'names');
            numbers = result.numbers;
            
            % Detected commas in non-thousand locations.
            invalidThousandsSeparator = false;
            if numbers.contains(',')
                thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
                if isempty(regexp(numbers, thousandsRegExp, 'once'))
                    numbers = NaN;
                    invalidThousandsSeparator = true;
                end
            end
            % Convert numeric text to numbers.
            if ~invalidThousandsSeparator
                numbers = textscan(char(strrep(numbers, ',', '')), '%f');
                numericData(row, col) = numbers{1};
                raw{row, col} = numbers{1};
            end
        catch
            raw{row, col} = rawData{row};
        end
    end
end


%% Split data into numeric and string columns.
rawNumericColumns = raw(:, [1,4,5,6,7,8,9,10,11,12,13,14,15,18]);
rawStringColumns = string(raw(:, [2,3,16,17]));


%% Replace non-numeric cells with NaN
R = cellfun(@(x) ~isnumeric(x) && ~islogical(x),rawNumericColumns); % Find non-numeric cells
rawNumericColumns(R) = {NaN}; % Replace non-numeric cells

%% Make sure any text containing <undefined> is properly converted to an <undefined> categorical
for catIdx = [1,2,3,4]
    idx = (rawStringColumns(:, catIdx) == "<undefined>");
    rawStringColumns(idx, catIdx) = "";
end

%% Create output variable
FLinsurancesample = table;
FLinsurancesample.policyID = cell2mat(rawNumericColumns(:, 1));
FLinsurancesample.statecode = categorical(rawStringColumns(:, 1));
FLinsurancesample.county = categorical(rawStringColumns(:, 2));
FLinsurancesample.eq_site_limit = cell2mat(rawNumericColumns(:, 2));
FLinsurancesample.hu_site_limit = cell2mat(rawNumericColumns(:, 3));
FLinsurancesample.fl_site_limit = cell2mat(rawNumericColumns(:, 4));
FLinsurancesample.fr_site_limit = cell2mat(rawNumericColumns(:, 5));
FLinsurancesample.tiv_2011 = cell2mat(rawNumericColumns(:, 6));
FLinsurancesample.tiv_2012 = cell2mat(rawNumericColumns(:, 7));
FLinsurancesample.eq_site_deductible = cell2mat(rawNumericColumns(:, 8));
FLinsurancesample.hu_site_deductible = cell2mat(rawNumericColumns(:, 9));
FLinsurancesample.fl_site_deductible = cell2mat(rawNumericColumns(:, 10));
FLinsurancesample.fr_site_deductible = cell2mat(rawNumericColumns(:, 11));
FLinsurancesample.point_latitude = cell2mat(rawNumericColumns(:, 12));
FLinsurancesample.point_longitude = cell2mat(rawNumericColumns(:, 13));
FLinsurancesample.line = categorical(rawStringColumns(:, 3));
FLinsurancesample.construction = categorical(rawStringColumns(:, 4));
FLinsurancesample.point_granularity = cell2mat(rawNumericColumns(:, 14));

This code is automatically generated and works however, one could also use:

all_data = ds.readall();

For each CSV file the importfile requires 48 seconds, the ds.readall only takes 0.2 seconds. This is over 220 times faster!

Read this article for updated performance in R2019a