On MATLAB Answers a common question is:
“How can I parse multiple
MATLAB Answerscsv files in a folder and collect all the data in one MATLAB variable?”
There are 2 parts to this question:
- Reading Multiple Text Files
- Parsing a CSV File Quickly
To set the scene for this technical article VersionBay used the following CSV file. This file has over 36 thousand rows and 18 different columns with different datatypes, all separated by commas.
Reading Multiple Text Files
To tackle the first part of the question most MATLAB users will start with:
listFiles = dir('*.csv');
This works as it returns a structure with a list of the text files. This leads to the following common MATLAB construct.
for idx = 1:length(listFiles) file_name = listFiles(idx).name; % use custom MATLAB import function, e.g. from Import Wizard data{idx} = importfile(file_name); end
An alternative could be to leverage the datastore function that is available in MATLAB since R2014b:
ds = datastore('*.csv');
MATLAB creates a datastore associated to the given files. For text files, properties for parsing the files are pre-populated, such as: FileEncoding, Delimiter and MultipleDelimitersAsOne. The magic happens when one performs:
all_data = ds.readall();
This returns the data from all the text files in the variable all_data.
Parsing a CSV File Quickly
MATLAB has several ways of doing this and to the beginner it can be overwhelming the number of ways to do the same thing.
There is the person who comes from C and will use fscanf
function data = my_parse_file_function(filename) fp = fopen(filename); fileFormat = '%d,%d,%C'; % this is customized according to the data data = fscanf(fp, fileFormat); fclose(fp);
Or there is the person who attended MATLAB training. He may try the Import Wizard and generate a function that will use textscan. This will produce something like:
function FLinsurancesample = importfile(filename, startRow, endRow) %IMPORTFILE Import numeric data from a text file as a matrix. % FLINSURANCESAMPLE = IMPORTFILE(FILENAME) Reads data from text file % FILENAME for the default selection. % % FLINSURANCESAMPLE = IMPORTFILE(FILENAME, STARTROW, ENDROW) Reads data % from rows STARTROW through ENDROW of text file FILENAME. % % Example: % FLinsurancesample = importfile('FL_insurance_sample.csv', 2, 36635); % % See also TEXTSCAN. % Auto-generated by MATLAB on 2018/12/14 16:58:46 %% Initialize variables. delimiter = ','; if nargin<=2 startRow = 2; endRow = inf; end %% Read columns of data as text: % For more information, see the TEXTSCAN documentation. formatSpec = '%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%[^\n\r]'; %% Open the text file. fileID = fopen(filename,'r'); %% Read columns of data according to the format. % This call is based on the structure of the file used to generate this % code. If an error occurs for a different file, try regenerating the code % from the Import Tool. dataArray = textscan(fileID, formatSpec, endRow(1)-startRow(1)+1, 'Delimiter', delimiter, 'TextType', 'string', 'HeaderLines', startRow(1)-1, 'ReturnOnError', false, 'EndOfLine', '\r\n'); for block=2:length(startRow) frewind(fileID); dataArrayBlock = textscan(fileID, formatSpec, endRow(block)-startRow(block)+1, 'Delimiter', delimiter, 'TextType', 'string', 'HeaderLines', startRow(block)-1, 'ReturnOnError', false, 'EndOfLine', '\r\n'); for col=1:length(dataArray) dataArray{col} = [dataArray{col};dataArrayBlock{col}]; end end %% Close the text file. fclose(fileID); %% Convert the contents of columns containing numeric text to numbers. % Replace non-numeric text with NaN. raw = repmat({''},length(dataArray{1}),length(dataArray)-1); for col=1:length(dataArray)-1 raw(1:length(dataArray{col}),col) = mat2cell(dataArray{col}, ones(length(dataArray{col}), 1)); end numericData = NaN(size(dataArray{1},1),size(dataArray,2)); for col=[1,4,5,6,7,8,9,10,11,12,13,14,15,18] % Converts text in the input cell array to numbers. Replaced non-numeric % text with NaN. rawData = dataArray{col}; for row=1:size(rawData, 1) % Create a regular expression to detect and remove non-numeric prefixes and % suffixes. regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)'; try result = regexp(rawData(row), regexstr, 'names'); numbers = result.numbers; % Detected commas in non-thousand locations. invalidThousandsSeparator = false; if numbers.contains(',') thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$'; if isempty(regexp(numbers, thousandsRegExp, 'once')) numbers = NaN; invalidThousandsSeparator = true; end end % Convert numeric text to numbers. if ~invalidThousandsSeparator numbers = textscan(char(strrep(numbers, ',', '')), '%f'); numericData(row, col) = numbers{1}; raw{row, col} = numbers{1}; end catch raw{row, col} = rawData{row}; end end end %% Split data into numeric and string columns. rawNumericColumns = raw(:, [1,4,5,6,7,8,9,10,11,12,13,14,15,18]); rawStringColumns = string(raw(:, [2,3,16,17])); %% Replace non-numeric cells with NaN R = cellfun(@(x) ~isnumeric(x) && ~islogical(x),rawNumericColumns); % Find non-numeric cells rawNumericColumns(R) = {NaN}; % Replace non-numeric cells %% Make sure any text containing <undefined> is properly converted to an <undefined> categorical for catIdx = [1,2,3,4] idx = (rawStringColumns(:, catIdx) == "<undefined>"); rawStringColumns(idx, catIdx) = ""; end %% Create output variable FLinsurancesample = table; FLinsurancesample.policyID = cell2mat(rawNumericColumns(:, 1)); FLinsurancesample.statecode = categorical(rawStringColumns(:, 1)); FLinsurancesample.county = categorical(rawStringColumns(:, 2)); FLinsurancesample.eq_site_limit = cell2mat(rawNumericColumns(:, 2)); FLinsurancesample.hu_site_limit = cell2mat(rawNumericColumns(:, 3)); FLinsurancesample.fl_site_limit = cell2mat(rawNumericColumns(:, 4)); FLinsurancesample.fr_site_limit = cell2mat(rawNumericColumns(:, 5)); FLinsurancesample.tiv_2011 = cell2mat(rawNumericColumns(:, 6)); FLinsurancesample.tiv_2012 = cell2mat(rawNumericColumns(:, 7)); FLinsurancesample.eq_site_deductible = cell2mat(rawNumericColumns(:, 8)); FLinsurancesample.hu_site_deductible = cell2mat(rawNumericColumns(:, 9)); FLinsurancesample.fl_site_deductible = cell2mat(rawNumericColumns(:, 10)); FLinsurancesample.fr_site_deductible = cell2mat(rawNumericColumns(:, 11)); FLinsurancesample.point_latitude = cell2mat(rawNumericColumns(:, 12)); FLinsurancesample.point_longitude = cell2mat(rawNumericColumns(:, 13)); FLinsurancesample.line = categorical(rawStringColumns(:, 3)); FLinsurancesample.construction = categorical(rawStringColumns(:, 4)); FLinsurancesample.point_granularity = cell2mat(rawNumericColumns(:, 14));
This code is automatically generated and works however, one could also use:
all_data = ds.readall();
For each
Read this article for updated performance in R2019a