Monthly Archives: April 2014

[Matlab] Import big file

The problem: I have this big txt file (well, relatively big, a little more than 200Mb) that I need to import into matlab and process.This file contain a table with characters and integers. The first row of the table is the column header. For my purpose, I only need data from specific columns (about 20 columns).

My initial solution: Since the file is not super huge and I do have a pretty decent machine. I just went with the importdata function. since it handles delimited txt file that contain string and integers pretty well in general.

The result: It took a long time, probably about 20 mins. And the file was not imported completely. The Header row was correctly imported, but only 4 rows of data were imported (I am still confused about why this is happening, and why Matlab did not even throw an error). But, I did learn something useful. It turned out that I have about 7000000 columns (according to the size of my header row).

My thought: I am sure you are thinking about the same thing as well, If I only need about 20 columns, it does not make much sense to read the whole file in. And apparently, reading the whole file in (at least using importdata function) has failed me.

The new plan:
(1) import the header, figure out which columns need to be imported
(2) import the content of the table line by line, only store the columns needed.

The code:
Alright, so here is how I did it. if you want to try this yourself, you can copy (start copying from the line that starts with YKZ) the following sample data into a text editor and save a txt file (in my example code, the sample data is saved in a txt file named text_file.txt) and see how it goes.

=============== [Sample data (space delimited)] ===============

YKZ Timestamp Temp Humidity Wind Weather
06-Sep-2013 01:00:00 6.6 89 4 clear
06-Sep-2013 05:00:00 5.9 95 1 clear
06-Sep-2013 09:00:00 15.6 51 5 mainly_clear
06-Sep-2013 13:00:00 19.6 37 10 mainly_clear
06-Sep-2013 17:00:00 22.4 41 9 mostly_cloudy
06-Sep-2013 21:00:00 17.3 67 7 mainly_clear
09-Sep-2013 01:00:00 15.2 91 8 clear
09-Sep-2013 05:00:00 19.1 94 7 n/a

==== [code (copy the code below to your matlab editor and run it directly) ] ======

% (1) import the header, figure out which columns need to be imported
clear all;
selected_col = {‘YKZ’,’Temp’,’Wind’};% specify columns need to be extracted by column names
fid = fopen(‘text_file.txt’);% create a file identifier. This does not load the content of the file
tline = fgetl(fid);% fgetl return the next line of the file identified by the fid. This function return a character string.
header = strsplit(tline,’ ‘); % now I parse the returned string using function strsplit, as a result I get the row header of the table. the first argument of strsplit is the string to be parsed, the second argument is the delimiter. Notice that the delimiter is optional, if a delimiter is not specified it uses white spaces.
idx = ismember(header,selected_col); % now we create a logical array to indicate the columns to be extracted.

% (2) import the content of the table line by line, only store the columns needed.
tdata={}; % initiate variable to store data
while ~feof(fid) % do the following step until end-of-file is reached
    tline = fgetl(fid); % every time fgetl is called, it reads in a new line from the file.
    delim_line = strsplit(tline,’ ‘); % parse the current line
    tdata(end+1,:)=delim_line(idx); % select data in columns specified by logical variable idx
end
fclose(fid); % close the original file

% (3) clean up
% Now, the extracted data are stored in cell array tdata (8×3 cell array)
% If you type tdata{1,2} in your command window, you should get:
ans =
6.6
% It is really important to recognize that 6.6 is not a number here.
% Now type class(tdata{1,2}) in your command window, you get:
ans =
char
% 6.6 here is of type char, so are all the other “numbers” in variable tdata. It is quite easy to convert them to numbers. Just one line of code.
num_data=cell2mat(cellfun(@(s) str2double(s), tdata(:,2:end), ‘UniformOutput’,false));
% the cellfun function apply str2double function to every cell in tdata(:,2:end) which convert them to type double, and function cell2mat convert the resulting cell array to matrix.

% (4) ta da, here is the data
col_header = header(idx); % column header
row_header = tdata(:,1); % row header
data = num_data; % here is the table content
clearvars -except col_header row_header data % clear all other intermediate variables

I hope you find this post helpful and feel free to leave a comment :)