MATLAB: How to pull only a portion of a CSV into memory


I have a very large CSV file that is continually updating to add new rows of data.
How can I pull only the last 300 lines of the CSV into memory so that it can be plotted.

Best Answer

  • There are a couple of ways to achieve this goal.
    The first approach is to use the "SelectedVariableNames" and "DataLines" options in conjunction with "readtable" to only pull in the desired columns and rows. This approach has a relatively short run time, but there are a couple of limitations. You would need to know how many rows are present in the CSV file in order to set your "DataLines" option correctly. Additionally, the "DataLines" option for "readtable" was added in MATLAB R2018a, so you would need to ensure you were running that release or a later one.
    All together this approach would look something like as follows:
    opts = detectImportOptions('airlinesmall.csv');
    opts.SelectedVariableNames = {'ArrDelay', 'DepDelay'};
    opts.DataLines = [123225 inf];
    T = readtable('airlinesmall.csv', opts);
    The second approach utilizes datastores and tall arrays to only pull into memory the parts of the CSV that you would like to manipulate. With this approach, you do not need to know how many rows are in your CSV prior to running the script, but creating and evaluating the tall array can take a slightly longer run time. It would look something like as follows:
    ttds = tabularTextDatastore('airlinesmall.csv');
    ttds.SelectedVariableNames = {'ArrDelay', 'DepDelay'};
    ttds.TreatAsMissing = 'NA';
    tt = tall(ttds);
    val = 300;
    TT = gather(tail(tt, val));
    Note: If you have Parallel Computing Toolbox, by default the 'tall' function will start a parpool. You can turn this setting off from the Parallel Computing Toolbox Preferences panel by deselecting:
    "Automatically create a parallel pool (if one doesn't already exist) when parallel keywords (e.g. parfor) are executed.