MATLAB: How to make a MAT-file that can be used to create a Datastore for MapReduce

datastoremapreducematparallel computing

This page tell you how to Read and Analyze Data in KeyValueDatastore for MAT-File. However, it only "shows how to create a datastore for key-value pair data in a MAT-file that is the output of mapreduce." The question is how you can make a MAT-file to create a datastore?
I found the following reply by Rick Amos in another thread useful: Currently, the one very specific form of mat files that can be read by datastore is the output of another mapreduce call. An unofficial shortcut that creates such a mat file is the following code:-
data.Key = {'Test'};
data.Value = {struct('a', 'Hello World!', 'b', 42)};
save('myMatFile.mat', '-struct', 'data');
ds = datastore('myMatFile.mat');
readall(ds)
This is nice to know, and it works well with one key-value pair. In general case, how do you save multiple key-value pairs for datastore (such that readall(ds) would produce multiple rows)? I have tried two alternatives with no success: saving two same-sized cell arrays for keys and values, and saving one struct array of key-value pairs. Thank you!

Best Answer

  • In R2014b there is currently not a direct way of creating a MAT file datastore. However, there are several indirect ways that will create a mat file datastore in R2014b.
    The first method is to use the output of a mapreduce operation. That is, create an input file 'input.txt' that has the following contents:
    Filename
    myMatFile.mat
    mySecondMatFile.mat
    Then create a 'myMapper.m' with the following contents:
    function myMapper(data, ~, intermediateOutput)
    filenames = data.Filename;
    addmulti(intermediateOutput, filenames, filenames);
    end
    And a 'myReducer.m' with the following contents:
    function myReducer(filename, ~, finalOutput)
    % This should be changed depending on the inputData.
    % This purely converts a struct array into a cell array of structs for addmulti.
    data = load(filename);
    values = num2cell(data.myStructArrayVariable);
    keys = repmat({'SomeKey'}, size(values));
    addmulti(finalOutput, keys, values);
    end
    With all of this in place, do:
    ds = datastore('input.txt');
    mapFunction = @myMapper;
    reduceFunction = @myReducer;
    outputFolder = '/my/output/folder';
    resultDS = mapreduce(ds, mapFunction, reduceFunction, 'OutputFolder', outputFolder)
    This will create a collection of MAT files in the given output folder that consists of the original data and that can be used with datastore.
    The second method is an unofficial shortcut to this. That is to do the following:-
    % Suppose keys and values are two arrays of the same size such as:-
    keys = {'TestKey1'; 'TestKey2'};
    values = struct('Foo', {1,2}, 'Bar', {3,4});
    % Then this will store data in such a way that it can likely be read by datastore:-
    if ~iscell(keys)
    keys = num2cell(keys);
    end
    if ~iscell(values)
    values= num2cell(values);
    end
    data.Key = keys(:);
    data.Value = values(:);
    save('myMatFile.mat', '-struct', 'data');
    ds = datastore('myMatFile.mat');
    readall(ds)