MATLAB: Fastest Possible Way to convert a table containing Only 2 strings to numbers

convert strings to a numberMATLABParallel Computing Toolbox

Hello all,
I have an NxM array of strings. Most of the cells are empty. however, of those that are not empty, they contain only 2 values, 'Het' or 'Hom' for hetero vs homozygous.
I want to:
1. Create an NxM matrix 2. Put a 1 into the matrix at a position (i,j) for every instance of the string 'Het' in position (i,j) in the array 3. Put a 2 into the matrix at a position (i,j) for every instance of the string 'Het' in position (i,j) in the array (the number one and two should be a number, not a string)
EXAMPLE:
Array = 'Het' '' '' 'Het' '' 'Hom'
'' 'Het' 'Hom' '' '' ''
would become
Matrix = [1 0 0 1 0 2; 0 1 2 0 0 0] (could be NaN instead of 0, that doesnt matter to me)
Now, I can think of a bunch of work around for this.
I could call strfind ton of times. I could use uint8, then divide that output by set number or something etc.
But all the workarounds I can think of are slow.
What is the fastest way to make this conversion on a very large array?
I do have parallel computing toolbox in principle, but I have never used it so I would need clear instructions…
Thank you very much in advance!

Best Answer

  • You can do this with UNIQUE or ISMEMBER.
    values = {'Het', '', '', 'Het', '', 'Hom'; '', 'Het', 'Hom', '', '', ''}
    % use third output of UNIQUE directly:
    [u, ~, idx] = unique(values);
    % idx is the wrong shape, so reshape it:
    out = reshape(idx, size(values))
    % Or, using the second output of ISMEMBER:
    [~, idx] = ismember(values, u)
    You could next parallelize it by using PARFOR over the rows of values. For example, let's make a larger 'values' by replicating it
    values = repmat(values, 100, 10)
    parfor rowIdx = 1:size(values,1)
    [~, out(rowIdx, :)] = ismember(values(rowIdx, :), u);
    end
    It's not clear to me whether applying PARFOR like this would make things faster though - in general, PARFOR works well when you are doing lots of work per amount of data transferred, and I'm not convinced that's the case here.