MATLAB: Access to data without reading all of it

big data

Hi all,
I have two data sets. One is10 gb and the other one is 2TB. They are both txt files.
Let me say in the small data set, I have variables timestamp, ID and x. In the big one, I have timestamp, ID and y. Unique number of ID's and timestamps are much higher in the big data set.
For each observation in the small data, I want to find the row with the same milisecond and id in the big data and then copy the value of y to small data.
Is it possible to find corresponding rows without reading 2TB of data?

Best Answer

  • "I am slightly confused about the comments about search algorithm. [...] Is find function in Matlab not very efficient?"
    I'm convinced that the Matlab function, find(), is efficient given the requirements. However, for the special case of a large sorted vector there are faster search algorithms. I made a little demo, which shows that simple m-code implementations of interpolation_search and binary_search are more than an order of magnitude faster than find() - searching a large sorted vector.

    A = sort( floor( cumsum( 1+9*rand(1e9,1) ) ) ); % sample data
    key = A( 2E8 );
    tic; [ixIS,nIS] = interpolation_search( A, key ); toc
    tic; [ixBS,nBS] = binary_search( A, key ); toc
    tic; ixFL = find( A==key, 1, 'last' ); toc
    tic; ixFF = find( A==key, 1, 'first' ); toc
    [ key ; A([ixIS,ixBS,ixFL,ixFF]) ]
    outputs in the command window
    Elapsed time is 0.002190 seconds.
    Elapsed time is 0.001413 seconds.
    Elapsed time is 0.916372 seconds.
    Elapsed time is 0.608332 seconds.
    ans =
    1.0e+09 *