Matlab version: 2015a.

Let's say I have a function f(x). I know that it takes 0.001s to run on a GPU in a stand-alone way. But during the process of my large program (which may involve many other GPU operations), it becomes 100x slower — it takes 0.1s to run. All time recordings were measured by "tic" and "toc". I can feel the difference because f(x) is an extremely frequent operation and it slows down the entire program significantly.

Everything is run sequentially. No multi-threading. In fact, I can put a break point in the problematic lines of my program and run f(x) in the debug mode. It runs reliably 100x slower than the stand-alone situation. The input x is the same in both cases. The only way to restore its speed is to reset the GPU, which would lead to loss of all the variables. After resetting the GPU in the debug mode, f(x) speeds up to the normal speed (0.001s), except for the first call.

Yes, the first call to f(x) after resetting the GPU is still much slower than usual. This seems to be a common problem for most of the GPU operations. Does anybody know why?

One example:

`>> A = gpuArray(single(randn(9600, 5600)));>> B = gpuArray(single(randn(5600, 7600)));>> A2 = gpuArray(single(randn(96, 56)));>> B2 = gpuArray(single(randn(56, 76)));>> tic; C2 = A2*B2; toc;Elapsed time is 0.113910 seconds.>> tic; C2 = A2*B2; toc;Elapsed time is 0.000492 seconds.>> tic; C = A*B; toc;Elapsed time is 0.001034 seconds.>> tic; C = A*B; toc;Elapsed time is 0.000886 seconds.`

Note that this example may not be that related to my problem. But it shows that Matlab could do something inefficient during the GPU calls.

Any comments about how I can solve my problem? Matlab should probably have more transparent GPU calls. GPU algorithms take a long time to write. It is so frustrating to see it ruined by having some 0.1s costs just by calling it.

Also the slow-down is proportional to the size of x. If x is small, the slow-down is smaller. I suspect that some redundant memory copy operations might be problem.

By the way, f(x) is implemented in mex. It is a *.cu file. The GPU memory is sufficient,there's still 75% free space.

Thank you so much.

Update: It can be partially reproduced by a standalone for-loop:

`for i = 1:100 tic;f(x);toc;endElapsed time is 0.000990 seconds.Elapsed time is 0.000514 seconds.Elapsed time is 0.000520 seconds.Elapsed time is 0.000512 seconds.Elapsed time is 0.000519 seconds.Elapsed time is 0.000514 seconds.Elapsed time is 0.000553 seconds.Elapsed time is 0.000516 seconds.Elapsed time is 0.000584 seconds.Elapsed time is 0.000536 seconds.Elapsed time is 0.000543 seconds.Elapsed time is 0.000547 seconds.Elapsed time is 0.000583 seconds.Elapsed time is 0.000386 seconds.Elapsed time is 0.000289 seconds.Elapsed time is 0.000304 seconds.Elapsed time is 0.000305 seconds.Elapsed time is 0.000310 seconds.Elapsed time is 0.000310 seconds.Elapsed time is 0.000316 seconds.Elapsed time is 0.000317 seconds.Elapsed time is 0.000324 seconds.Elapsed time is 0.000309 seconds.Elapsed time is 0.000362 seconds.Elapsed time is 0.000340 seconds.Elapsed time is 0.000335 seconds.Elapsed time is 0.035467 seconds.Elapsed time is 0.064322 seconds.Elapsed time is 0.069029 seconds.Elapsed time is 0.064246 seconds.Elapsed time is 0.057586 seconds.Elapsed time is 0.061643 seconds.Elapsed time is 0.056715 seconds.Elapsed time is 0.054350 seconds.Elapsed time is 0.052523 seconds.Elapsed time is 0.053982 seconds.Elapsed time is 0.049440 seconds.Elapsed time is 0.049568 seconds.Elapsed time is 0.049307 seconds.Elapsed time is 0.046428 seconds.Elapsed time is 0.044839 seconds.Elapsed time is 0.048271 seconds.Elapsed time is 0.045572 seconds.Elapsed time is 0.043718 seconds.Elapsed time is 0.046787 seconds.Elapsed time is 0.045258 seconds.Elapsed time is 0.043749 seconds.`

After doing it repeatedly for a while, it slows down to ~0.05s, a 100x slow-down.

## Best Answer