Sql-server – Round robin vs Demand Parallelism partitioning

parallelismpartitioningsql server

This is related to having a common key driver on top of the query with a nested loops, and the parallelism for the rows coming out of driver being either demand type or round robin type. I would have assumed demand partitioning would perform better but I get opposite results.

I started the queries within the same try at the same time in SQL Server. When the queries running I monitored dm_exec_query_profiles dmv constantly. I noticed the Round robin versions start a lot faster, they are inserting a lot more rows a lot faster into the Table Insert portion in the dmv also they pick up a lot more rows quicker from the driver side parallelism portion. Thinking logically demand partitioning should be more advantageous since our SQL server is usually over 50-60% cpu, litespeed backups running, has 64 cores etc. I was able to balance rows processed on threads a lot better with round robin partitioning but also the data within partitions are so out of balance I noticed some threads in demand partitioning only processes 1 record from the driver whereas the average records from the driver is around 196. With the demand partitioning I order the rows within partitions descending vs in round robin I try to balance the rows better.

Should I always use round robin instead, why does round robin starts processing rows a lot faster than demand partitioning, can I do more optimization for demand partitioning?

The query plans are in the One Drive link, I couldn't find another way of posting them here (pastetheplan only accepts xml, which does not capture the extra information like wait stats & duration captured by Plan Explorer).


CompareRoundRobinToDemand_DM_2_5114.pesession
CompareRoundRobinToDemand_RRB_2_3046.pesession

started at the same time, Round Robin was considerably faster


CompareRoundRobinToDemand_DM_3_5228.pesession
CompareRoundRobinToDemand_RRB_3_4367.pesession

started at the same time, Round Robin was faster again


CompareRoundRobinToDemand_DM_4_4813.pesession
CompareRoundRobinToDemand_RRB_4_3577.pesession

started at the same time, Round Robin was faster again

Thank you in advance.

Monitoring Running Sessions
Query Plans

SQL Server Version and running Traces

I am able to balance round robin rows almost perfect whereas for demand I just order the driver in rows descending order, balancing the rows in demand is an option in my procedure but did not produce better results. In my observation demand performs better when overall CPU usage is less and Round robin performs better when server is busier. I noticed demand creates one extra thread compared to round robin and also overall cpu utilization in demand is slightly higher than similar RRB version.

i am aware of imbalance of row distribution in demand. the memory grant is also on purpose, it's not using that much memory at the end. the first 16 records passed from driver is the same in demand vs round robin, but round robin starts processing them a lot faster for some reason. I want to understand why. I also want to understand when it's more beneficial to use demand or round robin. It seems like when server is idle demand works faster and when server has existing load round robin is faster, thats an observation so far, does that have a basis I dont know.

Best Answer

Should I always use round robin instead

Only if you find it produces "better" results for your particular workload, or for particular queries within that workload.

In general, I would say no. My personal experience is that Demand (D) partitioning tends to have fewer problems than Round Robin (RR). The particular plan you shared does not seem to have any of the more common features that can cause performance problems with RR, but it is tough to assess at a distance.

why does round robin starts processing rows a lot faster than demand partitioning

If I were to hazard a guess, I would say the D partitioning threads are checking and yielding the processor more frequently. A thread that uses less of its quantum on average (by checking and yielding more often) is a good scheduling citizen, but it may be punished for that good behaviour (before SQL Server 2016). The screenshot you shared does seem to show the D plan getting less CPU time than the concurrent RR execution.

D pulls single rows across the exchange whereas RR pushes whole packets - and yield checks occur at exchanges (among many other places). A very detailed analysis would be required to confirm this (or any other) theory.

can I do more optimization for demand partitioning?

The plan seems reasonable enough from what I can see.

It is interesting that the screenshot shows a merge join whereas the uploaded plans all use hash join. Trying a co-located merge join was one of the first suggestions that came to mind - as explained in my article Improving Partitioned Table Join Performance.

You could also check on thread distribution across schedulers and perhaps experiment with different thread placement strategies. I would certainly be checking that the D plan was allocating threads as optimally as the RR version (e.g. no concurrent same-branch threads sharing the same scheduler).

Perhaps also look at presenting the demand rows in a different order. What you have now may be based on a sensible expectation, but that's not to say it is necessarily optimal.

In any case, I would be very wary of drawing general conclusions about D vs RR based on this one example.


I noticed demand creates one extra thread compared to round robin

This is an illusion due to a timing difference. In the RR plan, the Index Scan completes very quickly, with its rows stuffed into packets on the exchange. The thread running the scan thus terminates quickly.

In the D plan, individual rows are pulled on demand across the exchange and fully processed before a new row is pulled across. So, the Constant Scan thread hangs around for longer.

Both plans have two concurrent branches, with DOP (16) threads reserved for each. Both use 17 threads as reported in the ThreadStat showplan element (16 for the main branch, 1 for the serial index scan/constant scan).