Sql-server – SQL update satement taking a very long time / high disk usage for hours

sql serversql-server-2008-r2ssms

Yes it sounds like a very generic issue, but I have not been able to narrow it down very much yet.

So I have an UPDATE statement in a sql batch file:

UPDATE A
SET A.X = B.X
FROM A JOIN B ON A.B_ID = B.ID

B has 40k records, A has 4M records and they are related 1-to-n via A.B_ID, although there is no FK between the two.

So basically I am pre-calculating a field for data mining purposes.
Although I changed the name of the tables for this question, I didn't change the statement, it's really that simple.

This takes hours to run, so I decided to cancel everything. The DB got corrupted, so I deleted it, restored a backup I did just before running the statement and decided to go more into details with a cursor:

DECLARE CursorB CURSOR FOR SELECT ID FROM B ORDER BY ID DESC -- Descending order
OPEN CursorB 
DECLARE @Id INT
FETCH NEXT FROM CursorB INTO @Id

WHILE @@FETCH_STATUS = 0
BEGIN
    DECLARE @Msg VARCHAR(50) = 'Updating A for B_ID=' + CONVERT(VARCHAR(10), @Id)
    RAISERROR(@Msg, 10, 1) WITH NOWAIT

    UPDATE A
    SET A.X = B.X
    FROM A JOIN B ON A.B_ID = B.ID
    WHERE B.ID = @Id

    FETCH NEXT FROM CursorB INTO @Id
END

Now I can see it running with a message with the id descending.
What happens is it takes around 5min to go from id=40k to id = 13

And then at id 13, for some reason, it seems to hang. The DB does not have any connection to it besides SSMS, but it's not actually hanged:

  • the hard drive is running continuously so it's definitely doing something (I checked in Process Explorer that it's indeed the sqlserver.exe process using it)
  • I ran sp_who2, found the SPID (70) of the SUSPENDED session then ran the following script:

    select *
    from sys.dm_exec_requests r
    join sys.dm_os_tasks t on r.session_id = t.session_id
    where r.session_id = 70

This gives me the wait_type, which is PAGEIOLATCH_SH most of the time but actually changes to WRITE_COMPLETION sometimes, which I guess happens when it's flushing the log

  • the log file, which was 1.6GB when I restored the DB (and when it got to id 13), is now 3.5GB

Other maybe useful information:

  • the number of records in table A for B_ID 13 is not big (14)
  • My colleague does not have the same issue on her machine, with a copy of this DB (from a couple of months ago) with the same structure.
  • table A is by far the biggest table in the DB
  • It has several indexes, and several indexed views use it.
  • There is no other user on the DB, it's local and no application is using it.
  • The LDF file is not limited in size.
  • Recovery model is SIMPLE, compatibility level is 100
  • Procmon does not give me much information: sqlserver.exe is reading and writing a lot from the MDF and the LDF files.

I'm still waiting for it to finish (it's been 1h30) but I was hoping that maybe someone would give me some other action I could try to troubleshoot this.

Edited: adding extract from procmon log

15:24:02.0506105    sqlservr.exe    1760    ReadFile    C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\TA.mdf  SUCCESS Offset: 5,498,732,544, Length: 8,192, I/O Flags: Non-cached, Priority: Normal
15:24:02.0874427    sqlservr.exe    1760    WriteFile   C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\TA.mdf  SUCCESS Offset: 6,225,805,312, Length: 16,384, I/O Flags: Non-cached, Write Through, Priority: Normal
15:24:02.0884897    sqlservr.exe    1760    WriteFile   C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\TA_1.LDF    SUCCESS Offset: 4,589,289,472, Length: 8,388,608, I/O Flags: Non-cached, Write Through, Priority: Normal

From using DBCC PAGE it seems to be reading from and writing to fields that look like table A's (or one of its indexes), but for different B_ID that 13. Rebuilding indexes maybe?

Edited 2: execution plan

So I cancelled the query (actually deleted the DB and its files then restored it), and checked the execution plan for:

UPDATE A
SET A.X = B.X
FROM A JOIN B ON A.B_ID = B.ID
WHERE B.ID = 13

The (estimated) execution plan is the same than for any B.ID, and looks fairly straightforward. The WHERE clause uses an index seek on a non-clustered index of B, the JOIN uses a clustered index seek on both PKs of the tables. The clustered index seek on A uses parallelism (x7) and represents 90% of the CPU time.

More importantly, actually executing the query with ID 13 is immediate.

Edited 3: index fragmentation

The structure of indexes is as follows:

B has one clustered PK (not the ID field), and one non-clustered unique index, which first field is B.ID – this second index seems to be used always.

A has one clustered PK (field not related).

There are also 7 views on A (all include the A.X field), each one with its own clustered PK, and an other index which also includes the A.X field

The views are filtered (with fields that are not in this equation), so I doubt there is any way the UPDATE A would use the views themselves.
But they do have an index including A.X, so changing A.X means writing the 7 views and the 7 indexes they have that include the field.

Although the UPDATE is expected to be slower for this, there is no reason why a specific ID would be so much longer than the others.

I checked the fragmentation for all the indexes, all were at <0.1%, except the secondary indexes of the views, all between 25% and 50%. Fill factors for all indexes seem ok, between 90% and 95%.

I reorganized all the secondary indexes, and reran my script.

It is still hanged, but at a different point:

...
(0 row(s) affected)

        Updating A for B_ID=14

(4 row(s) affected)

Whereas previously, the message log looked like this:

...
(0 row(s) affected)

        Updating A for B_ID=14

(4 row(s) affected)

        Updating A for B_ID=13

This is weird, because it means it's not even hanged at the same point in the WHILE loop.
The rest looks the same: same UPDATE line waiting in sp_who2, same PAGEIOLATCH_EX wait type and same heavy HD usage from sqlserver.exe.

Next step is to delete all indexes and views and recreate them I think.

Edited 4: deleting then rebuilding indexes

So, I deleted all the indexed views I had on the table (7 of them, 2 indexes per view including the clustered one).
I ran the initial script (without cursor), and it actually ran in 5 minutes.

So my issue originates from the existence of these indexes.

I recreated my indexes after running the update, and it took 16 minutes.

Now I understand indexes take time to rebuild, and I am actually fine with the complete task taking 20 minutes.

What I still don't understand is, why when I run the update without deleting the indexes first, it takes several hours, but when I delete them first then recreate them, it takes 20 minutes. Shouldn't it take about the same time either way?

Best Answer

  1. Stick with UPDATE command. CURSOR will be slower for what you are trying to do.
  2. Drop/disable all indexes, including those for indexed views. If you have foreign key on A.X, drop it.
  3. Create index which will contain only A.B_ID, and another one for B.ID.
  4. Even though you are using Simple Recovery model, the last transaction will be always in transaction log before it is flushed to disk. That's why you need to pre-grow your transaction log and set it to grow for some larger amount (eg. 100 MB).
  5. Also, set data file growth to some larger amount.
  6. Make sure you have enough disk space for further growth of log and data files.
  7. When update finishes recreate/enable indexes which you dropped/disabled in step 2.
  8. If you don't need them anymore, drop indexes created in step 3.

Edit: Since I can't comment on your original post, I'll answer here your question from Edit 4. You have 7 indexes on A.X. Index is a B-tree, and each update to that field causes B-tree to rebalance. It's faster to rebuild those indexes from scratch than to rebalance it each time.