Yes it sounds like a very generic issue, but I have not been able to narrow it down very much yet.
So I have an UPDATE statement in a sql batch file:
UPDATE A SET A.X = B.X FROM A JOIN B ON A.B_ID = B.ID
B has 40k records, A has 4M records and they are related 1-to-n via A.B_ID, although there is no FK between the two.
So basically I am pre-calculating a field for data mining purposes.
Although I changed the name of the tables for this question, I didn't change the statement, it's really that simple.
This takes hours to run, so I decided to cancel everything. The DB got corrupted, so I deleted it, restored a backup I did just before running the statement and decided to go more into details with a cursor:
DECLARE CursorB CURSOR FOR SELECT ID FROM B ORDER BY ID DESC -- Descending order OPEN CursorB DECLARE @Id INT FETCH NEXT FROM CursorB INTO @Id WHILE @@FETCH_STATUS = 0 BEGIN DECLARE @Msg VARCHAR(50) = 'Updating A for B_ID=' + CONVERT(VARCHAR(10), @Id) RAISERROR(@Msg, 10, 1) WITH NOWAIT UPDATE A SET A.X = B.X FROM A JOIN B ON A.B_ID = B.ID WHERE B.ID = @Id FETCH NEXT FROM CursorB INTO @Id END
Now I can see it running with a message with the id descending.
What happens is it takes around 5min to go from id=40k to id = 13
And then at id 13, for some reason, it seems to hang. The DB does not have any connection to it besides SSMS, but it's not actually hanged:
- the hard drive is running continuously so it's definitely doing something (I checked in Process Explorer that it's indeed the sqlserver.exe process using it)
I ran sp_who2, found the SPID (70) of the SUSPENDED session then ran the following script:
from sys.dm_exec_requests r
join sys.dm_os_tasks t on r.session_id = t.session_id
where r.session_id = 70
This gives me the wait_type, which is PAGEIOLATCH_SH most of the time but actually changes to WRITE_COMPLETION sometimes, which I guess happens when it's flushing the log
- the log file, which was 1.6GB when I restored the DB (and when it got to id 13), is now 3.5GB
Other maybe useful information:
- the number of records in table A for B_ID 13 is not big (14)
- My colleague does not have the same issue on her machine, with a copy of this DB (from a couple of months ago) with the same structure.
- table A is by far the biggest table in the DB
- It has several indexes, and several indexed views use it.
- There is no other user on the DB, it's local and no application is using it.
- The LDF file is not limited in size.
- Recovery model is SIMPLE, compatibility level is 100
- Procmon does not give me much information: sqlserver.exe is reading and writing a lot from the MDF and the LDF files.
I'm still waiting for it to finish (it's been 1h30) but I was hoping that maybe someone would give me some other action I could try to troubleshoot this.
Edited: adding extract from procmon log
15:24:02.0506105 sqlservr.exe 1760 ReadFile C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\TA.mdf SUCCESS Offset: 5,498,732,544, Length: 8,192, I/O Flags: Non-cached, Priority: Normal 15:24:02.0874427 sqlservr.exe 1760 WriteFile C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\TA.mdf SUCCESS Offset: 6,225,805,312, Length: 16,384, I/O Flags: Non-cached, Write Through, Priority: Normal 15:24:02.0884897 sqlservr.exe 1760 WriteFile C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\TA_1.LDF SUCCESS Offset: 4,589,289,472, Length: 8,388,608, I/O Flags: Non-cached, Write Through, Priority: Normal
From using DBCC PAGE it seems to be reading from and writing to fields that look like table A's (or one of its indexes), but for different B_ID that 13. Rebuilding indexes maybe?
Edited 2: execution plan
So I cancelled the query (actually deleted the DB and its files then restored it), and checked the execution plan for:
UPDATE A SET A.X = B.X FROM A JOIN B ON A.B_ID = B.ID WHERE B.ID = 13
The (estimated) execution plan is the same than for any B.ID, and looks fairly straightforward. The WHERE clause uses an index seek on a non-clustered index of B, the JOIN uses a clustered index seek on both PKs of the tables. The clustered index seek on A uses parallelism (x7) and represents 90% of the CPU time.
More importantly, actually executing the query with ID 13 is immediate.
Edited 3: index fragmentation
The structure of indexes is as follows:
B has one clustered PK (not the ID field), and one non-clustered unique index, which first field is B.ID – this second index seems to be used always.
A has one clustered PK (field not related).
There are also 7 views on A (all include the A.X field), each one with its own clustered PK, and an other index which also includes the A.X field
The views are filtered (with fields that are not in this equation), so I doubt there is any way the UPDATE A would use the views themselves.
But they do have an index including A.X, so changing A.X means writing the 7 views and the 7 indexes they have that include the field.
Although the UPDATE is expected to be slower for this, there is no reason why a specific ID would be so much longer than the others.
I checked the fragmentation for all the indexes, all were at <0.1%, except the secondary indexes of the views, all between 25% and 50%. Fill factors for all indexes seem ok, between 90% and 95%.
I reorganized all the secondary indexes, and reran my script.
It is still hanged, but at a different point:
... (0 row(s) affected) Updating A for B_ID=14 (4 row(s) affected)
Whereas previously, the message log looked like this:
... (0 row(s) affected) Updating A for B_ID=14 (4 row(s) affected) Updating A for B_ID=13
This is weird, because it means it's not even hanged at the same point in the
The rest looks the same: same UPDATE line waiting in sp_who2, same PAGEIOLATCH_EX wait type and same heavy HD usage from sqlserver.exe.
Next step is to delete all indexes and views and recreate them I think.
Edited 4: deleting then rebuilding indexes
So, I deleted all the indexed views I had on the table (7 of them, 2 indexes per view including the clustered one).
I ran the initial script (without cursor), and it actually ran in 5 minutes.
So my issue originates from the existence of these indexes.
I recreated my indexes after running the update, and it took 16 minutes.
Now I understand indexes take time to rebuild, and I am actually fine with the complete task taking 20 minutes.
What I still don't understand is, why when I run the update without deleting the indexes first, it takes several hours, but when I delete them first then recreate them, it takes 20 minutes. Shouldn't it take about the same time either way?