I have a large table with a structure similar to this:
Id bigint primary key Sequence int Parent int foreign key Data1 varchar(20) Data2 varchar(20) Data3 varchar(20) Data4 varchar(20) Data5 varchar(20)
Data 1-5 are text fields (mostly numbers and dashes) and can be used to help prevent accidentally recording duplicates, but the data can actually duplicate occasionally. We have to let the user confirm that yes it's OK it's a duplicate.
This duplicate check can be time consuming with millions of rows. I've been tasked to ensure we can deal with tripling the amount of data we are keeping in this database. I've been told by previous developers that the current process will not work with that much data (there is of course a lot more going on with the data than I'm including here, and though the actual query for finding duplicates is pretty simple this part of the system is apparently sensitive to additional latency). I have not done experiments myself to prove this but I trust their judgement, and regardless I'd like to reduce the impact as much as possible. Unfortunately, I do not have a huge volume to test with until fairly late in the process. So while I have time I would like to try to be set up to test a few scenarios when I get a window of time on the test server.
All 5 fields must match to be counted as a duplicate. I'm skeptical that this could never change, but I'm being told that for the foreseeable future all 5 must match exactly. I'm thinking that by adding a indexed column that is a hash of the combined values, I will be able to find potential duplicates much faster. I would still have to compare the individual values to account for hash collisions though. Would I gain anything by combining the values themselves too? So a hash column, and a single column with all 5 values in a delimited string that I use to compare rather than compare each value separately?
It seems like I would only benefit if varchars are not stored with the rest of the row, which I always assumed they were. However that doesn't seem to be the case. I'm having a hard time finding specific information, but it seems that if my row is less than 8060 bytes, the full 20 characters will be allocated in the row. Only if the row is over 8060 will some varchar columns be chose to move into a separate page.
Any documentation that explains how small varchars are stored, and any suggestions regarding whether or not combining the columns into a single column will have a performance impact would be much appreciated.
This table will have millions of rows (say 10 million for an estimate). Each row will definitely have a maximum length of less than 8060 bytes, though I can't give you an exact number right now.