Sql-server – Working out Duplicated Data in a Customer Data Base

duplicationsql serversql-server-2008

I have a database in SQL2008 of customers, There is an ID number for the customer. What im trying to do is to create a NEWID field.

We have duplicated customers that have different ID fields, I am combining the Surname and Postcode to find the duplicated customers.

As you can see in the image with an example, i need to do the following.

1.Check if PostSur field have a match,

2.Check if there is more than 1 IDNO,

3.Check the next IDNO against the First IDNO and if there is a unique entry in the Name field. Then populate that record with the First IDNO in a new field called NewCode.

4.Otherwise the Next IDNO populate the NewCode with 'Dupe'

5.Populate all other IDNO in the NewCode also.

So i can then hide the Dupes and i should then have unique customers using NewCode.

I have the following SQL which uses ROW_NUMBER but cant figure out how to do the above with it?

   select *
 from (
  select *,
      row_number() over (partition by PostSur  order by Postsur) as rowID 
 from Aron_Reporting.dbo.Customer_NewSort
   ) as rowID
 order by PostSUr

any help would be great

enter image description here

Best Answer

Here are your duplicates, then everything else is your non-duplicate.

with c as (
    select *,
    row_number() over (partition by PostSur, Name  order by ID) as rowID 
    from Aron_Reporting.dbo.Customer_NewSort
)
select c.*
    from c 
    inner join c c2 on c.PostSur = c2.PostSur AND c.rowID > 1 AND c2.rowID = 1 as rowID
     order by PostSur

This code partitions the row numbers by PostSur but also name, that way every person with a different name, or PostSur starts the numbering over. I joined the results back to the anchor, but you really don't need that necessarily.

I think I understood your requirements, hopefully this at least gets you close.

Related Solutions

Sql-server – Imitate ‘dynamic’ columns in SQL Server

I used your approach two years ago for a similar problem, it works fairly well, except a bit pain with linq to sql (or entity framework), plus WinForm:

We created a few updatable views for each type of entity (or customer as per your sample) but linq to sql could not cope well with updatable views, or probably it's not the case now.

Now we used a different approach:

In the main table, we will have a column named Descriminator (something like UserType),

then for each type of entity, we will have its delegated table (extended from the main table), which uses same PrimaryKey as the main table - so of course, a foreign constraint refer back to the main table (i.e. Primary Key and foreign key on the same field), and the delegated table will have the specific columns related to that user type.

it works well too, and with this, we can take the advantage of entityframework's inheritance feature.

that's just my experience, so my suggestion is to write some small test apps to test out your approach(es) and evaluate. then get the best that has good performance and also suits your framework for UI.

Sql-server – Composite Primary Key in multi-tenant SQL Server database

Having worked on a large-scale, multi-tenant system (federated approach with customers spread across 18+ servers, each server having identical schema, just different customers, and thousands of transactions per second per each server), I can say:

There are some folks (a few, at least) who will agree on your choice of GUID as the IDs for both "TenantID" and whatever entity "ID". But no, not a good choice. All other considerations aside, that choice alone will hurt in a few ways: fragmentation to start with, vast amounts of wasted space (don't say disk is cheap when thinking about enterprise storage — SAN — or queries taking longer due to each data page holding fewer rows than it could with either INT or BIGINT even), more difficult support and maintenance, etc. GUIDs are great for portability. Is the data generated in some system and then transferred to another? If not, then switch to a more compact data type (e.g. TINYINT, SMALLINT, INT, or even BIGINT), and increment sequentially via IDENTITY or SEQUENCE.
With item 1 out of the way, you really do need to have the TenantID field in EVERY table that has user data. That way you can filter anything without needing an extra JOIN. This also means that ALL queries against client-data tables are required to have the TenantID in the JOIN condition and/or WHERE clause. This also helps guarantee that you don't accidentally mix data from different customers, or show Tenant A data from Tenant B.
I'm thinking to add TenantId as primary key along with Id. And possibly add OutletId too. So primary key(s) in sales order table will be Id, TenantId, OutletId.

Yes, you should have your clustered indexes on the client-data tables be composite keys, inclusive of TenantID and ID ^**. This also ensures that TenantID is in every NonClustered index (since those include the Clustered Index Key(s)) which you would need anyway since 98.45% of queries against client-data tables will need the TenantID (the main exception is when garbage collecting old data based on CreatedDate and not caring about TenantID).

No, you would not include FKs such as OutletID to the PK. The PK needs to uniquely identify the row, and adding in FKs would not help with that. In fact, it would increase chances for duplicated data, assuming that OrderID was unique for each TenantID, as opposed to unique per each OutletID within each TenantID.

Also, it is not necessary to add OutletID to the PK in order to ensure that Outlets from Tenant A don't get mixed up with Tenant B. Since all user-data tables will have TenantID in the PK, that means TenantID will also be in the FKs. For example, the Outlet table has a PK of (TenantID, OutletID), and the Order table has a PK of (TenantID, OrderID) and an FK of (TenantID, OutletID) which references the PK on the Outlet table. Properly defined FKs will prevent Tenant data from getting intermixed.
Does the composite key order matter?

Well, here is where it gets fun. There is some debate as to which field should come first. The "typical" rule for designing good indexes is to pick the most selective field to be the leading field. TenantID, by its very nature, will not be the most selective field; the ID field is the most selective field. Here are some thoughts:
- ID first: This is the most selective (i.e. most unique) field. But by being an auto increment field (or random if still using GUIDs), each customer's data is spread out throughout each table. This means that there are times when a customer needs 100 rows, and that requires almost 100 data pages read from disk (not fast) into the Buffer Pool (taking up more space than 10 data pages). It also increases contention on the data pages since it will be more frequent that multiple customers will need to update the same data page.
  
  However, you typically do not run into as many parameter sniffing / bad cached plan issues as much since the statistics across the different ID values is fairly consistent. You might not get the most optimal plans, but you will be less likely to get horrible ones. This method essentially sacrifices performance (slightly) across all customers to gain the benefit of less frequent problems.
- TenantID first: This is very much not selective at all. There might be very little variation across 1 million rows if you only have 100 TenantIDs. But the statistics for these queries is more accurate since SQL Server will know that a query for Tenant A will pull back 500,000 rows but that same query for Tenant B is only 50 rows. This is where the main pain-point is. This method greatly increases the chances of having parameter sniffing issues where the first run of a Stored Procedure is for Tenant A and acts appropriately based on the Query Optimizer seeing those statistics and knowing it needs to be efficient getting 500k rows. But when Tenant B, with only 50 rows, runs, that execution plan is no longer appropriate, and in fact, is quite inappropriate. AND, since the data is not being inserted in the order of the leading field, this method creates a lot more page splits (hence more fragmentation) in a short period of time than the other approach.
  
  However, for the first TenantID to run a Stored Procedure, the performance should be better than in the other approach since the data (at least after doing index maintenance) will be physically and logically organized such that far fewer data pages are needed to satisfy the queries. This means less physical I/O, fewer logical reads, less contention between Tenants for the same data pages, less wasted space taken up in the Buffer Pool (hence improved Page Life Expectancy) etc.
  
  There are two main costs to getting this improved performance. The first is not so difficult: you must do regular index maintenance to counteract the increased fragmentation. The second is a bit less fun.
  
  In order to counteract the increased parameter sniffing issues, you need to separate the execution plans between Tenants. The simplistic approach is to use WITH RECOMPILE on procs or the OPTION (RECOMPILE) query hint, but that is a hit on performance that could wipe away all of the gains made by putting TenantID first. The method that I found worked best is to use parameterized Dynamic SQL via sp_executesql. The reason for needing the Dynamic SQL is to allow for concatenating the TenantID into the text of the query, while all other predicates that would normally be parameters are still parameters. For example, if you were looking for a particular Order, you would do something like:
```
DECLARE @GetOrderSQL NVARCHAR(MAX);
SET @GetOrderSQL = N'
  SELECT ord.field1, ord.field2, etc.
  FROM   dbo.Orders ord
  WHERE  ord.TenantID = ' + CONVERT(NVARCHAR(10), @TenantID) + N'
  AND    ord.OrderID = @OrderID_dyn;
';

EXEC sp_executesql
   @GetOrderSQL,
   N'@OrderID_dyn INT',
   @OrderID_dyn = @OrderID;
```
  The effect this has is to create a reusable query plan for just that TenantID that will match the data volume of that particular Tenant. If that same Tenant A executes the stored procedure again for another @OrderID then it will reuse that cached query plan. A different Tenant running that same Stored Procedure would generate a query text that was different only in the value of the TenantID, but any difference in the query text is enough to generate a different plan. And the plan generated for Tenant B will not only match the data volume for Tenant B, but it will also be reusable for Tenant B for different values of @OrderID (since that predicate is still parameterized).
  
  The downsides to this approach are:
  - It is a little more work than just typing in a simple query (but not all queries need to be Dynamic SQL, just the ones that end up having the parameter sniffing problem).
  - Depending on how many Tenants are on a system, it does increase the size of the plan cache since each query now requires 1 plan per TenantID that is calling it. This might not be an issue, but is at least something to be aware of.
  - Dynamic SQL breaks the ownership chain, which means read/write access to tables cannot be assumed by having EXECUTE permission on the Stored Procedure. The easy but less secure fix is just to give the User direct access to the tables. This is certainly not ideal, but that is usually the trade-off for quick and easy. The more secure approach is to use Certificate-based security. Meaning, create a Certificate, then create a User from that Certificate, grant that User the desired permissions (a Certificate-based User or Login cannot connect to SQL Server on its own), and then sign the Stored Procedures that use Dynamic SQL with that same Certificate via ADD SIGNATURE.
    
    For more information on module signing and Certificates, please see: ModuleSigning.Info
Please see UPDATE section towards the end for additional topics related to the issue of dealing with mitigating statistics issues resulting from this decision.

^** Personally, I really dislike using just "ID" for the PK field name on every table as it is not meaningful, and it is not consistent across FKs since the PK is always "ID" and the field in the child table has to include the parent table name. For example: Orders.ID -> OrderItems.OrderID. I find it much easier to deal with a data model that has: Orders.OrderID -> OrderItems.OrderID. It is more readable, and cuts down on the number of times you will get the "ambiguous column reference" error :-).

UPDATE

Would the OPTIMIZE FOR UNKNOWN Query Hint (introduced in SQL Server 2008) help with either ordering of the composite PK?

Not really. This option does get around parameter sniffing issues, but it merely replaces one problem with another. In this case, rather than remembering the statistical info for the parameter values of the initial run of the stored procedure or parameterized query (which is definitely great for some, but potentially mediocre for some, and potentially horrible for some), it uses a general statistic of data distribution to estimate row counts. This is hit-or-miss as to how many (and to what degree) queries will be affected positively, negatively, or not at all. At least with parameter sniffing some queries were guaranteed to benefit. If your system has Tenants with widely varied data volumes, this could potentially hurt performance for all queries.

This option accomplishes the same thing as copying input parameters to local variables and then using the local variables in the query (I have tested this but no room for that here). Additional info can be found in this blog post: http://www.brentozar.com/archive/2013/06/optimize-for-unknown-sql-server-parameter-sniffing/. Reading the comments, Daniel Pepermans came to a conclusion similar to mine regarding the use of Dynamic SQL that has limited variation.
If ID is the leading field in the Clustered Index, would it help / suffice to have a Non-Clustered Index on (TenantID, ID), or just (TenantID) to have accurate statistics for queries that process many rows of a single tenant?

Yes, it would help. The large system that I mentioned working on for years was based on an index design of having the IDENTITY field as the leading field because it was more selective and reduced parameter sniffing issues. However, when we needed to operations against a good portion of a particular Tenant's data, the performance did not hold up. In fact, a project to migrate all data into new databases had to be put on hold because the SAN controllers got maxed out in terms of throughput. The fix was to add Non-Clustered Indexes to all tenant data tables to be just (TenantID). No need to do (TenantID, ID) since ID is already in the Clustered Index so the internal structure of the Non-Clustered Index was naturally (TenantID, ID).

While this did solve the immediate issue of being able to do TenantID-based queries much more efficiently, they still weren't as efficient as they could have been if it were the Clustered Index that were in that same order. And, now we had yet one more index on every table. That increased the amount of SAN space we were using, increased the size of our backups, made the backups take longer to complete, increased the potential for blocking and deadlocks, decreased performance on INSERT and DELETE operations, etc.

AND we were still left with the general inefficiency of having a Tenant's data spread out across many data pages, intermixed with many other Tenant's data. As I mentioned above, this increases the amount of contention on these pages, and it fills up the Buffer Pool with lots of data pages that have 1 or 2 useful rows in them, especially when some of the rows on those pages were for clients that were inactive but hadn't been garbage collected yet. There is much less potential for re-use of the data pages in the Buffer Pool in this approach, so our Page Life Expectancy was pretty low. And that means more time going back to disk to load more pages.

Best Answer

Related Solutions

Sql-server – Imitate ‘dynamic’ columns in SQL Server

Sql-server – Composite Primary Key in multi-tenant SQL Server database

Related Question