Assuming you just want to delete duplicates in books
after merging duplicate authors.
BEGIN;
LOCK books, authors;
CREATE TEMP TABLE dupes ON COMMIT DROP AS (SELECT 2 AS dupe, 1 AS org);
DELETE FROM books b -- delete duplicate books
USING dupes d
WHERE b.author_id = d.dupe
AND EXISTS (
SELECT 1
FROM books
WHERE title = b.title
AND author_id = d.org
);
UPDATE books b -- now we relink all remaining books
SET author_id = d.org
FROM dupes d
WHERE b.author_id = d.dupe;
DELETE FROM authors a -- now we can delete all dupes
USING dupes d
WHERE a.id = d.dupe;
COMMIT;
The temp table could hold many rows to remove many dupes at once.
Repeat the first two steps for every tables referencing authors.id
. If there are many I would create and execute the statements dynamically ...
I lock the tables explicitly to avoid concurrent disturbances.
Automation
A basic function could look like this:
CREATE OR REPLACE FUNCTION f_remove_dupe(_tbl text, _col text, _dupe int, _org int)
RETURNS void AS
$func$
DECLARE
_ftbl text;
_fcol text;
BEGIN
FOR _ftbl, _fcol IN
-- table and column name behind all referencing FKs
SELECT c.conrelid::regclass::text, f.attname
FROM pg_attribute a
JOIN pg_constraint c ON a.attrelid = c.confrelid AND a.attnum = c.confkey[1]
JOIN pg_attribute f ON f.attrelid = c.conrelid AND f.attnum = c.conkey[1]
WHERE a.attrelid = _tbl::regclass
AND a.attname = _col
AND c.contype = 'f'
LOOP
EXIT WHEN _ftbl IS NULL; -- skip if not found
EXECUTE format('
UPDATE %1$s
SET %2$I = $2
WHERE %2$I = $1'
, _ftbl, _fcol)
USING _dupe, _org;
END LOOP;
EXECUTE format('
DELETE FROM %I WHERE %I = $1'
, _tbl, _col)
USING _dupe;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT f_remove_dupe('authors', 'id', 2, 1);
This simple version ...
- ... only works for a single dupe.
- ... ignores
UNIQUE
constraints in referencing tables.
- ... assumes all FK constraints only use the one column, ignoring multi-column FKs
- ... ignores possible interference from concurrent transactions.
Adapt to your requirements.
Related:
In SQL Server, I might accomplish this using a cross reference table between "companies" and "users" where only a single row can be an owner for a given company, via the presence of a unique non-clustered index.
For example:
USE tempdb; --create this test in tempdb
IF OBJECT_ID(N'dbo.CompaniesUsers', N'U') IS NOT NULL
DROP TABLE dbo.CompaniesUsers;
IF OBJECT_ID(N'dbo.Users', N'U') IS NOT NULL
DROP TABLE dbo.Users;
IF OBJECT_ID(N'dbo.Companies', N'U') IS NOT NULL
DROP TABLE dbo.Companies;
GO
CREATE TABLE dbo.Users
(
UserID int NOT NULL
CONSTRAINT PK_Users
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, UserName varchar(100) NOT NULL
CONSTRAINT UQ_Users_UserName
UNIQUE
);
CREATE TABLE dbo.Companies
(
CompanyID int NOT NULL
CONSTRAINT PK_Companies
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, CompanyName varchar(100) NOT NULL
CONSTRAINT UQ_Companies_CompanyName
UNIQUE
);
CREATE TABLE dbo.CompaniesUsers
(
CompanyID int NOT NULL
CONSTRAINT FK_CompanyUsers_CompanyID
FOREIGN KEY
REFERENCES dbo.Companies(CompanyID)
, UserID int NOT NULL
CONSTRAINT FK_CompanyUsers_UserID
FOREIGN KEY
REFERENCES dbo.Users(UserID)
, CompanyOwner bit NOT NULL
CONSTRAINT DF_CompanyUsers_Owner
DEFAULT ((0))
);
CREATE UNIQUE NONCLUSTERED INDEX IX_CompanyUsers
ON dbo.CompaniesUsers(CompanyID, CompanyOwner)
INCLUDE (UserID)
WHERE CompanyOwner = 1;
Here I'll insert some test data to show how this works. First, two "users":
INSERT INTO dbo.Users (UserName)
VALUES ('owner 1')
, ('owner 2');
And, now, a "Company":
INSERT INTO dbo.Companies (CompanyName)
VALUES ('Company 1');
If I attempt to make both users an owner of the company, the statement fails:
INSERT INTO dbo.CompaniesUsers (CompanyID, UserID, CompanyOwner)
VALUES (1, 1, 1)
, (1, 2, 1);
Msg 2601, Level 14, State 1, Line 62
Cannot insert duplicate key row in object 'dbo.CompaniesUsers' with unique index 'IX_CompanyUsers'. The duplicate key value is (1, 1).
However, if I only allow one of the users to be the "owner", then the insert works:
INSERT INTO dbo.CompaniesUsers (CompanyID, UserID, CompanyOwner)
VALUES (1, 1, 1)
, (1, 2, 0);
SELECT *
FROM dbo.CompaniesUsers
╔═══════════╦════════╦══════════════╗
║ CompanyID ║ UserID ║ CompanyOwner ║
╠═══════════╬════════╬══════════════╣
║ 1 ║ 1 ║ 1 ║
║ 1 ║ 2 ║ 0 ║
╚═══════════╩════════╩══════════════╝
If you need to ensure a "user" can own at most a single "company", then this unique, non-clustered index, should do the trick:
CREATE UNIQUE NONCLUSTERED INDEX IX_CompanyUsers_1
ON dbo.CompaniesUsers(UserID, CompanyOwner)
INCLUDE (CompanyID)
WHERE CompanyOwner = 1;
I originally missed the requirement that a "user" can only be a member of a single "company". With that in mind, you'd need to add this index (the "IX_CompanyUsers_1" index would be unnecessary):
CREATE UNIQUE NONCLUSTERED INDEX IX_CompanyUsers_2
ON dbo.CompaniesUsers(UserID)
Best Answer
The typical nomenclature for this type of table design would be "class-subclass". In your example of
Books
andAuthors
,Books
is a subclass of theAuthor
class. Any book automatically has an author, since no book can ever be written without having an author write it. Same goes forSales
andSalesItems
; eachSalesItem
must have an associatedSales
object.However, "parent-child" may be employed where it makes sense. As is so often typical, the answer is It Depends™.
Using nomenclature that makes sense for the specific domain in question, and is easily and readily understood by the stakeholders, ensures future comprehension.
One-to-many is a typical term used to describe the relationship between rows in each table. The relationship might also be one-to-one or many-to-one or even many-to-many, depending on domain requirements.