Relational Algebra – How to Delete Duplicate Keys

relational-algebrarelational-calculusrelational-theory

Hi I am new to databases and relational algebra. I was wondering if there is a way to remove the tuples from a table using relational algebra that have the same keys but different value.

e.g. I want to keep only [1, 5] and [4, 9] but remove everything else.

Key    | Value
-------|-------
 1     | 5
 2     | 6
 2     | 7
 2     | 8
 4     | 9

Thanks.

Best Answer

There are many variants and extensions of relational algebra described in books. I assume that you have a group by operator, which is an extension of the classical relational algebra, and it is written as:

a₁... a_n γ f₁...f_m

where each a_i is a grouping attribute, and f_i is an aggregation function.

Using this operator, your query could be answered by the following expression (assuming that the name of your relation is R):

R ⨝ π_Key(σ_COUNT(*)=1(_Key γ _COUNT(*) (R)))

First we group by Key and keep only the groups with a unique value for it, and then we perform a natural join on R itself to maintain only the tuples with just one value of Key.

Related Solutions

Relational Theory – Practical Reasons for Learning Relational Algebra

When Codd defined the relational model he defined a set of operators which could be applied to relations. In specifying a relational algebra, much like specification of an integer algebra, we are able to use symbols in place of relations to solve queries. These operators are subject to the same algebraic properties that integer algebra operators (+, -, *, /) are. As a result, we can assume certain laws that always apply to a relation, any relation, undergoing that operation. For example, in integer algebra we know that addition and multiplication are associative in that we can change the grouping of operands and not change the result:

a + ( b + c ) = ( a + b ) + c

Similarly, in relational algebra we know that natural join is associative and thus know that A join B join C can be executed in any order. These properties and laws create the power to re-write query formulations and be guaranteed to get the same results. The book Applied Mathematics for Database Professionals provides significant detail on the various re-write rules you can use to precisely formulate the same query in different ways. In a perfect world any formulation producing the same result would have the same performance. A modern optimizer, while an amazing piece of software, isn't perfect however. Thus if you have formulated a query one way and are getting poor performance, you have the skills to formulate it a different way and know it has the same semantics. Another practical advantage to this is in the specification of database constraints. First, understanding the relational algebra enables you to determine the simplest way to formulate the constraint. Second, by formulating the constraint in formal logic, you can immediately clarify any ambiguity in intent from the business subject matter experts who formulated the business rule in loose English and avoid bugs.

It was Leonardo da Vinci who said:

He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.

In this same way, a data practitioner who doesn't understand the fundamentals of relational theory cannot be in as complete command of the technology as they can be with that understanding. Some great references on relational algebra are SIRA_PRISE's Introduction to the Relational Algebra page, and CJ Date's SQL and Relational Theory. Date's book shows the practicality in understanding relational algebra so that you can write much more accurate SQL queries. SQL has many quirks and pitfalls and having a sound grasp of how it works vs. the original relational algebra operators really helps realizing where the pitfalls are and avoiding them.

Relational Theory – Are Three Equalities on a Join Operation Legal?

The notation Producer=Actor=Director is wrong. Besides that, neither Π_Actor(Film) nor Π_Producer(Produce) has an attribute Director. The first projection has only an attribute Actor, the second projection has only an attribute Producer.

In the second answer what is meant if you divide two relations that have the same number of attributes. I think that does not make sense because the result is a relation with no attributes.

You first relation Film(Title,Director,Actor) is very strange. It defines a relation between three attributes actor, directors and title. But from my understanding of a film there does not exist such a relation. There is a relation between actor an title, if an actors plays in a movie with this title. And there is a relation between a director and a title if a person is a director of this movie. But I don't understand the tuple (actor, director, title). Does an actor have a director in a movie and another actor has another director in the same movie? If that is not the case then you should better use two relations: A relation Acts(Person, Title)and a relation Directs(Person, Title).

Here are the answers using only natural join, which makes them a little bit clumsy.

The first question

Which actors produce at least a Film they directed?

has the following answer, using the notation from wikipedia:

Latex of Formula: $$\pi_{\text{Actor}}(\text{Film}) \bowtie \pi_{\text{Actor}}\left(\rho_{\text{Actor}/\text{Director}}\left(\pi_{\text{Director},\text{Title}}(\text{Film}) \bowtie \rho_{\text{Director}/\text{Producer }}(\text{Produce})\right)\right)$$

Here is an equivalent SQL code (Oracle):

select F2.Actor
from Film F2 join (
  Film F1 join Produce P 
    on (F1.Director=P.Producer and F1.Title=P.Title)
  ) on F2.Actor=F1.Director

Example

For the following data the result is Donald Duck

Film

    Title              | Director    | Actor
    -------------------+-------------+-----------
    Mickey Mouse Revue | Donald Duck | Minnie Mouse
    Mickey Mouse Revue | Donald Duck | Mickey Mouse
    Duck Tales         | Walt Disney | Donald Duck

Producer

    Producer    | Title 
    ------------+------------------
    Donald Duck | Mickey Mouse Revue 
    Walt Disney | Duck Tales

The second question

Which actors produce every film they directed

uses additionally the set difference operator \:

Latex of Formula: $$\pi_{\text{Actor}}(\text{Film}) \bowtie \rho_{\text{Actor}/\text{Director}}(\pi_{\text{Director}}((\pi_{\text{Director},\text{Title}}(\text{Film}) \bowtie \rho_{\text{Director}/\text{Producer }}(\text{Produce}))\setminus \pi_{\text{Director}}(\pi_{\text{Director},\text{Title}}(\text{Film}) \setminus \rho_{\text{Director}/\text{Producer }}(\text{Produce}))))$$

These are all (Director,Title) pairs that were not produced by the director of the title:

and therefore these are all directors that haven't produced at least one of their titles:

Similar, these are all directors that produced at least one of their titles:

The difference are the directors that have produced all of the titles the directed. Now we have to join it to the set of all actors to filter out the directors that are also actors (but not necessary of the films they directed).

Here is an equivalent SQL Code (Oracle):

Select F2.Actor
from Film F2 join 
(
  (Select F1.Director,F1.Title
  from Film join Produce on (
      F1.Director=P.Producer 
      and F.Title=P.Filem))
  minus
  (Select Director, Title
  from Film
  minus
  select Producer,Title
  from Produce)
) F3
on(F2.Actor=F3.Director)