How to find the least optimised query

optimizationperformancequery-performance

A library uses a relational database to store data about library books, their members and lending (books, which are currently borrowed). The database relations of this system are given below.

The tables are shown in the following format:
Tablename(column(length),column(length),column(length))

Book(bookno(13), title(20), author(30), edition(7))

Member(memno(4), name(26), category(5), address(40), phone(15))

Lending(memno(4), bookno(13), duedate(8))

The library database currently has 10,000 book tuples and 1,000 member tuples. Assume that 400 book borrowings are on record and 10 copies of the bookno "81-203-1257-0" is on lend.

The following SQL query has been written to retrieve borrower's name with due date given the book no "81-203-1257-0".

SELECT m.*, l.*
FROM Member m, Lending l
WHERE m.memno = l.memno and bookno = "81-203-1257-0";

The following are relational algebra expressions to retrieve the information for the above query.

(i) JOIN(Member and Lending Over memno) where bookno="81-203-1257-0")

(ii) JOIN(Member and (RESTRICT Lending where bookno="81-203-1257-0") over memno

(iii) CARTESIAN(Member and Lending) where bookno="81-203-1257-0")

(iv) RESTRICT(CARTESIAN (Member and (RESTRICT Lending where bookno="81-203-1257-0")) where Member.memno=Lending.memno

(v) PROJECT m.*, l.* (RESTRICT Member m and Lending l where m.memno = l.memno and bookno = "81-203-1257-0")

Which one is the least optimised, and how do I compare with it others? I am new to database and query optimisation.

Best Answer

My relational algebra is pretty rusty but maybe I can put you on the right track. Start with a simpler problem: What if you were looking for book number "ZZZZ81-203-1257-0" which had no matches in the tables? Which query would be the least optimized then?

Let's randomly consider the third relational algebra expression as an example. You are doing a cartesian join between the member and lending tables so you'll end up with 1000 x 400 = 400000 rows as an intermediate result set. You then check the filter condition on each of the 400000 rows and don't keep any of them. Does that seem efficient or inefficient to you? How does it to compare to the others? Can you draw any comparisons between the reasoning that you used for this case compared to the problem that you're trying to answer?