Deriving formulas for input/output

database-designdatabase-theoryperformance

I'm currently enrolled in a DBS class and am having problem with an assignment. I've searched around and have been unable to understand what it is I'm meant to be doing with this derivation formula.

A plant file with TREE-GENUS as the key field includes records with
the following TREE-GENUS values: Tsuga, Ficus , Arbutus, Quercus,
Melaleuca, Tristaniopsis, Cornus, Sequoiadendron, Lithocarpus,
Liriodendron, Pittosporum.
Suppose that records with these search
field values are inserted into a random (heap) file with a maximum of
3 records per block. Derive a formula for the expected number of
disk I/O to scan these records and to search for a particular record

I've been using some software that was given with the assignment and it also asks what are the maximum number of blocks that are allowed and that is not given by the above brief. I'm not really sure how to derive a formula for this. I've assumed that because there are 3 records per block there are 4 blocks required and that a random heap file uses 1 disk i/o per write/read.

If this is a larger topic than is worth explaining a link to a reliable few pages is also helpful.

Best Answer

This is obviously a homework question but it seems to me a bit like a trick question from a db perspective. I don't think there is a simple answer and here the answer may be quite a number of additional questions. My recommendation in trying to answer the question is to sketch out how pages would work, and the like. In essence so much depends on implementation (compare disk I/O in MySQL's InnoDB vs PostgreSQL and you will see massive differences due partly to supported methods of access).

The key problem comes down to indexing. The heap table could in fact be implicitly indexed the way MySQL's InnoDB does, or it could be unindexed (the way PostgreSQL does it). If it is indexed, then you have additional issues. Do you support a physical scan of the first block (to determine where to start)? If so, how do you determine which starting point to use (if you collect statistics on the data in the tables that would help but that's additional I/O there too)?

I could imagine cases where the number of disk block reads could range from n/3 (where n is the number of records) to (n^1/10)/3, and writes to range from 1 to n/3.

In the end questions like this, particularly where you have software assigned for use in class probably should be directed to your instructor. It isn't clear what the point of the exercise is and so much depends on implementation.