SQL Index – How It Works

index

My knowledge of databases and SQL is based in most on university classes. Anyhow, I spent few monts (almost a year) in a company, where I was working with databases.

I have read few books and I have taken part in few trainings about databases such as MySQL, PostgreSQL, SQLite, Oracle and also few nonSQL dbs such us MongoDB, Redis, ElasticSearch etc.

As well as I said, I am begginer, with a lot of lacks of knowledge but today, someone told something, what is totally against my begginer's knowledge.

Let me explain. Let's take SQL database and create simple table Person with few records inside:

id | name   | age
-----------------
1  | Alex   | 24
2  | Brad   | 34
3  | Chris  | 29
4  | David  | 28
5  | Eric   | 18
6  | Fred   | 42
7  | Greg   | 65
8  | Hubert | 53
9  | Irvin  | 17
10 | John   | 19
11 | Karl   | 23

Now, it's the part, I would like to focus on – id is the INDEX.

So far, I thought it works in this way: when a table is being created the INDEX is empty. When I am adding new record to my table the INDEX is being recalculated based on some alghortims. For example:

Grouping one by one:

1    ... N
N+1  ... 2N
     ...
XN+1 ... (X+1)N

so, for my example with size = 11 elements and N = 3 it will be like this:

id | name   | age
-----------------
1  | Alex   | 24     // group0
2  | Brad   | 34     // group0
3  | Chris  | 29     // group0
4  | David  | 28     // group1
5  | Eric   | 18     // group1
6  | Fred   | 42     // group1
7  | Greg   | 65     // group2
8  | Hubert | 53     // group2
9  | Irvin  | 17     // group2
10 | John   | 19     // group3
11 | Karl   | 23     // group3

So, when I am using query SELECT * FROM Person WHERE id = 8 it will do some simple calculation 8 / 3 = 2, so we have to look for this object in group2 and then this row will be returned:

8  | Hubert | 53

enter image description here

This approach works in time O(k) where k << size. Of course, an alghoritm to organise rows in groups is for sure much more complicated but I think this simple example shows my point of view.

So now, I would like to present another approach, which has been showed me today.

Let's take once again this table:

id | name   | age
-----------------
1  | Alex   | 24
2  | Brad   | 34
3  | Chris  | 29
4  | David  | 28
5  | Eric   | 18
6  | Fred   | 42
7  | Greg   | 65
8  | Hubert | 53
9  | Irvin  | 17
10 | John   | 19
11 | Karl   | 23

Now, we are creating something similar to Hashmap(in fact, literally it is a Hash Map) which maps id to address of row with this id. Let's say:

id | addr 
---------
1  | @0001
2  | @0010
3  | @0011
4  | @0100
5  | @0101
6  | @0110
7  | @0111
8  | @1000
9  | @1001
10 | @1010
11 | @1011

So now, when I am running my query: SELECT * FROM Person WHERE id = 8

it will map directly id = 8 to address in memory and the row will be returned. Of course complexity of this is O(1).

So now, I have got few questions.

1. What are the adventages and disadventages of both solutions?

2. Which one is more popular in current database implementations? Maybe different dbs use different approaches?

3. Does it exist in nonSQL dbs?

Thank you in advance


COMPARISON

               |      B-tree     |   Hash Table
----------------------------------------------------
----------------   one element   -------------------
----------------------------------------------------
SEARCHING      |  O(log(N))      | O(1) -> O(N)  
DELETING       |  O(log(N))      | O(1) -> O(N)
INSERTING      |  O(log(N))      | O(1) -> O(N)
SPACE          |  O(N)           | O(N)
----------------------------------------------------
----------------    k elements   -------------------
----------------------------------------------------
SEARCHING      |  k + O(log(N))  | k * O(1) -> k * O(N)
DELETING       |  k + O(log(N))  | k * O(1) -> k * O(N)
INSERTING      |  k + O(log(N))  | k * O(1) -> k * O(N)
SPACE          |  O(N)           | O(N)

N – number of records

Am I right? What about cost of rebuilding B-tree and Hash table after each insert/delete? In case of B-tree we have to change some pointers but in case of balanced b-tree it needs more effort. Also in case of Hash table we have to do few operation, especially, if our operation generate conflicts.

Best Answer

You're basically describing a B-tree index and a hash index. They both have a place, but both are best suited for different jobs.

Advantages and disadvantages

B-tree (and B+-tree) indexes are usually balanced. This means that looking for a value will always take the same amount of time no matter where in the tree it falls (O(log n)). Generally, the number of levels in the tree is limited, so it tends to get "wider" not "deeper". For small data sets, the cost of maintaining and using the B-tree, however, can be more than just reading all the rows. B-tree indexes are good for large data sets, data sets with low selectivity, or data sets where you intend to select a range of objects not just one object.

Hash tables are great for small data sets. Hash indexes have a predefined number of hash buckets, depending on the hashing algorithm used. This is because a given hash algorithm can only produce so many unique hashes, so it only gets "deeper" not "wider". Once the database engine finds the right bucket, it then walks through all the objects in that bucket to find the one you want. With small, highly selective data sets each bucket contains a very small number of objects and is resolved pretty quickly. With larger data sets, the buckets get much more crowded. So, if the object you need is in a small bucket or is near the beginning of the bucket, it returns pretty quick. If it's at the end of a large bucket, it takes longer. The index is not balanced, so the performance is anywhere from O(1) to O(n).

Popularity

In general, I've run across B-trees the most. Bitmap indexes are also another option for values with a low cardinality (think booleans or maybe gender). This is going to vary depending on your database engine as to what index types are available.

NoSQL

NoSQL databases definitely support indexes. Most support B-tree or a variation on B-tree. Most seem to support hashed indexes as well.