Postgresql – How does Postgres make its B-tree index

database-internalsindexpostgresql

I want to estimate how many reads the Partial index of PostgreSQL with the method B-tree requires, since I have not been able directly to alter the block size.
PostgreSQL manual about the index here and here about the block size, which is 100 for aux so having 3 reads.

The default memory taken by the block size is 8 KB i.e. 1 block normally but I am not sure if this can be possible because log_1(2) is infinity.
I got the idea that the number of reads is dynamically calculated possibly also in PostgreSQL here by considering B-tree where the block size determines the number of reads.
I would like to know how much this is in the block size: log_b n where b is the block size and n is the number of events.
I think it cannot be one mathematically.
I think Postgres B-tree is implemented the standard way as described in the Wiki page, also described by Cormen et al.

B-tree index contains only keys in PostgreSQL, based on this answer.
Data is then again in tables which are logical heaps.
The point of indexes is to store keys to begin with. Data lies in tables, which are logical heaps, based on here.
I am interested in how PostgresSQL make the entity called B-tree.
Based on here, the physical storage of B-tree indexes and tables are using the same data pages with mostly the same page-layout.
However, I am interested in how this entity works together.
Probably, the functioning of the index and data can be described by this:

B-tree grows from root, not from leaves.

but more precisely from Sumathi about Fundamentals of Relational Database Management Systems (Studies in Computational Intelligence):

In B-tree, nonleaf nodes are larger than leaf nodes. Pointers to data
records exist at all level of the tree.

In B+tree, pointers exists only at the leaves.
How can you evalute the pointer system of B-tree?
How can you describe the big-O space taken by PostgreSQL B-tree?
How does Postgres make its B-tree index?

Best Answer

Masi,

The PostgreSQL B-Tree index is very strongly based on the implementation by Lehman and Yao, which includes a lot of work oriented around multi-version concurrency control, but there's still great info in this paper.

Of course, PostgreSQL doesn't make a 100% accurate replica of the method in the paper, and to find the differences, there will be almost no way to do it other than to (1) find someone who understands the PostgreSQL B-Tree, and has the time to go through the intricate explanation, or (2) dig through the source code yourself.

Another possibility is for you to visit Bruce Momjian's excellent reference website, where he discusses PostgreSQL internals in more detail.

In this case, however, based on the nature of your questions, I feel like you may have a fundamental misunderstanding about how B-Tree indexes work. In this case, I think a little Google searching, or maybe reading through a portion of a textbook like Fundamentals of Database Systems by Elmasri & Navathe would do you some good.

Related Solutions

Mongodb – How do databases store index key values (on-disk) for variable length fields

You can store your index as a list of fixed-size offsets into the block containing your key data. For example:

+--------------+
| 3            | number of entries
+--------------+
| 16           | offset of first key data
+--------------+
| 24           | offset of second key data
+--------------+
| 39           | offset of third key data
+--------------+
| key one |
+----------------+
| key number two |
+-----------------------+
| this is the third key |
+-----------------------+

(well, the key data would be sorted in a real example, but you get the idea).

Note that this does not necessarily reflect how index blocks are actually constructed in any database. This is merely an example of how you might organise a block of index data where the key data is of variable length.

Postgresql – To minimize Cache misses in PostgreSQL

Keeping things cached with PostgreSQL is a balance between what you can have in shared_buffers and what you have in the filesystem cache, and avoiding both of them being pushed out by other processes.

One of the first things you can do on PostgreSQL 9.4 is set huge_pages=on in your postgresql.conf.

This also involves doing some OS changes to allow PostgreSQL to start with this enabled. The appropriate sysctl variable is vm.nr_hugepages

You can do a quick and dirty calculation of what you should set vm.nr_hugepages to with the following bit of shell scripting.

echo $(grep ^VmPeak /proc/$(cat /var/run/postgresql/9.4-main.pid)/status | sed -e 's/VmPeak://' | sed -e 's/kB//' | sed -e 's/[ \t]//g')/$(grep Hugepagesize /proc/meminfo | sed -e 's/Hugepagesize://' | sed -e 's/kB//' | sed -e 's/[ \t]//g') | bc -l | awk '{print int($1+1)}'

This line grabs the VmPeak from the running PostgreSQL process in /proc/$pid/status, which tells you the peak virtual memory size for the PostgreSQL process. This number is in kB. It cleans up the spacing and extra output, then gets the measurement of Hugepagesize from /proc/memstat (which in my case was 2048 kB), then does the appropriate division, rounds the output to the nearest integer, then adds one for a safety margin.

In my case, it was 2147.

Make it permanent by doing the following:

echo "vm.nr_hugepages=2147" >>/etc/sysctl.d/postgresql.conf

This is important for the following reasons, excerpted from the documentation:

/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge pages in the kernel's huge page pool. "Persistent" huge pages will be returned to the huge page pool when freed by a task. A user with root privileges can dynamically allocate more or free some persistent huge pages by increasing or decreasing the value of 'nr_hugepages'.

Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure.

So, pages grabbed by the postgres process for shared_buffers won't be swapped out under memory pressure. And you'll have less stress on the TLB for your processor too.

Following that, we adjust the OOMKiller, because something randomly murdering postgres processes is not a good citizen. These settings strongly encourage it to not fire.

echo "vm.overcommit_memory=2" >>/etc/sysctl.d/postgresql.conf

echo "vm.overcommit_ratio=100" >>/etc/sysctl.d/postgresql.conf

Next we set a couple of kernel scheduler sysctls:

echo "kernel.sched_autogroup_enabled=0" >>/etc/sysctl.d/postgresql.conf

echo "kernel.sched_migration_cost=5000000" >>/etc/sysctl.d/postgresql.conf

The migration cost is the total time the scheduler will consider a migrated process "cache hot" and thus less likely to be re-migrated.

Autogroup enabled basically groups tasks by setsid() so perceived responsiveness is improved. But on server systems, large daemons like PostgreSQL are going to be launched from the same setsid(), and be effectively choked out of CPU cycles in favor of less important tasks.

After that, we can adjust the swappiness, which is a knob that controls the degree to which a system swaps processes out of physical memory. A higher number more aggressively swaps out, and a lower number keeps things in the filesystem cache longer. Setting this to zero will increase the chance of OOM killing under memory and I/O pressure.

echo "vm.swappiness=10" >>/etc/sysctl.d/postgresql.conf

These control how often dirty data is synced back to disk from the filesystem cache. Since dirty objects in the pagecache can't be freed, more aggressively writing back can keep more pages available for use, and help avoid large bursts of disk IO.

echo "vm.dirty_ratio=2" >>/etc/sysctl.d/postgresql.conf

echo "vm.dirty_background_ratio=1" >>/etc/sysctl.d/postgresql.conf

Since your PostgreSQL version output shows GCC 4.9, I'm assuming that you're running Debian Jessie or a derivative, so the 3.16 kernel will avoid some truly dreadful I/O issues that the 3.2 kernel had, so that's most certainly good news.

Once you've adjusted all these settings, reboot the system. This will help make sure that the hugepages aren't fragmented when allocated. The best and most reliable way though is to put hugepages=2147 (the number I needed on my server), in /etc/default/grub under GRUB_CMDLINE_LINUX, and run update-grub, then reboot. But I would test the necessity of this, as it is much harder to change than the sysctl.

These should give you a reasonable amount of pages in the filesystem cache and in shared_buffers. After this, making sure the fewest number of processes are running on your server (because they all compete for resources), should give you much of the page cache miss minimization you're looking for.

Further reading on huge pages and their effects are in the hugetlbpage documentation.

The vm settings and their effects are in the vm documentation.

This is a very approachable treatment of the Linux Memory system Understanding and Optimizing Memory Utilization

The PostgreSQL documentation goes into detail on the effects of huge pages at the bottom of the kernel resources page

Hope that helps. =)

Best Answer

Related Solutions

Mongodb – How do databases store index key values (on-disk) for variable length fields

Postgresql – To minimize Cache misses in PostgreSQL

Related Question