HOT Updates – PostgreSQL

So you might be aware that PostgreSQL uses MVCC to handle transaction isolation and this has loads of advantages such as –

  • Very few locks are required, meaning readers don’t block writers and writers don’t block readers.
  • ROLLBACK can be very fast as it does not have to undo any changes.

So what is MVCC? It basically means that each time a row is to be modified, rather than updating the existing row a new version of the row is created, with the updated data. So yes, this means that you can have multiple versions of the same row in the database at any given time. The older rows are safe to be deleted when there is no transaction that exists any longer that can see still see that older row.

But back to what we are here to talk about today – HOT Updates.

So in fact, each update in PostgreSQL is essentially an insert followed some time later by a delete. This MVCC approach has two major drawbacks –

  • Tables that experience heavy update load can become bloated with old ‘Deleted’ rows that at some point need to be removed.
  • Each update causes a new row to be inserted which could mean all of the indexes need to be updated to be made aware of the new position of the row. (Remember in PostgreSQL, tables are heaps and so the physical position of the row is stored in a non clustered index as opposed to say SQL Server where you can make a table a Clustered Index, in which case a logical pointer to the row is stored so even if the physical position of the row changes the indexes do not need to be updated as they are pointing to the Clustered Index.

So a feature called HOT updates introduced in PosgreSQL 8.3 can help with this. Here’s how.

As already mentioned PostgreSQL tables are heaps, meaning that the physical pages are not stored in order and either are the rows within those pages. Instead PostgreSQL has an array of pointers at the start of each page, that points to a row on that page. Because of this it doesn’t matter how the rows are ordered within each page, any process trying to access a particular row will just go straight to the array pointer at the beginning of each page to find the row it is looking for. In fact a TID (Tuple Identifier) which is used to identify a row in a table consists of a page number and an offset number of what item in the array pointer to read in order to get the physical location of the row on the page.

HOT stands for Heap Only Tuple, meaning a tuple that is not referenced from outside the table block. When a new version of a row is added, a pointer is added to the old row version to point to this new row provided that both entries of the row are on the same page. So to access the heap only tuple, PostgreSQL follows the ‘hot chain’ within the page.

So why is this good?

So the external address of the row stays the same hence, as long as the column that was changed is not indexed then there is no need to make any change to any index. Instead an Index scan just follow the ‘Hot Chain’ to find the version of the row which is the most current for that transaction.

There is also the added advantage that dead rows can be removed without the need of a VACUUM, which means, update heavy tables don’t have to rely on VACUUM to clear out old rows. So any process that detects a ‘hot chain’, even a SELECT statement, will try and lock that page and reorganize it, removing all old tuples. This is made possible because that process knows that these tuples are only referenced internally within that page.

Ok then, so these HOT Updates sound great, what conditions are necessary for these to happen?

  1. The columns being updated must not be referenced by an index
  2. There must be enough space within the page for the addition of an updated row (Remember this only works if both the old and new version of the row reside on the same page.) So you need to set your fill factor accordingly.

Default fill factor for heaps is 100%, and for Non-Clustered Indexes is 90%

So what should I set the fill factor on my table to then?

As with most things in I.T. the answer is it depends, firstly on whether the table in question experiences a lot of updates, if so it then becomes a balancing act, set the fill factor too low and yes you gain from HOT tuples, but on the other hand, your table will take up more pages, which means less data can be fitted into memory, which could mean less efficient queries. Too little fill factor and you won’t be able to get the gains that can be achieved from HOT updates, of course it also depends on how heavily indexed that table is.

My advice would be the following:

  1. For sure change the fill factor from 100% (even start out and try 90%) for tables that are update heavy
  2. If you can run tests on your environment to determine which fill factor % gives you the best performance for you.

The reloptions field on the pg_class view is a text array of all of the option set on each table, if you have fill factor set it should show up here, if not then assume the default fill factor of 100% is used.

select reloptions,* from pg_class

Until next time, Slán!

DisclaimerNever just believe everything you read in a post, test this out for yourself, we all can make mistakes! Happy coding.

Published by sqlrebel

Data Engineer.

Leave a comment