Collations and Encoding – PostgreSQL

Ok, not the most exciting topic in the world, but before you blow off reading this months blog know this, the collations that you have set at a database level in PostgreSQL impact the storage engines ability to be able to use an Index on a wild card search. Say what? Yes you heard me right, read on for some more detail!

So what are Collations and Encodings?

A Collation is a SQL schema object that maps a SQL name to an operating system locale. (See PostgreSQL docs). It mainly consists of two parts:

  • LC_COLLATE – Concerned with sorting
  • LC_CTYPE – Concerned with Capitalization

An Encoding is concerned with how many bytes is used to store the data.

As we may know already from our blog on SQL Server Collations, the database system uses the collation to perform ordering, for example as you may guess when the ‘order by’ clause is used. i.e. It tells us in the English language that ‘a’ comes before ‘b’.

The ‘default’ collation is what you will be using unless you specify otherwise and this means that you are using the default collation of the database server that you are on and this may differ depending on which region the database exists in. This is set upon creation of the PostgreSQL cluster and inherited by any databases that are created after, but you can explicitly specify a collation when creating a database.

The C and POSIX collations both specify the programming language C behavior, which basically means:

  • The characters A-Z are treated as letters and nothing else
  • And that sort order is then based solely on byte comparison.

This would differ say for example if you used the collation for french, then the sort order would not be based on byte values but on cultural rules.

‘C’ is the North American/English friendly UNIX default.

You can see why just using ‘C’ collation would be popular but if storing text from a language other than English then it is not the appropriate collation to use.

The code below will show you what the default collation and encoding on the server is.

show lc_collate

Ok, that’s all very interesting but lets get to the point that you said this impacted the use of Indexes!

So how does a locale impact indexing?

  • An Index is an ordered data set – when PostgreSQL looks up an Index it has to make certain assumptions as to where it will find the data based on the sort order of the text. So Indexes only work because they are sorted.
  • So for example if the locale is set to C then ‘a’ will be found before ‘b’ but what of the character á (a fada in the Irish Language), if C is the default collation then PostgreSQL would not treat this as a letter and just sort it based on its byte value and so this could lead to your indexes becoming less efficient as data is not correctly sorted.
  • If you database is using any other collation other than ‘C’ as the default, then any ‘like’ searches that are performed will not be able to use the standard B-Tree indexes!
  • So the standard B-tree indexes that you create on a table can only be used in wild card searches if the ‘C’ collation is being used.

Why is this the case?

The ‘LIKE’ operator needs to compare on byte position and not on character rules. If you have a different collation specified to ‘C’ then sort order is determined by the cultural rules of that collation and not the byte value and hence this is why if a like operator is used, then the storage engine will not be able to use any indexes in the searching as it cannot rely on the sorting of the index.

So what’s the solution then?

If you know that you will need to do pattern searching, then create an additional index and during index creation you specify an operator class (This tells the storage engine what comparison functions to be used when sorting the data for storage).

The following operator classes are most useful:

  • text_pattern_ops for type text
  • varchar_pattern_ops for type varchar
  • char_pattern_ops for type char

The difference between these operator classes and the default operator class that is used by an index is that these compare values strictly on byte position rather than local specific collation rule.

create index text_index on test_table(col varchar_pattern_ops);

Until next time, Slán!

DisclaimerNever just believe everything you read in a post, test this out for yourself, we all can make mistakes! Happy coding.

Published by sqlrebel

Data Engineer.

Leave a comment