UTF-16 and UTF-8 Encoding – SQL Server – The Front-line Database Blog

When storing a string in SQL Server we use either char(n), varchar(n), nchar(n) and nvarchar(n). So what really is the difference?

So we might know that char and varchar store Ascii characters and use 1 byte per character to store this data, where as nchar and nvarchar store Unicode characters and use 2 bytes per character to store this data. While in part this is mostly true, lets dig a little deeper on this.

UTF-16 is the Unicode encoding that is used for nchar, nvarchar in SQL Server. As already mentioned it uses 2 bytes to store each character. So nvarchar(20) can store up to 20 Unicode characters and will take up 40 bytes of space.

** Important *

The ‘n’ refers to the string length in byte pairs and not character length, it just so happens that for unicode characters 0-65,536 one character can be stored per each byte pair. So for example if you had nvarchar(10), this is saying that as long as the characters stored fall within unicode 0-65,536, then you can store 10 characters here but this will cost you 20 bytes of storage. However, if you were storing a character(s) above the 65,536 range, you might not fit 10 characters in as they may need 2 byte pairs per character.

Prior to SQL Server 2019, if you wanted to store Unicode data then nchar and nvarchar were you only options. However, arriving with SQL Server 2019 is the support for UTF-8 Unicode encoding to be used on char and varchar fields. Say what? Does this mean we should never use nchar or nvarchar again then? Not so.

Ok lets firstly look at what is UTF-8?

So UTF-8 is widely used Unicode character encoding (For example its used in Postgres). It differs with UTF-16 in how many bytes it uses to store data. So characters that fall within the Ascii range (0-127) are encoding using 1 byte, above that range, and similar to UTF-16 2 bytes are used to represent characters, but only to a point, above a specific range 3 bytes can be used to represent a character and above that again 4 bytes. So in summary with UTF-8, depending on the character it can take anything from 1 byte to 4 bytes of storage, so you have a split of 4 ranges essentially.

With UTF-16 there is only a split of 2 ranges, the lower range will always use 2 bytes of storage and the upper range will always use 4 bytes of storage.

Ok, looks interesting so how do we use UTF-8 in SQL Sever 2019 then?

So UTF-8 is used in char and varchar data types and it is enabled when you change or create a columns/table collation to a collation that has a UTF8 suffix.

Changing 
LATIN1_GENERAL_100_CI_AS_SC 
to 
LATIN1_GENERAL_100_CI_AS_SC_UTF8.

So the UTF-8 character encoding is available only to Window Collations that allow additional characters.

The UTF-16 character encoding used by nchar and nvarchar will remain unchanged.

So lets looks and some examples.

Latin1_General_100_CI_AI_SC_UTF8 if this collation was used on a column that was say char(15), so this will store data using UTF-8 character encoding. So what this means is that if Ascii characters (in the range 0-127) are stored in this column then it will take up a max of 15 bytes of space. Above that range, 2 bytes, 3 bytes or even up to 4 bytes per character can be used to store a character.

Compare that to a column of type nvarchar(15) which will use UTF-16 character encoding, this has two ranges of which it can either store characeters using 2 or 4 bytes, BUT its ability to store characters only using 2 bytes is a much large range than that of UTF-8.

So moral of the story here is that, Before you choose whether to use UTF-8 or UTF-16 encoding for a database or column, consider the type of data that will be stored in it.

For example, if it is mostly ASCII characters that you are storing and these require 1 bytes in UTF-8 and 2 bytes in UTF-16, storing this data in a char or varchar column with a UTF8 collation as opposed to storing it in a nchar or nvarchar column using UTF-16 could lead to a 50% reduction in space savings.

Above the ASCII range however, such as Latin-based languages mostly require 2 bytes per character so there is not much saving to be gained.

However on the other side, if it was mostly Asian data that you were storing, where each character would typically require 3 bytes in UTF-8, this could be stored using UTF-16 much more effeciently.

So to sum, coming out with SQL Server 2019 you now have the ability to store Unicode characters in char and varchar data types provided the correct UTF-8 collations are used. On whether you use char/varchar or nchar/nvarchar depends on the data you are storing and into what character range it falls. If you have not upgraded to 2019 yet, and you have the need to store Unicode characters then nchar/nvarchar is your only option.

Until next time, Slán!

Disclaimer: Never just believe everything you read in a post, test this out for yourself, we all can make mistakes! Happy coding.

UTF-16 and UTF-8 Encoding – SQL Server

Published by sqlrebel

One thought on “UTF-16 and UTF-8 Encoding – SQL Server”

Leave a comment Cancel reply

Share this:

Related

Published by sqlrebel

One thought on “UTF-16 and UTF-8 Encoding – SQL Server”

Leave a comment Cancel reply