Fun with AWS Availability Zone Names and RDS Cloning Across Accounts

Well the above is some mouth full for the title of the blog, but hey, it best describes what I’m going to talk about today.

One of the very 1st things I remember learning about when I started my cloud journey into AWS way back in 2018 was about how AWS splits the world up into regions, and each region is further split into at least three and often many more Availability Zones or as they are better known as AZ’s and how a physical data center is then located in each one of these Availability Zones, from here on to be know simply as AZ’s. I also remember reading that the names the AWS gives to AZ’s such as AZ1a and AZ1b that these are just logical names, and they are mapped randomly to the actual physical AZ locations. So AZ1a in my account will possibly reference a different physical AZ in your AWS account.

All good so far and you the reader are probably already aware of this, and to be fair it makes sense. As lets be honest if for example AZ1a mapped to the same physical location for every account in AWS the likely hood is that despite recommendations from AWS most people would still put most of their resources into AZ1a and then AZ1b and so the AZs at the beginning at the alphabet would get over loaded. This is all good and no complaints on my side, where things get a little messy and where I have some minor complaints with AWS is when you are doing an RDS Clone across accounts. So let me give a run down on the issue we faced recently.

We have an Aurora PostgreSQL Cluster in what we will call Account Prod, when we were creating this cluster AWS required us to specify a minimum of three AZ’s which are the AZ’s that our storage is going to be replicated across thereby increasing data redundancy. So naturally we chose AZ1a, AZ1b and AZ1c, how original of us 🙂 We coded all of this up in Terraform and created us cluster in Account Prod.

We also have another account which we will call Account Test, we ran our terraform code against this account as well to create the same cluster. So all good, we have two different clusters in two different AWS accounts and they share the same code base.

A situation arises where we need to do some load testing (we want to test a minor version upgrade of the cluster) and the size of the data that we have in our cluster on our test account just will not do. So we have two possible solutions:

  1. Take a snapshot of our production cluster and restore it to the test account
  2. Take a clone of our production cluster and restore it to the test account

We decide to go with option 2 because one it works out cheaper, in that we will not be paying for duplicate storage as with a database clone, the clone is just sharing the same storage as the original cluster until writes are made. This suits us as we will be writing very little data in our load testing. Secondly cloning is also much faster than a restore as already mentioned the data does not need to be copied over.

So we destroy our cluster on our test account, do a database clone of the cluster from our production account and hey presto, we are good to start our load testing.

This is where things get a bit weird. We then modify out terraform code base to reflect the minor version upgrade of the cluster and we look to run a terraform apply on this against our test account. What we expect to see is terraform telling us it will upgrade the cluster, what instead we see is terraform telling us it is going to destroy our cluster, replacing AZ1c with AZ1d. WHAT NOW? AZ1d, where is this coming from, why is it attempting to change the AZ’s of our underling storage?

The answer of course is that as mentioned at the start of this article, AZ names do not match to physical locations. So what has happened here is the following:

  • We have AZ1a, AZ1b & AZ1c specified in our code base, when we 1st deployed this to our test account this pointed to an underlying AZ ID, so for example we might have the following:
    • AZ1a – az4
    • AZ1b – az6
    • AZ1c – az1
  • When we deployed the same code based to our prod account, our AZ names were mapped to the following AZ IDs:
    • AZ1a – az2
    • AZ1b – az4
    • AZ1c – az6

So you are probably beginning to see the mismatch here, and this becomes a problem when you do a restore of a clone from our prod account to our test account, because what is in the clone is the AZ IDs. So when the clone is being restored it will do the following:

  • Create storage in AZ ID of az4, this maps to AZ1a in our test account, this does not cause us issues with terraform as we have 1a specified in our code base.
  • Create storage in AZ ID of az6, this maps to AZ1b in our test account, again this does not cause us issues with terraform as we have 1b specified in our code base.
  • Create storage in AZ ID az2, this maps to AZ1d in our test account, oh no!, we do not have a 1d in our code base, so the clone as basically created a cluster with storage specified in 1a, 1b and 1d and this of course is not what we have in our code base and when we do a terraform apply, it tries to destroy the cluster and create a new one.

So this might seem obvious that this is going to happen but it also might leave you scratching your head for a few minutes before you get that light bulb moment.

If you think about, this makes total sense that when you are creating a clone that you cannot specify what AZ to put the storage on, as the whole selling point of a clone is that you share the same storage, so allowing a user to specify different storage would mean AWS having to move the data potentially into a different AZ, every time you did a clone, which defeats the purpose of a clone. I get that.

The issue I have here with AWS is the following:

  1. Why ask for a minimum of three AZ’s at all when creating a cluster, why not just assign these automatically under the hood. Maybe I’m missing something here but what is the gain for the customer in specifying these AZ’s, remember these are the AZ’s of the storage not the instance so you do not pay for cross AZ charges. I suppose there might be a slight performance gain in having the writer in the same AZ as your storage clusters but couldn’t AWS handle this for you?
  2. Could AWS ensure that all accounts within the same organization have the same mapping? Surely that would not cause too many issues.

Yes I know these are minor complaints and to be fair this easily fixed, so modify your code base so that for the sage account AZ1d is included and for prod account AZ 1c is included, this is not difficult to implement in terraform, I just feel on a whole it is something that could be handled better by AWS.

Ok rant over, I hope you found this blog helpful and whether you agree or disagree with me, keep checking in on sqlrebel.org.

Until next time, Slán!

Published by sqlrebel

Data Engineer.

2 thoughts on “Fun with AWS Availability Zone Names and RDS Cloning Across Accounts

  1. Sometimes I miss not working on AWS. Other times I am happy to just have simple problems like some eejit enabled auto-close on a database and broke a backup job!

    Like

Leave a comment