AWS DMS Hidden Gotcha’s Part II

Back in March 2022 I wrote a Blog post AWS DMS Hidden Gotcha’s Part I, todays post is a continuation of this outlining some more quirks of DMS that I have become aware of from working with it on a daily basis.

Index creation locking system table causes DMS Task to fail:

Ok, so this is a strange one and certainly one I would not have thought of only for it happening to us on one of our Production databases, so here is the scenario.

We have a SQL Server Database, which we will call “MyFisrtDatabase” and in that we have many tables, 3 of these tables (Table_A, Table_B and Table_C) we replicate over to another location using AWS DMS. We have another table in the database, Table_Z, which is not in any DMS task and has never been marked for replication. We would like to put an index on this table for performance reasons, we are ok that this will offline, i.e. the table will be locked for the duration of this as this as that table is currently not in use by our clients.

Now since the Index that we are putting on Table_Z is some what large (its 200gb) we will change the recovery model of our database from “Full” to “Bulk_Logged” so that we reduce the amount of data that is written to the log as we are conscious that DMS will have to read through all this data that gets written to the log and this could potentially slow down the replication of our data by DMS.

Ok, so all good so far. We kick off the Index creation task, we estimate that it should take approximately 25 minutes to complete. Things are all good to begin with but after a few minutes our monitoring begins to pick up errors in the Cloudwatch logs of our DMS Task, we wonder what’s going on here and decide to run sp_whoisactive on source server. What do we see only the below query being blocked by the Index Creation Task.

select count(*) from sysobjects o,sys.system_internals_partitions s
where o.id=s.object_id
and o.type=N'U'
and s.is_data_row_format=1
and s.partition_id=72057594205241344

So DMS is attempting to run this query and its wait type is “LCK_M_S” . Lets have a closer look at what locks its attempting to take out.

<Locks>
<Lock request_mode="S" request_status="GRANT" request_count="1" />
</Locks>
<Objects>
<Object name="sysrowsets" schema_name="sys">
<Locks>
<Lock resource_type="KEY" index_name="clust" request_mode="S" request_status="WAIT" request_count="1" />
<Lock resource_type="OBJECT" request_mode="Sch-S" request_status="GRANT" request_count="1" />
</Locks>
</Object>
</Objects>

Ok, so its attempting to a take shared table lock on sysrowsets but it cannot be granted this as for some reason the Index Creation is blocking it, well in that case the next place to go is to see what kind of locks the Index Creation Task has taken out on the system tables. But why sysrowsets, I thought it was querying the sysobjects and sys.system_internal_partitions? Well sysrowsets is an underlying table in these views.

<Object name="sysrowsets" schema_name="sys">
<Locks>
<Lock resource_type="KEY" index_name="clust" request_mode="X" request_status="GRANT" request_count="1" />
<Lock resource_type="OBJECT" request_mode="IX" request_status="GRANT" request_count="1" />
</Locks>
</Object>

So the above is one of the locks that has been taken out by the Index creation task, so as you can see it has a table lock on sysrowsets, hence why the query that the DMS Task is trying to run is being blocked.

After while the DMS Task will fail and show a status of “ERROR”.

Its not all bad news! Once the Index has finished creating we could resume DMS tasks without any issues. But still this is far from an ideal situation. Why does DMS need to periodically run the query where it is checking the internal partitions? I cannot find any AWS documentation on why it needs to do so, your guess is as good as mine.

So to sum up, we had being prepared that creating an index would lead to one large transaction being written to the log file and thus could cause our DMS task to fall behind but we never thought it would end up failing due to a locking issue on system tables. Going forward if adding/rebuilding an index of any considerable size we will have to ensure that our DMS tasks are stopped in advanced. Just one final note, this will happen regardless of whether it is a “Full Load and Replicate” task or a “Replication Only” task.

DMS Task Order of Cache Changes Applied:

This next issue applies when you have a “Full Load and Replicate” DMS Task.

Ok so we have an in-house process whereby we are replicating data from one of our Aurora PostgreSQL database down to S3 into parquet format where it is then being processed by a glue job and upserts are done into a Hudi table. And yes you guessed it, this is where the problems begin to rise.

Lets go back a minute and talk about how does DMS handles a “Full Load and Replicate” task. DMS initials imposes a Schema-Mod lock on the table, (note of caution on this as if this lock request is kept waiting it will make every other requests line up behind it), this lock is only taken out briefly and then the data begins to be read from the data pages that make up that table. No snapshot or table locks preventing writes are taken out on that table but any writes that occur will be recorded in the transaction log and these will be read over once the full load has been completed.

Now Imagine a scenario where we have Table_A that has say 1 million rows, while the full load is being replicated an update is applied to a row that has already been read during the full load process, changing the value in the name column from ‘John’ to ‘Bill’, this will get written to the Transaction log and DMS will read this once it has finished with the full load. OK, so what’s the problem you might ask? The core of the issue is how DMS applies timestamps to these records. The timestamp of the update to the row where the name value was changed from ‘John’ to ‘Bill’ will have an older time stamp than the full load that read the data, i.e. the ‘Insert’. So the “Update” is treated as older than the “Insert”. This obviously is a major issue, the current and correct value for that row is “Bill”, but instead the update will be applied 1st (the value “Bill”) and then the Insert will be applied after (“John”), so now the the hudi table has the incorrect value.

Yes I agree this is a very specific problem and one that would not be an issue if we were dealing with a relational database but this to me seems a major flaw in DMS. Remember these are all core AWS Services, DMS, S3, Glue. This seems like a major flaw with DMS, it should surely have the ability to be able to handle this situation.

Are there ways around this? There sure are but they all require extra work from the developer and could be tricky. Perhaps the easiest solution would be to add your timestamp column but then that means you having to change your source database just so that you can get the data working down-stream. Another alternative solution might be to select the “Stop before applying cached changes” option, so that your DMS task will do the full load and stop and will not apply any of data operations which happened during the full load. You would then have to manually change the timestamp of the insert records to some arbitrary date and then resume your DMS task, the cache changes will not get applied but they will all have a timestamp after the full load data. (h/t to my co-worker Tommy for thinking of the last solution.)

As always, if you liked this blog please follow me on Twitter to be kept up-to-date with my latest blog releases and more fun stuff.

Until next time, Slán!

Published by sqlrebel

Data Engineer.

3 thoughts on “AWS DMS Hidden Gotcha’s Part II

Leave a comment