Lock Statistics: Diagnose performance issues in Cloud Spanner

You want to investigate the root cause of this latency spike. You know that there’s no recent configuration or schema change to the database. The only thing you can think of is the new deployment a few hours ago that might have introduced a new access pattern. You think there might be something suspicious with the transactions that were rolled out with that deployment. But how can you find the smoking gun to prove it? How can you find out whether lock conflicts are causing the latency spike? If lock conflicts are indeed the culprits, then which table cells in the database are transactions trying to lock?

Cloud Spanner and Locking

The good news is that Cloud Spanner’s introspection tools and, in particular, the recently added Lock Statistics feature, can help you answer these questions.

Cloud Spanner – Google’s fully managed horizontally scalable relational database service, offers the strictest concurrency-control guarantees, so that you can focus on the logic of the transaction without worrying about data integrity. To give you this peace of mind, and to ensure consistency of multiple concurrent transactions, we use a combination of shared locks and exclusive locks at the table cell level. And, as you know, with locks comes the potential for lock conflicts. When several transactions try to take a lock on the same cell, lock conflicts can occur which can lead to a performance hit on your database.

What is Lock Statistics

Lock statistics is another introspection tool we recently added to the collection of features that help you analyze and solve issues in your Cloud Spanner database. As its name suggests, Lock Statistics exposes data about locks, lock wait times, which table cells are involved in locks, and so on. As with the other introspection tools, this data is exposed in Lock Statistics through a set of built-in statistics tables. 

  • Aggregated Lock Statistics Table. The aggregated table contains the total lock wait time for every 1-minute, 10-minutes, and 60-minutes period. The total lock wait time can be used to monitor overall application health and correlate with latency spikes.

  • Top Lock Statistics Table. The top statistics table contains sampled cells that incur the longest lock wait time for every 1-minute, 10-minutes, and 60-minutes period as well. The top statistics table helps to pin down specific data cells that incur the longest lock wait and identify the transactions that contend for locks.

Each row in the top lock statistics table represents the lock conflict which causes the longest wait time in the given time period. It reveals the start key of the locked key range, the wait time the transactions spent on the certain conflicts, and sample columns that had conflicts along with the lock mode. This helps you to pin down to specific data cells that incur the longest lock wait and identify what are the transactions which contend for locks.

You can find more details in the Introspection Tools section of the documentation, which covers query, transaction, read and lock statistics, as well as information on how to discover your oldest active queries. 

Let’s now look at a pattern that combines Lock Statistics with other statistics in order to solve the problem we discussed in the example scenario. 

How to use Lock Statistics

So, how can Lock Statistics help us in our quest to understand, and hopefully mitigate, the problem that led to the latency issues we discovered in our example scenario ? As you’ll see, the true power of the introspection tools collection comes out when we combine information from these tools over the time period during which our problems started to occur. 

Are lock conflicts the root cause of the latency spikes?

To answer this question, we can examine average latency numbers and lock wait times during the time period when our problems first occurred. 

Transaction Statistics provides commit latency information and Lock Statistics provides lock wait time information. If lock conflicts were the reason for the increase in commit latency, then you should be able to see a correlation when we join, for example, (TXN_STATS_TOTAL_10MINUTE) with lock statistics (LOCK_STAT_TOTAL_10MINUTE) as in the following example. The following sample query tells us the average commit latency and total lock wait time for every 10 minute interval.

What's your reaction?

In Love
Not Sure

You may also like

More in:GOOGLE

Leave a reply

Your email address will not be published. Required fields are marked *