How to Identify the Cause of Deadlocks and Lockwait Timeouts (And Still Have Time for Coffee Breaks)

Nearly every engineer working with relational database management systems has encountered deadlocks or Lockwait Timeouts, or let’s be honest, been haunted by the nightmare of them.

HealthKart also encountered a similar issue. Back-to-back deadlocks hampered our user experience, especially during Sale Seasons owing to the high concurrency. This kept us all night leaving us begging even for a coffee break.

While there are numerous blogs that help in understanding what deadlocks or Lockwait timeouts actually are and offer solutions to either avoid the issue or minimize it.
For example,

Make changes to the table schema, such as removing foreign key constraints to detach two tables, or adding indexes to minimize the rows scanned and locked.
Keep transactions small and short in duration to make them less prone to collision.
When modifying multiple tables within a transaction, or different sets of rows in the same table, do those operations in a consistent order each time. Then transactions form well-defined queues and do not deadlock

Such solutions can be applied to small applications with relatively light data entries in databases or applications that are being made from scratch.

But these solutions seem infeasible and difficult for an application like Healthkart, which has large backend services with numerous REST APIs, relatively long transaction blocks, numerous concurrent users, and relatively large data in databases.

Breaking the transaction blocks in heavy backend services without knowing the culprit transactions and altering the heavy database tables was relatively impossible for us.

So, it was clear that minimizing or avoiding the deadlocks will only make the monster more powerful. We had to figure out the crucial step that lies between understanding them and resolving them: Identifying the root cause/culprit transaction blocks participating in a deadlock and Lockwait Timeouts.

Problem Statement

Error Logged in Application Log

We came across the following error in our application logs.

2023-05-03 15:05:36.463 ERROR [demo,be80487cd442ff4e,b9e94dec7cb710f0,false, ] 13787 --- [io-8080-exec-45] .engine.jdbc.spi.SqlExceptionHelper(142) : Deadlock found when trying to get lock; try restarting transaction

2023-05-03 15:05:36.462 ERROR [demo,be80487cd442ff4e,be80487cd442ff4e,false, ] 10384 --- [io-8080-exec-45] .hk.rest.resource.cart.CartResource(361) : CRITICAL ERROR - org.springframework.dao.CannotAcquireLockException: could not execute statement; SQL [n/a]; nested exception is org.hibernate.exception.LockAcquisitionException: could not execute statement

The above error caused the API to terminate with a status code of 500, resulting in a poor user experience. With the help of application logs, We were able to identify the API in which the deadlock occurred. But we failed to understand what queries were involved in it. We had to dig deeper.

Error Logged in MySql Error Log file

We had the following MySQL error log at our disposal

2023-05-03 15:04:36 7f3e95dfd700

*** (1) TRANSACTION:

TRANSACTION 46429732, ACTIVE 22 sec starting index read

mysql tables in use 1, locked 1

LOCK WAIT 14 lock struct(s), heap size 2936, 15 row lock(s), undo log entries 2

MySQL thread id 1043926, OS thread handle 0x7f3e9ad35700, query id 29356871 updating

UPDATE test_user_cart SET value_addition = 1 WHERE id = 34699

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 4467 page no 68318 n bits 272 index `PRIMARY` of table `test_user_cart` trx id 46429732 lock_mode X locks rec but not gap waiting

*** (2) TRANSACTION:

TRANSACTION 46429733, ACTIVE 20 sec starting index read

mysql tables in use 2, locked 2

4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1

MySQL thread id 1043927, OS thread handle 0x7f3e95dfd700, query id 29356872 updating

update test_user_rewards set reward_value = 0 where id = 34578

*** (2) HOLDS THE LOCK(S):

RECORD LOCKS space id 4467 page no 68318 n bits 272 index `PRIMARY` of table `test_user_cart` trx id 46429733 lock_mode X locks rec but not gap

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 5136 page no 1024353 n bits 472 index `contact_index` of table `hk_cat`.`user` trx id 46429733 lock_mode X waiting


*** WE ROLL BACK TRANSACTION (2)

According to the above log, update test_user_rewards set reward_value = 0 where id = 34578 and update test_user_cart SET value_addition = 1 WHERE id = 34699 were part of transactions involved in the deadlock. But Neither the `test_user_cart` nor `test_user_rewards` tables are interdependent through any entity relationship. How are they ending up in a deadlock?

Analysis of the MySQL and Application Logs.

Note: In Spring Boot, while executing a @Transactional method, all the SQL statements executed in the method share the same database transaction.

Taking the above statement into consideration and observing both log files and entities in our database, it seems that the queries mentioned in the MySQL error log, are not the sole cause of the problem. Instead, they are part of a transaction block that is contributing to the issue.

We tried to dig this further through APM tools however we realized that all modern APM tools fail to link database errors and its underlying APIs, hence of no use.

Since we already knew which API ended with the “Deadlock found when trying to get lock; try restarting transaction” error. Hence we know one of the transaction blocks of the deadlock and the queries executed within the block.

The real challenge was identifying the successfully executed transaction block or API that contributed to the deadlock.

How We Overcame the Nightmare: Identifying the Root Cause of Deadlocks and LockWait Timeouts

In any relational database, each transaction block is assigned a unique transaction id. Our MySQL error log is logging the transaction ID of both successful as well as rolled-back transactions.
Our idea was to include this MySQL transaction ID in our application log so that we could pinpoint which successful API call executed the transaction and what queries are executed within that transaction which caused the other to roll back.

Note: Although we use Spring Boot for transaction management, it’s worth noting that the transaction ID obtained via the TransactionSynchronizationManager class in Spring Boot is not the same as the MySQL transaction ID found in our MySQL error log.

In MySQL, we can obtain the transaction information including transaction id, rows locked, and rows modified for the current transaction using the below query

SELECT tx.trx_id,trx_started, trx_rows_locked, trx_rows_modified FROM information_schema.innodb_trx tx WHERE tx.trx_mysql_thread_id = connection_id();

Implementation: Logging transaction id against each Transaction block in our Application Log

We incorporated the above query into the currently running transactions in our Spring Boot application by utilizing the Spring AOP (Aspect-Oriented Programming) concept.

Our implementation uses a POJO called TransactionMonitor to store transaction information in the thread-local of each incoming API request.

This TransactionMonitor POJO contains several fields to keep track of important transaction details such as the transaction ID, transaction name, parent method name, start time (in milliseconds), time taken for completion (in seconds), rows locked, and rows modified.

We have created a pointcut on the Spring @Transactional annotation with the following code:

@Pointcut("@annotation(transactional)")
public void transactionalMethods(Transactional transactional{
}

This pointcut will match any method annotated with @Transactional.

Next, we registered the @Before advice for the above jointpoint. Here, we used the TransactionSynchronizationManager to override the beforeCompletion() and afterCompletion() methods of the currently running transaction.

@Before("transactionalMethods(transactional)")
public void profile(JoinPoint joinPoint, Transactional transactional) throws Throwable {

        Propagation level = transactional.propagation();
        String methodName=joinPoint.getSignature().getName();

        if(TransactionSynchronizationManager.isSynchronizationActive()) {
            TransactionSynchronizationManager.registerSynchronization(new TransactionSynchronization() {

                @Override
                public void beforeCompletion() {
                    try {
                        executeBeforeTransactionMonitor(level, methodName);
                    }catch (Exception e){
                        log.error("Exception occurred while executing before commit {}",e);
                    }
                }

                @Override
                public void afterCompletion(int status){
                    try {
                        executeAfterTransactionMonitor(level, methodName);
                    }catch (Exception e){
                        log.error("Exception occurred while executing after completion {}",e);
                    }
                }
            });
        }
    }

It is worth noting how transaction propagation works in Spring. By default, the propagation level is set to REQUIRED. This means that Spring checks if there is an active transaction and if none exists, it creates a new one. Otherwise, the business logic appends to the currently active transaction.

It is important to handle cases where the parent method and some branching child methods have the @Transactional annotation. In such cases, the transactions will be the same. It is crucial to handle these cases properly to avoid triggering unnecessary MySQL query.

private void executeBeforeTransactionMonitorV1(Propagation level, String methodName) {

        String trxName = TransactionSynchronizationManager.getCurrentTransactionName();
        boolean readOnly = TransactionSynchronizationManager.isCurrentTransactionReadOnly();
        boolean executeQuery = false;

        if (!readOnly && (HkThreadLocal.getTrxDetails(trxName) == null || HkThreadLocal.getTrxDetails(trxName).getParentMethodName().equals(methodName))) {
            executeQuery = true;
        }

        if (executeQuery) {
            TransactionMonitor res = getTransactionDetailsFromDb();
            if(res != null) {
                res.setParentMethodName(methodName);
                res.setTransactionName(trxName);
                HkThreadLocal.setTrxDetails(trxName, res);
            }
        }
    }

The purpose of this method is to check whether the current transaction has the same name and method as a previous transaction.
If the transaction name is the same but the method name is different, it signifies that one parent transaction executes many transactional methods.
- In this case, the child transactional methods will not execute the query and will not reset the start time as the transaction is started when the parent transaction is started and completed only when that parent transaction is completed.

On the other hand, if the transaction name and method name are the same, the same transactional method is running in new transactions multiple times for a particular API.
- In this case, the method will reset the start time every time and execute the query.

private void executeAfterTransactionMonitorV1(Propagation level, String methodName) {

        String trxName = TransactionSynchronizationManager.getCurrentTransactionName();
        if (HkThreadLocal.getTrxDetails(trxName) == null) {
            log.info("Information of transaction for transaction name " + trxName + " and method name " + methodName + " doesnt exists in thread-local");
            return;
        }

        TransactionMonitor transactionMonitor = HkThreadLocal.getTrxDetails(trxName);

        transactionMonitor.setDiff((double) ((System.currentTimeMillis() - transactionMonitor.getStartTime()) / 1000));
        HkThreadLocal.setTrxDetails(trxName, transactionMonitor);
        TransactionMonitor monitor = HkThreadLocal.getTrxDetails(trxName);

        if(monitor != null) {
            if(monitor.getRowsLocked() > 0) {
                log.info("Transaction Monitoring Log : " + gson.toJson(monitor.getParameters(methodName)));
            }

            HkThreadLocal.removeTrxDetails(trxName);
        }
    }

The method calculates the total time the transaction takes and then checks if the number of rows locked during the transaction is greater than 0. If it is, it logs the transaction monitoring information.

Finally, the below method retrieves data from a MySQL database.

private TransactionMonitor getTransactionDetailsFromDb(){

        String sql = "SELECT tx.trx_id,trx_started, trx_rows_locked, trx_rows_modified " +
                "FROM information_schema.innodb_trx tx WHERE tx.trx_mysql_thread_id = connection_id()";

        List<Object[]> res = entityManager.createNativeQuery(sql).getResultList();

        if(res != null && res.size() > 0) {
            Object[] result = res.get(0);
            if(result != null && result.length > 0) {
                TransactionMonitor trxMonitor = new TransactionMonitor();
                trxMonitor.setTransactionId(String.valueOf(result[0]));
                trxMonitor.setStartTime(result[1] != null ? ((Timestamp) result[1]).getTime() : 0L);
                trxMonitor.setRowsLocked(result[2] != null ? ((BigInteger) result[2]).intValue() : 0);
                trxMonitor.setRowsModified(result[3] != null ? ((BigInteger) result[3]).intValue() : 0);
                return trxMonitor;
            }
        }
        return null;
}

Note: Executing the query in the above approach requires the user to have the PROCESS privilege of MySQL.

Result Analysis

We already had the below log to identify the API that failed due to the error :

In the case of Deadlock

2023-05-03 15:05:36.463 WARN [,fd583b14f8d89413,fd583b14f8d89413] 51565 --- [nio-8080-exec-4] o.h.engine.jdbc.spi.SqlExceptionHelper: SQL Error: 1213, SQLState: 40001

2023-05-03 15:05:36.464 ERROR [,fd583b14f8d89413,fd583b14f8d89413] 51565 --- [nio-8080-exec-4] o.h.engine.jdbc.spi.SqlExceptionHelper: Deadlock found when trying to get lock; try restarting transaction

In the case of LockWait timeout

2023-05-04 12:27:56.805  WARN [,8d2ce62e69e21438,8d2ce62e69e21438] 114297 --- [nio-8080-exec-4] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 1205, SQLState: 40001

2023-05-04 12:27:56.805 ERROR [,8d2ce62e69e21438,8d2ce62e69e21438] 114297 --- [nio-8080-exec-4] o.h.engine.jdbc.spi.SqlExceptionHelper   : Lock wait timeout exceeded; try restarting transaction

Finally, Our application will start generating the following log upon completion of the successful transaction:

2023-05-03 15:05:36.671 INFO [,3a6df4df0f3b46bb,3a6df4df0f3b46bb] 51565 --- [nio-8080-exec-1] com.example.demo.aop.DemoAspect: Transaction Monitoring Log: {"rowsModified":3,"currentMethodName":"deadlockExample","totalTime":82.0,"rowsLocked":15,"trxId":"46429732","trxName":"com.example.demo.service.TransactionErrorServiceImpl.deadlockExample"}

We can easily extract Transaction Id ( to detect deadlocks) and Total Time taken ( to detect which transaction exceeded configured lock wait timeout).

Now we can identify the API that caused the deadlock or Lockwait Timeout and look for a solution to resolve the issue.

Appendix

Deadlock occurs when two or more transactions are waiting for each other to release locks on resources, resulting in a situation where none of the transactions can proceed. Here’s an example:

Suppose we have two transactions, T1 and T2, accessing the same database. T1 wants to update row 1 and then row 2, while T2 wants to update row 2 and then row 1. If T1 locks row 1 and then tries to lock row 2, and at the same time T2 locks row 2 and then tries to lock row 1, a deadlock will occur. Neither transaction can proceed because they are both waiting for the other to release the lock on the resource they need.

Here’s a simplified version of the SQL code for T1 and T2:

BEGIN TRANSACTION;
SELECT * FROM table WHERE id=1 FOR UPDATE;
SELECT * FROM table WHERE id=2 FOR UPDATE;
UPDATE table SET column1=value1 WHERE id=1;
UPDATE table SET column2=value2 WHERE id=2;
COMMIT;

BEGIN TRANSACTION;
SELECT * FROM table WHERE id=2 FOR UPDATE;
SELECT * FROM table WHERE id=1 FOR UPDATE;
UPDATE table SET column2=value2 WHERE id=2;
UPDATE table SET column1=value1 WHERE id=1;
COMMIT;

Lock wait timeouts refer to a situation where a transaction in a database is blocked from proceeding because it is waiting for a lock on a resource that is currently held by another transaction. When this happens, the blocked transaction will wait for a certain period of time for the lock to be released by the other transaction, before timing out and throwing a lock wait timeout error.

In some cases, the transaction holding the lock may be waiting for another resource, which in turn is held by a different transaction, creating a chain of dependencies that can lead to longer wait times and potential deadlock situations.

Lock wait timeouts can be a symptom of larger performance issues in a database system, and can lead to slow query response times, reduced throughput, and application errors.

Reference: Spring Transaction propagation

The above outcome is based on our work at HealthKart, your experience may vary and would love to hear it in the comment section.

Photo by Kevin Ku: https://www.pexels.com/photo/data-codes-through-eyeglasses-577585/

Fixing MySQL errors (Lockwait timeout/ Deadlocks) In High Concurrent Spring Boot Transactional Systems

Problem Statement

Error Logged in Application Log

Error Logged in MySql Error Log file

Analysis of the MySQL and Application Logs.

How We Overcame the Nightmare: Identifying the Root Cause of Deadlocks and LockWait Timeouts

Implementation: Logging transaction id against each Transaction block in our Application Log

Result Analysis

In the case of Deadlock

In the case of LockWait timeout

Appendix

Leave a comment Cancel reply

Problem Statement

Error Logged in Application Log

Error Logged in MySql Error Log file

Analysis of the MySQL and Application Logs.

How We Overcame the Nightmare: Identifying the Root Cause of Deadlocks and LockWait Timeouts

Implementation: Logging transaction id against each Transaction block in our Application Log

Result Analysis

In the case of Deadlock

In the case of LockWait timeout

Appendix

Share this:

Related

Leave a comment Cancel reply