Tracing Back Errors in Data

If you rigorously check the results of your queries and updates, you’ll catch many of the problems that could otherwise go undetected for weeks and cause a lot of grief when the problems finally grow too large to miss. But problems do creep up on you. Sometimes a SELECT suddenly starts returning wrong results, but your experiments with the query just confirm there is nothing wrong with it.

In this case, you need to imitate user actions, but in reverse order, until you find the source of the error. If you are lucky, you will catch the problem in a single step. Usually it will take multiple steps, and sometimes a very long time.

A lot of these issues happen because the data is different on the master and slave in a replication environment. One common problem is duplicate values where they are supposed to be unique (e.g., if a user relies on an INSERT ON DUPLICATE KEY UPDATE statement but a table has a different structure on the master and slave). For such setups, the user usually notices the problem later when SELECT statements query the slave, instead of noticing them when the INSERT takes place. Things become even worse when this happens during circular replication.

To illustrate this problem, we’ll work with a stored procedure that inserts into a table from a temporary table that was created to hold the results of other selects. This is another example of a common technique when a user wants to handle data from large tables without the risk of modifying data inadvertently or blocking other applications that are using the large tables.

Let’s create our table and populate it with temporary values. In a real application, the temporary table would hold a result set from some calculation that is waiting to be stored in the main table:

CREATE TABLE t1(f1 INT) ENGINE=InnoDB;
CREATE TEMPORARY TABLE t2(f1 INT) ENGINE=InnoDB;

Now create values in the temporary table:

INSERT INTO t2 VALUES(1),(2),(3);

The stored routine moves data from the temporary table into the main table. It checks first to make sure something is in the temporary table before doing the move. Our version looks like this:

CREATE PROCEDURE p1()
BEGIN
DECLARE m INT UNSIGNED DEFAULT NULL;
CREATE TEMPORARY TABLE IF NOT EXISTS t2(f1 INT) ENGINE=InnoDB;
SELECT MAX(f1) INTO m FROM t2;
IF m IS NOT NULL
THEN
  INSERT INTO t1(f1) SELECT f1 FROM t2;
END IF;
END
|

This routine creates the temporary table if it does not exist when the routine is called. This prevents errors that would be caused if the temporary table does not exist, but at the same time leads to new issues, as we will see.

Note

The example uses the MAX function just to check whether there is at least one row in the table. I prefer MAX to COUNT because InnoDB tables do not store the number of rows they contain, but calculate this value every time the COUNT function is called. Therefore, MAX(indexed_field) is faster than COUNT.

If a slave restarted after the first insert but before the stored procedure call, the temporary table on the slave would be empty and the main table on the slave would have no data. In that case, we will get the following on the master:

mysql> SELECT * FROM t1;
+------+
| f1   |
+------+
|    1 |
|    2 |
|    3 |
+------+
3 rows in set (0.03 sec)

Whereas on the slave we get:

mysql> SELECT * FROM t1;
Empty set (0.00 sec)

Even worse, if we insert into t1 after the stored procedure call runs, we will have a total mess in the slave’s data.

Suppose we notice the error in an application that reads data from the main table. We now need to find out how data has been inserted into the slave table: was it a direct update on the slave, or was data replicated from the master?

Warning

MySQL replication does not check data consistency for you, so updates of the same objects using both the SQL replication thread and the user thread on the slave leave the data different from the master, which in turn can lead to failure during later replicated events.

Because we imitated this situation in our example, we know at this point why data corruption happened: the slave was restarted after the first insert but before the stored procedure call. In a real-life situation, issues tend to be noticed later when a user issues a select:

mysql> SELECT * FROM t1;
Empty set (0.00 sec)

When you see unexpected results from SELECT, you need to find out whether this is caused by the query itself or by something that went wrong earlier. The query just shown is so simple that it couldn’t produce errors unless the table was corrupted, so we must try to go back to see how the table was modified.

Our generic example is in a replicated environment with a read-only slave, so we can be sure that the wrong data arose in one of two ways: either the master inserted the wrong data, or the data was corrupted during replication.

So check first whether the master has the wrong data:

master> SELECT * FROM t1;
+------+
| f1   |
+------+
|    1 |
|    2 |
|    3 |
+------+
3 rows in set (0.03 sec)

The master’s data is correct, so the source of the problem lies in the replication layer.

But why did it happen? Replication seems to be running fine,[5] so we suspect a logic error on the master. Having discovered a possible source of the problem, you need to analyze the stored procedure and the calls on the master to find a fix.

As I said before, the slave server was restarted after events that insert data into the temporary table were replicated and emptied the temporary table, but before the stored procedure call that selects and inserts data into the main table. So the slave just re-created an empty temporary table and inserted no data.

In this case, you can either switch to row-based replication or rewrite the procedure so it does not rely on the existence of the temporary table. Another approach is to truncate and then refill the table so that a sudden restart will not leave the slave without data.

One might think that this example is very artificial and that you can’t predict when a server will suddenly restart. This is correct, but restarts are sure to happen from time to time. Therefore, you need to worry about such errors.

Actually, a slave replicates binary log events one by one, and when data is created within an atomic event (e.g., a transaction or stored procedure call), the slave would not be affected by this issue. But again, this example was just a simple one to show the concept behind events that do happen in real life.

  • When you experience a problem on a statement that you know is correct, check what your application did before you saw the problem.

More details on replication errors are in Chapter 5.

We’re done for now with problems related to wrong results. Next we’ll turn to some other problems that occur frequently.



[5] We will discuss in great detail how to troubleshoot replication failures in Chapter 5, so I’m skipping the explanation here.

Get MySQL Troubleshooting now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.