Show
Chapter 4. Query Performance OptimizationIn the previous chapter, we explained how to optimize a schema, which is one of the necessary conditions for high performance. But working with the schema isn’t enough—you also need to design your queries well. If your queries are bad, even the best-designed schema will not perform well. Query optimization, index optimization, and schema optimization go hand in hand. As you gain experience writing queries in MySQL, you will come to understand how to design schemas to support efficient queries. Similarly, what you learn about optimal schema design will influence the kinds of queries you write. This process takes time, so we encourage you to refer back to this chapter and the previous one as you learn more. This chapter begins with general query design considerations—the things you should consider first when a query isn’t performing well. We then dig much deeper into query optimization and server internals. We show you how to find out how MySQL executes a particular query, and you’ll learn how to change the query execution plan. Finally, we look at some places MySQL doesn’t optimize queries well and explore query optimization patterns that help MySQL execute queries more efficiently. Our goal is to help you understand deeply how MySQL really executes queries, so you can reason about what is efficient or inefficient, exploit MySQL’s strengths, and avoid its weaknesses. Slow Query Basics: Optimize Data AccessThe most basic reason a query doesn’t perform well is because it’s working with too much data. Some queries just have to sift through a lot of data and can’t be helped. That’s unusual, though; most bad queries can be changed to access less data. We’ve found it useful to analyze a poorly performing query in two steps:
Are You Asking the Database for Data You Don’t Need?Some queries ask for more data than they need and then throw some of it away. This demands extra work of the MySQL server, adds network overhead, [36] and consumes memory and CPU resources on the application server. Here are a few typical mistakes: Fetching more rows than needed One common mistake is assuming that MySQL provides results on demand, rather than calculating and returning the full result set. We
often see this in applications designed by people familiar with other database systems. These developers are used to techniques such as issuing a If you want to retrieve all actors who appear in Academy Dinosaur, don’t write the query this way: mysql> That returns all columns from all three tables. Instead, write the query as follows: mysql> Fetching all columnsYou should always be suspicious when you see Some DBAs ban Of course, asking for more data than you really need is not always bad. In many cases we’ve investigated, people tell us the wasteful approach simplifies development, as it lets the developer use the same bit of code in more than one place. That’s a reasonable consideration, as long as you know what it costs in terms of performance. It may also be useful to retrieve more data than you actually need if you use some type of caching in your application, or if you have another benefit in mind. Fetching and caching full objects may be preferable to running many separate queries that retrieve only parts of the object. Is MySQL Examining Too Much Data?Once you’re sure your queries retrieve only the data you need, you can look for queries that examine too much data while generating results. In MySQL, the simplest query cost metrics are:
None of these metrics is a perfect way to measure query cost, but they reflect roughly how much data MySQL must access internally to execute a query and translate approximately into how fast the query runs. All three metrics are logged in the slow query log, so looking at the slow query log is one of the best ways to find queries that examine too much data. Execution timeAs discussed in Chapter 2, the standard slow query logging feature in MySQL 5.0 and earlier has serious limitations, including lack of support for fine-grained logging. Fortunately, there are patches that let you log and measure slow queries with microsecond resolution. These are included in the MySQL 5.1 server, but you can also patch earlier versions if needed. Beware of placing too much emphasis on query execution time. It’s nice to look at because it’s an objective metric, but it’s not consistent under varying load conditions. Other factors—such as storage engine locks (table locks and row locks), high concurrency, and hardware—can also have a considerable impact on query execution times. This metric is useful for finding queries that impact the application’s response time the most or load the server the most, but it does not tell you whether the actual execution time is reasonable for a query of a given complexity. (Execution time can also be both a symptom and a cause of problems, and it’s not always obvious which is the case.) Rows examined and rows returnedIt’s useful to think about the number of rows examined when analyzing queries, because you can see how efficiently the queries are finding the data you need. However, like execution time, it’s not a perfect metric for finding bad queries. Not all row accesses are equal. Shorter rows are faster to access, and fetching rows from memory is much faster than reading them from disk. Ideally, the number of rows examined would be the same as the number returned, but in practice this is rarely possible. For example, when constructing rows with joins, multiple rows must be accessed to generate each row in the result set. The ratio of rows examined to rows returned is usually small—say, between 1:1 and 10:1—but sometimes it can be orders of magnitude larger. Rows examined and access typesWhen you’re thinking about the cost of a query, consider the cost of finding a single row in a table. MySQL can use several access methods to find and return a row. Some require examining many rows, but others may be able to generate the result without examining any. The access method(s) appear in the If you aren’t getting a good access type, the best way to solve the problem is usually by adding an appropriate index. We discussed indexing at length in the previous chapter; now you can see why indexes are so important to query optimization. Indexes let MySQL find rows with a more efficient access type that examines less data. For example, let’s look at a simple query on the Sakila sample database: mysql> This query will return 10 rows, and mysql>
mysql> Predictably, the access type has changed to a full table scan ( In general, MySQL can apply a
This example illustrates how important it is to have good indexes. Good indexes help your queries get a good access type and examine only the rows they need. However, adding an index doesn’t always mean that MySQL will access and return the same number of rows. For example, here’s a query that uses the mysql> This query returns only 200 rows, but it needs to read thousands of rows to build the result set. An index can’t reduce the number of rows examined for a query like this one. Unfortunately, MySQL does not tell you how many of the rows it accessed were used to
build the result set; it tells you only the total number of rows it accessed. Many of these rows could be eliminated by a If you find that a huge number of rows were examined to produce relatively few rows in the result, you can try some more sophisticated fixes:
Ways to Restructure QueriesAs you optimize problematic queries, your goal should be to find alternative ways to get the result you want—but that doesn’t necessarily mean getting the same result set back from MySQL. You can sometimes transform queries into equivalent forms and get better performance. However, you should also think about rewriting the query to retrieve different results, if that provides an efficiency benefit. You may be able to ultimately do the same work by changing the application code as well as the query. In this section, we explain techniques that can help you restructure a wide range of queries and show you when to use each technique. Complex Queries Versus Many QueriesOne important query design question is whether it’s preferable to break up a complex query into several simpler queries. The traditional approach to database design emphasizes doing as much work as possible with as few queries as possible. This approach was historically better because of the cost of network communication and the overhead of the query parsing and optimization stages. However, this advice doesn’t apply as much to MySQL, because it was designed to handle connecting and disconnecting very efficiently and to respond to small and simple queries very quickly. Modern networks are also significantly faster than they used to be, reducing network latency. MySQL can run more than 50,000 simple queries per second on commodity server hardware and over 2,000 queries per second from a single correspondent on a Gigabit network, so running multiple queries isn’t necessarily such a bad thing. Connection response is still slow compared to the number of rows MySQL can traverse per second internally, though, which is counted in millions per second for in-memory data. All else being equal, it’s still a good idea to use as few queries as possible, but sometimes you can make a query more efficient by decomposing it and executing a few simple queries instead of one complex one. Don’t be afraid to do this; weigh the costs, and go with the strategy that causes less work. We show some examples of this technique a little later in the chapter. That said, using too many queries is a common mistake in application design. For example, some applications perform 10 single-row queries to retrieve data from a table when they could use a single 10-row query. We’ve even seen applications that retrieve each column individually, querying each row many times! Chopping Up a QueryAnother way to slice up a query is to divide and conquer, keeping it essentially the same but running it in smaller “chunks” that affect fewer rows each time. Purging old data is a great example. Periodic purge jobs may need to remove quite a bit of data, and doing this in one massive query could lock a lot of rows for
a long time, fill up transaction logs, hog resources, and block small queries that shouldn’t be interrupted. Chopping up the mysql> you could do something like the following pseudocode: rows_affected = 0 do { rows_affected = do_query( "DELETE FROM messages WHERE created < DATE_SUB(NOW(),INTERVAL 3 MONTH) LIMIT 10000") } while rows_affected > 0 Deleting 10,000 rows at a time is typically a large enough task to make each query efficient, and
a short enough task to minimize the impact on the server [38] (transactional storage engines may benefit from smaller transactions). It may also be a good idea to add some sleep time between the
Join DecompositionMany high-performance web sites use join decomposition. You can decompose a join by running multiple single-table queries instead of a multitable join, and then performing the join in the application. For example, instead of this single query: mysql> You might run these queries: mysql> This looks wasteful at first glance, because you’ve increased the number of queries without getting anything in return. However, such restructuring can actually give significant performance advantages:
Query Execution BasicsIf you need to get high performance from your MySQL server, one of the best ways to invest your time is in learning how MySQL optimizes and executes queries. Once you understand this, much of query optimization is simply a matter of reasoning from principles, and query optimization becomes a very logical process. TipThis discussion assumes you’ve read Chapter 1, which provides a foundation for understanding the MySQL query execution engine. Figure 4-1 shows how MySQL generally executes queries. Follow along with the illustration to see what happens when you send MySQL a query:
Each of these steps has some extra complexity, which we discuss in the following sections. We also explain which states the query will be in during each step. The query optimization process is particularly complex and important to understand. Figure 4-1. Execution path of a query The MySQL Client/Server ProtocolThough you don’t need to understand the inner details of MySQL’s client/server protocol, you do need to understand how it works at a high level. The protocol is half-duplex, which means that at any given time the MySQL server can be either sending or receiving messages, but not both. It also means there is no way to cut a message short. This protocol makes MySQL communication simple and fast, but it limits it in some ways too. For one thing, it means there’s no flow control; once one side sends a message, the other side must fetch the entire message before responding. It’s like a game of tossing a ball back and forth: only one side has the ball at any instant, and you can’t toss the ball (send a message) unless you have it. The client sends a query to the server as a single packet of data. This is why the In contrast, the response from the server usually consists of many packets of data. When the server responds, the client has to receive the entire result set. It cannot simply fetch
a few rows and then ask the server not to bother sending the rest. If the client needs only the first few rows that are returned, it either has to wait for all of the server’s packets to arrive and then discard the ones it doesn’t need, or disconnect ungracefully. Neither is a good idea, which is why appropriate Here’s another way to think about this: when a client fetches rows from the server, it thinks it’s pulling them. But the truth is, the MySQL server is pushing the rows as it generates them. The client is only receiving the pushed rows; there is no way for it to tell the server to stop sending rows. The client is “drinking from the fire hose,” so to speak. (Yes, that’s a technical term.) Most libraries that connect to MySQL let you either fetch the whole result set and buffer it in memory, or fetch each row as you need it. The default behavior is generally to fetch the whole result and buffer it in memory. This is important because until all the rows have been fetched, the MySQL server will not release the locks and other resources required by the query. The query will be in the “Sending data” state (explained in “Query states” on Query states). When the client library fetches the results all at once, it reduces the amount of work the server needs to do: the server can finish and clean up the query as quickly as possible. Most client libraries let you treat the result set as though you’re fetching it from the server, although in fact you’re just fetching it from the buffer in the library’s memory. This works fine most of the time, but it’s not a good idea for huge result sets that might take a long time to fetch and use a lot of memory. You can use less memory, and start working on the result sooner, if you instruct the library not to buffer the result. The downside is that the locks and other resources on the server will remain open while your application is interacting with the library. [40] Let’s look at an example using PHP. First, here’s how you’ll usually query MySQL from PHP: <?php $link = mysql_connect('localhost', 'user', 'p4ssword'); $result = mysql_query('SELECT * FROM HUGE_TABLE', $link); while ( $row = mysql_fetch_array($result) ) { // Do something with result } ?> The code seems to indicate that you fetch rows
only when you need them, in the <?php $link = mysql_connect('localhost', 'user', 'p4ssword'); $result = mysql_unbuffered_query('SELECT * FROM HUGE_TABLE', $link); while ( $row = mysql_fetch_array($result) ) { // Do something with result } ?> Programming languages have different ways to override buffering. For example, the Perl #!/usr/bin/perl use DBI; my $dbh = DBI->connect('DBI:mysql:;host=localhost', 'user', 'p4ssword'); my $sth = $dbh->prepare('SELECT * FROM HUGE_TABLE', { mysql_use_result => 1 }); $sth->execute(); while ( my $row = $sth->fetchrow_array() ) { # Do something with result } Notice that the call to my $dbh = DBI->connect('DBI:mysql:;mysql_use_result=1', 'user', 'p4ssword'); Query statesEach MySQL connection, or thread, has a state that shows what it is doing at any given time. There are several ways to view these
states, but the easiest is to use the
The thread is waiting for a new query from the client. Query The thread is either executing the query or sending the result back to the client. Locked The thread is waiting for a table lock to be granted at the server level. Locks that are implemented by the storage engine, such as InnoDB’s row locks, do not cause the thread to enter the Analyzing
and statistics The thread is checking storage engine statistics and optimizing the query. Copying to tmp table [on
disk] The thread is processing the query and copying results to a temporary table, probably for a Sorting result The thread is sorting a result set. Sending data This can mean several things: the thread might be sending data between stages of the query, generating the result set, or returning the result set to the client. It’s helpful to at least know the basic states, so you can get a sense of “who has the ball” for the query. On very busy servers, you might see an unusual or
normally brief state, such as The Query CacheBefore even parsing a query, MySQL checks for it in the query cache, if the cache is enabled. This operation is a case sensitive hash lookup. If the query differs from a similar query in the cache by even a single byte, it won’t match, and the query processing will go to the next stage. If MySQL does find a match in the query cache, it must check privileges before returning the cached query. This is possible without parsing the query, because MySQL stores table information with the cached query. If the privileges are OK, MySQL retrieves the stored result from the query cache and sends it to the client, bypassing every other stage in query execution. The query is never parsed, optimized, or executed. You can learn more about the query cache in Chapter 5. The Query Optimization ProcessThe next step in the query lifecycle turns a SQL query into an execution plan for the query execution engine. It has several sub-steps: parsing, preprocessing, and optimization. Errors (for example, syntax errors) can be raised at any point in the process. We’re not trying to document the MySQL internals here, so we’re going to take some liberties, such as describing steps separately even though they’re often combined wholly or partially for efficiency. Our goal is simply to help you understand how MySQL executes queries so that you can write better ones. The parser and the preprocessorTo begin, MySQL’s parser breaks the query into tokens and builds a “parse tree” from them. The parser uses MySQL’s SQL grammar to interpret and validate the query. For instance, it ensures that the tokens in the query are valid and in the proper order, and it checks for mistakes such as quoted strings that aren’t terminated. The preprocessor then checks the resulting parse tree for additional semantics that the parser can’t resolve. For example, it checks that tables and columns exist, and it resolves names and aliases to ensure that column references aren’t ambiguous. Next, the preprocessor checks privileges. This is normally very fast unless your server has large numbers of privileges. (See Chapter 12 for more on privileges and security.) The query optimizerThe parse tree is now valid and ready for the optimizer to turn it into a query execution plan. A query can often be executed many different ways and produce the same result. The optimizer’s job is to find the best option. MySQL uses a cost-based optimizer, which means it tries to predict the cost of various execution plans and choose the least expensive. The unit of cost is a single random four-kilobyte data page read. You can see how expensive the optimizer estimated a query to be by running the query, then inspecting the mysql> This result means that the optimizer estimated it would need to do about 1,040 random data page reads to execute the query. It bases the estimate on statistics: the number of pages per table or index, the cardinality (number of distinct values) of indexes, the length of rows and keys, and key distribution. The optimizer does not include the effects of any type of caching in its estimates—it assumes every read will result in a disk I/O operation. The optimizer may not always choose the best plan, for many reasons:
MySQL’s query optimizer is a highly complex piece of software, and it uses many optimizations to transform the query into an execution plan. There are two basic types of optimizations, which we call
static and dynamic. Static optimizations can be performed simply by inspecting the parse tree. For example, the optimizer can transform the In contrast, dynamic optimizations are based on context and can depend on many factors, such as which value is in a The difference is important in executing prepared statements or stored procedures. MySQL can do static optimizations once, but it must reevaluate dynamic optimizations every time it executes a query. MySQL sometimes even reoptimizes the query as it executes it. [41] Here are some types of optimizations MySQL knows how to do: Reordering joins Tables don’t always have to be joined in the order you specify in the query. Determining the best join order is an important optimization; we explain it in depth in “The join optimizer” on The join optimizer. ConvertingOUTER JOIN s to INNER
JOIN sAn MySQL applies algebraic
transformations to simplify and canonicalize expressions. It can also fold and reduce constants, eliminating impossible constraints and constant conditions. For example, the term COUNT() , MIN() , and MAX()
optimizationsIndexes and column nullability can often help MySQL optimize away these expressions. For example, to find the minimum value of a column that’s leftmost in a B-Tree index, MySQL can just request the first row in the index. It can even do this in the query optimization stage, and treat the value as a constant for the rest of the query. Similarly, to find the maximum value in a B-Tree index, the server reads the last row. If the server uses this
optimization, you’ll see “Select tables optimized away” in the Likewise, When MySQL detects that an expression can be reduced to a constant, it will do so during optimization. For example, a user-defined variable can be converted to a constant if it’s not changed in the query. Arithmetic expressions are another example. Perhaps surprisingly, even something you might consider to be a query can be reduced to a constant during the optimization phase. One example is a mysql> MySQL executes this query in two steps, which correspond to the two rows in the output. The first step is to find the desired row in the In the second step, MySQL treats the Another way you’ll see constant conditions applied is by propagating a value’s constant-ness from one place to another if there is a Covering indexes MySQL can sometimes use an index to avoid reading row data, when the index contains all the columns the query needs. We discussed covering indexes at length in Chapter 3. Subquery optimizationMySQL can convert some types of subqueries into more efficient alternative forms, reducing them to index lookups instead of separate queries. Early terminationMySQL can stop processing a query (or a step in a query) as soon as it fulfills the query or step. The obvious case is a mysql> This query stopped during the optimization step, but MySQL can also terminate execution sooner in some cases. The server can use this optimization when the query execution engine recognizes the need to retrieve distinct values, or to stop when a value doesn’t exist. For example, the following query finds all movies without any actors: [42] mysql> This query works by eliminating any films that have actors. Each film might have many actors, but as soon as it finds one actor, it stops processing the current film and moves to the next
one because it knows the MySQL recognizes when a query holds two columns as equal—for example, in a mysql> MySQL knows that the If you’re used to another database server that can’t do this, you may have been advised to “help the optimizer” by manually specifying the ... WHERE film.film_id > 500 AND film_actor.film_id > 500 This is unnecessary in MySQL. It just makes your queries harder to maintain. IN() list comparisonsIn many database servers, The preceding list is woefully incomplete, as MySQL performs more optimizations than we could fit into this entire chapter, but it should give you an idea of the optimizer’s complexity and intelligence. If there’s one thing you should take away from this discussion, it’s don’t try to outsmart the optimizer. You may end up just defeating it, or making your queries more complicated and harder to maintain for zero benefit. In general, you should let the optimizer do its work. Of course, as smart as the optimizer is, there are times when it doesn’t give the best result. Sometimes you may know something about the data that the optimizer doesn’t, such as a fact that’s guaranteed to be true because of application logic. Also, sometimes the optimizer doesn’t have the necessary functionality, such as hash indexes; at other times, as mentioned earlier, its cost estimates may prefer a query plan that turns out to be more expensive than an alternative. If you know the optimizer isn’t giving a good result, and you know why, you can help it. Some of the options are to add a hint to the query, rewrite the query, redesign your schema, or add indexes. Table and index statisticsRecall the various layers in the MySQL server architecture, which we illustrated in Figure 1-1. The server layer, which contains the query optimizer, doesn’t store statistics on data and indexes. That’s a job for the storage engines, because each storage engine might keep different kinds of statistics (or keep them in a different way). Some engines, such as Archive, don’t keep statistics at all! Because the server doesn’t store statistics, the MySQL query optimizer has to ask the engines for statistics on the tables in a query. The engines may provide the optimizer with statistics such as the number of pages per table or index, the cardinality of tables and indexes, the length of rows and keys, and key distribution information. The optimizer can use this information to help it decide on the best execution plan. We see how these statistics influence the optimizer’s choices in later sections. MySQL’s join execution strategyMySQL uses the term “join” more broadly than you might be
used to. In sum, it considers every query a join—not just every query that matches rows from two tables, but every query, period (including subqueries, and even a Consider the example of a At the moment, MySQL’s join execution strategy is simple: it treats every join as a nested-loop join. This means MySQL runs a loop to find a row from a table, then runs a nested loop to find a matching row in the next table. It continues until it has found a matching row in each table in the join. It then builds and returns a row from the columns named in the
This process of finding rows, probing into the next table, and then backtracking can be written as nested loops in the execution plan—hence the name “nested-loop join.” As an example, consider this simple query: mysql> Assuming MySQL decides to join the tables in the order shown in the query, the following pseudocode shows how MySQL might execute the query: outer_iter = iterator over tbl1 where col1 IN(5,6) outer_row = outer_iter.next while outer_row inner_iter = iterator over tbl2 where col3 = outer_row.col3 inner_row = inner_iter.next while inner_row output [ outer_row.col1, inner_row.col2 ] inner_row = inner_iter.next end outer_row = outer_iter.next end This query execution plan applies as easily to a single-table query as it does to a many-table query, which is why even a single-table query can be considered a join—the single-table join is the basic operation from which more complex joins are composed. It can support mysql> Here’s the corresponding pseudocode, with the changed parts in bold: outer_iter = iterator over tbl1 where col1 IN(5,6) outer_row = outer_iter.next while outer_row inner_iter = iterator over tbl2 where col3 = outer_row.col3 inner_row = inner_iter.next Another way to visualize a query execution plan is to use what the optimizer folks call a “swim-lane diagram.” Figure 4-2 contains a swim-lane diagram of our initial Figure 4-2. Swim-lane diagram illustrating retrieving rows using a join MySQL executes every kind of query in essentially the same way.
For example, it handles a subquery in the It’s not possible to execute every legal SQL query this way, however. For example, a The execution planMySQL doesn’t generate byte-code to execute a query, as many other database products do. Instead, the query execution plan is actually a tree of instructions that the query execution engine follows to produce the query results. The final plan contains
enough information to reconstruct the original query. If you execute Any multitable query can conceptually be represented as a tree. For example, it might be possible to execute a four-table join as shown in Figure 4-3. Figure 4-3. One way to join multiple tables This is what computer scientists call a balanced tree. This is not how MySQL executes the query, though. As we described in the previous section, MySQL always begins with one table and finds matching rows in the next table. Thus, MySQL’s query execution plans always take the form of a left-deep tree, as in Figure 4-4. Figure 4-4. How MySQL joins multiple tables The join optimizerThe most important part of the MySQL query optimizer is the join optimizer, which decides the best order of execution for multitable queries. It is often possible to join the tables in several different orders and get the same results. The join optimizer estimates the cost for various plans and tries to choose the least expensive one that gives the same result. Here’s a query whose tables can be joined in different orders without changing the results: mysql> You can probably think of a few different query plans. For example, MySQL could begin with the *************************** 1. row *************************** id: 1 select_type: SIMPLE table: actor type: ALL possible_keys: PRIMARY key: NULL key_len: NULL ref: NULL rows: 200 Extra: *************************** 2. row *************************** id: 1 select_type: SIMPLE table: film_actor type: ref possible_keys: PRIMARY,idx_fk_film_id key: PRIMARY key_len: 2 ref: sakila.actor.actor_id rows: 1 Extra: Using index *************************** 3. row *************************** id: 1 select_type: SIMPLE table: film type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 2 ref: sakila.film_actor.film_id rows: 1 Extra: This is quite a different plan from the one suggested in the previous paragraph. MySQL wants to start with the mysql> This shows why MySQL wants to reverse the join order: doing so will enable it to examine fewer rows in the first table. [46] In both cases, it will be able to perform fast indexed lookups in the second and third tables. The difference is how many of these indexed lookups it will have to do:
In other words, the reversed join order will require less backtracking and rereading. To double-check the optimizer’s choice, we executed the two query versions and
looked at the This is a simple example of how MySQL’s join optimizer can reorder queries to make them less expensive to execute. Reordering joins is usually a very effective optimization. There are times when it won’t result in an optimal plan, and for those times you can
use The join optimizer tries to produce a query execution plan tree with the lowest achievable cost. When possible, it examines all potential combinations of subtrees, beginning with all one-table plans. Unfortunately, a join over n tables will have n-factorial
combinations of join orders to examine. This is called the search space of all possible query plans, and it grows very quickly—a 10-table join can be executed up to 3,628,800 different ways! When the search space grows too large, it can take far too long to optimize the query, so the server stops doing a full analysis. Instead, it resorts to shortcuts such as “greedy” searches when the number of tables exceeds the MySQL has many heuristics, accumulated through years of research and experimentation, that it uses to speed up the optimization stage. This can be beneficial, but it can also mean that MySQL may (on rare occasions) miss an optimal plan and choose a less optimal one because it’s trying not to examine every possible query plan. Sometimes queries can’t be reordered, and the join optimizer can use this fact to reduce the search space by eliminating choices. A Sort optimizationsSorting results can be a costly operation, so you can often improve performance by avoiding sorts or by performing them on fewer rows. We showed you how to use indexes for sorting in Chapter 3. When MySQL can’t use an index to produce a sorted result, it must sort the rows itself. It can do this in memory or on disk, but it always calls this process a filesort, even if it doesn’t actually use a file. If the values to be sorted will fit into the sort buffer, MySQL can perform the sort entirely in memory with a quicksort. If MySQL can’t do the sort in memory, it performs it on disk by sorting the values in chunks. It uses a quicksort to sort each chunk and then merges the sorted chunk into the results. There are two filesort algorithms: Two passes (old) Reads row
pointers and The two-pass algorithm can be quite expensive, because it reads the rows from the table twice, and the second read causes a lot of random I/O. This is especially expensive for MyISAM, which uses a system call to fetch each row (because MyISAM relies on the operating system’s cache to hold the data). On the other hand, it stores a minimal amount of data during the sort, so if the rows to be sorted are completely in memory, it can be cheaper to store less data and reread the rows to generate the final result. Single pass (new)Reads all the columns needed for the query, sorts them by the This algorithm is available only in MySQL 4.1 and newer. It can be much more efficient, especially on large I/O-bound datasets, because it avoids reading the rows from the table twice and trades random I/O for more sequential I/O. However, it has the potential to use a lot more space, because it holds all desired columns from each row, not just the columns needed to sort the rows. This means fewer tuples will fit into the sort buffer, and the filesort will have to perform more sort merge passes. MySQL may use much more temporary storage space for a filesort
than you’d expect, because it allocates a fixed-size record for each tuple it will sort. These records are large enough to hold the largest possible tuple, including the full length of each When sorting a join, MySQL may perform the
filesort at two stages during the query execution. If the See “Optimizing for Filesorts” on Optimizing for Filesorts for more on how to tune the server for filesorts and how to influence which algorithm the server uses. The Query Execution EngineThe parsing and optimizing stage outputs a query execution plan, which MySQL’s query execution engine uses to process the query. The plan is a data structure; it is not executable byte-code, which is how many other databases execute queries. In contrast to the optimization stage, the execution stage is usually not all that complex: MySQL simply follows the instructions given in the query execution plan. Many of the operations in the plan invoke methods implemented by the storage engine interface, also known as the handler API. Each table in the query is represented by an instance of a handler. If a table appears three times in the query, for example, the server creates three handler instances. Though we glossed over this before, MySQL actually creates the handler instances early in the optimization stage. The optimizer uses them to get information about the tables, such as their column names and index statistics. The storage engine interface has lots of functionality, but it needs only a dozen or so “building-block” operations to execute most queries. For example, there’s an operation to read the first row in an index, and one to read the next row in an index. This is enough for a query that does an index scan. This simplistic execution method makes MySQL’s storage engine architecture possible, but it also imposes some of the optimizer limitations we’ve discussed. TipNot everything is a handler operation. For example, the server manages table locks. The handler may implement its own lower-level locking, as InnoDB does with row-level locks, but this does not replace the server’s own locking implementation. As explained in Chapter 1, anything that all storage engines share is implemented in the server, such as date and time functions, views, and triggers. To execute the query, the server just repeats the instructions until there are no more rows to examine. Returning Results to the ClientThe final step in executing a query is to reply to the client. Even queries that don’t return a result set still reply to the client connection with information about the query, such as how many rows it affected. If the query is cacheable, MySQL will also place the results into the query cache at this stage. The server generates and sends results incrementally. Think back to the single-sweep multijoin method we mentioned earlier. As soon as MySQL processes the last table and generates one row successfully, it can and should send that row to the client. This has two benefits: it lets the server avoid holding the row in memory, and it means the client starts getting the results as soon as possible. [47] Limitations of the MySQL Query OptimizerMySQL’s “everything is a nested-loop join” approach to query execution isn’t ideal for optimizing every kind of query. Fortunately, there are only a limited number of cases where the MySQL query optimizer does a poor job, and it’s usually possible to rewrite such queries more efficiently. TipThe information in this section applies to the MySQL server versions to which we have access at the time of this writing—that is, up to MySQL 5.1. Some of these limitations will probably be eased or removed entirely in future versions, and some have already been fixed in versions not yet released as GA (generally available). In particular, there are a number of subquery optimizations in the MySQL 6 source code, and more are in progress. Correlated SubqueriesMySQL sometimes optimizes subqueries very badly. The worst offenders are mysql> It’s tempting to think that MySQL will execute this query from the inside out, by finding a list of -- SELECT GROUP_CONCAT(film_id) FROM sakila.film_actor WHERE actor_id = 1; -- Result: 1,23,25,106,140,166,277,361,438,499,506,509,605,635,749,832,939,970,980 SELECT * FROM sakila.film WHERE film_id IN(1,23,25,106,140,166,277,361,438,499,506,509,605,635,749,832,939,970,980); Unfortunately, exactly the opposite happens. MySQL tries to “help” the subquery by pushing a correlation into it from the outer table, which it thinks will let the subquery find rows more efficiently. It rewrites the query as follows: SELECT * FROM sakila.film WHERE Now the subquery requires the mysql> According
to the mysql> Another good optimization is to manually generate the MySQL has been criticized thoroughly for this particular type of subquery execution plan. Although it definitely needs to be fixed, the criticism often confuses two different issues: execution order and caching. Executing the query from the inside out is one way to optimize it; caching the inner query’s result is another. Rewriting the query yourself lets you take control over both aspects. Future versions of MySQL should be able to optimize this type of query much better, although this is no easy task. There are very bad worst cases for any execution plan, including the inside-out execution plan that some people think would be simple to optimize. When a correlated subquery is goodMySQL doesn’t always optimize correlated subqueries badly. If you hear advice to always avoid them, don’t listen! Instead, benchmark and make your own decision. Sometimes a correlated subquery is a perfectly reasonable, or even optimal, way to get a result. Let’s look at an example: mysql> The standard advice for this query is to write it as a mysql> The plans are nearly identical, but there are some differences:
So, in theory, MySQL will execute the queries almost identically. In reality, benchmarking is the only way to tell which approach is really faster. We benchmarked both queries on our standard setup. The results are shown in Table 4-1. Table 4-1. NOT EXISTS versus LEFT OUTER JOIN
Our benchmark found that the subquery is quite a bit slower! However, this isn’t always the case. Sometimes a subquery can be faster. For example, it can work well when you just want to see rows from one table that match rows in another table. Although that sounds like it describes a join perfectly, it’s not always the same thing. The following join, which is designed to find every film that has an actor, will return duplicates because some films have multiple actors: mysql> We need to use mysql> But what are we really trying to express with this query, and is it obvious from the SQL? The mysql> Again, we benchmarked to see which strategy was faster. The results are shown in Table 4-2. Table 4-2. EXISTS versus INNER JOIN
In this example, the subquery performs much faster than the join. We showed this lengthy example to illustrate two points: you should not heed categorical advice about subqueries, and you should use benchmarks to prove your assumptions about query plans and execution speed. UNION limitationsMySQL
sometimes can’t “push down” conditions from the outside of a If you think any of the individual queries inside a Index merge optimizationsIndex merge algorithms, introduced in MySQL 5.0, let MySQL use more than one index
per table in a query. Earlier versions of MySQL could use only a single index, so when no single index was good enough to help with all the restrictions in the mysql> In older MySQL versions, that query would produce a table scan unless you wrote it as the mysql> In MySQL 5.0 and newer, however, the query can use both indexes, scanning them simultaneously and merging the results. There are three variations on the algorithm: union for mysql> MySQL can use this technique on complex If your queries run more slowly because of this optimizer limitation, you can work around it by disabling some indexes with Equality propagationEquality propagation can have unexpected costs sometimes. For example, consider a huge The optimizer will “share” the list by copying it to the corresponding columns in all related tables.
This is normally helpful, because it gives the query optimizer and execution engine more options for where to actually execute the Parallel executionMySQL can’t execute a single query in parallel on many CPUs. This is a feature offered by some other database servers, but not MySQL. We mention it so that you won’t spend a lot of time trying to figure out how to get parallel query execution on MySQL! Hash joinsMySQL can’t do true hash joins at the time of this writing—everything is a nested-loop join. However, you can emulate hash joins using hash indexes. If you aren’t using the Memory storage engine, you’ll have to emulate the hash indexes, too. We showed you how to do this in “Building your own hash indexes” on Hash indexes. Loose index scansMySQL has historically been unable to do loose index scans, which scan noncontiguous ranges of an index. MySQL’s index scans generally require a defined start point and a defined end point in the index, even if only a few noncontiguous rows in the middle are really desired for the query. MySQL will scan the entire range of rows within these end points. An example will help clarify this. Suppose we have a table with an index on columns mysql> Because the index begins with column Figure 4-5. MySQL scans the entire table to find rows It’s easy to see that there’s a faster way to execute this query. The index’s structure (but not MySQL’s storage engine API) lets you seek to the beginning of each range of values, scan until the end of the range, and then backtrack and jump ahead to the start of the next range. Figure 4-6 shows what that strategy would look like if MySQL were able to do it. Notice the absence of a Figure 4-6. A loose index scan, which MySQL cannot currently do, would be more efficient This is admittedly a simplistic example, and we could easily optimize the query we’ve shown by adding a different index. However, there are many cases where adding another index can’t solve the problem. One example is a query that has a range condition on the index’s first column and an equality condition on the second column. Beginning in MySQL 5.0, loose index scans are possible in certain limited circumstances, such as queries that find maximum and minimum values in a grouped query: mysql> The “Using index for group-by” information in this Until MySQL supports general-purpose loose index scans, the workaround is to supply a constant or list of constants for the leading columns of the index. We showed several examples of how to get good performance with these types of queries in our indexing case study in the previous chapter. MIN() and MAX()MySQL doesn’t optimize certain mysql> Because there’s no index on mysql> This general strategy often works well when MySQL would otherwise choose to scan more rows than necessary. If you’re a purist, you might object that this query is missing the point of SQL. We’re supposed to be able to tell the server what we want and it’s supposed to figure out how to get that data, whereas, in this case, we’re telling MySQL how to execute the query and, as a result, it’s not clear from the query that what we’re looking for is a minimal value. True, but sometimes you have to compromise your principles to get high performance. SELECT and UPDATE on the same tableMySQL doesn’t let you mysql> To work around this limitation, you can use a derived table, because MySQL materializes it as a temporary table. This effectively executes two queries: one mysql>
Optimizing Specific Types of QueriesIn this section, we give advice on how to optimize certain kinds of queries. We’ve covered most of these topics in detail elsewhere in the book, but we wanted to make a list of common optimization problems that you can refer to easily. Most of the advice in this section is version-dependent, and it may not hold for future versions of MySQL. There’s no reason why the server won’t be able to do some or all of these optimizations itself someday. Optimizing COUNT() QueriesThe Before we get into
optimization, it’s important that you understand what What COUNT() does
The other form of One of the most common mistakes we see is specifying column names inside the parentheses when you want to count rows. When you want to know the number of rows in the result, you should always use Myths about MyISAMA common misconception is that MyISAM is extremely fast for MyISAM does not have any magical speed optimizations for counting rows when the query has a Simple optimizationsYou can sometimes use MyISAM’s mysql> If you profile this query with mysql> This version reads fewer rows because the subquery is turned into a constant during the query optimization phase, as you can see with +----+-------------+-------+...+------+------------------------------+ | id | select_type | table |...| rows | Extra | +----+-------------+-------+...+------+------------------------------+ | 1 | PRIMARY | City |...| 6 | Using where; Using index | | 2 | SUBQUERY | NULL |...| NULL | Select tables optimized away | +----+-------------+-------+...+------+------------------------------+ A frequent question on mailing lists and IRC channels is how to retrieve counts for several different values in the same column with just
one query, to reduce the number of queries required. For example, say you want to create a single query that counts how many items have each of several colors. You can’t use an mysql> And here is another that’s equivalent, but instead of
using mysql> More complex optimizationsIn general, Optimizing JOIN QueriesThis topic is actually spread throughout most of the book, but we mention a few highlights:
Optimizing SubqueriesThe most important advice we can give on subqueries is that you should usually prefer a join where possible, at least in current versions of MySQL. We covered this topic extensively earlier in this chapter. Subqueries are the subject of intense work by the optimizer team, and upcoming versions of MySQL may have more subquery optimizations. It remains to be seen which of the optimizations we’ve seen will end up in released code, and how much difference they’ll make. Our point here is that “prefer a join” is not future-proof advice. The server is getting smarter all the time, and the cases where you have to tell it how to do something instead of what results to return are becoming fewer. Optimizing GROUP BY and DISTINCTMySQL optimizes these two kinds of queries similarly in many cases, and in fact converts between them as needed internally during the optimization process. Both types of queries benefit from indexes, as usual, and that’s the single most important way to optimize them. MySQL has two kinds of If you need to group a join by a value that comes from a lookup table, it’s usually more efficient to group by the lookup table’s identifier than by the value. For example, the following query isn’t as efficient as it could be: mysql> The query is more efficiently written as follows: mysql> Grouping by This query takes advantage of the fact that the actor’s first and last name are dependent on the mysql> Purists will argue that you’re grouping by the wrong thing, and they’re right. A spurious mysql> But sometimes the cost of creating and filling the temporary table required for the subquery is high compared to the cost of fudging pure relational theory a little bit. Remember, the temporary table created by the subquery has no indexes. It’s generally a bad idea to select nongrouped columns in a grouped query, because the results will be nondeterministic and could easily change if you change an index or the optimizer decides to use a different strategy. Most such queries we see are accidents (because the server doesn’t complain), or are the result of laziness rather than being designed that way for optimization purposes. It’s better to be explicit. In fact, we suggest that you set the server’s MySQL automatically orders grouped queries by the columns in the Optimizing GROUP BY WITH ROLLUPA variation on grouped queries is to ask MySQL to do superaggregation within the results. You can do this with a Sometimes it’s more efficient to do superaggregation in your application, even if it means fetching many more rows from the server. You can also nest a subquery in the The best approach may be to move the
Optimizing LIMIT and OFFSETQueries with A frequent problem is having a high value for the offset. If your query looks like One simple technique to improve efficiency is to do the offset on a covering index, rather than the full rows. You can then join the result to the full row and retrieve the additional columns you need. This can be much more efficient. Consider the following query: mysql> If the table is very large, this query is better written as follows: mysql> This works because it lets the server examine as little data as possible in an index without accessing
rows, and then, once the desired rows are found, join them against the full table to retrieve the other columns from the row. A similar technique applies to joins with Sometimes you can also convert the limit to a positional query, which the server can execute as an index range scan. For example, if you precalculate and index a position column, you can rewrite the query as follows: mysql> Ranked data poses a similar problem, but usually mixes If you really need to optimize pagination systems, you should probably use precomputed summaries. As an alternative, you can join against redundant tables that contain only the primary key and the columns you need for the Optimizing SQL_CALC_FOUND_ROWSAnother common technique for paginated displays is to add the A better design is to convert the pager to a “next” link. Assuming there are 20 results per page, the query should then use a Another possibility is to fetch and cache many more rows than you need—say, 1,000—and then retrieve them from the cache for successive pages. This strategy lets your application know how large the full result set is. If it’s fewer than 1,000 rows, the application knows how many page links to render; if it’s more, the application can just display “more than 1,000 results found.” Both strategies are much more efficient than repeatedly generating an entire result and discarding most of it. Even when you can’t use these tactics, using a separate Optimizing UNIONMySQL always executes It’s important to always use Query Optimizer HintsMySQL has a few optimizer hints you can use to control the query plan if you’re not happy with the one MySQL’s optimizer chooses. The following list identifies these hints and indicates when it’s a good idea to use them. You place the appropriate hint in the query whose plan you want to modify, and it is effective for only that query. Check the MySQL manual for the exact syntax of each hint. Some of them are version-dependent. The options are:
These hints tell MySQL how to prioritize the statement relative to other statements that are trying to access the same tables.
These hints are effective on storage engines with table-level locking, but you should never need them on InnoDB or other engines with fine-grained locking and concurrency control. Be careful when using them on MyISAM, because they can disable concurrent inserts and greatly reduce performance. The DELAYED This hint is for use with STRAIGHT_JOIN This hint can appear either just after the The You can use SQL_SMALL_RESULT and SQL_BIG_RESULT These hints are for SQL_BUFFER_RESULT This hint tells the optimizer to put the results into a temporary table and release table locks as soon as possible. This is different from the client-side buffering we described in “The MySQL Client/Server Protocol” on Query Execution Basics. Server-side buffering can be useful when you don’t use buffering on the client, as it lets you avoid consuming a lot of memory on the client and still release locks quickly. The tradeoff is that the server’s memory is used instead of the client’s. SQL_CACHE and SQL_NO_CACHE These hints instruct the server that the query either is or is not a candidate for caching in the query cache. See the next chapter for details on how to use them. SQL_CALC_FOUND_ROWS This
hint tells MySQL to calculate a full result set when there’s a FOR UPDATE
and LOCK IN SHARE
MODE These hints control locking for These hints are not needed for At the time of this writing, only InnoDB supports these hints, and it’s too early to say whether other storage engines with row-level locks will support them in the future. When using these hints with InnoDB, be aware that they may disable some optimizations, such as covering indexes. InnoDB can’t lock rows exclusively without accessing the primary key, which is where the row versioning information is stored. USE INDEX, IGNORE INDEX ,
and FORCE
INDEX These hints tell the optimizer which indexes to use or ignore for finding rows in a table (for example, when deciding on a join order). In MySQL 5.0 and earlier, they don’t influence which indexes the server uses for sorting and grouping; in MySQL 5.1 the syntax can take an optional
In MySQL 5.0 and newer, there are also some system variables that influence the optimizer:
This variable tells the optimizer how exhaustively to examine partial plans. If your queries are taking a very long time in the “Statistics” state, you might try lowering this value. optimizer_prune_level This variable, which is enabled by default, lets the optimizer skip certain plans based on the number of rows examined. Both options control optimizer shortcuts. These shortcuts are valuable for good performance on complex queries, but they can cause the server to miss optimal plans for the sake of efficiency. That’s why it sometimes makes sense to change them. User-Defined VariablesIt’s easy to forget about MySQL’s user-defined variables, but they can be a powerful technique for writing efficient queries. They work especially well for queries that benefit from a mixture of procedural and relational logic. Purely relational queries treat everything as unordered sets that the server somehow manipulates all at once. MySQL takes a more pragmatic approach. This can be a weakness, but it can be a strength if you know how to exploit it, and user-defined variables can help. User-defined variables are temporary containers for values, which persist as long as your connection to the server lives. You define them by simply assigning
to them with a mysql> You can then use the variables in most places an expression can go: mysql> Before we get into the strengths of user-defined variables, let’s take a look at some of their peculiarities and disadvantages and see what things you can’t use them for:
One of the most important features of variables is that you can assign a value to a variable and use the resulting value at the same time. In other words, an assignment is an L-value. Here’s an example that simultaneously calculates and outputs a “row number” for a query: mysql> This example isn’t terribly interesting, because it just shows that we can duplicate the table’s primary key. Still, it has its uses—one of which is ranking. Let’s write a query that returns the 10 actors who have played in the most movies, with a rank column that gives actors the same rank if they’re tied. We start with a query that finds the actors and the number of movies: mysql> Now let’s add the rank, which should be the same for all the actors who played in 35 movies. We use three variables to do this: one to keep track of the current rank, one to keep track of the previous actor’s movie count, and one to keep track of the current actor’s movie count. We change the rank when the movie count changes. Here’s a first try: mysql> Oops—the rank and count never got updated from zero. Why did this happen? It’s impossible to give a one-size-fits-all answer. The problem could be as simple as a misspelled variable name (in this example it’s
not), or something more involved. In this case, This is the type of inscrutable behavior you’ll often experience with MySQL’s user-defined variables. Debugging such problems can be tough, but it can really pay off. Ranking in SQL normally requires quadratic algorithms, such as counting the distinct number of actors who played in a greater number of movies. A user-defined variable solution can be a linear algorithm—quite an improvement. An easy solution in this case is to add another level of temporary tables to the query, using a subquery in the mysql> Most problems with user variables come from assigning to them and reading them at different stages in the query. For example, it doesn’t work predictably to assign them in the mysql> This happens because the mysql> This query returns every row in the table, because the mysql> Pop quiz: what will happen if you add the mysql> The answer to most unexpected user-defined variable behavior can be found by running The last example introduced another useful hack: we placed the assignment in the You can put variable assignments in all types of statements, not just It can be a little tricky to get the desired behavior, though. Sometimes the optimizer decides to consider the variables compile-time constants and refuses to perform assignments. Placing the assignments inside a function like With a little experimentation, you can do all sorts of interesting things with user-defined variables. Here are some ideas:
Be Careful with MySQL UpgradesAs we’ve said, trying to outsmart the MySQL optimizer usually is not a good idea. It generally creates more work and increases maintenance costs for very little benefit. This is especially relevant when you upgrade MySQL, because optimizer hints used in your queries might prevent new optimizer strategies from being used. The way the MySQL optimizer uses indexes is a moving target. New MySQL versions change how existing indexes can be used, and you should adjust your indexing practices as these new versions become available. For example, we’ve mentioned that MySQL 4.0 and older could use only one index per table per query, but MySQL 5.0 and newer can use index merge strategies. Besides the big changes MySQL occasionally makes to the query optimizer, each incremental release typically includes many tiny changes. These changes usually affect small things, such as the conditions under which an index is excluded from consideration, and let MySQL optimize more special cases. Although all this sounds good in theory, in practice some queries perform worse after an upgrade. If you’ve used a certain version for a long time, you have likely tuned certain queries just for that version, whether you know it or not. These optimizations may no longer apply in newer versions, or may degrade performance. If you care about high performance you should have a benchmark suite that represents your particular workload, which you can run against the new version on a development server before you upgrade the production servers. Also, before upgrading, you should read the release notes and the list of known bugs in the new version. The MySQL manual includes a user-friendly list of known serious bugs. Most MySQL upgrades bring better performance overall; we don’t mean to imply otherwise. However, you should still be careful. Get High Performance MySQL, 2nd Edition now with the O’Reilly learning platform. O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers. When a record from one table is related to several records in another table?A many-to-many relationship means that for each record in one table there can be many records in another table and for each record in the second table there can be many in the first.
Which of the following options is used to add a new record in Access?To add records to a table in datasheet view in Access, open the desired table in datasheet view. Click the “New Record” button at the right end of the record navigation button group.
When you use a split form you only can add records using the simple form?When you use a split form, you only can add records using the simple form. The button on the Access status bar that displays a form in Form view is Form View. To add a new record using Form view, click the New (blank) record button on the Navigation bar.
What queries are often used in conjunction with delete queries to move records from one table to another?An append query deletes records from one or more tables and adds them to an existing table. Moving records from one table to another can be accomplished with the combination of an append query and a delete query.
|