EverSQL is a platform that intelligently tunes your SQL queries by providing query optimization recommendations, and feedback on missing indexes. This is the second post of our EverSQL series, if you missed our introductory post take a look there first and then come back to this article.
We’ll use the Stackoverflow data set again as we did in our first post.
Diving into query optimization
We’ll grab the worst performing query in the list from PMM and optimize it. This query builds a list of the top 50 most recent posts which have a score greater than two, and involves joining two large tables – posts and comments. The original runtime of that query is above 20 minutes and causes high load on the server while running.
Assuming you have EverSQL’s chrome extension installed, you’ll see a new button in the PMM Query Analytics page, allowing you to send the query and schema structure directly to EverSQL, to retrieve indexing and query optimization recommendations.
After implementing EverSQL’s recommendations, the query’s execution duration significantly improved:
So what was the actual optimization in this specific case? And why did it work so well? Let’s look at the original query:
SELECT p.title FROM so.posts p INNER JOIN so.comments c ON p.id = c.postid WHERE c.score > 2 GROUP BY p.id ORDER BY p.creationdate DESC LIMIT 100;
The tables’ structure:
CREATE TABLE `posts` ( `Id` int(11) NOT NULL, `CreationDate` datetime NOT NULL, ... PRIMARY KEY (`Id`), KEY `posts_idx_creationdate` (`CreationDate`), ) ENGINE=InnoDB DEFAULT CHARSET=latin1 CREATE TABLE `comments` ( `Id` int(11) NOT NULL, `CreationDate` datetime NOT NULL, `PostId` int(11) NOT NULL, `Score` int(11) DEFAULT NULL, .... PRIMARY KEY (`Id`), KEY `comments_idx_postid` (`PostId`), KEY `comments_idx_postid_score` (`PostId`,`Score`), KEY `comments_idx_score` (`Score`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1
This query will return the post title of the latest 100 stackoverflow posts, which had at least one popular comment (with a score higher than two). The posts table contains 39,646,923 records, while the comments table contains 64,510,258 records.
This is the execution plan MySQL (v5.7.20) chose:
One of the challenges with this query is that the GROUP BY and ORDER BY clauses contain different fields, which prevent MySQL from using an index for the ORDER BY. As MySQL’s documentation states:
“In some cases, MySQL cannot use indexes to resolve the ORDER BY, although it may still use indexes to find the rows that match the WHERE clause. Examples: … The query has different ORDER BY and GROUP BY expressions.”.
Now let’s look into the optimized query:
SELECT p.title FROM so.posts p WHERE EXISTS( SELECT 1 FROM so.comments c WHERE p.id = c.postid AND c.score > 2) ORDER BY p.creationdate DESC LIMIT 100;
Since the comments table is joined in this query only to check for existence of matching records in the posts table, we can use an EXISTS subquery instead. This will allow us to avoid inflating the results (by using JOIN) and then deflating them (by using GROUP BY), which are costly operations.
Now that the GROUP BY is redundant and removed, the database can optionally choose to use an index for the ORDER BY clause.
The new execution plan MySQL chooses is:
As mentioned above, this transformation reduced the query execution duration from ~20 minutes to 370ms.
Co-Author: Tomer Shay
Tomer Shay is the Founder of EverSQL. He loves being where the challenge is. In the last 12 years, he had the privilege to code a lot and lead teams of developers, while focusing on databases and performance. He enjoys using technology to bring ideas into reality, help people and see them smile.