| MySQL 5.1 partitions in practice |
|
This article explains how to test the performance of a large database with MySQL 5.1, showing the advantages of using partitions.
The test database uses data published by the US Bureau of Transportation Statistics. Currently, the data consists of ~ 113 million records (7.5 GB data + 5.2 GB index). Getting and loading the dataThe data for this exercise comes from the Bureau of Transportation Statistics. The data is provided as CSV files, and they are available from July 1987 to July 2007 (at the time of writing). Problem specificationThe test affects a database that is bigger than the amount of RAM in the server, and also the indexes are bigger than the RAM. The reasoning behind this specification is that data warehouses have data collections that are far beyond any reasonable amount of RAM that you can possibly install, occupying several terabytes of storage. In normal database storage, especially for OLTP, the indexes are cached in RAM, to allow for fast retrieval of records. When the data reaches sizes that can't be contained in the available RAM, we need to use a different approach. One of MySQL 5.1 main features is partitioning, a technique that divides a table into logical portions to speed-up retrievals. Using MySQL 5.1 partitions looks simple in principle, but there are some tricky points to be aware of while setting the data for maximum performance. Partitioning overviewThe current implementation of partitioning in MySQL 5.1 is quite simple. You can partition your data by
Depending on your needs, you may choose different partitioning types. In this article we concentrate on range partitioning, which is perhaps the most interesting for data warehousing. MySQL partitioning has some constraints that you must be aware of if you want to use this feature effectively. The first limitation is the one that has the biggest impact on your design decisions. If the column that you need to use for partitioning is not an integer, you need to use a function to transform it. Some additional constraints apply to partitions, as described in the manual, but we are not concerned about it here. Partitioning gotchasUsing date columnsWhat is relevant in this context is the usage of date columns for partitioning. Since the native data type is not supported, we must convert the date into an integer. In addition to the list of allowed functions, we must take into account the fact that only two date functions can trigger the partition pruning. Thus, if we have to deal with a date column, we need to use one of them (YEAR or TO_DAYS). When using the YEAR() function, partitioning is easy, readable, and straightforward.
Partitioning by month is trickier. You can't use the MONTH() for two reasons: Thus, you need to use the other function that is optimized for partition pruning, TO_DAYS.
That's already less clear to read than the one partitioned by year. What's worse is that the server won't retain your values, but it will only save the corresponding integers. This is what you get for the above table:
It is advisable to save a copy of the script used to create tables partitioned by month, if you want to have a readable reference of what each partition means. partitioning by function, searching by columnOne common mistake that is made when using tables partitioned using a date column is to query by the same function used for partitioning. For example if your table was created with the clause
and you are told that YEAR and TO_DAYS are optimized for partition pruning, it seems logical to use a query like
The partition pruning does not kick, as you can see from EXPLAIN
The query is doing a full table scan. The meaning of "optimized for partition pruning" is that the search will use partitions when that column is used in the WHERE clause. The right way of querying is
And now the partition pruning is being used:
This shows that the query does not cause a full table scan, but it will use only one partition. Using primary keys and indexesOne of the problems I had while testing partitioned tables with a large dataset is that I could not get the performance improvement that I was expecting. That was a mistake. A primary key on a table so large that its indexes can't fit in memory is not efficient. To get records from such a table means accessing the disk frequently. Your performance depends completely on the speed of your disk and your processor. Looking at what others do in data warehousing design, I found out that it's common practice to design large DW sets without using indexes. Testing methodIn this test, I wanted to compare performance of a large dataset using MyISAM, InnoDB, and Archive storage engines. Each topology is tested on a dedicated instance of MySQL server, containing only one database with one table. The server instances were created using MySQL Sandbox.
To compare the effects of partitions on large and small datasets, I created 9 more instances, each containing slightly less than 2 GB of data, to see if the results were different. The results were recorded in yet another database instance, for better comparison. The queries used for this test are of two types
For each query type, I generated queries for different date ranges. For each range, I generated a set of additional queries on adjacent dates. The first query for each range is cold, meaning that such range was hit for the first time. Subsequent queries on the same range are warm, meaning that the range was already at least partially cached. The list of queries used for the test is on the Forge Resultspartitioned tables with primary keyLet's start with the wrong approach first. My first batch of tests used partitioned tables with a composite primary key, the same used in the original table. The total size of the PK was 5.5 GB. Rather than improving performance, PK slowed down the operations. Queries with partitioned tables, burdened with the index search through a PK that can't fit in RAM (4 GB) performed poorly. This is a lesson to be remembered. Partitions are useful, but they must be used in the right way.
The solution was in front of my eyes, but I refused to see it at first. Look at the results for the same set of queries using the ARCHIVE storage engine.
The results for the table partitioned by month are better than the ones I got on the corresponding MyISAM table. More comments about this fact later. partitioned tables without primary keySince the performance on partitioned tables using primary keys was so bad, I decided to jump the gun, and get rid of the primary key. The new approach was successful. Using partitions only, without primary keys, I got what I wanted. A significant improvement in performance. Tables partitioned by month get a 70 to 90% performance gain.
To make the difference more visible, I tested with two massive queries that should take advantage of the partition pruning mechanism.
The results show a performance gain of 30 to 60% for the table partitioned by month, and a gain of 15 to 30% for the table partitioned by year.
The processor factorI had some trouble when I moved the test to a different server. The above tests were taken on my home desktop, which uses an Intel Dual Core 2.3 MHz CPU. The new server is much faster, with a dual Xeon 2.66 MHz.
The original table, with primary key, is faster than the partitioned ones. The times for partitioned tables are the same I got on the slower server, but the performance of the original table has improved, making partitions unnecessary. What to do? Since this server seems to take advantage of indexes so well, I added an index on the partitioning column to the partitioned tables.
The results were then much more satisfactory, with a 35% performance gain.
Lessons learnedTesting partitions has been a tiring experience. Gaining performance improvements is not a straightforward operation. What I assumed was a painless change turned out to be a long trial-and-error process. There is no silver bulletApplying a partition change to a table is no guarantee for performance improvement. The gain depends on several factors:
Nothing should be taken for granted. Run benchmarks before applying changes to a production systemDepending on the usage of your database, you may get a huge performance gain, or nothing at all. You can also get a performance decrease, if you are not careful. Archive tables can be an excellent compromiseArchive tables achieve a huge performance gain when partitioned. Again, it depends on your usage. Without partitioning, any query to an Archive table is a full table scan. If you have historical data that does not change and you need to perform statistical queries by range, the Archive engine is an excellent choice. It uses 10 to 20% of the original storage, and it can perform better than the original MyISAM or InnoDB table for aggregate queries. Summing upPartitioning is the key for performance gain with large databases. The definition of large depends on the hardware at your disposal. Applying partitions blindly is no guarantee of achieving performance improvements, but with the aid of some preliminary benchmarks it can become your perfect solution. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||