Wednesday, April 29, 2009

Linux I/O scheduler queue size and MyISAM performance

At MySQL Conference and Expo 2009, I explained how Linux I/O scheduler queue size affects MyISAM insert performance.

It is well known that Linux implemented four types of I/O schedulers (noop/deadline/anticipatory/cfq) in Linux kernel 2.6.10. The default is cfq in most distributions including RHEL, which is not so good then noop normally outperforms, but I'll talk this in other posts.

Linux I/O scheduler also has a functionality to sort incoming I/O requests in its request-queue for optimization. Queue size is configurable. Queue size is defined as "/sys/block/sdX/queue/nr_requests" then you can change queue length as follows.

# cat /sys/block/sda/queue/nr_requests
128
# echo 100000 > /sys/block/sda/queue/nr_requests

Changing queue size is even effective for noop scheduler.

Here are benchmarking results about changing i/o scheduler queue size for MyISAM insert-intensive loads. Detailed are written in the slides at the UC.






Apparently increasing queue size was very helpful for HDD, but not helpful for SSD.
Let me explain about backgrounds.

On Linux side, I/O requests are handled by the following order:
system calls(pwrite/etc)
-> Filesystem
-> I/O scheduler
-> Device Driver/Disks

I/O scheduler sorts incoming I/O requests by logical block addresses, then sending them to a device driver.
I/O scheduler does not depend on storage devices so is helpful for some parts (i.e. minimizing disk seek overheads), not helpful for other parts (i.e. minimizing disk rotation overheads).

On the other hand, Command Queuing (TCQ/NCQ) also sorts I/O requests for optimization. SAS and recent SATA II disks support command queuing. The goal is partly duplicate from I/O scheduler. But TCQ can minimize not only disk seeks but also disk rotation overhead (See the link to wikipedia). The disadvantage of TCQ is that queue size is very limited (normally 32-64).


Based on the above, sorting almost all random I/O requests on I/O sheculer then sending them to TCQ would be nice.

Suppose 100,000 random read I/O requests are coming.
When I/O scheduler queue size is 128 (default in many cases), TCQ gets I/O requests by almost random order, so pretty high disk seek overhead happens for each action (requests within single queue is dispersed).

When I/O scheduler queue size is 100,000, TCQ gets I/O requests by fully sorted order, so seek overhead can be much smaller.

Increasing queue size does not have any effect on SSD because no disk seek happens.

This would explain my benchmarking results.

I/O scheduler queue size settings is not helpful for InnoDB because InnoDB internally sorts I/O requests to optimize disk seek overheads, and sending limited number of i/o requests controlled by InnoDB internal i/o threads. So the role is duplicate between InnoDB itself and I/O scheduler queue. Note that TCQ improves InnoDB throughput because disk rotation overheads are significantly reduced and such optimizations can not be done from application/kernel side.

MyISAM does nothing special (highly depending on OS) so this helps.



Updated in Apr 30: Added detailed benchmark conditions for people who are interested..

Here is a test script. I ran a single-threaded stored procedure on a same machine.
create table aa (id int auto_increment primary key,
b1 int,
b2 int,
b3 int,
c varchar(100),
index(b1), index(b2), index(b3)) engine=myisam;

drop procedure sp_aa;
delimiter //
create procedure sp_aa(IN count INTEGER)
BEGIN
DECLARE time_a, time_b BIGINT DEFAULT 0;
DECLARE done INTEGER DEFAULT 0;
DECLARE i INTEGER DEFAULT 1;
WHILE done != 1 DO
insert into aa values (i,rand()*count,rand()*count,rand()*count,repeat(1,40));
SET i = i + 1;
IF i % 1000000 = 1 THEN
SELECT unix_timestamp() into time_a from dual;
SELECT i, from_unixtime(time_a), time_a - time_b from dual;
SET time_b = time_a;
END IF;
IF i > count THEN
SET done = 1;
END IF;
END WHILE;
END;
//
delimiter ;

mysql test -vvv -e "call sp_aa(300000000)"


Then wait for a long long time...
# Default insert. no insert-delayed, no disable-keys, no delay-key-write, no mmap


H/W, OS, MySQL settings
Sun Fire X4150
CPU: Intel Xeon, 8 cores
RAM: 32GB (but limit filesystem cache size up to 5GB, no swapping happened)
HDD: SAS 15,000RPM, 2 disks, RAID 1, write cache enabled
SSD: Intel X25-E, Single drive, write cache enabled
OS: RedHat Enterprise Linux 5.3 (2.6.18-128.el5)
Filesystem: ext3
I/O Scheduler: deadline
MySQL 5.1.33
key_buffer_size: 2G


i/o stats:
queue size=128 (default)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.01 0.00 0.08 24.69 0.00 75.22

Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 0.00 0.87 0.00 575.60 0.00

wMB/s avgrq-sz avgqu-sz await svctm %util
2.25 8.01 142.44 247.03 1.74 100.00
(At running 12 hours, 13 mil rows were inserted)


queue size=100000
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
2.06 0.00 5.32 29.66 0.00 62.96

Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 0.00 2487.33 0.00 2042.60 0.00

wMB/s avgrq-sz avgqu-sz await svctm %util
35.11 35.20 84402.36 43582.33 0.49 100.01
(At running 1.5 hours, 41 mil rows were inserted)
...


avg-cpu: %user %nice %system %iowait %steal %idle
0.07 0.00 0.19 24.82 0.00 74.91

Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 0.00 9.03 1.70 756.03 0.01

wMB/s avgrq-sz avgqu-sz await svctm %util
5.56 15.04 31981.72 127560.45 1.32 100.00
(At running 12 hours, 77 mil rows were inserted,
index size was 4.5 GB)


If running many more hours(days) so that index size exceeds filesystem cache, a lot of disk reads will happen. (I didn't have time to run that test before UC)

13 comments:

peter-zaitsev said...

Yoshinori,

I really think you're getting into some side effect here. If you would publish scripts I would be very curious to redo the run to see if I can repeat the same behavior.

It also would be interesting to see the IO data from the test. When you're inserting data you have potential of doing reads and writes. Reads are issued one by one anyway so they can't be affected by queue depth. It is possible scheduler algorithm causes side effect of having myisam files writes delayed.

Yoshinori Matsunobu said...

Hi Peter,

I published detailed test scripts. If you need more information please let me know.
I didn't have time to fully test on noop/anticipatory/cfq scheduler for this test, but as far as running a couple of hours I saw similar behavior (=increasing queue size improved throughput for this test) on noop scheduler.

Mark Callaghan said...

This is very interesting. Do you have results for a queue size larger than 128 but not as large as 100,000?

Yoshinori Matsunobu said...

Hi Mark,

I haven't done yet. In theory it should be higher than 128 but lower than 100,000. For example, when setting 10,000, avgqu-sz should be less than 10,000, sorting less than 10,000 i/o requests. Then the rest i/o requests are not sorted so it is less efficient than setting 100,000.

Log Buffer said...

"Yoshinori Matsunobu’s item illuminating Linux I/O scheduler queue size and MyISAM performance attracted some comment..."

Log Buffer #145

Anonymous said...

So a trade off between latency and throughput when you know your load isn't latency sensitive. Nice tweak.

Anonymous said...

How about arrays with more then 2 spindles? For instance my Data Array has 34 in a raid 10. I would think that the spindle count would have an impact on what is optimum

Anonymous said...

is it necessary to to shutdown the oracle db before changing the value? And is it neccessary to reboot the server after chanign the value?

Yoshinori Matsunobu said...

Not needed. The change is dynamically updated.

Anonymous said...

Yoshinori,
i wonder if the change of the parameter "queue size" affects the performance of other types of querys (select or update for example)?
Sorry for my bad english...
tanks.

Sumant said...

How about arrays with more then 2 spindles? For instance my Data Array has 34 in a raid 10.


anyway Dont forget to click link.

Nick said...

I just increased the queue size to 100,000 and my WinXP VM started 15 seconds faster- from 60 to 45 seconds. Worth investigating. Thanks a lot!

Sildenafil Citrate said...

wow this is great I also managed to increase the Queue size to 200,000 and the noop scheduler performance increased a lot, I did not know that I could configure the Queue size, thank you very much!

Post a Comment