Krish,
You are quite correct. I misread the 'sar' output. I went back and
looked again at the "original". My mail reader had rendered it in a
variable width font, misaligning the columns. Had I read more carefully, I
would have seen that the systems do, indeed report 70% idle, not 70% wio.
Sorry all. But then, I stated my assumptions for a reason...
Anyway, since this was the primary basis for my prior comments, I am
going to have to withdraw them. While not necessarily *wrong*, there is
little reason to say that they are right either. :-(
Don did post a follow-up reporting that something (compressed backups, I
think, although it did not seem clear) had increased his backup throughput.
With 10g, RMAN compressed backups are quite CPU intensive. That enabling
compression has *improved* throughput does suggest that either I/O bandwidth
(e.g., backups across a slow network or written to a slow device) or I/O
contention *could be* a major contributor to backup time. On the other
hand, your (Krish's) observation there there are probably unused CPU cycles
available can also explain the same observations.
Don, to better understand the issue of I/O contention, remember how a
disk works. It has a spinning platter, and a (one!) moving arm. (Yes,
multi-actuator disk drives exist. But I haven't actually seen one for
decades; they are too costly to manufacture and most IT managers are so
fixated on metrics like $/GB that foolish matters like "throughput" and
response time are irrelevant. At least until well after the purchasing
decision has been made.)
To read a given block of data, the disk drive need to move the arm to the
right track, and then wait for the required data to rotate beneath the
head. Sometimes it is a large movement ("seek"), and sometimes it is
small. Sometimes, the required data comes up right under the head, others
you need to wait for a full rotation.
Simple first-year college math (which I confess to having mostly
forgotten nearly 20 years ago) shows pretty easily that for *random* I/Os,
you will need -- on average -- to move the arm half way across the disk, and
spin the platter for one half of a rotation. Once the required datablock is
beneath the read head, the data is read; usually in much less than a
millisecond.
Reading the spec sheets for almost any disk drive, you will find
published numbers for "average seek" time. Rotation time is easily computed
from the RPMs. Because the actual time to *read* (or write) the data is
almost negligible, these two numbers effectively determine the number of
Random IOs Per Second (RIOPS) your disk drive can do.
Now, that is roughly *half* the story.
The other half is that depending on how you actually *use* your disk your
actual experience can be much better -- or much worse -- than the "average"
values used to compute RIOPS. Basically, when you do very large sequential
reads or writes (the kind you might do during a backup or a full table scan)
the disk head hardly needs to move, and the data can (usually) be read with
next to *no* cost for seek time or rotational latency.
This is how you do much *better* than the average. Large sequential disk
I/Os can reach -- and sustain -- throughputs of about 50MB/s on most modern
disks.
But, as I said, you can also do much *worse* than average. Backups or
file copies are the classic example of how this happens. Imagine you have
two partitions on a single disk, one on the inner edge, and one on the outer
edge, and that you need to copy data from one area to the other. (Does this
sound familiar? It should; this is probably what *your* backups are doing,
because your database and your backup volume occupy separate portions of the
same set of disks.) In this scenario, a *simplistic* copy procedure will
read one block of data from the first area, and then write it to the other.
This forces the disk to make a very large "seek" (head movement) for *every*I/O.
It is many years since I actually benchmarked something like this (why
would I do it twice?). At the time, rather than copying from one partition
to another on the same disk, I benchmarked two concurrent backups (to tape),
one reading from a partition at the inner edge of the disk, and one reading
from the outer edge, In this particular situation, it was easy to
demonstrate that running the two backups *concurently* was about 12x *slower
* than running them one at a time.
That was done with 1990's vintage disks. I would expect current disks to
show an even *larger* discrepancy on this particular test, but I won't go
into the unnecessary details of why I expect that.
This is what *I/O contention* is all about. Two operations (in your
case, backup-reads and backup-writes) fight over the placement of the disk
heads, forcing very frequent (and potentially *large*) head movements and
resulting in much-less-than-optimal throughput.
Note that the scenario I described (copying data from one edge of a disk
to the other) is *one* way to cause I/O contention. Another way is the one
that I actually *tested* in the 90's -- multiple concurrent theads trying to
do I/O against the same device. Since you are using RAID-10, the latter is
less likely to affect you, although it could. (This is one reason I
suggested you try *reducing* the number of backup channels you use, and see
whether that increases or decreases your throughput.)
Note, however, that my last tests were done in the 1990's, when "smart"
disk arrays were a novelty, and stuff like non-volatile RAM cache was next
to non-existent. Smart disk hardware with caching and clever scheduling
algorithms *can*, at least theoretically, make the effects of I/O contention
much less pronounced.
*Whew*. That was a lot of typing!
Okay, so, in summary: I cannot say that I/O contention *is* the cause
of Don's performance issue. But it *could* very well be. Certainly, Don's
situation -- data and backups on common devices, and (possibly) excessive
concurrency -- fit the profile of systems that are *likely* to suffer from
I/O contention, and hopefully I have helped Don (and others) understand why.
By the way... I did mention before that Don has another reason to move
his backups to separate physical spindles, but I may have failed to say what
it was. Even RAID-10 disk sets sometimes go "poof". It would be *very*
embarassing indeed to have this happen, and simultaneously lose both your
database and your backups! That is the make-sure-your-resume-is-up-to-date
kind of "embarassing"...
Anyway, I hope this has been helpful...
On Nov 27, 2007 10:56 AM, wrote:
*Don's "sar" statistics show that during the backup, the system is
completely "busy", spending about 30% of its time in CPU, and 70% waiting on
I/O.*
Perhaps I missed an email in this thread. The sar statistics do not show
70% wait. The time spent here is in user CPU and therefore the disk is not
the bottleneck in this case. This is a problem with a single channel/thread
going as fast as it can given the CPU capability. Making the rman disk
targets into multiple disk targets may (at best) get at most 5% back. All
other things being equal, in my view, you can't speed this up, but you can
scale this up to the extent you have excess cpu capacity (that said a
benchmark might be to create an uncompressed backup of say 4 data files and
then follow it up with compress/gzip and see what the resource utilization
and elapsed times are).
...
--
Cheers,
-- Mark Brinsmead
Senior DBA,
The Pythian Group
http://www.pythian.com/blogs--
http://www.freelists.org/webpage/oracle-l