On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas wrote:
On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch wrote:
On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas wrote:
On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wrote:
The double write buffer is one of the few areas where InnoDB does more
IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
checkpoints (which help to keep dirty pages in memory longer),
buffering of writing out changes to secondary indexes, and recently
tunable page level compression.
Baron Schwartz was talking to me about this at Surge.  I don't really
understand how the fuzzy checkpoint stuff works, and I haven't been
able to find a good description of it anywhere.  How does it keep
dirty pages in memory longer?  Details on the other things you mention
would be interesting to hear, too.
For checkpoint behavior:

I would think that best case behavior "sharp" checkpoints with a large
checkpoint_completion_target would have behavior similar to a fuzzy
Well, under that definition of a fuzzy checkpoint, our checkpoints are
fuzzy even with checkpoint_completion_target=0.

What Baron seemed to be describing was a scheme whereby you could do
what I might call partial checkpoints.  IOW, you want to move the redo
pointer without writing out ALL the dirty buffers in memory, so you
write out the pages with the oldest LSNs and then move the redo
pointer to the oldest LSN you have left.  Except that doesn't quite
work, because the page might have been dirtied at LSN X and then later
updated again at LSN Y, and you still have to flush it to disk before
moving the redo pointer to any value >X.  So you work around that by
maintaining a "first dirtied" LSN for each page as well as the current

I'm not 100% sure that this is how it works or that it would work in
PG, but even assuming that it is and does, I'm not sure what the
benefit is over the checkpoint-spreading logic we have now.  There
might be some benefit in sorting the writes that we do, so that we can
spread out the fsyncs.  So, write all the blocks to a give file,
fsync, and then repeat for each underlying data file that has at least
one dirty block.  But that's completely orthogonal to (and would
actually be hindered by) the approach described in the preceding
I wish I could answer your questions better. I am a power user that
does not fully understand InnoDB internals. There are not all that
many folks that have a very good understanding of InnoDB internals
(given how well it works there is not all that much need).
Insert (for innodb 1.1+ evidently there is also does delete and purge)
We do something a bit like this for GIST indices.  It would be
interesting to see if it also has a benefit for btree indices.
For a recent ~800GB db I had to restore, the insert buffer saved 92%
of io needed for secondary indexes.


For many workloads 50% compression results in negligible impact to
performance. For certain workloads compression can help performance.
Please note that InnoDB also has non-tunable toast like feature.
Interesting.  I am surprised this works well.  It seems that this only
works for pages that can be compressed by >=50%, which seems like it
could result in a lot of CPU wasted on failed attempts to compress.
In my world, the spinning disk is almost always the bottleneck.
Trading CPU for IO is almost always a good deal for me.
Given that InnoDB is not shipping its logs across the wire, I don't
think many users would really care if it used the double writer or
full page writes approach to the redo log (other than the fact that
the log files would be bigger). PG on the other hand *is* pushing its
logs over the wire...
So how is InnoDB doing replication?  Is there a second log just for that?
The other log is the "binary log" and it is one of the biggest
problems with MySQL. Running MySQL in such a way that the binary log
stays in sync with the InnoDB redo has a very significant impact on
(check out the pretty graph)
Hmm.  That seems kinda painful.  Having to ship full page images over
the wire doesn't seems so bad by comparison, though I'm not very happy
about having to do that either.
The binary log is less than ideal, but with MySQL replication I can
replicate to *many* servers that are *very* geographically distributed
without all that many headaches. In addition it is simple enough that
I can have junior DBA manage it. I have doubts that I could make PG
do the same anywhere near as easily, particularly given how long and
narrow some pipes are...

Rob Wultsch

Search Discussions

Discussion Posts


Follow ups

Related Discussions



site design / logo © 2018 Grokbase