FAQ
I was talking to a Linux user yesterday, and he said that performance
using the xfs file system is pretty bad. He believes it has to do with
the fact that fsync() on log-based file systems requires more writes.

With a standard BSD/ext2 file system, WAL writes can stay on the same
cylinder to perform fsync. Is that true of log-based file systems?

I know xfs and reiser are both log based. Do we need to be concerned
about PostgreSQL performance on these file systems? I use BSD FFS with
soft updates here, so it doesn't affect me.

--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

Search Discussions

  • Alfred Perlstein at May 2, 2001 at 9:28 pm

    * Bruce Momjian [010502 14:01] wrote:
    I was talking to a Linux user yesterday, and he said that performance
    using the xfs file system is pretty bad. He believes it has to do with
    the fact that fsync() on log-based file systems requires more writes.

    With a standard BSD/ext2 file system, WAL writes can stay on the same
    cylinder to perform fsync. Is that true of log-based file systems?

    I know xfs and reiser are both log based. Do we need to be concerned
    about PostgreSQL performance on these file systems? I use BSD FFS with
    soft updates here, so it doesn't affect me.
    The "problem" with log based filesystems is that they most likely
    do not know the consequences of a write so an fsync on a file may
    require double writing to both the log and the "real" portion of
    the disk. They can also exhibit the problem that an fsync may
    cause all pending writes to require scheduling unless the log is
    constructed on the fly rather than incrementally.

    There was also the problem that was brought up recently that
    certain versions (maybe all?) of Linux perform fsync() in a very
    non-optimal manner, if the user is able to use the O_FSYNC option
    rather than fsync he may see a performance increase.

    But his guess is probably nearly as good as mine. :)


    --
    -Alfred Perlstein - [alfred@freebsd.org]
    http://www.egr.unlv.edu/~slumos/on-netbsd.html
  • Bruce Momjian at May 2, 2001 at 9:38 pm

    The "problem" with log based filesystems is that they most likely
    do not know the consequences of a write so an fsync on a file may
    require double writing to both the log and the "real" portion of
    the disk. They can also exhibit the problem that an fsync may
    cause all pending writes to require scheduling unless the log is
    constructed on the fly rather than incrementally.
    Yes, this double-writing is a problem. Suppose you have your WAL on a
    separate drive. You can fsync() WAL with zero head movement. With a
    log based file system, you need two head movements, so you have gone
    from zero movements to two.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Alfred Perlstein at May 2, 2001 at 11:06 pm

    * Bruce Momjian [010502 15:20] wrote:
    The "problem" with log based filesystems is that they most likely
    do not know the consequences of a write so an fsync on a file may
    require double writing to both the log and the "real" portion of
    the disk. They can also exhibit the problem that an fsync may
    cause all pending writes to require scheduling unless the log is
    constructed on the fly rather than incrementally.
    Yes, this double-writing is a problem. Suppose you have your WAL on a
    separate drive. You can fsync() WAL with zero head movement. With a
    log based file system, you need two head movements, so you have gone
    from zero movements to two.
    It may be worse depending on how the filesystem actually does
    journalling. I wonder if an fsync() may cause ALL pending
    meta-data to be updated (even metadata not related to the
    postgresql files).

    Do you know if reiser or xfs have this problem?

    --
    -Alfred Perlstein - [alfred@freebsd.org]
    Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/
  • Bruce Momjian at May 3, 2001 at 12:18 am

    Yes, this double-writing is a problem. Suppose you have your WAL on a
    separate drive. You can fsync() WAL with zero head movement. With a
    log based file system, you need two head movements, so you have gone
    from zero movements to two.
    It may be worse depending on how the filesystem actually does
    journalling. I wonder if an fsync() may cause ALL pending
    meta-data to be updated (even metadata not related to the
    postgresql files).

    Do you know if reiser or xfs have this problem?
    I don't know, but the Linux user reported xfs was really slow.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Thomas graichen at May 5, 2001 at 7:49 pm

    Bruce Momjian wrote:
    Yes, this double-writing is a problem. Suppose you have your WAL on a
    separate drive. You can fsync() WAL with zero head movement. With a
    log based file system, you need two head movements, so you have gone
    from zero movements to two.
    It may be worse depending on how the filesystem actually does
    journalling. I wonder if an fsync() may cause ALL pending
    meta-data to be updated (even metadata not related to the
    postgresql files).

    Do you know if reiser or xfs have this problem?
    I don't know, but the Linux user reported xfs was really slow.
    i think this should be tested in more detail: i once tried this
    lightly (running pgbench against postgresql 7.1beta4) with
    different filesystems: ext2, reiserfs and XFS and reproducable
    i got about 15% better results running on XFS ... ok - it's
    not a very big test, but i think it might be worth to really
    do an a/b test before seing it as a fact that postgresql is
    slow on XFS (and maybe reiserfs too ... but reiserfs has had
    performance problems in certain situations anyway)

    XFS is a journaling fs, but it does all it's work in a very
    clever way (delayed allocation etc.) - so usually you should
    under normal conditions get decent performance out of it -
    otherwise it might be worth sending a mail to the XFS
    mailinglist (resierfs maybe dito)

    t

    --
    thomas graichen <tgr@spoiled.org> ... perfection is reached, not
    when there is no longer anything to add, but when there is no
    longer anything to take away. --- antoine de saint-exupery
  • Mark L. Woodward at May 3, 2001 at 12:05 pm

    Bruce Momjian wrote:

    I was talking to a Linux user yesterday, and he said that performance
    using the xfs file system is pretty bad. He believes it has to do with
    the fact that fsync() on log-based file systems requires more writes.

    With a standard BSD/ext2 file system, WAL writes can stay on the same
    cylinder to perform fsync. Is that true of log-based file systems?

    I know xfs and reiser are both log based. Do we need to be concerned
    about PostgreSQL performance on these file systems? I use BSD FFS with
    soft updates here, so it doesn't affect me.
    I did see poor performance on reiserfs, I have not as yet ventured into using
    xfs.

    I occurs to me that journalizing file systems will almost always be slower on
    an application such as postgres. The journalizing file system is trying to
    maintain data integrity for an application which is also trying to maintain
    data integrity. There will always be extra work involved.

    This behavior raises the question about file system usage in Postgres. Many
    databases, such as Oracle, create table space files and operate directly on the
    raw blocks, bypassing the file system altogether.

    On one hand, Postgres is easy to use and maintain because it cooperates with
    the native file system, on the other hand it incurs the overhead of whatever
    silliness the file system wants to do.

    I would bet it is a huge amount of work to use a "table space" system and no
    one wants that. lol. However, it should be noted that a bit more control over
    database layout would make some great performance improvements.

    The ability to put indexes on a separate volume from data.
    The ability to put different tables on different volumes.
    And so on.

    In the short term, I think poor performance on a journalizing file system is to
    be expected, unless there is an IOCTL to tell the FS to leave the files alone
    (and postgres calls it). A Linux HOWTO which informs people that certain file
    systems will have performance issues and why should handle the problem.

    Perhaps we can convince the Linux community to create a "dbfs" which is a
    stripped down simple no nonsense file system designed for applications like
    databases?

    --
    I'm not offering myself as an example; every life evolves by its own laws.
    ------------------------
    http://www.mohawksoft.com
  • Matthew Kirkwood at May 3, 2001 at 12:23 pm

    On Thu, 3 May 2001, mlw wrote:

    I would bet it is a huge amount of work to use a "table space" system
    and no one wants that.
    From some stracing of 7.1, the most common syscall issued by
    postgres is an lseek() to the end of the file, presumably to
    find its length, which seems to happen up to about a dozen
    times per (pgbench) transaction.

    Tablespaces would solve this (not that lseek is a particularly
    expensive operation, of course).
    Perhaps we can convince the Linux community to create a "dbfs" which
    is a stripped down simple no nonsense file system designed for
    applications like databases?
    Sync-metadata ext2 should be fine. Filesystems fsck pretty
    quick when they contain only a few large files.

    Otherwise, something like "smugfs" (now obsolete) might do.

    Matthew.
  • Tom Lane at May 3, 2001 at 1:33 pm

    Matthew Kirkwood writes:
    From some stracing of 7.1, the most common syscall issued by
    postgres is an lseek() to the end of the file, presumably to
    find its length, which seems to happen up to about a dozen
    times per (pgbench) transaction.
    Tablespaces would solve this (not that lseek is a particularly
    expensive operation, of course).
    No, they wouldn't; or at least they'd just create a different problem.
    The reason for the lseek is that the file length may have changed since
    the current backend last checked it. To avoid lseek we'd need some
    shared data structure that maintains the current length of every active
    table, which would be a nuisance to maintain and probably a source of
    contention delays.

    (Of course, such a data structure would just be the tip of the iceberg
    of what we'd have to maintain for ourselves if we couldn't depend on the
    kernel to do it for us. Reimplementing a filesystem doesn't strike me
    as a profitable use of our time.)

    regards, tom lane
  • Bruce Momjian at May 3, 2001 at 3:42 pm

    Matthew Kirkwood writes:
    From some stracing of 7.1, the most common syscall issued by
    postgres is an lseek() to the end of the file, presumably to
    find its length, which seems to happen up to about a dozen
    times per (pgbench) transaction.
    Tablespaces would solve this (not that lseek is a particularly
    expensive operation, of course).
    No, they wouldn't; or at least they'd just create a different problem.
    The reason for the lseek is that the file length may have changed since
    the current backend last checked it. To avoid lseek we'd need some
    shared data structure that maintains the current length of every active
    table, which would be a nuisance to maintain and probably a source of
    contention delays.
    Seems we should cache the file lengths somehow. Not sure how to do it
    because our file system cache is local to each backend.

    (Of course, such a data structure would just be the tip of the iceberg
    of what we'd have to maintain for ourselves if we couldn't depend on the
    kernel to do it for us. Reimplementing a filesystem doesn't strike me
    as a profitable use of our time.)
    Ditto. The database is complicated enough.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Kaare Rasmussen at May 3, 2001 at 7:07 pm

    kernel to do it for us. Reimplementing a filesystem doesn't strike me
    as a profitable use of our time.)
    Ditto. The database is complicated enough.
    Maybe some kind of recommendation would be a good thing. That is, if the
    PostgreSQL community has enough knowledge.

    A section in the docs that discusses various file systems, so people can make
    an intelligent choice.

    --
    Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582
    Kaki Data tshirts, merchandize Fax: 3816 2501
    Howitzvej 75 Åben 14.00-18.00 Web: www.suse.dk
    2000 Frederiksberg Lørdag 11.00-17.00 Email: kar@webline.dk
  • Bruce Momjian at May 3, 2001 at 3:41 pm

    I know xfs and reiser are both log based. Do we need to be concerned
    about PostgreSQL performance on these file systems? I use BSD FFS with
    soft updates here, so it doesn't affect me.
    I did see poor performance on reiserfs, I have not as yet ventured into using
    xfs.

    I occurs to me that journalizing file systems will almost always be slower on
    an application such as postgres. The journalizing file system is trying to
    maintain data integrity for an application which is also trying to maintain
    data integrity. There will always be extra work involved.
    Yes, the problem is that extra work is required on PostgreSQL's part.
    Log-based file systems make sure all the changes get onto the disk in an
    orderly way, but I believe it can delay what gets written to the drive.
    PostgreSQL wants to be sure all the data is on the disk, period.
    Unfortunately, the _orderly_ part makes the _fsync_ part do more work.
    By going from ext2 to a log-based file system, we are getting _farther_
    from a raw device that if we just sayed with ext2.

    ext2 has serious problems with corrupt file systems after a crash, so I
    understand the need to move to another file system type. I have been
    waitin for Linux to get a more modern file system. Unfortunately, the
    new ones seem to be worse for PostgreSQL.
    This behavior raises the question about file system usage in Postgres. Many
    databases, such as Oracle, create table space files and operate directly on the
    raw blocks, bypassing the file system altogether.
    OK, we have considered this, but frankly, the new, modern file systems
    like FFS/softupdates have i/o rates near raw speed, with all the
    advantages a file system gives us. I believe most commercial dbs are
    moving away from raw devices and toward file systems. In the old days
    the SysV file system was pretty bad at i/o & fragmentation, so they used
    raw devices.
    The ability to put indexes on a separate volume from data.
    The ability to put different tables on different volumes.
    And so on.
    We certainly need that, but raw devices would not make this any easier,
    I think.
    In the short term, I think poor performance on a journalizing file system is to
    be expected, unless there is an IOCTL to tell the FS to leave the files alone
    (and postgres calls it). A Linux HOWTO which informs people that certain file
    systems will have performance issues and why should handle the problem.

    Perhaps we can convince the Linux community to create a "dbfs" which is a
    stripped down simple no nonsense file system designed for applications like
    databases?
    It could become a serious problem as people start using reiser/xfs for
    their file systems and don't understand the performance problems. Even
    more likely is that they will turn off fsync, thinking reiser doesn't
    need it, when in fact, I think it does.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Bpalmer at May 3, 2001 at 6:20 pm

    This behavior raises the question about file system usage in Postgres. Many
    databases, such as Oracle, create table space files and operate directly on the
    raw blocks, bypassing the file system altogether.
    OK, we have considered this, but frankly, the new, modern file systems
    like FFS/softupdates have i/o rates near raw speed, with all the
    advantages a file system gives us. I believe most commercial dbs are
    moving away from raw devices and toward file systems. In the old days
    the SysV file system was pretty bad at i/o & fragmentation, so they used
    raw devices.
    I'm starting to like the idea of raw FS for a few reasons:

    1) Considering that postgresql now does WAL, the need for a logging FS
    for the database doesn't seem as needed (is it needed at all?).

    2) Given the fact that postgresql is trying to support many OSs,
    depending on, for example, XFS on a linux system will cause many
    problems. What about solaris? How about BSD? Etc.. Using raw db MAY be
    easier than dealing with the problems that will arise from supporting
    multiple filesystems.

    That said, the ability to use the system's FS does have it's advantages
    (backup, moving files, etc).

    Just some thoughts..

    - Brandon

    b. palmer, bpalmer@crimelabs.net
    pgp: www.crimelabs.net/bpalmer.pgp5
  • Michael Samuel at May 4, 2001 at 11:35 am

    On Thu, May 03, 2001 at 11:41:24AM -0400, Bruce Momjian wrote:
    ext2 has serious problems with corrupt file systems after a crash, so I
    understand the need to move to another file system type. I have been
    waitin for Linux to get a more modern file system. Unfortunately, the
    new ones seem to be worse for PostgreSQL.
    If you fsync() a directory in Linux, all the metadata within that directory
    will be written out to disk.

    As for filesystem corruption, I can say the e2fsck is among the best fsck
    programs out there, and I've only ever had 1 occasion where I've lost any
    data on an ext2 filesystem, and that was due to bad sectors causing me to
    lose the root directory. (Well, apart from human errors, but that doesn't
    count)
    OK, we have considered this, but frankly, the new, modern file systems
    like FFS/softupdates have i/o rates near raw speed, with all the
    advantages a file system gives us. I believe most commercial dbs are
    moving away from raw devices and toward file systems. In the old days
    the SysV file system was pretty bad at i/o & fragmentation, so they used
    raw devices.
    And Solaris' 1/01 media has better support for O_DIRECT (?), which they claim
    gives you 93% of the speed of a raw device. (Or something like that; I read
    this in marketing material a couple of months ago)

    Raw devices are designed to have filesystems on them. The only excuses for
    userland tools accessing them, are fs-specific tools (eg. dump, fsck, etc),
    or for non-unix filesystem tools, where the unix VFS doesn't handle things
    properly (hfstools).
    The ability to put indexes on a separate volume from data.
    The ability to put different tables on different volumes.
    And so on.
    We certainly need that, but raw devices would not make this any easier,
    I think.
    It would be cool if either at compile time or at database creation time, we
    could specify a printf-like format for placing tables, indexes, etc.
    It could become a serious problem as people start using reiser/xfs for
    their file systems and don't understand the performance problems. Even
    more likely is that they will turn off fsync, thinking reiser doesn't
    need it, when in fact, I think it does.
    ReiserFS only supports metadata logging. The performance slowdown must be
    due to logging things like mtime or atime, because otherwise ReiserFS is a
    very high performance FS. (Although, I admittedly haven't used it since it
    was early in it's development)

    --
    Michael Samuel <michael@miknet.net>
  • Mark L. Woodward at May 4, 2001 at 11:58 am

    Michael Samuel wrote:

    ReiserFS only supports metadata logging. The performance slowdown must be
    due to logging things like mtime or atime, because otherwise ReiserFS is a
    very high performance FS. (Although, I admittedly haven't used it since it
    was early in it's development)
    The way I understand it is that ReiserFS does not attempt to separate files at
    the block level. Multiple files can live in the same disk block. This is cool
    if you have many small files, but the extra overhead for large files such as
    those used by a database, is a bit much.

    I read some stuff about a year ago, and my impressions forced me to conclude
    that ReiserFS was geared toward applications. Which is a pretty good thing for
    applications, but not for databases.

    I really think a simple low down dirty file system is just what the doctor
    ordered for postgres.

    Remember, general purpose file systems must do for files what Postgres is
    already doing for records. You will always have extra work. I am seriously
    thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance,
    or if there is just something fundamentally stupid about FAT32 that will make
    it worse?


    --
    I'm not offering myself as an example; every life evolves by its own laws.
    ------------------------
    http://www.mohawksoft.com
  • Michael Samuel at May 4, 2001 at 1:50 pm

    On Fri, May 04, 2001 at 08:02:17AM -0400, mlw wrote:
    The way I understand it is that ReiserFS does not attempt to separate files at
    the block level. Multiple files can live in the same disk block. This is cool
    if you have many small files, but the extra overhead for large files such as
    those used by a database, is a bit much.
    It should be at least as fast as other filesystems for large files. I suspect
    that it would be faster in fact. The only catch is that the performance of
    reiserfs sucks when it gets past 85% or so full. (ext2 has similar problems)

    You can read about all this stuff at http://www.namesys.com/
    I really think a simple low down dirty file system is just what the doctor
    ordered for postgres.
    Traditional BSD FFS or Solaris UFS is probably the best bet for postgres.
    Remember, general purpose file systems must do for files what Postgres is
    already doing for records. You will always have extra work. I am seriously
    thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance,
    or if there is just something fundamentally stupid about FAT32 that will make
    it worse?
    Well, for a starters, file permissions...

    Ext2 would kick arse over FAT32 for performance.

    --
    Michael Samuel <michael@miknet.net>
  • Bruce Momjian at May 4, 2001 at 4:49 pm

    On Fri, May 04, 2001 at 08:02:17AM -0400, mlw wrote:
    The way I understand it is that ReiserFS does not attempt to separate files at
    the block level. Multiple files can live in the same disk block. This is cool
    if you have many small files, but the extra overhead for large files such as
    those used by a database, is a bit much.
    It should be at least as fast as other filesystems for large files. I suspect
    that it would be faster in fact. The only catch is that the performance of
    reiserfs sucks when it gets past 85% or so full. (ext2 has similar problems)
    That is pretty standard for most modern file systems. They need that
    free space to optimize.

    You can read about all this stuff at http://www.namesys.com/
    I really think a simple low down dirty file system is just what the doctor
    ordered for postgres.
    Traditional BSD FFS or Solaris UFS is probably the best bet for postgres.
    That is my opinion. BSD FFS seems to be general enough to give good
    performance for a large scale of application needs. It is not as fast
    as XFS for streaming large files (media), and it doesn't optimize small
    files below the 1k size (fragments), and it does require fsck on reboot.

    However, looking at all those for PostgreSQL, the costs of the new Linux
    file systems seems pretty high, especially considering our need for
    fsync().

    What I am really concerned about is when xfs/reiser become the default
    file systems for Linux, and people complain about PostgreSQL
    performance. And if we require special file systems, we lose some of
    our ability to easily grow. Because of ext2's problems with crash
    recovery, who is going to want to put other data on that file system
    when they have xfs/reiser available. And boots are going to have to
    fsck that ext2 file system.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Mark L. Woodward at May 4, 2001 at 5:56 pm
    Michael Samuel wrote:
    Remember, general purpose file systems must do for files what Postgres is
    already doing for records. You will always have extra work. I am seriously
    thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance,
    or if there is just something fundamentally stupid about FAT32 that will make
    it worse?
    Well, for a starters, file permissions...

    Ext2 would kick arse over FAT32 for performance.
    OK, I'll bite.

    In a database environment where file creation is not such an issue, why would ext2
    be faster?

    The FAT file system has, AFAIK, very little overhead for file writes. It simply
    writes the two FAT tables on file extension, and data. Depending on cluster size,
    there is probably even less happening there.

    I don't think that anyone is saying that FAT is the answer in a production
    environment, but maybe we can do a comparison of various file systems and see if any
    performance issues show up.

    I mentioned FAT only because I was thinking about how postgres would perform on a
    very simple file system, one which bypasses most of the normal stuff a "good"
    general purpose file system would do. While I was thinking this, it occurred to me
    that FAT was about he cheesiest simple file system one could find, short of a ram
    disk, and maybe we could use it to test the assumptions about performance impact of
    the file system on postgres.

    Just a thought. If you know of some reason why ext2 would perform better in the
    postgres environment, I would love to hear why, I'm very curious.
  • Gavin Sherry at May 3, 2001 at 11:33 pm

    On Thu, 3 May 2001, mlw wrote:

    This behavior raises the question about file system usage in Postgres. Many
    databases, such as Oracle, create table space files and operate directly on the
    raw blocks, bypassing the file system altogether.

    On one hand, Postgres is easy to use and maintain because it cooperates with
    the native file system, on the other hand it incurs the overhead of whatever
    silliness the file system wants to do.
    It is not *that* hard to write a 'postgresfs' but you have to look at
    the problems it creates. One of the biggest problems facing sys admins of
    large sites is that the Oracle/DB2/etc DBA, having created the
    purpose-build database filesystem, has not allowed enough room for
    growth. Like I said, a basic file system is not difficult, but volume
    management tools and the maintenance of the whole thing is. Currently,
    postgres administrators are not faced with such a problem.

    There is, of course, the argument that pgfs need not been enforced. The
    problem is that many people would probably use it so as to have a
    'superior' installation. This then entails the problems above, creating
    more work for core developers.

    Gavin
  • Christopher Kings-Lynne at May 4, 2001 at 1:12 am
    Just put a note in the installation docs that the place where the database
    is initialised to should be on a non-Reiser, non-XFS mount...

    Chris

    -----Original Message-----
    From: pgsql-hackers-owner@postgresql.org
    On Behalf Of mlw
    Sent: Thursday, 3 May 2001 8:09 PM
    To: Bruce Momjian; Hackers List
    Subject: [HACKERS] Re: New Linux xfs/reiser file systems


    Bruce Momjian wrote:
    I was talking to a Linux user yesterday, and he said that performance
    using the xfs file system is pretty bad. He believes it has to do with
    the fact that fsync() on log-based file systems requires more writes.

    With a standard BSD/ext2 file system, WAL writes can stay on the same
    cylinder to perform fsync. Is that true of log-based file systems?

    I know xfs and reiser are both log based. Do we need to be concerned
    about PostgreSQL performance on these file systems? I use BSD FFS with
    soft updates here, so it doesn't affect me.
    I did see poor performance on reiserfs, I have not as yet ventured into
    using
    xfs.

    I occurs to me that journalizing file systems will almost always be slower
    on
    an application such as postgres. The journalizing file system is trying to
    maintain data integrity for an application which is also trying to maintain
    data integrity. There will always be extra work involved.

    This behavior raises the question about file system usage in Postgres. Many
    databases, such as Oracle, create table space files and operate directly on
    the
    raw blocks, bypassing the file system altogether.

    On one hand, Postgres is easy to use and maintain because it cooperates with
    the native file system, on the other hand it incurs the overhead of whatever
    silliness the file system wants to do.

    I would bet it is a huge amount of work to use a "table space" system and no
    one wants that. lol. However, it should be noted that a bit more control
    over
    database layout would make some great performance improvements.

    The ability to put indexes on a separate volume from data.
    The ability to put different tables on different volumes.
    And so on.

    In the short term, I think poor performance on a journalizing file system is
    to
    be expected, unless there is an IOCTL to tell the FS to leave the files
    alone
    (and postgres calls it). A Linux HOWTO which informs people that certain
    file
    systems will have performance issues and why should handle the problem.

    Perhaps we can convince the Linux community to create a "dbfs" which is a
    stripped down simple no nonsense file system designed for applications like
    databases?

    --
    I'm not offering myself as an example; every life evolves by its own laws.
    ------------------------
    http://www.mohawksoft.com

    ---------------------------(end of broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
  • John at May 4, 2001 at 1:39 am
    There might be a problem, but if no one mentions it to the maintainers of
    those
    fs's, it will not get fixed...

    Regards
    John
  • Bruce Momjian at May 4, 2001 at 1:43 am

    Just put a note in the installation docs that the place where the database
    is initialised to should be on a non-Reiser, non-XFS mount...
    Sure, we can do that now. What do we do when these are the default file
    systems for Linux? We can tell them to create other types of file
    systems, but that is a pretty big hurdle. I wonder if it would be
    easier to get reiser/xfs to make some modifications.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Christopher Kings-Lynne at May 4, 2001 at 1:54 am
    Well, arguably if you're setting up a database server then a reasonable DBA
    should think about such things...

    (My 2c)

    Chris

    -----Original Message-----
    From: Bruce Momjian
    Sent: Friday, 4 May 2001 9:42 AM
    To: Christopher Kings-Lynne
    Cc: mlw; Hackers List
    Subject: Re: [HACKERS] Re: New Linux xfs/reiser file systems

    Just put a note in the installation docs that the place where the database
    is initialised to should be on a non-Reiser, non-XFS mount...
    Sure, we can do that now. What do we do when these are the default file
    systems for Linux? We can tell them to create other types of file
    systems, but that is a pretty big hurdle. I wonder if it would be
    easier to get reiser/xfs to make some modifications.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Bruce Momjian at May 4, 2001 at 1:56 am

    Well, arguably if you're setting up a database server then a reasonable DBA
    should think about such things...
    Yes, but people have trouble installing PostgreSQL. I can't imagine
    walking them through a newfs.

    (My 2c)

    Chris

    -----Original Message-----
    From: Bruce Momjian
    Sent: Friday, 4 May 2001 9:42 AM
    To: Christopher Kings-Lynne
    Cc: mlw; Hackers List
    Subject: Re: [HACKERS] Re: New Linux xfs/reiser file systems

    Just put a note in the installation docs that the place where the database
    is initialised to should be on a non-Reiser, non-XFS mount...
    Sure, we can do that now. What do we do when these are the default file
    systems for Linux? We can tell them to create other types of file
    systems, but that is a pretty big hurdle. I wonder if it would be
    easier to get reiser/xfs to make some modifications.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Roland Roberts at May 4, 2001 at 2:24 pm

    "Bruce" == Bruce Momjian writes:
    Well, arguably if you're setting up a database server then a
    reasonable DBA should think about such things...
    Bruce> Yes, but people have trouble installing PostgreSQL. I
    Bruce> can't imagine walking them through a newfs.

    In most of linux-land, the DBA is probably also the sysadmin. In
    bigger shops, and those which currently run, say Oracle or Sybase, the
    two roles are separate. When they are separate, you don't have to
    walk the DBA through it; he just walks over to the sysadmin and says
    "I need X megabytes of space on a new Y filesystem."

    roland
    --
    PGP Key ID: 66 BC 3B CD
    Roland B. Roberts, PhD RL Enterprises
    roland@rlenter.com 76-15 113th Street, Apt 3B
    rbroberts@acm.org Forest Hills, NY 11375
  • Mark L. Woodward at May 4, 2001 at 3:16 am

    Bruce Momjian wrote:
    Just put a note in the installation docs that the place where the database
    is initialised to should be on a non-Reiser, non-XFS mount...
    Sure, we can do that now. What do we do when these are the default file
    systems for Linux? We can tell them to create other types of file
    systems, but that is a pretty big hurdle. I wonder if it would be
    easier to get reiser/xfs to make some modifications.

    I have looked at Reiser, and I don't think it is a file system suited for very
    large files, or applications such as postgres. The Linux crowd should lobby
    against any such trend. It is ok for many moderately small files. ReiserFS
    would be great for a cddb server, but poor for a database box.

    XFS is a real big file system project, I'd bet that there are file properties
    or management tools to tell it to leave directories and files alone. They
    should have addressed that years ago.

    One last mention..

    Having better control over WHERE various files in a database are located can
    make it easier to deal with these things.

    Just a thought. ;-)

    --
    I'm not offering myself as an example; every life evolves by its own laws.
    ------------------------
    http://www.mohawksoft.com
  • Trond Eivind Glomsrød at May 4, 2001 at 1:33 pm

    mlw writes:

    I have looked at Reiser, and I don't think it is a file system suited for very
    large files, or applications such as postgres.
    What's the problem with big files? ReiserFS v2 doesn't seem to support
    it, while v3 seems just fine (of the ondisk format)

    That said, I'm certainly looking forward to xfs - I believe it will be
    the most widely used of the current batch of journaling file systems
    (reiserfs, jfs, XFS and ext3, the latter mainly focusing on an easy
    migration path for existing system)

    --
    Trond Eivind Glomsrød
    Red Hat, Inc.
  • Thomas Swan at May 4, 2001 at 5:24 pm

    mlw wrote:
    Bruce Momjian wrote:
    Just put a note in the installation docs that the place where the database
    is initialised to should be on a non-Reiser, non-XFS mount...
    Sure, we can do that now. What do we do when these are the default file
    systems for Linux? We can tell them to create other types of file
    systems, but that is a pretty big hurdle. I wonder if it would be
    easier to get reiser/xfs to make some modifications.

    I have looked at Reiser, and I don't think it is a file system suited for very
    large files, or applications such as postgres. The Linux crowd should lobby
    against any such trend. It is ok for many moderately small files. ReiserFS
    would be great for a cddb server, but poor for a database box.

    XFS is a real big file system project, I'd bet that there are file properties
    or management tools to tell it to leave directories and files alone. They
    should have addressed that years ago.

    One last mention..

    Having better control over WHERE various files in a database are located can
    make it easier to deal with these things.
    I think it's worth noting that Oracle has been petitioning the kernel
    developers for better raw device support: in other words, the ability to
    write directly to the hard disk and bypassing the filesystem all
    together.

    If the db is going to assume the responsibility of disk write
    verification it seems reasonable to assume you might want to investigate
    the raw disk i/o options.

    Telling your installers that a major performance gain is attainable by
    doing so might be a start in the opposite direction. I've monitored a
    lot of discussions and from what I can gather, postgresql does it's own
    set of journaling operations. I don't think that it's necessary for
    writes to be double journalled anyway.

    Again, just my two cents worth...
  • Lincoln Yeoh at May 5, 2001 at 5:01 pm

    At 02:09 AM 5/4/01 -0500, Thomas Swan wrote:
    I think it's worth noting that Oracle has been petitioning the
    kernel developers for better raw device support: in other words,
    the ability to write directly to the hard disk and bypassing the
    filesystem all together.
    But there could be other reasons why Oracle would want to do raw stuff.

    1) They have more things to sell - management modules/software. More
    training courses. Certified blahblahblah. More features in brochure.
    2) It just helps make things more proprietary. Think lock in.

    All that for maybe 10% performance increase?

    I think it's more advantageous for Postgresql to keep the filesystem layer
    of abstraction, than to do away with it, and later reinvent certain parts
    of it along with new bugs.

    What would be useful is if one can specify where the tables, indexes, WAL
    and other files go. That feature would probably help improve performance
    far more.

    For example: you could then stick the WAL on a battery backed up RAM disk.
    How much total space does a WAL log need?

    A battery backed RAM disk might even be cheaper than Brand X RDBMS
    Proprietary Feature #5.

    Cheerio,
    Link.
  • Mark L. Woodward at May 5, 2001 at 5:12 pm

    Lincoln Yeoh wrote:
    At 02:09 AM 5/4/01 -0500, Thomas Swan wrote:
    I think it's worth noting that Oracle has been petitioning the
    kernel developers for better raw device support: in other words,
    the ability to write directly to the hard disk and bypassing the
    filesystem all together.
    But there could be other reasons why Oracle would want to do raw stuff.

    1) They have more things to sell - management modules/software. More
    training courses. Certified blahblahblah. More features in brochure.
    2) It just helps make things more proprietary. Think lock in.

    All that for maybe 10% performance increase?

    I think it's more advantageous for Postgresql to keep the filesystem layer
    of abstraction, than to do away with it, and later reinvent certain parts
    of it along with new bugs.
    I just did a test of putting pg_xlog on a FAT file system, and my first rough
    tests (pgbench) show an approximate 20% performance increase over ext2 with
    fsync enabled.


    --
    I'm not offering myself as an example; every life evolves by its own laws.
    ------------------------
    http://www.mohawksoft.com
  • Lincoln Yeoh at May 6, 2001 at 11:18 am

    At 01:16 PM 5/5/01 -0400, mlw wrote:
    Lincoln Yeoh wrote:
    All that for maybe 10% performance increase?

    I think it's more advantageous for Postgresql to keep the filesystem layer
    of abstraction, than to do away with it, and later reinvent certain parts
    of it along with new bugs.
    I just did a test of putting pg_xlog on a FAT file system, and my first rough
    tests (pgbench) show an approximate 20% performance increase over ext2 with
    fsync enabled.
    OK. I slouch corrected :). It's more than 10%.

    However in the same message I did also say:
    What would be useful is if one can specify where the tables, indexes, WAL
    and other files go. That feature would probably help improve performance
    far more.

    For example: you could then stick the WAL on a battery backed up RAM disk.
    How much total space does a WAL log need?

    A battery backed RAM disk might even be cheaper than Brand X RDBMS
    Proprietary Feature #5.
    And your experiments do help show that it is useful to be able to specify
    where things go, that putting just the WAL somewhere else makes things 20%
    faster. So you don't have to put everything on a pgfs. Just the WAL on some
    other FS (even FAT32, ick ;) ).

    ---
    OK we can do that with symlinks, but is there a PGSQL Recommended or
    Standard way to do it, so as to reduce administrative errors, and at least
    help improve consistency with multiadmin pgsql installations?

    The WAL and DBs are in separate directories, so this makes things easy. But
    the object names are now all numbers so that makes things a bit harder -
    and what to do with temp tables?

    Would it be good to have tables in one directory and indexes in another? Or
    most people optimize on a specific table/index basis? Where does PGSQL do
    the on-disk sorts?

    How about naming the DB objects <object ID>.<object name>?
    e.g

    121575.testtable
    125575.testtableindex

    (or the other way round - name.OID - harder for DB, easier for admin?)

    They'll still be unique, but now they're admin readable. Slower? e.g. at
    that code point, pgsql no longer knows the object's name, and wants to
    refer to everything by just numbers?

    I apologize if there was already a long discussion on this. I seem to
    recall Bruce saying that the developers agonized over this.

    Cheerio,
    Link.
  • Hannu Krosing at May 6, 2001 at 12:06 pm

    Lincoln Yeoh wrote:
    At 01:16 PM 5/5/01 -0400, mlw wrote:
    Lincoln Yeoh wrote:
    All that for maybe 10% performance increase?

    I think it's more advantageous for Postgresql to keep the filesystem layer
    of abstraction, than to do away with it, and later reinvent certain parts
    of it along with new bugs.
    I just did a test of putting pg_xlog on a FAT file system, and my first rough
    tests (pgbench) show an approximate 20% performance increase over ext2 with
    fsync enabled.
    OK. I slouch corrected :). It's more than 10%.

    However in the same message I did also say:
    What would be useful is if one can specify where the tables, indexes, WAL
    and other files go. That feature would probably help improve performance
    far more.

    For example: you could then stick the WAL on a battery backed up RAM disk.
    How much total space does a WAL log need?

    A battery backed RAM disk might even be cheaper than Brand X RDBMS
    Proprietary Feature #5.
    And your experiments do help show that it is useful to be able to specify
    where things go, that putting just the WAL somewhere else makes things 20%
    faster. So you don't have to put everything on a pgfs. Just the WAL on some
    other FS (even FAT32, ick ;) ).
    So you propose pgwalfs ? ;)

    It may be much easier to implement than a full fs.

    How hard would it be to let wal reside on a (raw) device ?

    If we already pre-allocate a required number of fixed-size files would
    it be too
    hard to replace them with plain (raw) devices and test for possible
    performance gains ?
    How about naming the DB objects <object ID>.<object name>?
    e.g

    121575.testtable
    125575.testtableindex
    This sure seems to be an elegant solution for the problem that seems to
    be impossible
    to solve with symlinks and such. Even the IMHO hardest to solve problem
    - RENAME - can
    probably be done in a transaction-safe manner by doing a
    link(oid.<newname>) in the
    beginning and selective unlink(oid.<newname/oldname>) at commit time.

    --------------------
    Hannu
  • Mark L. Woodward at May 6, 2001 at 12:48 pm

    Hannu Krosing wrote:

    Lincoln Yeoh wrote:
    At 01:16 PM 5/5/01 -0400, mlw wrote:
    Lincoln Yeoh wrote:
    All that for maybe 10% performance increase?

    I think it's more advantageous for Postgresql to keep the filesystem layer
    of abstraction, than to do away with it, and later reinvent certain parts
    of it along with new bugs.
    I just did a test of putting pg_xlog on a FAT file system, and my first rough
    tests (pgbench) show an approximate 20% performance increase over ext2 with
    fsync enabled.
    OK. I slouch corrected :). It's more than 10%.

    However in the same message I did also say:
    What would be useful is if one can specify where the tables, indexes, WAL
    and other files go. That feature would probably help improve performance
    far more.

    For example: you could then stick the WAL on a battery backed up RAM disk.
    How much total space does a WAL log need?

    A battery backed RAM disk might even be cheaper than Brand X RDBMS
    Proprietary Feature #5.
    And your experiments do help show that it is useful to be able to specify
    where things go, that putting just the WAL somewhere else makes things 20%
    faster. So you don't have to put everything on a pgfs. Just the WAL on some
    other FS (even FAT32, ick ;) ).
    So you propose pgwalfs ? ;)
    I don't know about a "pgwalfs" too much work. I have had some time to grapple
    with my feelings about FAT, and you know what? I don't hate the idea. I would,
    of course, like to look through the driver code and see if there are any
    technical reasons why it should be excluded.

    FAT is almost perfect for WAL, and if I can figure out how to get the "base"
    directory to get the same performance, I'd think about putting it there as
    well.

    The ReiserFS issues touched on some vague suspicions I had about fsync. Maybe
    I'm over reacting, but there are reasons why the oracles manage their own table
    spaces.

    Back to FAT. FAT is probably the most simple file system I can think of. As
    long as it writes to disk when it gets synched, and doesn't loose things, its
    perfect. Postgres maintains much of the coherency issues, there is no real
    problem with permissions because it will be owned by the postgres super user,
    etc. I would never suggest FAT as a general purpose file system, but, geez, as
    a special purpose single user (postgres) it seems an ideal answer to what will
    be an increasingly hard problem of advanced file systems.

    Aside from a general, and well deserved, disdain for FAT. What are the
    technical "cons" of such a proposal. If we can get the Linux kernel (and other
    unices) to accept IOCTLs to direct space allocation, and/or write up a white
    paper on how to use this for postgres, why wouldn't it be a reasonable
    strategy?



    --
    I'm not offering myself as an example; every life evolves by its own laws.
    ------------------------
    http://www.mohawksoft.com
  • Lincoln Yeoh at May 6, 2001 at 3:57 pm

    Lincoln Yeoh wrote:
    Lincoln Yeoh wrote:
    For example: you could then stick the WAL on a battery backed up RAM disk.
    How much total space does a WAL log need?

    A battery backed RAM disk might even be cheaper than Brand X RDBMS
    Proprietary Feature #5.
    And your experiments do help show that it is useful to be able to specify
    where things go, that putting just the WAL somewhere else makes things 20%
    faster. So you don't have to put everything on a pgfs. Just the WAL on some
    other FS (even FAT32, ick ;) ).
    At 02:04 PM 5/6/01 +0200, Hannu Krosing wrote:
    So you propose pgwalfs ? ;)
    Nah. I'm proposing the opposite in fact.

    I'm saying so far there appears to be no real need to come up with a
    special filesystem. Stick to using existing/future filesystems. Just make
    it easy and safe enough for DBA's to put the objects on whatever filesystem
    they choose. So long as the O/S kernel/driver people support the hardware
    or filesystem, postgresql will take advantage of it with little if any
    extra work.

    In fact as mlw's experiments show, you can put the WAL on FAT (FAT16?) for
    a 20% performance increase. How much better would a raw device be? Would it
    really be worth all that hassle? For instance if you need to resize the FAT
    partition, you could probably use fips, Partition Magic or some other cost
    effective solution - no need for pgsql developers or anybody to reinvent
    anything.

    My proposed but untested idea is that you could get a significant
    performance increase by putting the WAL on popular filesystems running on
    battery backed RAM drives (or other special hardware). 128MB RAM should be
    enough for small setups?

    Don't know how much these things cost, but I believe that when you need the
    speed, they'll be more worthwhile than a special proprietary filesystem.

    Ok, just found:
    http://www.expressdata.com.au/Products/ProductsList.asp?SUPPLIER_NAME=PLATYP
    US+TECHNOLOGY&SUBCATEGORY_NAME=QikDrive2#PRODUCTTITLE

    AUD$1,624.70 = USD843.06. Not cheap but not way out of reach. Haven't found
    other competing products yet. Must be somewhere.

    Cheerio,
    Link.
  • Tom Lane at May 6, 2001 at 4:04 pm

    Hannu Krosing writes:
    Even the IMHO hardest to solve problem
    - RENAME - can
    probably be done in a transaction-safe manner by doing a
    link(oid.<newname>) in the
    beginning and selective unlink(oid.<newname/oldname>) at commit time.
    Nope. Consider

    begin;
    rename a to b;
    rename b to a;
    end;

    And don't tell me you'll solve this by ignoring failures from link().
    That's a recipe for losing your data...

    I would ask people who think they have a solution to please go back and
    reread the very long discussions we have had on this point in the past.
    Nobody particularly likes numeric filenames, but there really isn't any
    other workable answer.

    regards, tom lane
  • Lincoln Yeoh at May 6, 2001 at 5:50 pm

    At 12:03 PM 5/6/01 -0400, Tom Lane wrote:
    Hannu Krosing <hannu@tm.ee> writes:
    Even the IMHO hardest to solve problem
    - RENAME - can
    probably be done in a transaction-safe manner by doing a
    link(oid.<newname>) in the
    beginning and selective unlink(oid.<newname/oldname>) at commit time.
    Nope. Consider

    begin;
    rename a to b;
    rename b to a;
    end;

    And don't tell me you'll solve this by ignoring failures from link().
    That's a recipe for losing your data...

    I would ask people who think they have a solution to please go back and
    reread the very long discussions we have had on this point in the past.
    Nobody particularly likes numeric filenames, but there really isn't any
    other workable answer.
    OK. Found one of the discussions at:
    http://postgresql.readysetnet.com/mhonarc/pgsql-hackers/2000-03/threads.html
    #00088

    Conclusion calling stuff oid.relname doesn't really work. Sorry to have
    brought it up again.

    Another idea that's probably more messy than it's worth:

    Main object still called <oid> with a symlink called <oid.originalrelname>.
    DB really just uses <oid>.

    Rename= adds symlink called <oid.newrelname>, doesn't remove symlinks
    (symlinks more for show!).

    Committed drop table does what 7.1 does with the main oid entry.

    Vacuum cleans up the symlinks leaving just a single valid one or zaps all
    if the table has been dropped.

    For windows create empty files named oid.relname instead of symlinks.
    Windows will definitely like .verylongrelname extensions ;).

    Kinda messy and kludgy. Throw in the performance reduction and Ick!

    I probably have to think harder :), maybe there's just no good way :(.

    Ah well,
    Link.
  • Hannu Krosing at May 7, 2001 at 8:14 am

    Tom Lane wrote:

    Hannu Krosing <hannu@tm.ee> writes:
    Even the IMHO hardest to solve problem
    - RENAME - can
    probably be done in a transaction-safe manner by doing a
    link(oid.<newname>) in the
    beginning and selective unlink(oid.<newname/oldname>) at commit time.
    Nope. Consider

    begin;
    rename a to b;
    rename b to a;
    end;

    And don't tell me you'll solve this by ignoring failures from link().
    That's a recipe for losing your data...
    I guess link() failures can be safely ignored _as long as_ we check that
    we have the right link after doing it. I can't see how it will lose
    data.
    I would ask people who think they have a solution to please go back and
    reread the very long discussions we have had on this point in the past.
    I think I have now (No way to guarantee I have read _everything_ about
    it,
    but I did hit about ~10 messages on oid_relname naming scheme).

    the most serious objection seemed to be that we need to remember the
    postgres tablename while it would be much easier to use only oids .

    I guess we could hit some system limits here (running out of directory
    entries or reaching the maximum number of links to a file) but at least
    on
    linux i was able to make >10000 links to one file with no problems.

    now that i think of it I have one concern - it would require extra work
    to use tablenames like "/etc/passwd" or others that use characters that
    are
    reserved in filenames which are ok to use in 7.1.

    hannu=# create table "/etc/passwd"(
    hannu(# login text,
    hannu(# uid int,
    hannu(# gid int
    hannu(# );
    CREATE
    hannu=# \dt
    List of relations
    Name | Type | Owner
    -------------+-------+-------
    /etc/passwd | table | hannu

    So if people start using names like these it will not be easy to go back
    ;)
    Nobody particularly likes numeric filenames, but there really isn't any
    other workable answer.
    At least we could put links on system relations, so it would be
    easier to find them.

    I guess one is not supposed to rename/drop system tables ?

    ---------------------
    Hannu
  • Tom Lane at May 6, 2001 at 4:05 pm

    Lincoln Yeoh writes:
    OK we can do that with symlinks, but is there a PGSQL Recommended or
    Standard way to do it, so as to reduce administrative errors, and at least
    help improve consistency with multiadmin pgsql installations?
    Not yet. There should be support for this. See
    doc/TODO.detail/tablespaces.

    regards, tom lane
  • Kaare Rasmussen at May 4, 2001 at 4:00 pm

    Sure, we can do that now. What do we do when these are the default file
    systems for Linux? We can tell them to create other types of file
    What is a 'default file system' ? I know that untill now, everybody is using
    ext2. But that's only because there hasn't been anything comparable. Now we
    se ReiserFS, and my SuSE installation offers the choice. In the future, I
    believe that people can choose from ext2, ReiserFS,xfs, ext3 and maybe more.
    systems, but that is a pretty big hurdle. I wonder if it would be
    easier to get reiser/xfs to make some modifications.
    No, I don't think it's a big hurdle. If you just want to play with
    PostgreSQL, you wont care. If you're serious, you'll repartition.

    --
    Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582
    Kaki Data tshirts, merchandize Fax: 3816 2501
    Howitzvej 75 Åben 14.00-18.00 Web: www.suse.dk
    2000 Frederiksberg Lørdag 11.00-17.00 Email: kar@webline.dk
  • Bruce Momjian at May 4, 2001 at 4:51 pm
    [ Charset ISO-8859-1 unsupported, converting... ]
    Sure, we can do that now. What do we do when these are the default file
    systems for Linux? We can tell them to create other types of file
    What is a 'default file system' ? I know that untill now, everybody is using
    ext2. But that's only because there hasn't been anything comparable. Now we
    se ReiserFS, and my SuSE installation offers the choice. In the future, I
    believe that people can choose from ext2, ReiserFS,xfs, ext3 and maybe more.
    But some day the default will be a log-based file system, and people
    will have to hunt around to create a non-log based one.
    systems, but that is a pretty big hurdle. I wonder if it would be
    easier to get reiser/xfs to make some modifications.
    No, I don't think it's a big hurdle. If you just want to play with
    PostgreSQL, you wont care. If you're serious, you'll repartition.
    Yes, but we could get a reputation for slowness on these log-based file
    systems.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Carl garland at May 4, 2001 at 4:58 am

    Just put a note in the installation docs that the place where the database
    is initialised to should be on a non-Reiser, non-XFS mount...
    Sure, we can do that now.
    I still think this is not necessarily the right approach either. One
    major purpose of using a journaling fs is for fast boot up time after
    crash. If you have a 100 GB database you may wish to have the data
    on XFS. I do think that the WAL log should be on a separate disk and
    on a non-journaling fs for performance.

    Best Regards,
    Carl Garland

    _________________________________________________________________
    Get your FREE download of MSN Explorer at http://explorer.msn.com
  • Mark L. Woodward at May 4, 2001 at 10:39 am
    Here is a radical idea...

    What is it that is causing Postgres trouble? It is the file system's attempts
    to maintain some integrity. So I proposed a simple "dbfs" sort of thing which
    was the most basic sort of file system possible.

    I'm not sure, but I think we can test this hypothesis on the FAT32 file system
    on Linux. As far as I know, FAT32 (FAT in general) is a very simple file system
    and does very little during operation, except read and write the files and
    manage what's been allocated. Plus, the allocation table is very simple in
    comparison all the other file systems.

    Would pgbench run on a system using ext2, Reiser, then FAT32 be sufficient to
    get a feeling for the type of performance Postgres would get, or am I just off
    the wall?

    If this idea has some merit, what would be the best way to test it? Move the
    pg_xlog directory first, then try base? What's the best methodology to try?


    carl garland wrote:
    Just put a note in the installation docs that the place where the database
    is initialised to should be on a non-Reiser, non-XFS mount...
    Sure, we can do that now.
    I still think this is not necessarily the right approach either. One
    major purpose of using a journaling fs is for fast boot up time after
    crash. If you have a 100 GB database you may wish to have the data
    on XFS. I do think that the WAL log should be on a separate disk and
    on a non-journaling fs for performance.

    Best Regards,
    Carl Garland

    _________________________________________________________________
    Get your FREE download of MSN Explorer at http://explorer.msn.com

    ---------------------------(end of broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
    --
    I'm not offering myself as an example; every life evolves by its own laws.
    ------------------------
    http://www.mohawksoft.com
  • Ken Hirsch at May 4, 2001 at 12:23 pm
    Before we get too involved in speculating, shouldn't we actually measure the
    performance of 7.1 on XFS and Reiserfs? Since it's easy to disable fsync,
    we can test whether that's the problem. I don't think that logging file
    systems must intrinsically give bad performance on fsync since they only log
    metadata changes.

    I don't have a machine with XFS installed and it will be at least a week
    before I could get around to a build. Any volunteers?

    Ken Hirsch
  • Bruce Momjian at May 4, 2001 at 4:47 pm
    [ Charset ISO-8859-1 unsupported, converting... ]
    Before we get too involved in speculating, shouldn't we actually measure the
    performance of 7.1 on XFS and Reiserfs? Since it's easy to disable fsync,
    we can test whether that's the problem. I don't think that logging file
    systems must intrinsically give bad performance on fsync since they only log
    metadata changes.

    I don't have a machine with XFS installed and it will be at least a week
    before I could get around to a build. Any volunteers?
    There have been multiple reports of poor PostgreSQL performance on
    Reiser and xfs. I don't have numbers, though. Frankly, I think we need
    xfs and reiser experts involved to figure out our options here.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Trond Eivind Glomsrød at May 4, 2001 at 6:22 pm

    "Ken Hirsch" <kenhirsch@myself.com> writes:

    I don't have a machine with XFS installed and it will be at least a week
    before I could get around to a build. Any volunteers?
    I think I could do that... any useful benchmarks to run?

    --
    Trond Eivind Glomsrød
    Red Hat, Inc.
  • Trond Eivind Glomsrød at May 7, 2001 at 10:07 pm

    Trond Eivind Glomsrød writes:

    "Ken Hirsch" <kenhirsch@myself.com> writes:
    I don't have a machine with XFS installed and it will be at least a week
    before I could get around to a build. Any volunteers?
    I think I could do that... any useful benchmarks to run?
    In lack of bigger benchmarks, I tried postgresql 7.1 on a Red Hat
    Linux 7.1 system with the SGI XFS modifications. The differences were
    very small.
  • Bruce Momjian at May 7, 2001 at 11:07 pm

    teg@redhat.com (Trond Eivind Glomsr?d) writes:
    "Ken Hirsch" <kenhirsch@myself.com> writes:
    I don't have a machine with XFS installed and it will be at least a week
    before I could get around to a build. Any volunteers?
    I think I could do that... any useful benchmarks to run?
    In lack of bigger benchmarks, I tried postgresql 7.1 on a Red Hat
    Linux 7.1 system with the SGI XFS modifications. The differences were
    very small.
    Thanks. That is very helpful. Seems XFS is fine. According to Joe
    Conway, reiser has some problems.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Trond Eivind Glomsrød at May 9, 2001 at 3:16 pm

    Trond Eivind Glomsrød writes:

    teg@redhat.com (Trond Eivind Glomsrød) writes:
    "Ken Hirsch" <kenhirsch@myself.com> writes:
    I don't have a machine with XFS installed and it will be at least a week
    before I could get around to a build. Any volunteers?
    I think I could do that... any useful benchmarks to run?
    In lack of bigger benchmarks, I tried postgresql 7.1 on a Red Hat
    Linux 7.1 system with the SGI XFS modifications. The differences were
    very small.
    And here is the one for ReiserFS - same kernel, but recompiled to turn
    off debugging
  • Bruce Momjian at May 9, 2001 at 4:14 pm

    When compared to the earlier ones (including XFS), you'll note that ReiserFS
    performance is rather poor in some of the tests - it takes 37 vs. 13
    seconds for 8192 inserts, when the inserts are different transactions.
    That is all the fsync delay, probably, and it should be using fdatasync()
    on that kernel.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Trond Eivind Glomsrød at May 9, 2001 at 4:15 pm

    Bruce Momjian writes:


    When compared to the earlier ones (including XFS), you'll note that ReiserFS
    performance is rather poor in some of the tests - it takes 37 vs. 13
    seconds for 8192 inserts, when the inserts are different transactions.
    That is all the fsync delay, probably, and it should be using fdatasync()
    on that kernel.
    And it does seem to work that way with XFS...

    --
    Trond Eivind Glomsrød
    Red Hat, Inc.
  • Martín Marqués at May 9, 2001 at 5:20 pm

    Quoting Trond Eivind Glomsrød <teg@redhat.com>:

    Bruce Momjian <pgman@candle.pha.pa.us> writes:
    When compared to the earlier ones (including XFS), you'll note that
    ReiserFS
    performance is rather poor in some of the tests - it takes 37 vs. 13
    seconds for 8192 inserts, when the inserts are different transactions.
    That is all the fsync delay, probably, and it should be using fdatasync()
    on that kernel.
    And it does seem to work that way with XFS...
    I'm concearned about this because we are going to switch our fist server to a
    Journaling FS (on Linux).
    Searching and asking I found out that for our short term work we need ReiserFS
    (it's for a proxy server).
    Put the interesting thing was that for large (very large) files, everybody
    recomends XFS.
    The drawback of XFS is that it's very, very sloooow when deleting files.

    Saludos... :-)

    --
    El mejor sistema operativo es aquel que te da de comer.
    Cuida tu dieta.
    -----------------------------------------------------------------
    Martin Marques | mmarques@unl.edu.ar
    Programador, Administrador | Centro de Telematica
    Universidad Nacional
    del Litoral
    -----------------------------------------------------------------

Related Discussions

People

Translate

site design / logo © 2021 Grokbase