FAQ
Hi.

I'm using the stock Ext3 as the most tested one, but I wonder, has someone
ever tried, or even using there days in production another file system, like
JFS, XFS or even maybe Ext4?

I'm exploring way to boost the performance of DataNodes, and this seems as
one of possible venues.

Thanks for any info!

Search Discussions

  • Jason Venner at Oct 8, 2009 at 1:59 pm
    I have used xfs pretty extensively, it seemed to be somewhat faster than
    ext3.

    The only trouble we had related to some machines running the PAE 32 bit
    kernels, where we the filesystems lockup. That is an obscure use case
    however.
    Running JBOD with your dfs.data.dir listing a directory on each device
    speeds things up, as does keeping other users off of the disks/machine.

    On Thu, Oct 8, 2009 at 4:37 AM, Stas Oskin wrote:

    Hi.

    I'm using the stock Ext3 as the most tested one, but I wonder, has someone
    ever tried, or even using there days in production another file system,
    like
    JFS, XFS or even maybe Ext4?

    I'm exploring way to boost the performance of DataNodes, and this seems as
    one of possible venues.

    Thanks for any info!


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Stas Oskin at Oct 8, 2009 at 5:13 pm
    Hi.

    Thanks for the info, question is whether XFS performance justifies switching
    from the more common Ext3?

    JBOD is a great approach indeed.

    Regards.

    2009/10/8 Jason Venner <jason.hadoop@gmail.com>
    I have used xfs pretty extensively, it seemed to be somewhat faster than
    ext3.

    The only trouble we had related to some machines running the PAE 32 bit
    kernels, where we the filesystems lockup. That is an obscure use case
    however.
    Running JBOD with your dfs.data.dir listing a directory on each device
    speeds things up, as does keeping other users off of the disks/machine.

    On Thu, Oct 8, 2009 at 4:37 AM, Stas Oskin wrote:

    Hi.

    I'm using the stock Ext3 as the most tested one, but I wonder, has someone
    ever tried, or even using there days in production another file system,
    like
    JFS, XFS or even maybe Ext4?

    I'm exploring way to boost the performance of DataNodes, and this seems as
    one of possible venues.

    Thanks for any info!


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Tom Wheeler at Oct 8, 2009 at 5:26 pm
    As an aside, there's a short article comparing the two in the latest
    edition of Linux Journal. It was hardly scientific, but the main
    points were:

    - XFS is faster than ext3, especially for large files
    - XFS is currently unsupported on Red Hat Enterprise, but apparently
    will be soon.
    On Thu, Oct 8, 2009 at 12:12 PM, Stas Oskin wrote:
    Thanks for the info, question is whether XFS performance justifies switching
    from the more common Ext3?
    --
    Tom Wheeler
    http://www.tomwheeler.com/
  • Jason Venner at Oct 8, 2009 at 7:14 pm
    Busy datanodes become bound by the metadata lookup times for the directory
    and inode entries required to open a block.
    Anything that optimizes that will help substantially.

    We are thinking of playing with brtfs, and using a small SSD for our file
    system metadata, and the spinning disks for the block storage.

    On Thu, Oct 8, 2009 at 10:26 AM, Tom Wheeler wrote:

    As an aside, there's a short article comparing the two in the latest
    edition of Linux Journal. It was hardly scientific, but the main
    points were:

    - XFS is faster than ext3, especially for large files
    - XFS is currently unsupported on Red Hat Enterprise, but apparently
    will be soon.
    On Thu, Oct 8, 2009 at 12:12 PM, Stas Oskin wrote:
    Thanks for the info, question is whether XFS performance justifies switching
    from the more common Ext3?
    --
    Tom Wheeler
    http://www.tomwheeler.com/


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Stas Oskin at Oct 9, 2009 at 12:41 am
    Hi Jason.

    Brtfs is cool, I read that it has a 10% better performance then any other FS
    coming next to it.

    Can you post here the results of any your findings?

    Regards.

    2009/10/8 Jason Venner <jason.hadoop@gmail.com>
    Busy datanodes become bound by the metadata lookup times for the directory
    and inode entries required to open a block.
    Anything that optimizes that will help substantially.

    We are thinking of playing with brtfs, and using a small SSD for our file
    system metadata, and the spinning disks for the block storage.

    On Thu, Oct 8, 2009 at 10:26 AM, Tom Wheeler wrote:

    As an aside, there's a short article comparing the two in the latest
    edition of Linux Journal. It was hardly scientific, but the main
    points were:

    - XFS is faster than ext3, especially for large files
    - XFS is currently unsupported on Red Hat Enterprise, but apparently
    will be soon.
    On Thu, Oct 8, 2009 at 12:12 PM, Stas Oskin wrote:
    Thanks for the info, question is whether XFS performance justifies switching
    from the more common Ext3?
    --
    Tom Wheeler
    http://www.tomwheeler.com/


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Stas Oskin at Oct 8, 2009 at 7:26 pm
    Hi.

    Thanks for the info.

    What about JFS, any idea how well it compares to XFS?
    From what I read, JFS is considered more stable then XFS, but less
    performing, so I wonder if this true.

    Also, Ext4 is around the corner and was recently accepted into kernel, so I
    wonder if anyone knows about this one.

    Regards.

    2009/10/8 Tom Wheeler <tomwheel@gmail.com>
    As an aside, there's a short article comparing the two in the latest
    edition of Linux Journal. It was hardly scientific, but the main
    points were:

    - XFS is faster than ext3, especially for large files
    - XFS is currently unsupported on Red Hat Enterprise, but apparently
    will be soon.
    On Thu, Oct 8, 2009 at 12:12 PM, Stas Oskin wrote:
    Thanks for the info, question is whether XFS performance justifies switching
    from the more common Ext3?
    --
    Tom Wheeler
    http://www.tomwheeler.com/
  • Paul at Oct 8, 2009 at 7:46 pm
    Check out the bottom of this page:

    http://wiki.apache.org/hadoop/DiskSetup


    noatime is all we've done in our environment. I haven't found it worth the
    time to optimize further since we're CPU bound in most of our jobs.


    -paul
    On Thu, Oct 8, 2009 at 3:26 PM, Stas Oskin wrote:

    Hi.

    Thanks for the info.

    What about JFS, any idea how well it compares to XFS?

    From what I read, JFS is considered more stable then XFS, but less
    performing, so I wonder if this true.

    Also, Ext4 is around the corner and was recently accepted into kernel, so I
    wonder if anyone knows about this one.

    Regards.

    2009/10/8 Tom Wheeler <tomwheel@gmail.com>
    As an aside, there's a short article comparing the two in the latest
    edition of Linux Journal. It was hardly scientific, but the main
    points were:

    - XFS is faster than ext3, especially for large files
    - XFS is currently unsupported on Red Hat Enterprise, but apparently
    will be soon.
    On Thu, Oct 8, 2009 at 12:12 PM, Stas Oskin wrote:
    Thanks for the info, question is whether XFS performance justifies switching
    from the more common Ext3?
    --
    Tom Wheeler
    http://www.tomwheeler.com/
  • Stephen mulcahy at Oct 9, 2009 at 8:53 am

    paul wrote:
    Check out the bottom of this page:

    http://wiki.apache.org/hadoop/DiskSetup
    Just re-reading that page, two suggestions that may not be appropriate,

    1. Reducing reserved space to 0. AFAIK, ext3 needs a certain amount of
    free space to function properly - the man page for mke2fs suggests that
    this reserved space is used for defragmentation, as well as being
    emergency space reserved for root. A quick Google doesn't turn up
    anything more definitive, but setting it to 0 is a bad idea afaics.

    2. Reducing the number of inodes. This is a good idea, if you are
    really, really sure that nothing will create small files on that
    partition. Unless you are absolutely certain of this, I would not change
    from the default - I'm not clear on how much of an overall saving you'll
    make and the downside to running to running out of inodes is that you
    start getting "out of space" errors when you try to write to that disk
    (despite df showing you loads of free space), so again, I'm not sure I'd
    recommend this one.

    -stephen

    --
    Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
    NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
    http://di2.deri.ie http://webstar.deri.ie http://sindice.com
  • Edward Capriolo at Oct 9, 2009 at 1:36 pm
    On a 1tb disk reducing reserved space from 5 to 2 saves almost 30 gb.
    Cutting the inodes down saves you some space but not nearly as much.
    Say 10 gb.

    The differnce is once you format your disk you can't change the inode
    numbers. Tunefs can tune reserved blocks while the disk is mounted.

    I did reserved space with tunefs -m2
    , Noatime then moved on.
    On 10/9/09, stephen mulcahy wrote:
    paul wrote:
    Check out the bottom of this page:

    http://wiki.apache.org/hadoop/DiskSetup
    Just re-reading that page, two suggestions that may not be appropriate,

    1. Reducing reserved space to 0. AFAIK, ext3 needs a certain amount of
    free space to function properly - the man page for mke2fs suggests that
    this reserved space is used for defragmentation, as well as being
    emergency space reserved for root. A quick Google doesn't turn up
    anything more definitive, but setting it to 0 is a bad idea afaics.

    2. Reducing the number of inodes. This is a good idea, if you are
    really, really sure that nothing will create small files on that
    partition. Unless you are absolutely certain of this, I would not change
    from the default - I'm not clear on how much of an overall saving you'll
    make and the downside to running to running out of inodes is that you
    start getting "out of space" errors when you try to write to that disk
    (despite df showing you loads of free space), so again, I'm not sure I'd
    recommend this one.

    -stephen

    --
    Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
    NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
    http://di2.deri.ie http://webstar.deri.ie http://sindice.com
  • Stas Oskin at Oct 9, 2009 at 2:56 pm
    Hi.
    AFAIK, this space is reserved for root logs, in case the filesystem is full,
    so the kernel won't crash.
    From what I seen, it has to be only enabled on the root partition, the data
    partitions it can be safely set to 0.

    I usually leave the default 5% on root, boot and swap (as the space savings
    there insignificant), and set to 0 on data partition, where it really gives
    back the 50-60 GB mentioned below.

    Regards.


    2009/10/9 Edward Capriolo <edlinuxguru@gmail.com>
    On a 1tb disk reducing reserved space from 5 to 2 saves almost 30 gb.
    Cutting the inodes down saves you some space but not nearly as much.
    Say 10 gb.

    The differnce is once you format your disk you can't change the inode
    numbers. Tunefs can tune reserved blocks while the disk is mounted.

    I did reserved space with tunefs -m2
    , Noatime then moved on.
    On 10/9/09, stephen mulcahy wrote:
    paul wrote:
    Check out the bottom of this page:

    http://wiki.apache.org/hadoop/DiskSetup
    Just re-reading that page, two suggestions that may not be appropriate,

    1. Reducing reserved space to 0. AFAIK, ext3 needs a certain amount of
    free space to function properly - the man page for mke2fs suggests that
    this reserved space is used for defragmentation, as well as being
    emergency space reserved for root. A quick Google doesn't turn up
    anything more definitive, but setting it to 0 is a bad idea afaics.

    2. Reducing the number of inodes. This is a good idea, if you are
    really, really sure that nothing will create small files on that
    partition. Unless you are absolutely certain of this, I would not change
    from the default - I'm not clear on how much of an overall saving you'll
    make and the downside to running to running out of inodes is that you
    start getting "out of space" errors when you try to write to that disk
    (despite df showing you loads of free space), so again, I'm not sure I'd
    recommend this one.

    -stephen

    --
    Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
    NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
    http://di2.deri.ie http://webstar.deri.ie http://sindice.com
  • Stas Oskin at Oct 11, 2009 at 8:13 am
    Hi.

    By the way, about the noatime - is it safe just to set this for all
    partitions used, including / and boot?

    Thanks.

    2009/10/9 Stas Oskin <stas.oskin@gmail.com>
    Hi.
    AFAIK, this space is reserved for root logs, in case the filesystem is
    full, so the kernel won't crash.

    From what I seen, it has to be only enabled on the root partition, the data
    partitions it can be safely set to 0.

    I usually leave the default 5% on root, boot and swap (as the space savings
    there insignificant), and set to 0 on data partition, where it really gives
    back the 50-60 GB mentioned below.

    Regards.


    2009/10/9 Edward Capriolo <edlinuxguru@gmail.com>

    On a 1tb disk reducing reserved space from 5 to 2 saves almost 30 gb.
    Cutting the inodes down saves you some space but not nearly as much.
    Say 10 gb.

    The differnce is once you format your disk you can't change the inode
    numbers. Tunefs can tune reserved blocks while the disk is mounted.

    I did reserved space with tunefs -m2
    , Noatime then moved on.
    On 10/9/09, stephen mulcahy wrote:
    paul wrote:
    Check out the bottom of this page:

    http://wiki.apache.org/hadoop/DiskSetup
    Just re-reading that page, two suggestions that may not be appropriate,

    1. Reducing reserved space to 0. AFAIK, ext3 needs a certain amount of
    free space to function properly - the man page for mke2fs suggests that
    this reserved space is used for defragmentation, as well as being
    emergency space reserved for root. A quick Google doesn't turn up
    anything more definitive, but setting it to 0 is a bad idea afaics.

    2. Reducing the number of inodes. This is a good idea, if you are
    really, really sure that nothing will create small files on that
    partition. Unless you are absolutely certain of this, I would not change
    from the default - I'm not clear on how much of an overall saving you'll
    make and the downside to running to running out of inodes is that you
    start getting "out of space" errors when you try to write to that disk
    (despite df showing you loads of free space), so again, I'm not sure I'd
    recommend this one.

    -stephen

    --
    Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
    NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
    http://di2.deri.ie http://webstar.deri.ie http://sindice.com
  • Mikio Uzawa at Oct 11, 2009 at 11:53 am
    Hi all,

    I've just launched a blog named "JClouds" that describe the Japanese cloud
    market trends.
    In near future, there will be some of Hadoop topics, I trust.
    http://jclouds.wordpress.com/

    /mikio uzawa
  • Jason Venner at Oct 12, 2009 at 7:38 am
    Unless you are serving mail via imap or pop, it is generally considered
    safe.
    On Sun, Oct 11, 2009 at 1:11 AM, Stas Oskin wrote:

    Hi.

    By the way, about the noatime - is it safe just to set this for all
    partitions used, including / and boot?

    Thanks.

    2009/10/9 Stas Oskin <stas.oskin@gmail.com>
    Hi.
    AFAIK, this space is reserved for root logs, in case the filesystem is
    full, so the kernel won't crash.

    From what I seen, it has to be only enabled on the root partition, the data
    partitions it can be safely set to 0.

    I usually leave the default 5% on root, boot and swap (as the space savings
    there insignificant), and set to 0 on data partition, where it really gives
    back the 50-60 GB mentioned below.

    Regards.


    2009/10/9 Edward Capriolo <edlinuxguru@gmail.com>

    On a 1tb disk reducing reserved space from 5 to 2 saves almost 30 gb.
    Cutting the inodes down saves you some space but not nearly as much.
    Say 10 gb.

    The differnce is once you format your disk you can't change the inode
    numbers. Tunefs can tune reserved blocks while the disk is mounted.

    I did reserved space with tunefs -m2
    , Noatime then moved on.
    On 10/9/09, stephen mulcahy wrote:
    paul wrote:
    Check out the bottom of this page:

    http://wiki.apache.org/hadoop/DiskSetup
    Just re-reading that page, two suggestions that may not be
    appropriate,
    1. Reducing reserved space to 0. AFAIK, ext3 needs a certain amount of
    free space to function properly - the man page for mke2fs suggests
    that
    this reserved space is used for defragmentation, as well as being
    emergency space reserved for root. A quick Google doesn't turn up
    anything more definitive, but setting it to 0 is a bad idea afaics.

    2. Reducing the number of inodes. This is a good idea, if you are
    really, really sure that nothing will create small files on that
    partition. Unless you are absolutely certain of this, I would not
    change
    from the default - I'm not clear on how much of an overall saving
    you'll
    make and the downside to running to running out of inodes is that you
    start getting "out of space" errors when you try to write to that disk
    (despite df showing you loads of free space), so again, I'm not sure
    I'd
    recommend this one.

    -stephen

    --
    Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
    NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
    http://di2.deri.ie http://webstar.deri.ie http://sindice.com


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Stas Oskin at Oct 12, 2009 at 9:12 am
    Hi.

    Thanks for the advice.

    Regards.

    2009/10/12 Jason Venner <jason.hadoop@gmail.com>
    Unless you are serving mail via imap or pop, it is generally considered
    safe.
    On Sun, Oct 11, 2009 at 1:11 AM, Stas Oskin wrote:

    Hi.

    By the way, about the noatime - is it safe just to set this for all
    partitions used, including / and boot?

    Thanks.

    2009/10/9 Stas Oskin <stas.oskin@gmail.com>
    Hi.
    AFAIK, this space is reserved for root logs, in case the filesystem is
    full, so the kernel won't crash.

    From what I seen, it has to be only enabled on the root partition, the data
    partitions it can be safely set to 0.

    I usually leave the default 5% on root, boot and swap (as the space savings
    there insignificant), and set to 0 on data partition, where it really gives
    back the 50-60 GB mentioned below.

    Regards.


    2009/10/9 Edward Capriolo <edlinuxguru@gmail.com>

    On a 1tb disk reducing reserved space from 5 to 2 saves almost 30 gb.
    Cutting the inodes down saves you some space but not nearly as much.
    Say 10 gb.

    The differnce is once you format your disk you can't change the inode
    numbers. Tunefs can tune reserved blocks while the disk is mounted.

    I did reserved space with tunefs -m2
    , Noatime then moved on.
    On 10/9/09, stephen mulcahy wrote:
    paul wrote:
    Check out the bottom of this page:

    http://wiki.apache.org/hadoop/DiskSetup
    Just re-reading that page, two suggestions that may not be
    appropriate,
    1. Reducing reserved space to 0. AFAIK, ext3 needs a certain amount
    of
    free space to function properly - the man page for mke2fs suggests
    that
    this reserved space is used for defragmentation, as well as being
    emergency space reserved for root. A quick Google doesn't turn up
    anything more definitive, but setting it to 0 is a bad idea afaics.

    2. Reducing the number of inodes. This is a good idea, if you are
    really, really sure that nothing will create small files on that
    partition. Unless you are absolutely certain of this, I would not
    change
    from the default - I'm not clear on how much of an overall saving
    you'll
    make and the downside to running to running out of inodes is that
    you
    start getting "out of space" errors when you try to write to that
    disk
    (despite df showing you loads of free space), so again, I'm not sure
    I'd
    recommend this one.

    -stephen

    --
    Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
    NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
    http://di2.deri.ie http://webstar.deri.ie http://sindice.com


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Ricky Ho at Oct 10, 2009 at 6:03 am
    I'd like to get some Hadoop experts to verify my understanding ...

    To my understanding, within a Map/Reduce cycle, the input data set is "freeze" (no change is allowed) while the output data set is "created from scratch" (doesn't exist before). Therefore, the map/reduce model is inherently "batch-oriented". Am I right ?

    I am thinking whether Hadoop is usable in processing many data streams in parallel. For example, thinking about a e-commerce site which capture user's product search in many log files, and they want to run some analytics on the log files at real time.

    One naïve way is to chunkify the log and perform Map/Reduce in small batches. Since the input data file must be freezed, therefore we need to switch subsequent write to a new logfile. However, the chunking approach is not good because the cutoff point is quite arbitrary. Imagine if I want to calculate the popularity of a product based on the frequency of searches within last 2 hours (a sliding time window). I don't think Hadoop can do this computation.

    Of course, if we don't mind a distorted picture, we can use a jumping window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK. But this is still not good, because we have to wait for two hours before getting the new batch of result. (e.g. At 4:59 PM, we only have the result in the 1-3 PM batch)

    It doesn't seem like Hadoop is good at handling this kind of processing: "Parallel processing of multiple real time data stream processing". Anyone disagree ? The term "Hadoop streaming" is confusing because it means completely different thing to me (ie: use stdout and stdin as input and output data)

    I'm wondering if a "mapper-only" model would work better. In this case, there is no reducer (ie: no grouping). Each map task keep a history (ie: sliding window) of data that it has seen and then write the result to the output file.

    I heard about the "append" mode of HDFS, but don't quite get it. Does it simply mean a writer can write to the end of an existing HDFS file ? Or does it mean a reader can read while a writer is appending on the same HDFS file ? Is this "append-mode" feature helpful in my situation ?

    Rgds,
    Ricky
  • Ted Dunning at Oct 10, 2009 at 8:02 am

    On Fri, Oct 9, 2009 at 11:02 PM, Ricky Ho wrote:

    ... To my understanding, within a Map/Reduce cycle, the input data set is
    "freeze" (no change is allowed) while the output data set is "created from
    scratch" (doesn't exist before). Therefore, the map/reduce model is
    inherently "batch-oriented". Am I right ?
    Current implementations are definitely batch oriented. Keep reading,
    though.

    I am thinking whether Hadoop is usable in processing many data streams in
    parallel.

    Abolutely.

    For example, thinking about a e-commerce site which capture user's product
    search in many log files, and they want to run some analytics on the log
    files at real time.
    Or consider Yahoo running their ad inventories in real-time.

    One naïve way is to chunkify the log and perform Map/Reduce in small
    batches. Since the input data file must be freezed, therefore we need to
    switch subsequent write to a new logfile.

    Which is not a big deal. Moreover, these small chunks can be merged every
    so often.

    However, the chunking approach is not good because the cutoff point is
    quite arbitrary. Imagine if I want to calculate the popularity of a product
    based on the frequency of searches within last 2 hours (a sliding time
    window). I don't think Hadoop can do this computation.
    Subject of a moderate delay of 5-20 minutes, this is no problem at all for
    hadoop. This is especially true if you are doing straightforward
    aggregations that are associative and commutative.

    Of course, if we don't mind a distorted picture, we can use a jumping
    window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK. But
    this is still not good, because we have to wait for two hours before getting
    the new batch of result. (e.g. At 4:59 PM, we only have the result in the
    1-3 PM batch)
    Or just process each 10 minute period into aggregate form. Then add up the
    latest 12 aggregates. Every day, merge all the small files for the day and
    every month merge all the daily files.

    There are very few businesses where a 10 minute delay is a big problem.

    It doesn't seem like Hadoop is good at handling this kind of processing:
    "Parallel processing of multiple real time data stream processing". Anyone
    disagree ?

    It isn't entirely natural, but it isn't a problem.

    I'm wondering if a "mapper-only" model would work better. In this case,
    there is no reducer (ie: no grouping). Each map task keep a history (ie:
    sliding window) of data that it has seen and then write the result to the
    output file.
    This doesn't scale at all well.

    Take a look at the Chukwa project for a well worked example of how to
    process logs in near real-time with Hadoop.
  • Hong Tang at Oct 10, 2009 at 8:07 am
    MapReduce is indeed inherently a batch processing model, where each
    job's outcome is deterministically determined by the input and the
    operators (map, reduce, combiner) as long as the input stays immutable
    and the operator is deterministic and side-effect free. Such a model
    allows the framework to recover from failures without having to
    understand the semantics of the operators (unlike SQL). This is
    important because failures are bound to happen (frequently) for a
    large cluster assembled from commodity hardware.

    A typical technique to bridge a batch system and a real-time system is
    to pair with the batch system with an incremental processing component
    that computes delta on top of some aggregated result. The incremental
    processing part would also serve real-time queries, so the data are
    typically stored in memory. Some times you have to choose some
    approximation algorithms for the incremental part, and periodically
    reset the internal state with the more precise batch processing
    results (e.g. top-k queries).

    Hope this helps, Hong
    On Oct 9, 2009, at 11:02 PM, Ricky Ho wrote:

    I'd like to get some Hadoop experts to verify my understanding ...

    To my understanding, within a Map/Reduce cycle, the input data set
    is "freeze" (no change is allowed) while the output data set is
    "created from scratch" (doesn't exist before). Therefore, the map/
    reduce model is inherently "batch-oriented". Am I right ?

    I am thinking whether Hadoop is usable in processing many data
    streams in parallel. For example, thinking about a e-commerce site
    which capture user's product search in many log files, and they want
    to run some analytics on the log files at real time.

    One naïve way is to chunkify the log and perform Map/Reduce in small
    batches. Since the input data file must be freezed, therefore we
    need to switch subsequent write to a new logfile. However, the
    chunking approach is not good because the cutoff point is quite
    arbitrary. Imagine if I want to calculate the popularity of a
    product based on the frequency of searches within last 2 hours (a
    sliding time window). I don't think Hadoop can do this computation.

    Of course, if we don't mind a distorted picture, we can use a
    jumping window (1-3 PM, 3-5 PM ...) instead of a sliding window,
    then maybe OK. But this is still not good, because we have to wait
    for two hours before getting the new batch of result. (e.g. At 4:59
    PM, we only have the result in the 1-3 PM batch)

    It doesn't seem like Hadoop is good at handling this kind of
    processing: "Parallel processing of multiple real time data stream
    processing". Anyone disagree ? The term "Hadoop streaming" is
    confusing because it means completely different thing to me (ie: use
    stdout and stdin as input and output data)

    I'm wondering if a "mapper-only" model would work better. In this
    case, there is no reducer (ie: no grouping). Each map task keep a
    history (ie: sliding window) of data that it has seen and then write
    the result to the output file.

    I heard about the "append" mode of HDFS, but don't quite get it.
    Does it simply mean a writer can write to the end of an existing
    HDFS file ? Or does it mean a reader can read while a writer is
    appending on the same HDFS file ? Is this "append-mode" feature
    helpful in my situation ?

    Rgds,
    Ricky
  • Jeff Zhang at Oct 10, 2009 at 8:51 am
    I snuggest you to use pig to handle your problem. Pig is a sub-project of
    hadoop.

    And you do not need to worry about the boundary problem. Actually hadoop
    handle that for you.

    InputFormat help you split the data , and RecordReader guarantee the record
    boundary.


    Jeff zhang

    On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho wrote:

    I'd like to get some Hadoop experts to verify my understanding ...

    To my understanding, within a Map/Reduce cycle, the input data set is
    "freeze" (no change is allowed) while the output data set is "created from
    scratch" (doesn't exist before). Therefore, the map/reduce model is
    inherently "batch-oriented". Am I right ?

    I am thinking whether Hadoop is usable in processing many data streams in
    parallel. For example, thinking about a e-commerce site which capture
    user's product search in many log files, and they want to run some analytics
    on the log files at real time.

    One naïve way is to chunkify the log and perform Map/Reduce in small
    batches. Since the input data file must be freezed, therefore we need to
    switch subsequent write to a new logfile. However, the chunking approach is
    not good because the cutoff point is quite arbitrary. Imagine if I want to
    calculate the popularity of a product based on the frequency of searches
    within last 2 hours (a sliding time window). I don't think Hadoop can do
    this computation.

    Of course, if we don't mind a distorted picture, we can use a jumping
    window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK. But
    this is still not good, because we have to wait for two hours before getting
    the new batch of result. (e.g. At 4:59 PM, we only have the result in the
    1-3 PM batch)

    It doesn't seem like Hadoop is good at handling this kind of processing:
    "Parallel processing of multiple real time data stream processing". Anyone
    disagree ? The term "Hadoop streaming" is confusing because it means
    completely different thing to me (ie: use stdout and stdin as input and
    output data)

    I'm wondering if a "mapper-only" model would work better. In this case,
    there is no reducer (ie: no grouping). Each map task keep a history (ie:
    sliding window) of data that it has seen and then write the result to the
    output file.

    I heard about the "append" mode of HDFS, but don't quite get it. Does it
    simply mean a writer can write to the end of an existing HDFS file ? Or
    does it mean a reader can read while a writer is appending on the same HDFS
    file ? Is this "append-mode" feature helpful in my situation ?

    Rgds,
    Ricky
  • Ricky Ho at Oct 10, 2009 at 3:07 pm
    PIG provides a higher level programming interface but doesn't change the fundamental batch-oriented semantics to a stream-based semantics. As long as PIG is compiled into Map/Reduce job, it is using the same batch-oriented mechanism.

    I am not talking about "record boundary". I am talking about the boundary between 2 consecutive map/reduce cycles within a continuous data stream.

    I am thinking Ted's suggestion on the incremental small batch approach may be a good solution although I am not sure how small the batch should be. I assume there are certain overhead of running Hadoop so the batch shouldn't be too small. And there is a tradeoff decision to make between the delay of result and the batch size. And I guess in most case this should be ok.

    Rgds,
    Ricky
    -----Original Message-----
    From: Jeff Zhang
    Sent: Saturday, October 10, 2009 1:51 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Parallel data stream processing

    I snuggest you to use pig to handle your problem. Pig is a sub-project of
    hadoop.

    And you do not need to worry about the boundary problem. Actually hadoop
    handle that for you.

    InputFormat help you split the data , and RecordReader guarantee the record
    boundary.


    Jeff zhang

    On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho wrote:

    I'd like to get some Hadoop experts to verify my understanding ...

    To my understanding, within a Map/Reduce cycle, the input data set is
    "freeze" (no change is allowed) while the output data set is "created from
    scratch" (doesn't exist before). Therefore, the map/reduce model is
    inherently "batch-oriented". Am I right ?

    I am thinking whether Hadoop is usable in processing many data streams in
    parallel. For example, thinking about a e-commerce site which capture
    user's product search in many log files, and they want to run some analytics
    on the log files at real time.

    One naïve way is to chunkify the log and perform Map/Reduce in small
    batches. Since the input data file must be freezed, therefore we need to
    switch subsequent write to a new logfile. However, the chunking approach is
    not good because the cutoff point is quite arbitrary. Imagine if I want to
    calculate the popularity of a product based on the frequency of searches
    within last 2 hours (a sliding time window). I don't think Hadoop can do
    this computation.

    Of course, if we don't mind a distorted picture, we can use a jumping
    window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK. But
    this is still not good, because we have to wait for two hours before getting
    the new batch of result. (e.g. At 4:59 PM, we only have the result in the
    1-3 PM batch)

    It doesn't seem like Hadoop is good at handling this kind of processing:
    "Parallel processing of multiple real time data stream processing". Anyone
    disagree ? The term "Hadoop streaming" is confusing because it means
    completely different thing to me (ie: use stdout and stdin as input and
    output data)

    I'm wondering if a "mapper-only" model would work better. In this case,
    there is no reducer (ie: no grouping). Each map task keep a history (ie:
    sliding window) of data that it has seen and then write the result to the
    output file.

    I heard about the "append" mode of HDFS, but don't quite get it. Does it
    simply mean a writer can write to the end of an existing HDFS file ? Or
    does it mean a reader can read while a writer is appending on the same HDFS
    file ? Is this "append-mode" feature helpful in my situation ?

    Rgds,
    Ricky
  • Amandeep Khurana at Oct 11, 2009 at 3:41 am
    We needed to process a stream of data too and the best that we could get to
    was incremental data imports and incremental processing. So, thats probably
    your best bet as of now.

    -Amandeep

    On Sat, Oct 10, 2009 at 8:05 AM, Ricky Ho wrote:

    PIG provides a higher level programming interface but doesn't change the
    fundamental batch-oriented semantics to a stream-based semantics. As long
    as PIG is compiled into Map/Reduce job, it is using the same batch-oriented
    mechanism.

    I am not talking about "record boundary". I am talking about the boundary
    between 2 consecutive map/reduce cycles within a continuous data stream.

    I am thinking Ted's suggestion on the incremental small batch approach may
    be a good solution although I am not sure how small the batch should be. I
    assume there are certain overhead of running Hadoop so the batch shouldn't
    be too small. And there is a tradeoff decision to make between the delay of
    result and the batch size. And I guess in most case this should be ok.

    Rgds,
    Ricky
    -----Original Message-----
    From: Jeff Zhang
    Sent: Saturday, October 10, 2009 1:51 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Parallel data stream processing

    I snuggest you to use pig to handle your problem. Pig is a sub-project of
    hadoop.

    And you do not need to worry about the boundary problem. Actually hadoop
    handle that for you.

    InputFormat help you split the data , and RecordReader guarantee the record
    boundary.


    Jeff zhang

    On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho wrote:

    I'd like to get some Hadoop experts to verify my understanding ...

    To my understanding, within a Map/Reduce cycle, the input data set is
    "freeze" (no change is allowed) while the output data set is "created from
    scratch" (doesn't exist before). Therefore, the map/reduce model is
    inherently "batch-oriented". Am I right ?

    I am thinking whether Hadoop is usable in processing many data streams in
    parallel. For example, thinking about a e-commerce site which capture
    user's product search in many log files, and they want to run some analytics
    on the log files at real time.

    One naïve way is to chunkify the log and perform Map/Reduce in small
    batches. Since the input data file must be freezed, therefore we need to
    switch subsequent write to a new logfile. However, the chunking approach is
    not good because the cutoff point is quite arbitrary. Imagine if I want to
    calculate the popularity of a product based on the frequency of searches
    within last 2 hours (a sliding time window). I don't think Hadoop can do
    this computation.

    Of course, if we don't mind a distorted picture, we can use a jumping
    window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK. But
    this is still not good, because we have to wait for two hours before getting
    the new batch of result. (e.g. At 4:59 PM, we only have the result in the
    1-3 PM batch)

    It doesn't seem like Hadoop is good at handling this kind of processing:
    "Parallel processing of multiple real time data stream processing". Anyone
    disagree ? The term "Hadoop streaming" is confusing because it means
    completely different thing to me (ie: use stdout and stdin as input and
    output data)

    I'm wondering if a "mapper-only" model would work better. In this case,
    there is no reducer (ie: no grouping). Each map task keep a history (ie:
    sliding window) of data that it has seen and then write the result to the
    output file.

    I heard about the "append" mode of HDFS, but don't quite get it. Does it
    simply mean a writer can write to the end of an existing HDFS file ? Or
    does it mean a reader can read while a writer is appending on the same HDFS
    file ? Is this "append-mode" feature helpful in my situation ?

    Rgds,
    Ricky
  • Tom Wheeler at Oct 8, 2009 at 7:50 pm
    I've used XFS on Silicon Graphics machines and JFS on AIX systems --
    both were quite fast and extremely reliable, though this long predates
    my use of Hadoop.

    To your question, I recently came across a blog that compares
    performance of several Linux filesystems:

    http://log.amitshah.net/2009/04/re-comparing-file-systems.html

    I'd consider his results anecdotal unless the tests reflect the actual
    workload of a datanode, but since he's made the code available, you
    could probably adapt it yourself to get a better measure.
    On Thu, Oct 8, 2009 at 2:26 PM, Stas Oskin wrote:
    Hi.

    Thanks for the info.

    What about JFS, any idea how well it compares to XFS?

    From what I read, JFS is considered more stable then XFS, but less
    performing, so I wonder if this true.

    Also, Ext4 is around the corner and was recently accepted into kernel, so I
    wonder if anyone knows about this one.
    --
    Tom Wheeler
    http://www.tomwheeler.com/
  • Jason Venner at Oct 8, 2009 at 8:00 pm
    noatime is absolutely essential, I forget to mention it, because it is
    automatic now for me.

    I have a fun story about atime, I have some Solaris machines with ZFS file
    systems, and I was doing a find on a 6 level hashed directory tree with
    250000 leaf nodes.

    The find on a cold idle file system was running slowly, and the machine was
    writing at 5-10MB/sec, solaris lets you toggle atime at runtime,
    when I turned it off, the writes went to 0, and the find drastically speeded
    up.

    This is very representative of a datanode with many blocks.


    On Thu, Oct 8, 2009 at 12:43 PM, Tom Wheeler wrote:

    I've used XFS on Silicon Graphics machines and JFS on AIX systems --
    both were quite fast and extremely reliable, though this long predates
    my use of Hadoop.

    To your question, I recently came across a blog that compares
    performance of several Linux filesystems:

    http://log.amitshah.net/2009/04/re-comparing-file-systems.html

    I'd consider his results anecdotal unless the tests reflect the actual
    workload of a datanode, but since he's made the code available, you
    could probably adapt it yourself to get a better measure.
    On Thu, Oct 8, 2009 at 2:26 PM, Stas Oskin wrote:
    Hi.

    Thanks for the info.

    What about JFS, any idea how well it compares to XFS?

    From what I read, JFS is considered more stable then XFS, but less
    performing, so I wonder if this true.

    Also, Ext4 is around the corner and was recently accepted into kernel, so I
    wonder if anyone knows about this one.
    --
    Tom Wheeler
    http://www.tomwheeler.com/


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Edward Capriolo at Oct 8, 2009 at 8:02 pm

    On Thu, Oct 8, 2009 at 4:00 PM, Jason Venner wrote:
    noatime is absolutely essential, I forget to mention it, because it is
    automatic now for me.

    I have a fun story about atime, I have some Solaris machines with ZFS file
    systems, and I was doing a find on a 6 level hashed directory tree with
    250000 leaf nodes.

    The find on a cold idle file system was running slowly, and the machine was
    writing at 5-10MB/sec, solaris lets you toggle atime at runtime,
    when I turned it off, the writes went to 0, and the find drastically speeded
    up.

    This is very representative of a datanode with many blocks.


    On Thu, Oct 8, 2009 at 12:43 PM, Tom Wheeler wrote:

    I've used XFS on Silicon Graphics machines and JFS on AIX systems --
    both were quite fast and extremely reliable, though this long predates
    my use of Hadoop.

    To your question, I recently came across a blog that compares
    performance of several Linux filesystems:

    http://log.amitshah.net/2009/04/re-comparing-file-systems.html

    I'd consider his results anecdotal unless the tests reflect the actual
    workload of a datanode, but since he's made the code available, you
    could probably adapt it yourself to get a better measure.
    On Thu, Oct 8, 2009 at 2:26 PM, Stas Oskin wrote:
    Hi.

    Thanks for the info.

    What about JFS, any idea how well it compares to XFS?

    From what I read, JFS is considered more stable then XFS, but less
    performing, so I wonder if this true.

    Also, Ext4 is around the corner and was recently accepted into kernel, so I
    wonder if anyone knows about this one.
    --
    Tom Wheeler
    http://www.tomwheeler.com/


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
    The good news is its not like you are stuck into the file system you
    pick. Assuming you use the normal replication level 3, you can pull
    out a datanode, format it's disk with any FS you want and then stick
    it back into the cluster. Hadoop should not care after all. Not
    suggesting this...but you could theoretically run each node with a
    different file system, look at the performance and say "THIS is the
    one for me"
  • Stas Oskin at Oct 9, 2009 at 1:15 am
    Hi.

    I head about this option before, but never actually tried it.

    There is also another option, called "relatime", which described as being
    more compatible then noatime.
    Can anyone comment on this?

    Regards.

    2009/10/8 Edward Capriolo <edlinuxguru@gmail.com>
    On Thu, Oct 8, 2009 at 4:00 PM, Jason Venner wrote:
    noatime is absolutely essential, I forget to mention it, because it is
    automatic now for me.

    I have a fun story about atime, I have some Solaris machines with ZFS file
    systems, and I was doing a find on a 6 level hashed directory tree with
    250000 leaf nodes.

    The find on a cold idle file system was running slowly, and the machine was
    writing at 5-10MB/sec, solaris lets you toggle atime at runtime,
    when I turned it off, the writes went to 0, and the find drastically speeded
    up.

    This is very representative of a datanode with many blocks.


    On Thu, Oct 8, 2009 at 12:43 PM, Tom Wheeler wrote:

    I've used XFS on Silicon Graphics machines and JFS on AIX systems --
    both were quite fast and extremely reliable, though this long predates
    my use of Hadoop.

    To your question, I recently came across a blog that compares
    performance of several Linux filesystems:

    http://log.amitshah.net/2009/04/re-comparing-file-systems.html

    I'd consider his results anecdotal unless the tests reflect the actual
    workload of a datanode, but since he's made the code available, you
    could probably adapt it yourself to get a better measure.
    On Thu, Oct 8, 2009 at 2:26 PM, Stas Oskin wrote:
    Hi.

    Thanks for the info.

    What about JFS, any idea how well it compares to XFS?

    From what I read, JFS is considered more stable then XFS, but less
    performing, so I wonder if this true.

    Also, Ext4 is around the corner and was recently accepted into kernel,
    so
    I
    wonder if anyone knows about this one.
    --
    Tom Wheeler
    http://www.tomwheeler.com/


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
    The good news is its not like you are stuck into the file system you
    pick. Assuming you use the normal replication level 3, you can pull
    out a datanode, format it's disk with any FS you want and then stick
    it back into the cluster. Hadoop should not care after all. Not
    suggesting this...but you could theoretically run each node with a
    different file system, look at the performance and say "THIS is the
    one for me"
  • Edward Capriolo at Oct 9, 2009 at 1:20 am

    On Thu, Oct 8, 2009 at 9:15 PM, Stas Oskin wrote:
    Hi.

    I head about this option before, but never actually tried it.

    There is also another option, called "relatime", which described as being
    more compatible then noatime.
    Can anyone comment on this?

    Regards.

    2009/10/8 Edward Capriolo <edlinuxguru@gmail.com>
    On Thu, Oct 8, 2009 at 4:00 PM, Jason Venner <jason.hadoop@gmail.com>
    wrote:
    noatime is absolutely essential, I forget to mention it, because it is
    automatic now for me.

    I have a fun story about atime, I have some Solaris machines with ZFS file
    systems, and I was doing a find on a 6 level hashed directory tree with
    250000 leaf nodes.

    The find on a cold idle file system was running slowly, and the machine was
    writing at 5-10MB/sec, solaris lets you toggle atime at runtime,
    when I turned it off, the writes went to 0, and the find drastically speeded
    up.

    This is very representative of a datanode with many blocks.


    On Thu, Oct 8, 2009 at 12:43 PM, Tom Wheeler wrote:

    I've used XFS on Silicon Graphics machines and JFS on AIX systems --
    both were quite fast and extremely reliable, though this long predates
    my use of Hadoop.

    To your question, I recently came across a blog that compares
    performance of several Linux filesystems:

    http://log.amitshah.net/2009/04/re-comparing-file-systems.html

    I'd consider his results anecdotal unless the tests reflect the actual
    workload of a datanode, but since he's made the code available, you
    could probably adapt it yourself to get a better measure.

    On Thu, Oct 8, 2009 at 2:26 PM, Stas Oskin <stas.oskin@gmail.com>
    wrote:
    Hi.

    Thanks for the info.

    What about JFS, any idea how well it compares to XFS?

    From what I read, JFS is considered more stable then XFS, but less
    performing, so I wonder if this true.

    Also, Ext4 is around the corner and was recently accepted into kernel,
    so
    I
    wonder if anyone knows about this one.
    --
    Tom Wheeler
    http://www.tomwheeler.com/


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
    The good news is its not like you are stuck into the file system you
    pick. Assuming you use the normal replication level 3, you can pull
    out a datanode, format it's disk with any FS you want and then stick
    it back into the cluster. Hadoop should not care after all.  Not
    suggesting this...but you could theoretically run each node with a
    different file system, look at the performance and say "THIS is the
    one for me"
    Relatime is like a noatime that gets updated only periodically, not
    every read. Hadoop does not use atime so there is no benefit to
    relative. Go 'noatime nodiratime' although I think nodiratime is a
    subset of noatime.

    FYI almost nothing really uses noatime, I heard mutt does and older
    versions of vim might have. But I turned if and never had an issue.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 8, '09 at 11:38a
activeOct 12, '09 at 9:12a
posts26
users12
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase