FAQ
Let me rephrase this problem... as stated below, when I start writing to a
SequenceFile from an HDFS client, nothing is visible in HDFS until I've
written 64M of data. This presents three problems: fsck reports the file
system as corrupt until the first block is finally written out, the presence
of the file (without any data) seems to blow up my mapred jobs that try to
make use of it under my input path, and finally, I want to basically flush
every 15 minutes or so so I can mapred the latest data.
I don't see any programmatic way to force the file to flush in 17.2.
Additionally, "dfs.checkpoint.period" does not seem to be obeyed. Does that
not do what I think it does? What controls the 64M limit, anyway? Is it
"dfs.checkpoint.size" or "dfs.block.size"? Is the buffering happening on the
client, or on data nodes? Or in the namenode?

It seems really bad that a SequenceFile, upon creation, is in an unusable
state from the perspective of a mapred job, and also leaves fsck in a
corrupt state. Surely I must be doing something wrong... but what? How can I
ensure that a SequenceFile is immediately usable (but empty) on creation,
and how can I make things flush on some regular time interval?

Thanks,
Brian

On Thu, Jan 29, 2009 at 4:17 PM, Brian Long wrote:

I have a SequenceFile.Writer that I obtained via SequenceFile.createWriter
and write to using append(key, value). Because the writer volume is low,
it's not uncommon for it to take over a day for my appends to finally be
flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
Because I am running map/reduce tasks on this data multiple times a day, I
want to "flush" the sequence file so the mapred jobs can pick it up when
they run.
What's the right way to do this? I'm assuming it's a fairly common use
case. Also -- are writes to the sequence files atomic? (e.g. if I am
actively appending to a sequence file, is it always safe to read from that
same file in a mapred job?)

To be clear, I want the flushing to be time based (controlled explicitly by
the app), not size based. Will this create waste in HDFS somehow?

Thanks,
Brian

Search Discussions

  • Jason hadoop at Feb 3, 2009 at 3:57 am
    If you have to do a time based solution, for now, simply close the file and
    stage it, then open a new file.
    Your reads will have to deal with the fact the file is in multiple parts.
    Warning: Datanodes get pokey if they have large numbers of blocks, and the
    quickest way to do this is to create a lot of small files.
    On Mon, Feb 2, 2009 at 9:54 AM, Brian Long wrote:

    Let me rephrase this problem... as stated below, when I start writing to a
    SequenceFile from an HDFS client, nothing is visible in HDFS until I've
    written 64M of data. This presents three problems: fsck reports the file
    system as corrupt until the first block is finally written out, the
    presence
    of the file (without any data) seems to blow up my mapred jobs that try to
    make use of it under my input path, and finally, I want to basically flush
    every 15 minutes or so so I can mapred the latest data.
    I don't see any programmatic way to force the file to flush in 17.2.
    Additionally, "dfs.checkpoint.period" does not seem to be obeyed. Does that
    not do what I think it does? What controls the 64M limit, anyway? Is it
    "dfs.checkpoint.size" or "dfs.block.size"? Is the buffering happening on
    the
    client, or on data nodes? Or in the namenode?

    It seems really bad that a SequenceFile, upon creation, is in an unusable
    state from the perspective of a mapred job, and also leaves fsck in a
    corrupt state. Surely I must be doing something wrong... but what? How can
    I
    ensure that a SequenceFile is immediately usable (but empty) on creation,
    and how can I make things flush on some regular time interval?

    Thanks,
    Brian

    On Thu, Jan 29, 2009 at 4:17 PM, Brian Long wrote:

    I have a SequenceFile.Writer that I obtained via
    SequenceFile.createWriter
    and write to using append(key, value). Because the writer volume is low,
    it's not uncommon for it to take over a day for my appends to finally be
    flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
    Because I am running map/reduce tasks on this data multiple times a day, I
    want to "flush" the sequence file so the mapred jobs can pick it up when
    they run.
    What's the right way to do this? I'm assuming it's a fairly common use
    case. Also -- are writes to the sequence files atomic? (e.g. if I am
    actively appending to a sequence file, is it always safe to read from that
    same file in a mapred job?)

    To be clear, I want the flushing to be time based (controlled explicitly by
    the app), not size based. Will this create waste in HDFS somehow?

    Thanks,
    Brian
  • Tom White at Feb 3, 2009 at 9:40 am
    Hi Brian,

    Writes to HDFS are not guaranteed to be flushed until the file is
    closed. In practice, as each (64MB) block is finished it is flushed
    and will be visible to other readers, which is what you were seeing.

    The addition of appends in HDFS changes this and adds a sync() method
    to FSDataOutputStream. You can read about the semantics of the new
    operations here:
    https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc.
    Unfortunately, there are some problems with sync() that are still
    being worked through
    (https://issues.apache.org/jira/browse/HADOOP-4379). Also, even with
    sync() working, the append() on SequenceFile does not do an implicit
    sync() - it is not atomic. Furthermore, there is no way to get hold of
    the FSDataOutputStream to call sync() yourself - see
    https://issues.apache.org/jira/browse/HBASE-1155. (And don't get
    confused by the sync() method on SequenceFile.Writer - it is for
    another purpose entirely.)

    As Jason points out, the simplest way to achieve what you're trying to
    so is to close the file and start a new one. If you start to get too
    many small files, then you can have another process to merge the
    smaller files in the background.

    Tom
    On Tue, Feb 3, 2009 at 3:57 AM, jason hadoop wrote:
    If you have to do a time based solution, for now, simply close the file and
    stage it, then open a new file.
    Your reads will have to deal with the fact the file is in multiple parts.
    Warning: Datanodes get pokey if they have large numbers of blocks, and the
    quickest way to do this is to create a lot of small files.
    On Mon, Feb 2, 2009 at 9:54 AM, Brian Long wrote:

    Let me rephrase this problem... as stated below, when I start writing to a
    SequenceFile from an HDFS client, nothing is visible in HDFS until I've
    written 64M of data. This presents three problems: fsck reports the file
    system as corrupt until the first block is finally written out, the
    presence
    of the file (without any data) seems to blow up my mapred jobs that try to
    make use of it under my input path, and finally, I want to basically flush
    every 15 minutes or so so I can mapred the latest data.
    I don't see any programmatic way to force the file to flush in 17.2.
    Additionally, "dfs.checkpoint.period" does not seem to be obeyed. Does that
    not do what I think it does? What controls the 64M limit, anyway? Is it
    "dfs.checkpoint.size" or "dfs.block.size"? Is the buffering happening on
    the
    client, or on data nodes? Or in the namenode?

    It seems really bad that a SequenceFile, upon creation, is in an unusable
    state from the perspective of a mapred job, and also leaves fsck in a
    corrupt state. Surely I must be doing something wrong... but what? How can
    I
    ensure that a SequenceFile is immediately usable (but empty) on creation,
    and how can I make things flush on some regular time interval?

    Thanks,
    Brian

    On Thu, Jan 29, 2009 at 4:17 PM, Brian Long wrote:

    I have a SequenceFile.Writer that I obtained via
    SequenceFile.createWriter
    and write to using append(key, value). Because the writer volume is low,
    it's not uncommon for it to take over a day for my appends to finally be
    flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
    Because I am running map/reduce tasks on this data multiple times a day, I
    want to "flush" the sequence file so the mapred jobs can pick it up when
    they run.
    What's the right way to do this? I'm assuming it's a fairly common use
    case. Also -- are writes to the sequence files atomic? (e.g. if I am
    actively appending to a sequence file, is it always safe to read from that
    same file in a mapred job?)

    To be clear, I want the flushing to be time based (controlled explicitly by
    the app), not size based. Will this create waste in HDFS somehow?

    Thanks,
    Brian

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 2, '09 at 5:55p
activeFeb 3, '09 at 9:40a
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase