FAQ
Hi,

We have a job where the map tasks are given the path to an output folder.
Each map task writes a single file to that folder. There is no reduce phase.
There is another thread, which constantly looks for new files in the output
folder. If found, it persists the contents to index, and deletes the file.

We use this code in the map task:
try {
OutputStream oStream = fileSystem.create(path);
IOUtils.write("xyz", oStream);
} finally {
IOUtils.closeQuietly(oStream);
}

The problem: Some times the reader thread sees & tries to read a file which
is not yet fully written to HDFS (or the checksum is not written yet, etc),
and throws an error. Is it possible to write an HDFS file in such a way that
it won't be visible until it is fully written?

We use Hadoop 0.20.203.

Thanks,

Meghana

Search Discussions

  • Laxman at Jul 28, 2011 at 10:45 am
    One approach can be use some ".tmp" extension while writing. Once the write
    is completed rename back to original file name. Also, reducer has to filter
    out ".tmp" files.

    This will ensure reducer will not pickup the partial files.

    We do have the similar scenario where the a/m approach resolved the issue.

    -----Original Message-----
    From: Meghana
    Sent: Thursday, July 28, 2011 1:38 PM
    To: common-user; hdfs-user@hadoop.apache.org
    Subject: Reader/Writer problem in HDFS

    Hi,

    We have a job where the map tasks are given the path to an output folder.
    Each map task writes a single file to that folder. There is no reduce phase.
    There is another thread, which constantly looks for new files in the output
    folder. If found, it persists the contents to index, and deletes the file.

    We use this code in the map task:
    try {
    OutputStream oStream = fileSystem.create(path);
    IOUtils.write("xyz", oStream);
    } finally {
    IOUtils.closeQuietly(oStream);
    }

    The problem: Some times the reader thread sees & tries to read a file which
    is not yet fully written to HDFS (or the checksum is not written yet, etc),
    and throws an error. Is it possible to write an HDFS file in such a way that
    it won't be visible until it is fully written?

    We use Hadoop 0.20.203.

    Thanks,

    Meghana
  • Meghana at Jul 28, 2011 at 11:03 am
    Thanks Laxman! That would definitely help things. :)

    Is there a better FileSystem/other method call to create a file in one go
    (i.e. atomic i guess?), without having to call create() and then write to
    the stream?

    ..meghana

    On 28 July 2011 16:12, Laxman wrote:

    One approach can be use some ".tmp" extension while writing. Once the write
    is completed rename back to original file name. Also, reducer has to filter
    out ".tmp" files.

    This will ensure reducer will not pickup the partial files.

    We do have the similar scenario where the a/m approach resolved the issue.

    -----Original Message-----
    From: Meghana
    Sent: Thursday, July 28, 2011 1:38 PM
    To: common-user; hdfs-user@hadoop.apache.org
    Subject: Reader/Writer problem in HDFS

    Hi,

    We have a job where the map tasks are given the path to an output folder.
    Each map task writes a single file to that folder. There is no reduce
    phase.
    There is another thread, which constantly looks for new files in the output
    folder. If found, it persists the contents to index, and deletes the file.

    We use this code in the map task:
    try {
    OutputStream oStream = fileSystem.create(path);
    IOUtils.write("xyz", oStream);
    } finally {
    IOUtils.closeQuietly(oStream);
    }

    The problem: Some times the reader thread sees & tries to read a file which
    is not yet fully written to HDFS (or the checksum is not written yet, etc),
    and throws an error. Is it possible to write an HDFS file in such a way
    that
    it won't be visible until it is fully written?

    We use Hadoop 0.20.203.

    Thanks,

    Meghana
  • Laxman at Jul 28, 2011 at 11:21 am
    No such API as per my knowledge.
    copyFromLocal is such API. That may not fit in your scenario I guess.

    --Laxman

    -----Original Message-----
    From: Meghana
    Sent: Thursday, July 28, 2011 4:32 PM
    To: hdfs-user@hadoop.apache.org; lakshman_ch@huawei.com
    Cc: common-user@hadoop.apache.org
    Subject: Re: Reader/Writer problem in HDFS

    Thanks Laxman! That would definitely help things. :)

    Is there a better FileSystem/other method call to create a file in one go
    (i.e. atomic i guess?), without having to call create() and then write to
    the stream?

    ..meghana

    On 28 July 2011 16:12, Laxman wrote:

    One approach can be use some ".tmp" extension while writing. Once the write
    is completed rename back to original file name. Also, reducer has to filter
    out ".tmp" files.

    This will ensure reducer will not pickup the partial files.

    We do have the similar scenario where the a/m approach resolved the issue.

    -----Original Message-----
    From: Meghana
    Sent: Thursday, July 28, 2011 1:38 PM
    To: common-user; hdfs-user@hadoop.apache.org
    Subject: Reader/Writer problem in HDFS

    Hi,

    We have a job where the map tasks are given the path to an output folder.
    Each map task writes a single file to that folder. There is no reduce
    phase.
    There is another thread, which constantly looks for new files in the output
    folder. If found, it persists the contents to index, and deletes the file.

    We use this code in the map task:
    try {
    OutputStream oStream = fileSystem.create(path);
    IOUtils.write("xyz", oStream);
    } finally {
    IOUtils.closeQuietly(oStream);
    }

    The problem: Some times the reader thread sees & tries to read a file which
    is not yet fully written to HDFS (or the checksum is not written yet, etc),
    and throws an error. Is it possible to write an HDFS file in such a way
    that
    it won't be visible until it is fully written?

    We use Hadoop 0.20.203.

    Thanks,

    Meghana

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedJul 28, '11 at 8:09a
activeJul 28, '11 at 11:21a
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Meghana: 2 posts Laxman: 2 posts

People

Translate

site design / logo © 2022 Grokbase