FAQ
I have a Hadoop reducer that needs to write then read (key, value) pairs
from a local temporary file. It seems like the way to do this is with
Sequence Files, letting the Hadoop API choose their names for me, but I
haven't found the right combination of API calls to make it work.

If I try:

storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
".seq");
writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
configuration, storePath, IntWritable.class, itemClass);
...
reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
storePath, configuration);

I get an exception about a mismatch in file systems when trying to read from
the file.

java.lang.IllegalArgumentException: Wrong FS: hdfs://
chnode1.tuk2.com:40020/user/bmcneill/blocking/1.8/intelius-names/extremal-sets/_temporary/_attempt_201103311636_0393_r_000000_0/my-file-r-00000.seq,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:718)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1424)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1419)

Alternately if I try:

storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
"my-file", ".seq"));
writer = SequenceFile.createWriter(FileSystem.get(configuration),
configuration, storePath, IntWritable.class, itemClass);
...
reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
storePath, configuration);

The file is missing when I try to read from it.

java.io.FileNotFoundException: File my-file-r-00000.seq does not exist.
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:372)
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:718)
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419)

Am I using the wrong APIs, or is my problem elsewhere?

Search Discussions

  • W.P. McNeill at Apr 4, 2011 at 11:30 pm
    I recall that at one point I did some API call that created a writable file
    beneath the _logs directory of my job. I think this is exactly what I need,
    but I can't remember what I did now, and I've having a hard time figuring it
    out from the online API documentation.
    On Mon, Apr 4, 2011 at 2:23 PM, W.P. McNeill wrote:

    I have a Hadoop reducer that needs to write then read (key, value) pairs
    from a local temporary file. It seems like the way to do this is with
    Sequence Files, letting the Hadoop API choose their names for me, but I
    haven't found the right combination of API calls to make it work.

    If I try:

    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
    ".seq");
    writer =
    SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    I get an exception about a mismatch in file systems when trying to read
    from the file.

    java.lang.IllegalArgumentException: Wrong FS: hdfs://
    chnode1.tuk2.com:40020/user/bmcneill/blocking/1.8/intelius-names/extremal-sets/_temporary/_attempt_201103311636_0393_r_000000_0/my-file-r-00000.seq,
    expected: file:///
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352)
    at
    org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
    at
    org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368)
    at
    org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
    at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:718)
    at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1424)
    at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1419)

    Alternately if I try:

    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
    "my-file", ".seq"));
    writer = SequenceFile.createWriter(FileSystem.get(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    The file is missing when I try to read from it.

    java.io.FileNotFoundException: File my-file-r-00000.seq does not exist.

    org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:372)

    org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
    org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:718)
    org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
    org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419)

    Am I using the wrong APIs, or is my problem elsewhere?
  • Harsh J at Apr 9, 2011 at 9:41 am
    Hello,
    On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill wrote:
    If I try:

    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
    ".seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    I get an exception about a mismatch in file systems when trying to read from
    the file.

    Alternately if I try:

    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
    "my-file", ".seq"));
    writer = SequenceFile.createWriter(FileSystem.get(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
    since you are looking to create local temporary files to be used only
    by the task within itself, you shouldn't really worry about unique
    filenames (stuff can go wrong).

    You're looking for the tmp/ directory locally created in the FS where
    the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
    You can create a regular file there using vanilla Java APIs for files,
    or using RawLocalFS + your own created Path (not derived via
    OutputFormat/etc.).
    storePath = new Path(new Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    The above should work, I think (haven't tried, but the idea is to use
    the mapred.child.tmp).

    Also see: http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Directory+Structure

    --
    Harsh J
  • Bryan Keller at May 4, 2011 at 12:11 pm
    I too am looking for the best place to put local temp files I create during reduce processing. I am hoping there is a variable or property someplace that defines a per-reducer temp directory. The "mapred.child.tmp" property is by default simply the relative directory "./tmp" so it isn't useful on it's own.

    I have 5 drives being used in "mapred.local.dir", and I was hoping to use them all for writing temp files, rather than specifying a single temp directory that all my reducers use.

    On Apr 9, 2011, at 2:40 AM, Harsh J wrote:

    Hello,
    On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill wrote:
    If I try:

    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
    ".seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    I get an exception about a mismatch in file systems when trying to read from
    the file.

    Alternately if I try:

    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
    "my-file", ".seq"));
    writer = SequenceFile.createWriter(FileSystem.get(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
    since you are looking to create local temporary files to be used only
    by the task within itself, you shouldn't really worry about unique
    filenames (stuff can go wrong).

    You're looking for the tmp/ directory locally created in the FS where
    the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
    You can create a regular file there using vanilla Java APIs for files,
    or using RawLocalFS + your own created Path (not derived via
    OutputFormat/etc.).
    storePath = new Path(new Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    The above should work, I think (haven't tried, but the idea is to use
    the mapred.child.tmp).

    Also see: http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Directory+Structure

    --
    Harsh J
  • Robert Evans at May 4, 2011 at 4:04 pm
    Bryan,

    I believe that map/reduce gives you a single drive to write to so that your reducer has less of an impact on other reducers/mappers running on the same box. If you want to write to more drives I thought the idea would then be to increase the number of reducers you have and let mapred assign each to a drive to use, instead of having one reducer eating up I/O bandwidth from all of the drives.

    --Bobby Evans

    On 5/4/11 7:11 AM, "Bryan Keller" wrote:

    I too am looking for the best place to put local temp files I create during reduce processing. I am hoping there is a variable or property someplace that defines a per-reducer temp directory. The "mapred.child.tmp" property is by default simply the relative directory "./tmp" so it isn't useful on it's own.

    I have 5 drives being used in "mapred.local.dir", and I was hoping to use them all for writing temp files, rather than specifying a single temp directory that all my reducers use.

    On Apr 9, 2011, at 2:40 AM, Harsh J wrote:

    Hello,
    On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill wrote:
    If I try:

    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
    ".seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    I get an exception about a mismatch in file systems when trying to read from
    the file.

    Alternately if I try:

    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
    "my-file", ".seq"));
    writer = SequenceFile.createWriter(FileSystem.get(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
    since you are looking to create local temporary files to be used only
    by the task within itself, you shouldn't really worry about unique
    filenames (stuff can go wrong).

    You're looking for the tmp/ directory locally created in the FS where
    the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
    You can create a regular file there using vanilla Java APIs for files,
    or using RawLocalFS + your own created Path (not derived via
    OutputFormat/etc.).
    storePath = new Path(new Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    The above should work, I think (haven't tried, but the idea is to use
    the mapred.child.tmp).

    Also see: http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Directory+Structure

    --
    Harsh J
  • Bryan Keller at May 4, 2011 at 7:19 pm
    Right. What I am struggling with is how to retrieve the path/drive that the reducer is using, so I can use the same path for local temp files.
    On May 4, 2011, at 9:03 AM, Robert Evans wrote:

    Bryan,

    I believe that map/reduce gives you a single drive to write to so that your reducer has less of an impact on other reducers/mappers running on the same box. If you want to write to more drives I thought the idea would then be to increase the number of reducers you have and let mapred assign each to a drive to use, instead of having one reducer eating up I/O bandwidth from all of the drives.

    --Bobby Evans

    On 5/4/11 7:11 AM, "Bryan Keller" wrote:

    I too am looking for the best place to put local temp files I create during reduce processing. I am hoping there is a variable or property someplace that defines a per-reducer temp directory. The "mapred.child.tmp" property is by default simply the relative directory "./tmp" so it isn't useful on it's own.

    I have 5 drives being used in "mapred.local.dir", and I was hoping to use them all for writing temp files, rather than specifying a single temp directory that all my reducers use.

    On Apr 9, 2011, at 2:40 AM, Harsh J wrote:

    Hello,
    On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill wrote:
    If I try:

    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
    ".seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    I get an exception about a mismatch in file systems when trying to read from
    the file.

    Alternately if I try:

    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
    "my-file", ".seq"));
    writer = SequenceFile.createWriter(FileSystem.get(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
    since you are looking to create local temporary files to be used only
    by the task within itself, you shouldn't really worry about unique
    filenames (stuff can go wrong).

    You're looking for the tmp/ directory locally created in the FS where
    the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
    You can create a regular file there using vanilla Java APIs for files,
    or using RawLocalFS + your own created Path (not derived via
    OutputFormat/etc.).
    storePath = new Path(new Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    The above should work, I think (haven't tried, but the idea is to use
    the mapred.child.tmp).

    Also see: http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Directory+Structure

    --
    Harsh J
  • Harsh J at May 4, 2011 at 7:30 pm
    Bryan,

    The relative ./tmp directory is under the work-directory of the task's
    attempt on a node. I believe it should be sufficient to use that as a
    base, to create directories or files under? Or are you simply looking
    for a way to expand this relative directory name to a fully absolute
    one (which should be doable using simple Java SE APIs)?
    On Thu, May 5, 2011 at 12:48 AM, Bryan Keller wrote:
    Right. What I am struggling with is how to retrieve the path/drive that the reducer is using, so I can use the same path for local temp files.
    On May 4, 2011, at 9:03 AM, Robert Evans wrote:

    Bryan,

    I believe that map/reduce gives you a single drive to write to so that your reducer has less of an impact on other reducers/mappers running on the same box.  If you want to write to more drives I thought the idea would then be to increase the number of reducers you have and let mapred assign each to a drive to use, instead of having one reducer eating up I/O bandwidth from all of the drives.

    --Bobby Evans

    On 5/4/11 7:11 AM, "Bryan Keller" wrote:

    I too am looking for the best place to put local temp files I create during reduce processing. I am hoping there is a variable or property someplace that defines a per-reducer temp directory. The "mapred.child.tmp" property is by default simply the relative directory "./tmp" so it isn't useful on it's own.
    --
    Harsh J
  • Matt Pouttu-Clarke at May 4, 2011 at 7:32 pm
    Hi Bryan,

    These are called side effect files, and I use them extensively:

    O'Riley Hadoop 2nd Edition, p. 187
    Pro Hadoop, p. 279

    You get the path to the save the file(s) using:
    http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
    leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29

    The output committer moves these files from the work directory to the output
    directory when the task completes. That way you don't have duplicate files
    due to speculative execution. You should also generate a unique name for
    each of your output files by using this function to prevent file name
    collisions:
    http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
    leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java
    .lang.String%29

    Hope this helps,
    Matt

    On 5/4/11 12:18 PM, "Bryan Keller" wrote:

    Right. What I am struggling with is how to retrieve the path/drive that the
    reducer is using, so I can use the same path for local temp files.
    On May 4, 2011, at 9:03 AM, Robert Evans wrote:

    Bryan,

    I believe that map/reduce gives you a single drive to write to so that your
    reducer has less of an impact on other reducers/mappers running on the same
    box. If you want to write to more drives I thought the idea would then be to
    increase the number of reducers you have and let mapred assign each to a
    drive to use, instead of having one reducer eating up I/O bandwidth from all
    of the drives.

    --Bobby Evans

    On 5/4/11 7:11 AM, "Bryan Keller" wrote:

    I too am looking for the best place to put local temp files I create during
    reduce processing. I am hoping there is a variable or property someplace that
    defines a per-reducer temp directory. The "mapred.child.tmp" property is by
    default simply the relative directory "./tmp" so it isn't useful on it's own.

    I have 5 drives being used in "mapred.local.dir", and I was hoping to use
    them all for writing temp files, rather than specifying a single temp
    directory that all my reducers use.

    On Apr 9, 2011, at 2:40 AM, Harsh J wrote:

    Hello,
    On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill wrote:
    If I try:

    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
    ".seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    I get an exception about a mismatch in file systems when trying to read
    from
    the file.

    Alternately if I try:

    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
    "my-file", ".seq"));
    writer = SequenceFile.createWriter(FileSystem.get(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
    since you are looking to create local temporary files to be used only
    by the task within itself, you shouldn't really worry about unique
    filenames (stuff can go wrong).

    You're looking for the tmp/ directory locally created in the FS where
    the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
    You can create a regular file there using vanilla Java APIs for files,
    or using RawLocalFS + your own created Path (not derived via
    OutputFormat/etc.).
    storePath = new Path(new
    Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    The above should work, I think (haven't tried, but the idea is to use
    the mapred.child.tmp).

    Also see:
    http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Directory+
    Structure

    --
    Harsh J

    iCrossing Privileged and Confidential Information
    This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
  • Gang Luo at May 4, 2011 at 7:53 pm
    Hi,

    I use MapReduce to process and output my own stuff, in a customized way. I don't
    use context.write to output anything, and thus I don't want the empty files
    part-r-xxxxx on my fs. Is there someway to eliminate the output?

    Thanks.

    -Gang
  • Harsh J at May 4, 2011 at 8:04 pm
    Hello Gang,
    On Thu, May 5, 2011 at 1:22 AM, Gang Luo wrote:


    Hi,

    I use MapReduce to process and output my own stuff, in a customized way. I don't
    use context.write to output anything, and thus I don't want the empty files
    part-r-xxxxx on my fs. Is there someway to eliminate the output?
    You're looking for the NullOutputFormat:
    http://search-hadoop.com/?q=nulloutputformat

    --
    Harsh J
  • Gang Luo at May 4, 2011 at 8:43 pm
    Exactly what I want. Thanks Harsh J.

    -Gang



    ----- 原始邮件 ----
    发件人: Harsh J <harsh@cloudera.com>
    收件人: common-user@hadoop.apache.org
    发送日期: 2011/5/4 (周三) 4:03:35 下午
    主 题: Re: don't want to output anything

    Hello Gang,
    On Thu, May 5, 2011 at 1:22 AM, Gang Luo wrote:


    Hi,

    I use MapReduce to process and output my own stuff, in a customized way. I
    don't
    use context.write to output anything, and thus I don't want the empty files
    part-r-xxxxx on my fs. Is there someway to eliminate the output?
    You're looking for the NullOutputFormat:
    http://search-hadoop.com/?q=nulloutputformat

    --
    Harsh J
  • Bryan Keller at May 4, 2011 at 8:07 pm
    Am I mistaken or are side-effect files on HDFS? I need my temp files to be on the local filesystem. Also, the java working directory is not the reducer's local processing directory, thus "./tmp" doesn't get me what I'm after. As it stands now I'm using java.io.tmpdir which is not a long-term solution for me. I am looking to use the reducer's task-specific local directory which should be balanced across my local drives.
    On May 4, 2011, at 12:31 PM, Matt Pouttu-Clarke wrote:

    Hi Bryan,

    These are called side effect files, and I use them extensively:

    O'Riley Hadoop 2nd Edition, p. 187
    Pro Hadoop, p. 279

    You get the path to the save the file(s) using:
    http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
    leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29

    The output committer moves these files from the work directory to the output
    directory when the task completes. That way you don't have duplicate files
    due to speculative execution. You should also generate a unique name for
    each of your output files by using this function to prevent file name
    collisions:
    http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
    leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java
    .lang.String%29

    Hope this helps,
    Matt

    On 5/4/11 12:18 PM, "Bryan Keller" wrote:

    Right. What I am struggling with is how to retrieve the path/drive that the
    reducer is using, so I can use the same path for local temp files.
    On May 4, 2011, at 9:03 AM, Robert Evans wrote:

    Bryan,

    I believe that map/reduce gives you a single drive to write to so that your
    reducer has less of an impact on other reducers/mappers running on the same
    box. If you want to write to more drives I thought the idea would then be to
    increase the number of reducers you have and let mapred assign each to a
    drive to use, instead of having one reducer eating up I/O bandwidth from all
    of the drives.

    --Bobby Evans

    On 5/4/11 7:11 AM, "Bryan Keller" wrote:

    I too am looking for the best place to put local temp files I create during
    reduce processing. I am hoping there is a variable or property someplace that
    defines a per-reducer temp directory. The "mapred.child.tmp" property is by
    default simply the relative directory "./tmp" so it isn't useful on it's own.

    I have 5 drives being used in "mapred.local.dir", and I was hoping to use
    them all for writing temp files, rather than specifying a single temp
    directory that all my reducers use.

    On Apr 9, 2011, at 2:40 AM, Harsh J wrote:

    Hello,
    On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill wrote:
    If I try:

    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
    ".seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    I get an exception about a mismatch in file systems when trying to read
    from
    the file.

    Alternately if I try:

    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
    "my-file", ".seq"));
    writer = SequenceFile.createWriter(FileSystem.get(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
    since you are looking to create local temporary files to be used only
    by the task within itself, you shouldn't really worry about unique
    filenames (stuff can go wrong).

    You're looking for the tmp/ directory locally created in the FS where
    the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
    You can create a regular file there using vanilla Java APIs for files,
    or using RawLocalFS + your own created Path (not derived via
    OutputFormat/etc.).
    storePath = new Path(new
    Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    The above should work, I think (haven't tried, but the idea is to use
    the mapred.child.tmp).

    Also see:
    http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Directory+
    Structure

    --
    Harsh J

    iCrossing Privileged and Confidential Information
    This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
  • Matt Pouttu-Clarke at May 5, 2011 at 5:59 am
    Bryan,

    Not sure you should be concerned with whether the output is on local vs.
    HDFS. I wouldn't think there would be much of a performance difference if
    you are doing streaming output (append) in both cases. Hadoop already uses
    local storage where ever possible (including for the task working
    directories as far as I know). I've never had performance problems with
    side effect files, as long as the correct setup is used.

    Definitely if multiple mounts are available locally where the tasks are
    running you can add a comma delimited list to mapreduce.cluster.local.dir in
    mapred-site.xml of those machines:
    http://hadoop.apache.org/common/docs/current/cluster_setup.html#mapred-site.
    xml

    Theoretically you can use the methods I listed below to create unique
    files/paths under /tmp or any other mount point you wish. However, it is
    much better to let Hadoop manage where the files are stored (i.e. Use the
    work directory given to you).

    If you add multiple paths to mapreduce.cluster.local.dir then Hadoop will
    spread the I/O from multiple mappers/reducers across these paths. Likewise
    you can mount RAID 0 (stripe) of multiple drives to get the same effect.
    You can use a single RAID 0 to keep the mapred-site.xml uniform. RAID 0 is
    fine since speculative execution takes care of if a disk fails.

    If would be helpful to know your use case since the primary option is
    normally to create multiple outputs from a reducer:
    http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred
    uce/lib/output/MultipleOutputs.html

    Most likely you should try that before going into the realm of side effect
    files (or messing with local temp on the task nodes). Try the multiple
    outputs if you are dealing with streaming data. If you absolutely cannot
    get it to work then you may have to cross check the other more complex
    options.

    Cheers,
    Matt


    On 5/4/11 1:07 PM, "Bryan Keller" wrote:

    Am I mistaken or are side-effect files on HDFS? I need my temp files to be on
    the local filesystem. Also, the java working directory is not the reducer's
    local processing directory, thus "./tmp" doesn't get me what I'm after. As it
    stands now I'm using java.io.tmpdir which is not a long-term solution for me.
    I am looking to use the reducer's task-specific local directory which should
    be balanced across my local drives.
    On May 4, 2011, at 12:31 PM, Matt Pouttu-Clarke wrote:

    Hi Bryan,

    These are called side effect files, and I use them extensively:

    O'Riley Hadoop 2nd Edition, p. 187
    Pro Hadoop, p. 279

    You get the path to the save the file(s) using:
    http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
    leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29

    The output committer moves these files from the work directory to the output
    directory when the task completes. That way you don't have duplicate files
    due to speculative execution. You should also generate a unique name for
    each of your output files by using this function to prevent file name
    collisions:
    http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
    leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java
    .lang.String%29

    Hope this helps,
    Matt

    On 5/4/11 12:18 PM, "Bryan Keller" wrote:

    Right. What I am struggling with is how to retrieve the path/drive that the
    reducer is using, so I can use the same path for local temp files.
    On May 4, 2011, at 9:03 AM, Robert Evans wrote:

    Bryan,

    I believe that map/reduce gives you a single drive to write to so that your
    reducer has less of an impact on other reducers/mappers running on the same
    box. If you want to write to more drives I thought the idea would then be
    to
    increase the number of reducers you have and let mapred assign each to a
    drive to use, instead of having one reducer eating up I/O bandwidth from
    all
    of the drives.

    --Bobby Evans

    On 5/4/11 7:11 AM, "Bryan Keller" wrote:

    I too am looking for the best place to put local temp files I create during
    reduce processing. I am hoping there is a variable or property someplace
    that
    defines a per-reducer temp directory. The "mapred.child.tmp" property is by
    default simply the relative directory "./tmp" so it isn't useful on it's
    own.

    I have 5 drives being used in "mapred.local.dir", and I was hoping to use
    them all for writing temp files, rather than specifying a single temp
    directory that all my reducers use.

    On Apr 9, 2011, at 2:40 AM, Harsh J wrote:

    Hello,
    On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill wrote:
    If I try:

    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
    ".seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    I get an exception about a mismatch in file systems when trying to read
    from
    the file.

    Alternately if I try:

    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
    "my-file", ".seq"));
    writer = SequenceFile.createWriter(FileSystem.get(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
    since you are looking to create local temporary files to be used only
    by the task within itself, you shouldn't really worry about unique
    filenames (stuff can go wrong).

    You're looking for the tmp/ directory locally created in the FS where
    the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
    You can create a regular file there using vanilla Java APIs for files,
    or using RawLocalFS + your own created Path (not derived via
    OutputFormat/etc.).
    storePath = new Path(new
    Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    The above should work, I think (haven't tried, but the idea is to use
    the mapred.child.tmp).

    Also see:
    http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Director
    y+
    Structure

    --
    Harsh J

    iCrossing Privileged and Confidential Information
    This email message is for the sole use of the intended recipient(s) and may
    contain confidential and privileged information of iCrossing. Any
    unauthorized review, use, disclosure or distribution is prohibited. If you
    are not the intended recipient, please contact the sender by reply email and
    destroy all copies of the original message.
  • Bryan Keller at May 6, 2011 at 5:20 pm
    Thanks for the info Matt.

    My use case is this. I have a fairly large amount of data being passed to my reducer. I need to load all of the data into a matrix and run a linear program. The most efficient way to do this from the LP side is to write all of the data to a file, and pass the file to the LP. The file must be a local file, the LP library I'm using does not support streaming from HDFS, unfortunately.

    I want to do the processing in map-reduce, as it allows me to easily partition the problem into parts and run the LP in parallel. What I am doing now is iterating through the results in the reducer, writing to a local file, and then running the LP when the reducer is done passing in data. I wanted to be able to use the same local directory that the reducer is using so, if there are multiple reducers running, I can take advantage of all of the drives I have configured in mapred.local.dir.

    On May 4, 2011, at 10:59 PM, Matt Pouttu-Clarke wrote:

    Bryan,

    Not sure you should be concerned with whether the output is on local vs.
    HDFS. I wouldn't think there would be much of a performance difference if
    you are doing streaming output (append) in both cases. Hadoop already uses
    local storage where ever possible (including for the task working
    directories as far as I know). I've never had performance problems with
    side effect files, as long as the correct setup is used.

    Definitely if multiple mounts are available locally where the tasks are
    running you can add a comma delimited list to mapreduce.cluster.local.dir in
    mapred-site.xml of those machines:
    http://hadoop.apache.org/common/docs/current/cluster_setup.html#mapred-site.
    xml

    Theoretically you can use the methods I listed below to create unique
    files/paths under /tmp or any other mount point you wish. However, it is
    much better to let Hadoop manage where the files are stored (i.e. Use the
    work directory given to you).

    If you add multiple paths to mapreduce.cluster.local.dir then Hadoop will
    spread the I/O from multiple mappers/reducers across these paths. Likewise
    you can mount RAID 0 (stripe) of multiple drives to get the same effect.
    You can use a single RAID 0 to keep the mapred-site.xml uniform. RAID 0 is
    fine since speculative execution takes care of if a disk fails.

    If would be helpful to know your use case since the primary option is
    normally to create multiple outputs from a reducer:
    http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred
    uce/lib/output/MultipleOutputs.html

    Most likely you should try that before going into the realm of side effect
    files (or messing with local temp on the task nodes). Try the multiple
    outputs if you are dealing with streaming data. If you absolutely cannot
    get it to work then you may have to cross check the other more complex
    options.

    Cheers,
    Matt


    On 5/4/11 1:07 PM, "Bryan Keller" wrote:

    Am I mistaken or are side-effect files on HDFS? I need my temp files to be on
    the local filesystem. Also, the java working directory is not the reducer's
    local processing directory, thus "./tmp" doesn't get me what I'm after. As it
    stands now I'm using java.io.tmpdir which is not a long-term solution for me.
    I am looking to use the reducer's task-specific local directory which should
    be balanced across my local drives.
    On May 4, 2011, at 12:31 PM, Matt Pouttu-Clarke wrote:

    Hi Bryan,

    These are called side effect files, and I use them extensively:

    O'Riley Hadoop 2nd Edition, p. 187
    Pro Hadoop, p. 279

    You get the path to the save the file(s) using:
    http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
    leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29

    The output committer moves these files from the work directory to the output
    directory when the task completes. That way you don't have duplicate files
    due to speculative execution. You should also generate a unique name for
    each of your output files by using this function to prevent file name
    collisions:
    http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
    leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java
    .lang.String%29

    Hope this helps,
    Matt

    On 5/4/11 12:18 PM, "Bryan Keller" wrote:

    Right. What I am struggling with is how to retrieve the path/drive that the
    reducer is using, so I can use the same path for local temp files.
    On May 4, 2011, at 9:03 AM, Robert Evans wrote:

    Bryan,

    I believe that map/reduce gives you a single drive to write to so that your
    reducer has less of an impact on other reducers/mappers running on the same
    box. If you want to write to more drives I thought the idea would then be
    to
    increase the number of reducers you have and let mapred assign each to a
    drive to use, instead of having one reducer eating up I/O bandwidth from
    all
    of the drives.

    --Bobby Evans

    On 5/4/11 7:11 AM, "Bryan Keller" wrote:

    I too am looking for the best place to put local temp files I create during
    reduce processing. I am hoping there is a variable or property someplace
    that
    defines a per-reducer temp directory. The "mapred.child.tmp" property is by
    default simply the relative directory "./tmp" so it isn't useful on it's
    own.

    I have 5 drives being used in "mapred.local.dir", and I was hoping to use
    them all for writing temp files, rather than specifying a single temp
    directory that all my reducers use.

    On Apr 9, 2011, at 2:40 AM, Harsh J wrote:

    Hello,
    On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill wrote:
    If I try:

    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
    ".seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);

    I get an exception about a mismatch in file systems when trying to read
    from
    the file.

    Alternately if I try:

    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
    "my-file", ".seq"));
    writer = SequenceFile.createWriter(FileSystem.get(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
    since you are looking to create local temporary files to be used only
    by the task within itself, you shouldn't really worry about unique
    filenames (stuff can go wrong).

    You're looking for the tmp/ directory locally created in the FS where
    the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
    You can create a regular file there using vanilla Java APIs for files,
    or using RawLocalFS + your own created Path (not derived via
    OutputFormat/etc.).
    storePath = new Path(new
    Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
    configuration, storePath, IntWritable.class, itemClass);
    ...
    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
    storePath, configuration);
    The above should work, I think (haven't tried, but the idea is to use
    the mapred.child.tmp).

    Also see:
    http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Director
    y+
    Structure

    --
    Harsh J

    iCrossing Privileged and Confidential Information
    This email message is for the sole use of the intended recipient(s) and may
    contain confidential and privileged information of iCrossing. Any
    unauthorized review, use, disclosure or distribution is prohibited. If you
    are not the intended recipient, please contact the sender by reply email and
    destroy all copies of the original message.
  • Harsh J at May 6, 2011 at 6:14 pm
    Bryan,
    On Fri, May 6, 2011 at 10:50 PM, Bryan Keller wrote:
    I wanted to be able to use the same local directory that the reducer is using so, if there are multiple reducers running, I can take advantage of all of the drives I have configured in mapred.local.dir.
    If I was unclear before, ./tmp is within the "local directory that the
    reducer is using". If you're looking to manually handle local dirs to
    spread work around, this thread had some helpful discussions:
    http://search-hadoop.com/m/k3V5T1HFDxs1

    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 4, '11 at 9:24p
activeMay 6, '11 at 6:14p
posts15
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase