FAQ
Hi Folks,

Just doing a sanity check here.

I have a map-only job, which produces a filename for a key and data as
a value. I want to write the value (data) into the key (filename) in
the path specified when I run the job.

The value (data) doesn't need any formatting, I can just write it to
HDFS without modification.

So, looking at this link (the Output Formats section):

http://developer.yahoo.com/hadoop/tutorial/module5.html

Looks like I want to:
- create a new output format
- override write, tell it not to call writekey as I don't want that written
- new getRecordWriter method that use the key as the filename and
calls my outputformat

Sound reasonable?

Thanks,

Tom

--
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Search Discussions

  • Robert Evans at Jul 25, 2011 at 8:31 pm
    Tom,

    That assumes that you will never write to the same file from two different mappers or processes. HDFS currently does not support writing to a single file from multiple processes.

    --Bobby

    On 7/25/11 3:25 PM, "Tom Melendez" wrote:

    Hi Folks,

    Just doing a sanity check here.

    I have a map-only job, which produces a filename for a key and data as
    a value. I want to write the value (data) into the key (filename) in
    the path specified when I run the job.

    The value (data) doesn't need any formatting, I can just write it to
    HDFS without modification.

    So, looking at this link (the Output Formats section):

    http://developer.yahoo.com/hadoop/tutorial/module5.html

    Looks like I want to:
    - create a new output format
    - override write, tell it not to call writekey as I don't want that written
    - new getRecordWriter method that use the key as the filename and
    calls my outputformat

    Sound reasonable?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs
  • Robert Evans at Jul 25, 2011 at 8:34 pm
    Tom,

    I also forgot to mention that if you are writing to lots of little files it could cause issues too. HDFS is designed to handle relatively few BIG files. There is some work to improve this, but it is still a ways off. So it is likely going to be very slow and put a big load on the namenode if you are going to create lot of small files using this method.

    --Bobby


    On 7/25/11 3:30 PM, "Robert Evans" wrote:

    Tom,

    That assumes that you will never write to the same file from two different mappers or processes. HDFS currently does not support writing to a single file from multiple processes.

    --Bobby

    On 7/25/11 3:25 PM, "Tom Melendez" wrote:

    Hi Folks,

    Just doing a sanity check here.

    I have a map-only job, which produces a filename for a key and data as
    a value. I want to write the value (data) into the key (filename) in
    the path specified when I run the job.

    The value (data) doesn't need any formatting, I can just write it to
    HDFS without modification.

    So, looking at this link (the Output Formats section):

    http://developer.yahoo.com/hadoop/tutorial/module5.html

    Looks like I want to:
    - create a new output format
    - override write, tell it not to call writekey as I don't want that written
    - new getRecordWriter method that use the key as the filename and
    calls my outputformat

    Sound reasonable?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs
  • Tom Melendez at Jul 25, 2011 at 8:45 pm
    Hi Bobby,

    Yeah, that won't be a big deal in this case. It will create about 40
    files, each about 60MB each. This job is kind of an odd one that
    won't be run very often.

    Thanks,

    Tom
    On Mon, Jul 25, 2011 at 1:34 PM, Robert Evans wrote:
    Tom,

    I also forgot to mention that if you are writing to lots of little files it could cause issues too.  HDFS is designed to handle relatively few BIG files.  There is some work to improve this, but it is still a ways off.  So it is likely going to be very slow and put a big load on the namenode if you are going to create lot of small files using this method.

    --Bobby


    On 7/25/11 3:30 PM, "Robert Evans" wrote:

    Tom,

    That assumes that you will never write to the same file from two different mappers or processes.  HDFS currently does not support writing to a single file from multiple processes.

    --Bobby

    On 7/25/11 3:25 PM, "Tom Melendez" wrote:

    Hi Folks,

    Just doing a sanity check here.

    I have a map-only job, which produces a filename for a key and data as
    a value.  I want to write the value (data) into the key (filename) in
    the path specified when I run the job.

    The value (data) doesn't need any formatting, I can just write it to
    HDFS without modification.

    So, looking at this link (the Output Formats section):

    http://developer.yahoo.com/hadoop/tutorial/module5.html

    Looks like I want to:
    - create a new output format
    - override write, tell it not to call writekey as I don't want that written
    - new getRecordWriter method that use the key as the filename and
    calls my outputformat

    Sound reasonable?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs



    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs
  • Tom Melendez at Jul 25, 2011 at 8:35 pm
    Hi Robert,

    In this specific case, that's OK. I'll never write to the same file
    from two different mappers. Otherwise, think it's cool? I haven't
    played with the outputformat before.

    Thanks,

    Tom
    On Mon, Jul 25, 2011 at 1:30 PM, Robert Evans wrote:
    Tom,

    That assumes that you will never write to the same file from two different mappers or processes.  HDFS currently does not support writing to a single file from multiple processes.

    --Bobby

    On 7/25/11 3:25 PM, "Tom Melendez" wrote:

    Hi Folks,

    Just doing a sanity check here.

    I have a map-only job, which produces a filename for a key and data as
    a value.  I want to write the value (data) into the key (filename) in
    the path specified when I run the job.

    The value (data) doesn't need any formatting, I can just write it to
    HDFS without modification.

    So, looking at this link (the Output Formats section):

    http://developer.yahoo.com/hadoop/tutorial/module5.html

    Looks like I want to:
    - create a new output format
    - override write, tell it not to call writekey as I don't want that written
    - new getRecordWriter method that use the key as the filename and
    calls my outputformat

    Sound reasonable?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs
  • Harsh J at Jul 25, 2011 at 9:18 pm
    You can use MultipleOutputs (or MultiTextOutputFormat for direct
    key-file mapping, but I'd still prefer the stable MultipleOutputs).
    Your sinking Key can be of NullWritable type, and you can keep passing
    an instance of NullWritable.get() to it in every cycle. This would
    write just the value, while the filenames are added/sourced from the
    key inside the mapper code.

    This, if you are not comfortable writing your own code and maintaining
    it, I s'pose. Your approach is correct as well, if the question was
    specifically that.
    On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez wrote:
    Hi Folks,

    Just doing a sanity check here.

    I have a map-only job, which produces a filename for a key and data as
    a value.  I want to write the value (data) into the key (filename) in
    the path specified when I run the job.

    The value (data) doesn't need any formatting, I can just write it to
    HDFS without modification.

    So, looking at this link (the Output Formats section):

    http://developer.yahoo.com/hadoop/tutorial/module5.html

    Looks like I want to:
    - create a new output format
    - override write, tell it not to call writekey as I don't want that written
    - new getRecordWriter method that use the key as the filename and
    calls my outputformat

    Sound reasonable?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Harsh J
  • Tom Melendez at Jul 25, 2011 at 10:07 pm
    Hi Harsh,

    Thanks for the response. Unfortunately, I'm not following your response. :-)

    Could you elaborate a bit?

    Thanks,

    Tom
    On Mon, Jul 25, 2011 at 2:10 PM, Harsh J wrote:
    You can use MultipleOutputs (or MultiTextOutputFormat for direct
    key-file mapping, but I'd still prefer the stable MultipleOutputs).
    Your sinking Key can be of NullWritable type, and you can keep passing
    an instance of NullWritable.get() to it in every cycle. This would
    write just the value, while the filenames are added/sourced from the
    key inside the mapper code.

    This, if you are not comfortable writing your own code and maintaining
    it, I s'pose. Your approach is correct as well, if the question was
    specifically that.
    On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez wrote:
    Hi Folks,

    Just doing a sanity check here.

    I have a map-only job, which produces a filename for a key and data as
    a value.  I want to write the value (data) into the key (filename) in
    the path specified when I run the job.

    The value (data) doesn't need any formatting, I can just write it to
    HDFS without modification.

    So, looking at this link (the Output Formats section):

    http://developer.yahoo.com/hadoop/tutorial/module5.html

    Looks like I want to:
    - create a new output format
    - override write, tell it not to call writekey as I don't want that written
    - new getRecordWriter method that use the key as the filename and
    calls my outputformat

    Sound reasonable?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Harsh J


    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs
  • Harsh J at Jul 26, 2011 at 8:35 am
    Tom,

    What I meant to say was that doing this is well supported with
    existing API/libraries itself:

    - The class MultipleOutputs supports providing a filename for an
    output. See MultipleOutputs.addNamedOutput usage [1].
    - The type 'NullWritable' is a special writable that doesn't do
    anything. So if its configured into the above filename addition as a
    key-type, and you pass NullWritable.get() as the key in every write
    operation, you will end up just writing the value part of (key,
    value).
    - This way you do not have to write a custom OutputFormat for your use-case.

    [1] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
    (Also available for the new API, depending on which
    version/distribution of Hadoop you are on)
    On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez wrote:
    Hi Harsh,

    Thanks for the response.  Unfortunately, I'm not following your response.  :-)

    Could you elaborate a bit?

    Thanks,

    Tom
    On Mon, Jul 25, 2011 at 2:10 PM, Harsh J wrote:
    You can use MultipleOutputs (or MultiTextOutputFormat for direct
    key-file mapping, but I'd still prefer the stable MultipleOutputs).
    Your sinking Key can be of NullWritable type, and you can keep passing
    an instance of NullWritable.get() to it in every cycle. This would
    write just the value, while the filenames are added/sourced from the
    key inside the mapper code.

    This, if you are not comfortable writing your own code and maintaining
    it, I s'pose. Your approach is correct as well, if the question was
    specifically that.
    On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez wrote:
    Hi Folks,

    Just doing a sanity check here.

    I have a map-only job, which produces a filename for a key and data as
    a value.  I want to write the value (data) into the key (filename) in
    the path specified when I run the job.

    The value (data) doesn't need any formatting, I can just write it to
    HDFS without modification.

    So, looking at this link (the Output Formats section):

    http://developer.yahoo.com/hadoop/tutorial/module5.html

    Looks like I want to:
    - create a new output format
    - override write, tell it not to call writekey as I don't want that written
    - new getRecordWriter method that use the key as the filename and
    calls my outputformat

    Sound reasonable?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Harsh J


    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Harsh J
  • Tom Melendez at Jul 26, 2011 at 3:52 pm
    Hi Harsh,

    Cool, thanks for the details. For anyone interested, with your tip
    and description I was able to find an example inside the "Hadoop in
    Action" (Chapter 7, p168) book.

    Another question, though, it doesn't look like MultipleOutputs will
    let me control the filename in a per-key (per map) manner. So,
    basically, if my map receives a key of "mykey", I want my file to be
    "mykey-someotherstuff.foo" (this is a binary file). Am I right about
    this?

    Thanks,

    Tom
    On Tue, Jul 26, 2011 at 1:34 AM, Harsh J wrote:
    Tom,

    What I meant to say was that doing this is well supported with
    existing API/libraries itself:

    - The class MultipleOutputs supports providing a filename for an
    output. See MultipleOutputs.addNamedOutput usage [1].
    - The type 'NullWritable' is a special writable that doesn't do
    anything. So if its configured into the above filename addition as a
    key-type, and you pass NullWritable.get() as the key in every write
    operation, you will end up just writing the value part of (key,
    value).
    - This way you do not have to write a custom OutputFormat for your use-case.

    [1] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
    (Also available for the new API, depending on which
    version/distribution of Hadoop you are on)
    On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez wrote:
    Hi Harsh,

    Thanks for the response.  Unfortunately, I'm not following your response.  :-)

    Could you elaborate a bit?

    Thanks,

    Tom
    On Mon, Jul 25, 2011 at 2:10 PM, Harsh J wrote:
    You can use MultipleOutputs (or MultiTextOutputFormat for direct
    key-file mapping, but I'd still prefer the stable MultipleOutputs).
    Your sinking Key can be of NullWritable type, and you can keep passing
    an instance of NullWritable.get() to it in every cycle. This would
    write just the value, while the filenames are added/sourced from the
    key inside the mapper code.

    This, if you are not comfortable writing your own code and maintaining
    it, I s'pose. Your approach is correct as well, if the question was
    specifically that.
    On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez wrote:
    Hi Folks,

    Just doing a sanity check here.

    I have a map-only job, which produces a filename for a key and data as
    a value.  I want to write the value (data) into the key (filename) in
    the path specified when I run the job.

    The value (data) doesn't need any formatting, I can just write it to
    HDFS without modification.

    So, looking at this link (the Output Formats section):

    http://developer.yahoo.com/hadoop/tutorial/module5.html

    Looks like I want to:
    - create a new output format
    - override write, tell it not to call writekey as I don't want that written
    - new getRecordWriter method that use the key as the filename and
    calls my outputformat

    Sound reasonable?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Harsh J


    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Harsh J


    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs
  • Harsh J at Jul 26, 2011 at 7:08 pm
    Tom,

    You can theoretically add N amounts of named outputs from a single
    task itself, even from within the map() calls (addNamedOutputs or
    addMultiNamedOutputs checks within itself for dupes, so you don't have
    to). So yes, you can keep adding outputs and using them per-key, and
    given your earlier details of how many that's gonna be, I think MO
    would behave just fine with its cache of record writers.

    Regarding your other question, there are certain restrictions to the
    names provided to MultipleOutputs as a named output. Specifically,
    they accept only [A-Za-z0-9] and auto-include an "_" if you are using
    multi-named outputs. These may be going away in the future (0.23+) to
    allow for more flexible naming, however.
    On Tue, Jul 26, 2011 at 9:21 PM, Tom Melendez wrote:
    Hi Harsh,

    Cool, thanks for the details.  For anyone interested, with your tip
    and description I was able to find an example inside the "Hadoop in
    Action" (Chapter 7, p168) book.

    Another question, though, it doesn't look like MultipleOutputs will
    let me control the filename in a per-key (per map) manner.  So,
    basically, if my map receives a key of "mykey", I want my file to be
    "mykey-someotherstuff.foo" (this is a binary file).  Am I right about
    this?

    Thanks,

    Tom
    On Tue, Jul 26, 2011 at 1:34 AM, Harsh J wrote:
    Tom,

    What I meant to say was that doing this is well supported with
    existing API/libraries itself:

    - The class MultipleOutputs supports providing a filename for an
    output. See MultipleOutputs.addNamedOutput usage [1].
    - The type 'NullWritable' is a special writable that doesn't do
    anything. So if its configured into the above filename addition as a
    key-type, and you pass NullWritable.get() as the key in every write
    operation, you will end up just writing the value part of (key,
    value).
    - This way you do not have to write a custom OutputFormat for your use-case.

    [1] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
    (Also available for the new API, depending on which
    version/distribution of Hadoop you are on)
    On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez wrote:
    Hi Harsh,

    Thanks for the response.  Unfortunately, I'm not following your response.  :-)

    Could you elaborate a bit?

    Thanks,

    Tom
    On Mon, Jul 25, 2011 at 2:10 PM, Harsh J wrote:
    You can use MultipleOutputs (or MultiTextOutputFormat for direct
    key-file mapping, but I'd still prefer the stable MultipleOutputs).
    Your sinking Key can be of NullWritable type, and you can keep passing
    an instance of NullWritable.get() to it in every cycle. This would
    write just the value, while the filenames are added/sourced from the
    key inside the mapper code.

    This, if you are not comfortable writing your own code and maintaining
    it, I s'pose. Your approach is correct as well, if the question was
    specifically that.
    On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez wrote:
    Hi Folks,

    Just doing a sanity check here.

    I have a map-only job, which produces a filename for a key and data as
    a value.  I want to write the value (data) into the key (filename) in
    the path specified when I run the job.

    The value (data) doesn't need any formatting, I can just write it to
    HDFS without modification.

    So, looking at this link (the Output Formats section):

    http://developer.yahoo.com/hadoop/tutorial/module5.html

    Looks like I want to:
    - create a new output format
    - override write, tell it not to call writekey as I don't want that written
    - new getRecordWriter method that use the key as the filename and
    calls my outputformat

    Sound reasonable?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Harsh J


    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Harsh J


    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 25, '11 at 8:26p
activeJul 26, '11 at 7:08p
posts10
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase