FAQ
Hi,

I looked at different file types, input and output formats, but got
quite confused, and am not sure how to connect the pipe from one
format to another.

Here is what I would like to do:

1. Pass in a string to my hadoop program, and it will write this
single key-value pair to a file on the fly.

2. The first job will read from this file, do some processing, and
write more key-value pairs to other files (the same format as the file
in step 1). Subsequent jobs will read from those files generated by
the first job. This will continue in an iterative manner until some
terminal condition has reached.

3. Both the key and value in the file should be text (i.e. Human
readable ascii).

While this sounds simple, I have been having trouble figuring out the
correct formats to use, and here is why:

JobConf.setInputKeyClass, and setInputValueClass are both deprecated,
so I am avoiding them.

SequenceFileOutputFormat doesn't work because the key has to be
IntWriteable and a Text key causes the code to blow up. (which I still
dont quite understand why, because when I use a SequenceFile.Writer,
it can take Text for both keys and values)

KeyValueTextInputFormat looks promising, but I am not sure how to
bootstrap the first file mentioned in step 1, i.e. what formats and
writer I should use to create the file to hold the initial argument...

I have a feeling that this is actually a very simple problem, only
that I am not looking at the right direction. Your help would be
greatly appreciated.

-- Jim

Search Discussions

  • Ted Dunning at Dec 18, 2007 at 2:21 am
    Part of your problem is that you appear to be using a TextInputFormat (the
    default input format). The TIF produces keys that are LongWritable and
    values that are Text.

    Other input formats produce different types.

    With recent versions of hadoop, classes that extend InputFormatBase can (and
    I think should) use templates to describe their output types. Similarly,
    classes extending MapReduceBase and OutputFormat can specify input/output
    classes and output classes respectively.

    I have added more specific comments in-line.

    On 12/17/07 5:40 PM, "Jim the Standing Bear" wrote:

    1. Pass in a string to my hadoop program, and it will write this
    single key-value pair to a file on the fly.
    How is your string a key-value pair?

    Assuming that you have something as simple as tab-delimited text, you may
    not need to do anything at all other than just copy this data into hadoop.
    2. The first job will read from this file, do some processing, and
    write more key-value pairs to other files (the same format as the file
    in step 1). Subsequent jobs will read from those files generated by
    the first job. This will continue in an iterative manner until some
    terminal condition has reached.
    Can you be more specific?

    Let's assume that you are reading tab-delimited data. You should set the
    input format:

    conf.setInputFormat(TextInputFormat.class);

    Then, since the output of your map will have a string key and value, you
    should tell the system this:

    step1.setOutputKeyClass(Text.class);
    step1.setOutputValueClass(Text.class);

    Note that the signature on your map function should be:

    public static class JoinMap extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, Text> {
    ...

    public void map(LongWritable k, Text input,
    OutputCollector<Text, Text> output,
    Reporter reporter) throws IOException {
    String[] parts = input.split("\t");

    Text key, result;
    ...
    output.collect(key, result);
    }
    }

    And your reduce should look something like this:

    public static class JoinReduce extends MapReduceBase implements
    Reducer<Text, Text, Text, Mumble> {

    public void reduce(Text k, Iterator<Text> values,
    OutputCollector<Text, Mumble> output,
    Reporter reporter) throws IOException {
    Text key;
    Mumble result;
    ....
    output.collect(key, result);
    }
    }

    KeyValueTextInputFormat looks promising
    This could work, depending on what data you have for input. Set the
    separator byte to be whatever separates your key from your value and off you
    go.
  • Jim the Standing Bear at Dec 18, 2007 at 2:47 am
    Hi Ted,

    Yes, I got quite confused and picked TextInputFormat because I thought
    it would be easy to understand.

    To be more specific on what I am trying to do:

    I pass in the path to a directory (say "/usr/mydir/bigtree"). The
    code writes this to a file: DIR <TAB> /usr/mydir/bigtree

    The job will read data from the file, and if it gets a DIR, it will
    walk into it, and list everything that directory has, and write the
    contents to another file. sub-directories will have "DIR" as their
    keys, and files will have "FILE". Then the same job configuration
    will read off the new data file, and do the same thing again and
    again, until there is no more directories to be walked. So in the
    end, there should be a file containing all the files under a directory
    (not necessarily directly under).

    Now that you told me about the generics, I am hoping the reason
    sequence file didn't work for me because I didn't set the correct
    type. I shall try that again.

    With KeyValueTextInputFormat, the problem is not reading it - I know
    how to set the separator byte and all that... my problem is with
    creating the very first file - I simply don't know how. I can use
    SequenceFile.Writer to write the key and value, but the file contains
    a header, some funny-looking separator and sync bytes. If I simply
    want a file containing clean Key<Text>\tValue<Text>, I dont know what
    kind of Writer to use to create it. Do you know of a way? Thanks.

    -- Jim
    On Dec 17, 2007 9:01 PM, Ted Dunning wrote:


    Part of your problem is that you appear to be using a TextInputFormat (the
    default input format). The TIF produces keys that are LongWritable and
    values that are Text.

    Other input formats produce different types.

    With recent versions of hadoop, classes that extend InputFormatBase can (and
    I think should) use templates to describe their output types. Similarly,
    classes extending MapReduceBase and OutputFormat can specify input/output
    classes and output classes respectively.

    I have added more specific comments in-line.

    On 12/17/07 5:40 PM, "Jim the Standing Bear" wrote:

    1. Pass in a string to my hadoop program, and it will write this
    single key-value pair to a file on the fly.
    How is your string a key-value pair?

    Assuming that you have something as simple as tab-delimited text, you may
    not need to do anything at all other than just copy this data into hadoop.
    2. The first job will read from this file, do some processing, and
    write more key-value pairs to other files (the same format as the file
    in step 1). Subsequent jobs will read from those files generated by
    the first job. This will continue in an iterative manner until some
    terminal condition has reached.
    Can you be more specific?

    Let's assume that you are reading tab-delimited data. You should set the
    input format:

    conf.setInputFormat(TextInputFormat.class);

    Then, since the output of your map will have a string key and value, you
    should tell the system this:

    step1.setOutputKeyClass(Text.class);
    step1.setOutputValueClass(Text.class);

    Note that the signature on your map function should be:

    public static class JoinMap extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, Text> {
    ...

    public void map(LongWritable k, Text input,
    OutputCollector<Text, Text> output,
    Reporter reporter) throws IOException {
    String[] parts = input.split("\t");

    Text key, result;
    ...
    output.collect(key, result);
    }
    }

    And your reduce should look something like this:

    public static class JoinReduce extends MapReduceBase implements
    Reducer<Text, Text, Text, Mumble> {

    public void reduce(Text k, Iterator<Text> values,
    OutputCollector<Text, Mumble> output,
    Reporter reporter) throws IOException {
    Text key;
    Mumble result;
    ....
    output.collect(key, result);
    }
    }

    KeyValueTextInputFormat looks promising
    This could work, depending on what data you have for input. Set the
    separator byte to be whatever separates your key from your value and off you
    go.




    --
    --------------------------------------
    Standing Bear Has Spoken
    --------------------------------------
  • Ted Dunning at Dec 18, 2007 at 3:07 am
    I thought that is what your input file already was. The
    KeyValueTextInputFormat should read your input as-is.

    When you write out your intermediate values, just make sure that you use
    TextOutputFormat and put "DIR" as the key and the directory name as the
    value (same with files).

    On 12/17/07 6:46 PM, "Jim the Standing Bear" wrote:

    With KeyValueTextInputFormat, the problem is not reading it - I know
    how to set the separator byte and all that... my problem is with
    creating the very first file - I simply don't know how.
  • Jim the Standing Bear at Dec 18, 2007 at 3:11 am
    Hi Ted,

    I guess I didn't make it clear enough. I don't have a file to start
    with. When I run the program, I pass in an argument. The program,
    before doing its map/red jobs, is supposed to create a file on the
    DFS, and saves whatever I just passed in. And my trouble is, I am not
    sure how to create such a file so that both the key and values are
    clear Text, and they can subsequently be read by
    KeyValueTextInputFormat.
    On Dec 17, 2007 10:07 PM, Ted Dunning wrote:


    I thought that is what your input file already was. The
    KeyValueTextInputFormat should read your input as-is.

    When you write out your intermediate values, just make sure that you use
    TextOutputFormat and put "DIR" as the key and the directory name as the
    value (same with files).


    On 12/17/07 6:46 PM, "Jim the Standing Bear" wrote:

    With KeyValueTextInputFormat, the problem is not reading it - I know
    how to set the separator byte and all that... my problem is with
    creating the very first file - I simply don't know how.


    --
    --------------------------------------
    Standing Bear Has Spoken
    --------------------------------------
  • Ted Dunning at Dec 18, 2007 at 3:24 am
    Just do:

    $ echo "DIR\t/foo/bar/directory" > file
    $ hadoop -put file hfile

    And you got yourself a file.
    On 12/17/07 7:10 PM, "Jim the Standing Bear" wrote:

    Hi Ted,

    I guess I didn't make it clear enough. I don't have a file to start
    with. When I run the program, I pass in an argument. The program,
    before doing its map/red jobs, is supposed to create a file on the
    DFS, and saves whatever I just passed in. And my trouble is, I am not
    sure how to create such a file so that both the key and values are
    clear Text, and they can subsequently be read by
    KeyValueTextInputFormat.
    On Dec 17, 2007 10:07 PM, Ted Dunning wrote:


    I thought that is what your input file already was. The
    KeyValueTextInputFormat should read your input as-is.

    When you write out your intermediate values, just make sure that you use
    TextOutputFormat and put "DIR" as the key and the directory name as the
    value (same with files).


    On 12/17/07 6:46 PM, "Jim the Standing Bear" wrote:

    With KeyValueTextInputFormat, the problem is not reading it - I know
    how to set the separator byte and all that... my problem is with
    creating the very first file - I simply don't know how.
  • Jim the Standing Bear at Dec 18, 2007 at 3:04 am
    Just an update... my problem seems to be beyond defining generic types.

    Ted, I dont know if you have the answer for this question, which is
    regarding SequenceFile.

    If I am to create a SequenceFile by hand, I can do the following:

    <code>
    JobConf jobConf = new JobConf(MyClass.class);
    JobClient jobClient = new JobClient(jobConf);

    FileSystem fileSystem = jobClient.getFs();
    SequenceFile.Writer writer = SequenceFile.createWriter(fileSystem,
    jobConf, path, Text.class, Text.class);

    </code>

    After that, I can write all Text-based keys and values by doing this:

    <code>
    Text keyText = new Text();
    keyText.set("mykey");

    Text valText = new Text();
    valText.set("myval");

    writer.append(keyText, valText);
    </code>

    As you can see, there is no LongWriteable what-so-ever.

    However, in a map/reduce job, if I am to specify
    <code>
    jobConf.setOutputFormat(SequenceFileOutputFormat.class);
    </code>

    And later in the mapper, if I am to say
    <code>
    Text newkey = new Text();
    newkey.set("AAA");

    Text newval = new Text();
    newval.set("bbb");

    output.collect(newkey, newval);
    </code>

    It would throw an exception, complaining that the key is not LongWriteable.

    So that's a part of the reason that I am having trouble connecting the
    pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
    are talking about two different kinds of "sequence files"...
  • Ted Dunning at Dec 18, 2007 at 3:09 am
    You never set the input format in the second step.

    But I think you want to stay with your KeyValueTextInputFormat for input and
    TextOutputFormat for output.

    On 12/17/07 7:03 PM, "Jim the Standing Bear" wrote:


    So that's a part of the reason that I am having trouble connecting the
    pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
    are talking about two different kinds of "sequence files"...
  • Jim the Standing Bear at Dec 18, 2007 at 3:13 am
    When you said "You never set the input format in the second step",
    were you instructing me NOT to set input format in the second step, or
    were you asking me why I never set it in the second step?

    On Dec 17, 2007 10:09 PM, Ted Dunning wrote:

    You never set the input format in the second step.

    But I think you want to stay with your KeyValueTextInputFormat for input and
    TextOutputFormat for output.


    On 12/17/07 7:03 PM, "Jim the Standing Bear" wrote:


    So that's a part of the reason that I am having trouble connecting the
    pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
    are talking about two different kinds of "sequence files"...


    --
    --------------------------------------
    Standing Bear Has Spoken
    --------------------------------------
  • Ted Dunning at Dec 18, 2007 at 3:25 am
    I was saying that you didn't do it and probably should have.

    On 12/17/07 7:12 PM, "Jim the Standing Bear" wrote:

    When you said "You never set the input format in the second step",
    were you instructing me NOT to set input format in the second step, or
    were you asking me why I never set it in the second step?

    On Dec 17, 2007 10:09 PM, Ted Dunning wrote:

    You never set the input format in the second step.

    But I think you want to stay with your KeyValueTextInputFormat for input and
    TextOutputFormat for output.


    On 12/17/07 7:03 PM, "Jim the Standing Bear" wrote:


    So that's a part of the reason that I am having trouble connecting the
    pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
    are talking about two different kinds of "sequence files"...
  • Jim the Standing Bear at Dec 18, 2007 at 3:34 am
    Hi Ted,

    I see... I was hoping that the program could create it instead of
    having the user do it, but I guess hadoop is not really meant to be
    interactive/user-friendly.

    About the second step and why I didn't say what input format it
    used... In the code, I did specify the format. However, it depended
    upon the file output formats I used in the first step. Because I
    got so confused, I thought it would be more important to nail down the
    correct output format in the first step.

    -- Jim
    On Dec 17, 2007 10:24 PM, Ted Dunning wrote:

    I was saying that you didn't do it and probably should have.


    On 12/17/07 7:12 PM, "Jim the Standing Bear" wrote:

    When you said "You never set the input format in the second step",
    were you instructing me NOT to set input format in the second step, or
    were you asking me why I never set it in the second step?

    On Dec 17, 2007 10:09 PM, Ted Dunning wrote:

    You never set the input format in the second step.

    But I think you want to stay with your KeyValueTextInputFormat for input and
    TextOutputFormat for output.


    On 12/17/07 7:03 PM, "Jim the Standing Bear" wrote:


    So that's a part of the reason that I am having trouble connecting the
    pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
    are talking about two different kinds of "sequence files"...


    --
    --------------------------------------
    Standing Bear Has Spoken
    --------------------------------------
  • Ted Dunning at Dec 18, 2007 at 4:30 am
    The program can create the file just as easily as the shell commands that I
    gave you. You can open an output stream to a file in the hadoop file system
    and write the seed data.

    On 12/17/07 7:33 PM, "Jim the Standing Bear" wrote:

    Hi Ted,

    I see... I was hoping that the program could create it instead of
    having the user do it, but I guess hadoop is not really meant to be
    interactive/user-friendly.

    About the second step and why I didn't say what input format it
    used... In the code, I did specify the format. However, it depended
    upon the file output formats I used in the first step. Because I
    got so confused, I thought it would be more important to nail down the
    correct output format in the first step.

    -- Jim
    On Dec 17, 2007 10:24 PM, Ted Dunning wrote:

    I was saying that you didn't do it and probably should have.


    On 12/17/07 7:12 PM, "Jim the Standing Bear" wrote:

    When you said "You never set the input format in the second step",
    were you instructing me NOT to set input format in the second step, or
    were you asking me why I never set it in the second step?

    On Dec 17, 2007 10:09 PM, Ted Dunning wrote:

    You never set the input format in the second step.

    But I think you want to stay with your KeyValueTextInputFormat for input
    and
    TextOutputFormat for output.



    On 12/17/07 7:03 PM, "Jim the Standing Bear" <standingbear@gmail.com>
    wrote:
    So that's a part of the reason that I am having trouble connecting the
    pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
    are talking about two different kinds of "sequence files"...
  • Arun C Murthy at Dec 18, 2007 at 5:30 am
    Jim,

    Hopefully you've fixed this and gone ahead; just in case...

    You were right in using SequenceFile with <Text, Text> as the
    key/value types for your first job.

    The problem is that you did not specify an *input-format* for your
    second job. The Hadoop Map-Reduce framework assumes TextInputFormat as
    the default, which is <LongWritable, Text> and hence the
    behaviour/exceptions you ran into...

    hth,
    Arun

    PS: Do take a look at
    http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html,
    specifically the section titled Job Input
    (http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html#Job+Input).

    Do let us know if how and where we should improve it... Thanks!


    Jim the Standing Bear wrote:
    Just an update... my problem seems to be beyond defining generic types.

    Ted, I dont know if you have the answer for this question, which is
    regarding SequenceFile.

    If I am to create a SequenceFile by hand, I can do the following:

    <code>
    JobConf jobConf = new JobConf(MyClass.class);
    JobClient jobClient = new JobClient(jobConf);

    FileSystem fileSystem = jobClient.getFs();
    SequenceFile.Writer writer = SequenceFile.createWriter(fileSystem,
    jobConf, path, Text.class, Text.class);

    </code>

    After that, I can write all Text-based keys and values by doing this:

    <code>
    Text keyText = new Text();
    keyText.set("mykey");

    Text valText = new Text();
    valText.set("myval");

    writer.append(keyText, valText);
    </code>

    As you can see, there is no LongWriteable what-so-ever.

    However, in a map/reduce job, if I am to specify
    <code>
    jobConf.setOutputFormat(SequenceFileOutputFormat.class);
    </code>

    And later in the mapper, if I am to say
    <code>
    Text newkey = new Text();
    newkey.set("AAA");

    Text newval = new Text();
    newval.set("bbb");

    output.collect(newkey, newval);
    </code>

    It would throw an exception, complaining that the key is not LongWriteable.

    So that's a part of the reason that I am having trouble connecting the
    pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
    are talking about two different kinds of "sequence files"...
  • Jim the Standing Bear at Dec 18, 2007 at 5:57 am
    Hi Arun,

    I did specify the input format. The first job's output format is
    SequenceFileOutputFormat, and the second job's input format is
    SequenceFileInputFormat. But it seems that the two formats don't
    connect.
    Is there a reason that setKeyInputClass and setValueInputClass are
    being deprecated? I saw these two being used, even in nutch.

    Please see the code snippet below:

    <code>
    JobConf writeJob = new JobConf(SequenceFileIndexer.class);
    writeJob.setJobName("testing");
    writeJob.setInputFormat(SequenceFileInputFormat.class);
    writeJob.setInputPath(path);

    Path outPath = new Path("write-out");
    writeJob.setOutputPath(outPath);
    writeJob.setOutputFormat(SequenceFileOutputFormat.class);
    writeJob.setMapperClass(SequenceFileIndexer.class);

    JobClient.runJob(writeJob); // this job finished correctly

    JobConf secondJob = new JobConf(SequenceFileIndexer.class);
    secondJob.setJobName("second");
    secondJob.setInputFormat(SequenceFileInputFormat.class);
    secondJob.setInputPath(outPath);
    secondJob.setOutputKeyClass(Text.class);
    secondJob.setOutputValueClass(Text.class);
    Path finalPath = new Path("final");
    secondJob.setOutputPath(finalPath);
    secondJob.setMapperClass(SequenceFileIndexer.class);
    JobClient.runJob(secondJob); // but this job blew up because
    it complains the file format is not correct



    public void map(Text key, Text val,
    OutputCollector<Text, Text> output, Reporter reporter)
    throws IOException {

    String x = val.toString();
    String k = key.toString();

    output.collect(key, val);

    }





    </code>
    On Dec 18, 2007 12:27 AM, Arun C Murthy wrote:
    Jim,

    Hopefully you've fixed this and gone ahead; just in case...

    You were right in using SequenceFile with <Text, Text> as the
    key/value types for your first job.

    The problem is that you did not specify an *input-format* for your
    second job. The Hadoop Map-Reduce framework assumes TextInputFormat as
    the default, which is <LongWritable, Text> and hence the
    behaviour/exceptions you ran into...

    hth,
    Arun

    PS: Do take a look at
    http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html,
    specifically the section titled Job Input
    (http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html#Job+Input).

    Do let us know if how and where we should improve it... Thanks!



    Jim the Standing Bear wrote:
    Just an update... my problem seems to be beyond defining generic types.

    Ted, I dont know if you have the answer for this question, which is
    regarding SequenceFile.

    If I am to create a SequenceFile by hand, I can do the following:

    <code>
    JobConf jobConf = new JobConf(MyClass.class);
    JobClient jobClient = new JobClient(jobConf);

    FileSystem fileSystem = jobClient.getFs();
    SequenceFile.Writer writer = SequenceFile.createWriter(fileSystem,
    jobConf, path, Text.class, Text.class);

    </code>

    After that, I can write all Text-based keys and values by doing this:

    <code>
    Text keyText = new Text();
    keyText.set("mykey");

    Text valText = new Text();
    valText.set("myval");

    writer.append(keyText, valText);
    </code>

    As you can see, there is no LongWriteable what-so-ever.

    However, in a map/reduce job, if I am to specify
    <code>
    jobConf.setOutputFormat(SequenceFileOutputFormat.class);
    </code>

    And later in the mapper, if I am to say
    <code>
    Text newkey = new Text();
    newkey.set("AAA");

    Text newval = new Text();
    newval.set("bbb");

    output.collect(newkey, newval);
    </code>

    It would throw an exception, complaining that the key is not LongWriteable.

    So that's a part of the reason that I am having trouble connecting the
    pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
    are talking about two different kinds of "sequence files"...


    --
    --------------------------------------
    Standing Bear Has Spoken
    --------------------------------------

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedDec 18, '07 at 1:41a
activeDec 18, '07 at 5:57a
posts14
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase