Dear All,

I am having a requirement in which I need to move my existing program to
map-reduce framework:

---I am reading files within a directory and also subdirectories.
---Processing one file at a time
---Writing all the processed output to a single output file. [One output file
per folder]

Now, if I have to do this process using Map-Reduce, how should I progress?
I think I need to give one file to one Mapper at a time, when all the mappers
combine, one single reducer should write to a single file. [as I think we cannot
write parallely to a single output file]

Please suggest me (or point me to resources) so that I can:
a) My map function gets one file at a time (instead of one line at a time)
b) Should implementing a custom RecordReader and/or FileInputFormat allow me to
read files in subdirectories as well (one file at a time) ?

Appreciate any help.
Thanks
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"

Search Discussions

  • Harsh J at Nov 17, 2010 at 5:41 pm
    Hi,
    On Wed, Nov 17, 2010 at 7:52 PM, Bhaskar Ghosh wrote:

    ---I am reading files within a directory and also subdirectories.
    Currently FileInputFormat lets you read files for MapReduce, but does
    not recurse into directories. Although globs are accepted in Path
    strings, for proper recursion you need to implement the logic inside
    your custom extended FileInputFormat yourself.
    ---Processing one file at a time
    Doable by turning off file-splitting, or by creating custom SequenceFiles/HARs.
    ---Writing all the processed output to a single output file. [One output
    file per folder]
    Doable with single reducer, but why do you require a single file?
    I think I need to give one file to one Mapper at a time, when all the
    mappers combine, one single reducer should write to a single file. [as I
    think we cannot write parallely to a single output file]
    There's a "getmerge" feature the Hadoop DFS utils provide to retrieve
    a DFS directory of outputs as a single file. You should use that
    feature instead of bottling your reduce phase with a single reducer
    instance (unless its a requirement of some sort).

    See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html for
    the exact command syntax.
    Please suggest me (or point me to resources) so that I can:
    a) My map function gets one file at a time (instead of one line at a time)
    I suggest pre-creating a Hadoop SequenceFile for this purpose, with
    the <Key, Value> being <Filename, Contents>. Another solution would be
    to use HAR. See
    http://www.cloudera.com/blog/2009/02/the-small-files-problem/ for some
    further discussion on this.
    b) Should implementing a custom RecordReader and/or FileInputFormat allow me
    to read files in subdirectories as well (one file at a time) ?
    FileInputFormat.isSplittable is a method that tells if the input files
    must be split into chunks for processing or not, and
    FileInputFormat.listStatus is a method that lists all files
    (FileStatus objects) in a directory to compute Mapper splits for.

    You should write a custom class extending and overriding these methods
    to ask it not to split files (false) and recurse yourself as required
    to provide a proper list of FileStatus objects back to the framework.

    (In trunk code, the recursion support has been added to
    FileInputFormat itself. See MAPREDUCE-1501 on Apache's JIRA for the
    specifics and a patch.)

    --
    Harsh J
    www.harshj.com
  • Bhaskar Ghosh at Nov 19, 2010 at 8:21 pm
    Hi Harsh/All,

    I am getting exactly same error as stated by Kunal Gupta
    <kun...@techlead-india.com> in here:

    http://web.archiveorange.com/archive/v/5nvvZS5AMnkkI30H1pgm#6krHHtdTKhXDtBp

    Kunal if you are still there, please help me. Had you got the issue solved
    then?

    Exception in thread "main" java.lang.RuntimeException:
    java.lang.NoSuchMethodException:
    org.iiit.cloud.PreTrainingProcessorMR$WholeFileTextInputFormat.(ReflectionUtils.java:115)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:923)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:820)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
    at
    org.iiit.cloud.PreTrainingProcessorMR.main(PreTrainingProcessorMR.java:133)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
    Caused by: java.lang.NoSuchMethodException:
    org.iiit.cloud.PreTrainingProcessorMR$WholeFileTextInputFormat.(Class.java:2706)
    at java.lang.Class.getDeclaredConstructor(Class.java:1985)
    at
    org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)




    I have three classes:
    1) PreTrainingProcessorMR is my driving program, containing the mapper and
    reducer classes [see code below, how I am running the job inside the main method
    of this file]
    2) WholeFileTextInputFormat is my cutom InputFormat [attached]
    3) WholeFileLineRecordReader is my custom RecordReader [attached]




    I am executing the map-reduce program like:

    Job job = new Job(conf, "PreTrainingProcessorMR");
    job.setJarByClass(PreTrainingProcessorMR.class);
    job.setMapperClass(GrievanceFormatMapper.class);
    job.setReducerClass(GrievanceFormatReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    job.setInputFormatClass(WholeFileTextInputFormat.class);
    WholeFileTextInputFormat.addInputPath(job, new Path(otherArgs[0]));
    //FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);



    I am really disturbed at this. Any idea if I am missing something? Any help
    would be very useful.


    Thanks
    Bhaskar Ghosh
    Hyderabad, India

    http://www.google.com/profiles/bjgindia

    "Ignorance is Bliss... Knowledge never brings Peace!!!"




    ________________________________
    From: Harsh J <qwertymaniac@gmail.com>
    To: mapreduce-user@hadoop.apache.org
    Sent: Wed, 17 November, 2010 9:40:44 AM
    Subject: Re: How to read whole files and output processed texts to another file
    through MapReduce

    Hi,
    On Wed, Nov 17, 2010 at 7:52 PM, Bhaskar Ghosh wrote:

    ---I am reading files within a directory and also subdirectories.
    Currently FileInputFormat lets you read files for MapReduce, but does
    not recurse into directories. Although globs are accepted in Path
    strings, for proper recursion you need to implement the logic inside
    your custom extended FileInputFormat yourself.
    ---Processing one file at a time
    Doable by turning off file-splitting, or by creating custom SequenceFiles/HARs.
    ---Writing all the processed output to a single output file. [One output
    file per folder]
    Doable with single reducer, but why do you require a single file?
    I think I need to give one file to one Mapper at a time, when all the
    mappers combine, one single reducer should write to a single file. [as I
    think we cannot write parallely to a single output file]
    There's a "getmerge" feature the Hadoop DFS utils provide to retrieve
    a DFS directory of outputs as a single file. You should use that
    feature instead of bottling your reduce phase with a single reducer
    instance (unless its a requirement of some sort).

    See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html for
    the exact command syntax.
    Please suggest me (or point me to resources) so that I can:
    a) My map function gets one file at a time (instead of one line at a time)
    I suggest pre-creating a Hadoop SequenceFile for this purpose, with
    the <Key, Value> being <Filename, Contents>. Another solution would be
    to use HAR. See
    http://www.cloudera.com/blog/2009/02/the-small-files-problem/ for some
    further discussion on this.
    b) Should implementing a custom RecordReader and/or FileInputFormat allow me
    to read files in subdirectories as well (one file at a time) ?
    FileInputFormat.isSplittable is a method that tells if the input files
    must be split into chunks for processing or not, and
    FileInputFormat.listStatus is a method that lists all files
    (FileStatus objects) in a directory to compute Mapper splits for.

    You should write a custom class extending and overriding these methods
    to ask it not to split files (false) and recurse yourself as required
    to provide a proper list of FileStatus objects back to the framework.

    (In trunk code, the recursion support has been added to
    FileInputFormat itself. See MAPREDUCE-1501 on Apache's JIRA for the
    specifics and a patch.)

    --
    Harsh J
    www.harshj.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedNov 17, '10 at 2:21p
activeNov 19, '10 at 8:21p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Bhaskar Ghosh: 2 posts Harsh J: 1 post

People

Translate

site design / logo © 2021 Grokbase