FAQ
Hi ,
All examples that I found executes mapreduce job on a single file but in my
situation I have more than one.

Suppose I have such folder on HDFS which contains some files:

/my_hadoop_hdfs/my_folder:
/my_hadoop_hdfs/my_folder/file1.txt
/my_hadoop_hdfs/my_folder/file2.txt
/my_hadoop_hdfs/my_folder/file3.txt


How can I execute hadoop mapreduce on file1.txt , file2.txt and file3.txt?

Is it possible to provide to hadoop job folder as parameter and all files
will be produced by mapreduce job?

Thanks In Advance
Oleg

Search Discussions

  • Gang Luo at Mar 23, 2010 at 12:56 pm
    Hi Oleg,
    you can use FileInputFormat.addInputPath(JobConf, Path) multiple times in your program to add arbitrary paths. Instead, if you use FileInputFormat.setInputPath, there could be only one input path.

    If you are talking about output, the path you give is an output directory, all the output files (part-00000, part-00001...) will be generated in that directory.

    -Gang




    ----- 原始邮件 ----
    发件人: Oleg Ruchovets <oruchovets@gmail.com>
    收件人: common-user@hadoop.apache.org
    发送日期: 2010/3/23 (周二) 6:18:34 上午
    主 题: execute mapreduce job on multiple hdfs files

    Hi ,
    All examples that I found executes mapreduce job on a single file but in my
    situation I have more than one.

    Suppose I have such folder on HDFS which contains some files:

    /my_hadoop_hdfs/my_folder:
    /my_hadoop_hdfs/my_folder/file1.txt
    /my_hadoop_hdfs/my_folder/file2.txt
    /my_hadoop_hdfs/my_folder/file3.txt


    How can I execute hadoop mapreduce on file1.txt , file2.txt and file3.txt?

    Is it possible to provide to hadoop job folder as parameter and all files
    will be produced by mapreduce job?

    Thanks In Advance
    Oleg
  • Amogh Vasekar at Mar 23, 2010 at 3:39 pm
    Hi,
    Piggybacking on Gang’s reply, to add files / dirs recursively you can use the filestatus, liststatus to determine if its a file or dir and add as needed ( check FileStatus API for this ) There is a patch which does this for FileInputFormat

    http://issues.apache.org/jira/browse/MAPREDUCE-1501


    Amogh


    On 3/23/10 6:25 PM, "Gang Luo" wrote:

    Hi Oleg,
    you can use FileInputFormat.addInputPath(JobConf, Path) multiple times in your program to add arbitrary paths. Instead, if you use FileInputFormat.setInputPath, there could be only one input path.

    If you are talking about output, the path you give is an output directory, all the output files (part-00000, part-00001...) will be generated in that directory.

    -Gang




    ----- 原始邮件 ----
    发件人: Oleg Ruchovets <oruchovets@gmail.com>
    收件人: common-user@hadoop.apache.org
    发送日期: 2010/3/23 (周二) 6:18:34 上午
    主 题: execute mapreduce job on multiple hdfs files

    Hi ,
    All examples that I found executes mapreduce job on a single file but in my
    situation I have more than one.

    Suppose I have such folder on HDFS which contains some files:

    /my_hadoop_hdfs/my_folder:
    /my_hadoop_hdfs/my_folder/file1.txt
    /my_hadoop_hdfs/my_folder/file2.txt
    /my_hadoop_hdfs/my_folder/file3.txt


    How can I execute hadoop mapreduce on file1.txt , file2.txt and file3.txt?

    Is it possible to provide to hadoop job folder as parameter and all files
    will be produced by mapreduce job?

    Thanks In Advance
    Oleg

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 23, '10 at 10:19a
activeMar 23, '10 at 3:39p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase