FAQ
i have a MR job to read file on amazon S3 and process the data on local hdfs. the files are zipped text file as .gz. i tried to setup the job as below but it won't work, anyone know what might be wrong? do i need to add extra step to unzip the file first? thanks.


String S3_LOCATION = "s3n://access_key:private_key@bucket_name"

protected void prepareHadoopJob() throws Exception {

this.getHadoopJob().setMapperClass(Mapper1.class);
this.getHadoopJob().setInputFormatClass(TextInputFormat.class);

FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));

this.getHadoopJob().setNumReduceTasks(0);
this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
this.getHadoopJob().setOutputValueClass(Put.class);
}



[cid:73FA9081-776F-4031-93E2-EFC1A9FEAD76]
Dan Yi | Software Engineer, Analytics Engineering
Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
Predictive Analytics for a Connected World

Search Discussions

  • Harsh J at Jul 20, 2012 at 1:34 am
    Dan,

    Can you share your error? The plain .gz files (not .tar.gz) are natively
    supported by Hadoop via its GzipCodec, and if you are facing an error, I
    believe its cause of something other than compression.
    On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi wrote:

    i have a MR job to read file on amazon S3 and process the data on local
    hdfs. the files are zipped text file as .gz. i tried to setup the job as
    below but it won't work, anyone know what might be wrong? do i need to add
    extra step to unzip the file first? thanks.

    String S3_LOCATION = "s3n://access_key:private_key@bucket_name"

    protected void prepareHadoopJob() throws Exception {

    this.getHadoopJob().setMapperClass(Mapper1.class);
    this.getHadoopJob().setInputFormatClass(TextInputFormat.class);

    FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));

    this.getHadoopJob().setNumReduceTasks(0);
    this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
    this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
    this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
    this.getHadoopJob().setOutputValueClass(Put.class);
    }




    *

    Dan Yi | Software Engineer, Analytics Engineering
    Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
    Predictive Analytics for a Connected World
    *

    --
    Harsh J
  • Ben Kim at Oct 2, 2012 at 10:06 am
    I'm having a similar issue

    I'm running a wordcount MR as follows

    hadoop jar WordCount.jar wordcount.WordCountDriver
    s3n://bucket/wordcount/input s3n://bucket/wordcount/output

    s3n://bucket/wordcount/input is a s3 object that contains other input files.

    However I get following NPE error

    12/10/02 18:56:23 INFO mapred.JobClient: map 0% reduce 0%
    12/10/02 18:56:54 INFO mapred.JobClient: map 50% reduce 0%
    12/10/02 18:56:56 INFO mapred.JobClient: Task Id :
    attempt_201210021853_0001_m_000001_0, Status : FAILED
    java.lang.NullPointerException
    at
    org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
    at java.io.BufferedInputStream.close(BufferedInputStream.java:451)
    at java.io.FilterInputStream.close(FilterInputStream.java:155)
    at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
    at
    org.apache.hadoop.mapreduce.lib.input.LineRecordReader.close(LineRecordReader.java:144)
    at
    org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:497)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
    org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

    MR runs fine if i specify more specific input path such as
    s3n://bucket/wordcount/input/file.txt
    what i want is to be able to pass s3 folders as parameters
    Does anyone knows how to do this?

    Best regards,
    Ben Kim

    On Fri, Jul 20, 2012 at 10:33 AM, Harsh J wrote:

    Dan,

    Can you share your error? The plain .gz files (not .tar.gz) are natively
    supported by Hadoop via its GzipCodec, and if you are facing an error, I
    believe its cause of something other than compression.

    On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi wrote:

    i have a MR job to read file on amazon S3 and process the data on local
    hdfs. the files are zipped text file as .gz. i tried to setup the job as
    below but it won't work, anyone know what might be wrong? do i need to add
    extra step to unzip the file first? thanks.

    String S3_LOCATION = "s3n://access_key:private_key@bucket_name"

    protected void prepareHadoopJob() throws Exception {

    this.getHadoopJob().setMapperClass(Mapper1.class);
    this.getHadoopJob().setInputFormatClass(TextInputFormat.class);

    FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));

    this.getHadoopJob().setNumReduceTasks(0);
    this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
    this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
    this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
    this.getHadoopJob().setOutputValueClass(Put.class);
    }




    *

    Dan Yi | Software Engineer, Analytics Engineering
    Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
    Predictive Analytics for a Connected World
    *

    --
    Harsh J


    --

    *Benjamin Kim*
    *benkimkimben at gmail*
  • Marcos Ortiz at Oct 2, 2012 at 1:08 pm
    Are you sure that you prepare your MR code to work with mutiple files?
    This example (WordCount) works with a single input.

    You should take a look to the MultipleInput API for this.
    Best wishes

    El 02/10/2012 6:05, Ben Kim escribió:
    I'm having a similar issue

    I'm running a wordcount MR as follows

    hadoop jar WordCount.jar wordcount.WordCountDriver
    s3n://bucket/wordcount/input s3n://bucket/wordcount/output

    s3n://bucket/wordcount/input is a s3 object that contains other input
    files.

    However I get following NPE error

    12/10/02 18:56:23 INFO mapred.JobClient: map 0% reduce 0%
    12/10/02 18:56:54 INFO mapred.JobClient: map 50% reduce 0%
    12/10/02 18:56:56 INFO mapred.JobClient: Task Id :
    attempt_201210021853_0001_m_000001_0, Status : FAILED
    java.lang.NullPointerException
    at
    org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
    at
    java.io.BufferedInputStream.close(BufferedInputStream.java:451)
    at java.io.FilterInputStream.close(FilterInputStream.java:155)
    at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
    at
    org.apache.hadoop.mapreduce.lib.input.LineRecordReader.close(LineRecordReader.java:144)
    at
    org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:497)
    at
    org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
    org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)


    MR runs fine if i specify more specific input path such as
    s3n://bucket/wordcount/input/file.txt
    what i want is to be able to pass s3 folders as parameters
    Does anyone knows how to do this?

    Best regards,
    Ben Kim


    On Fri, Jul 20, 2012 at 10:33 AM, Harsh J wrote:

    Dan,

    Can you share your error? The plain .gz files (not .tar.gz) are
    natively supported by Hadoop via its GzipCodec, and if you are
    facing an error, I believe its cause of something other than
    compression.


    On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi wrote:

    i have a MR job to read file on amazon S3 and process the data
    on local hdfs. the files are zipped text file as .gz. i tried
    to setup the job as below but it won't work, anyone know what
    might be wrong? do i need to add extra step to unzip the file
    first? thanks.
    String S3_LOCATION = "s3n://access_key:private_key@bucket_name"
    protected void prepareHadoopJob() throws Exception {

    this.getHadoopJob().setMapperClass(Mapper1.class);
    this.getHadoopJob().setInputFormatClass(TextInputFormat.class);

    FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));

    this.getHadoopJob().setNumReduceTasks(0);
    this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
    this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
    this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
    this.getHadoopJob().setOutputValueClass(Put.class);
    }|




    *
    **
    Dan Yi*| Software Engineer, Analytics Engineering
    Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
    */Predictive Analytics for a Connected World/*
    *
    ***




    --
    Harsh J




    --

    *Benjamin Kim*
    *benkimkimben at gmail*
    --
    Marcos Ortiz Valmaseda,
    Data Engineer && Senior System Administrator at UCI
    Blog: http://marcosluis2186.posterous.com
    Linkedin: http://www.linkedin.com/in/marcosluis2186
    Twitter: @marcosluis2186





    10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
    CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

    http://www.uci.cu
    http://www.facebook.com/universidad.uci
    http://www.flickr.com/photos/universidad_uci

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJul 20, '12 at 12:45a
activeOct 2, '12 at 1:08p
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase