FAQ
Hi All

Greetings
The wordcount at
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html works fine
for following directory structure.

inputdir -> file1
-> file2
-> file3

And it does not work for
inputdir -> dir1 -> innerfile1
-> file1
-> file2
-> dir2
For this second scenario we get error like
----
branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount toplevel
outlevel
08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
: 3
java.io.IOException: Not a file:
hdfs://localhost:5310/user/username/inputdir/dir1
----


So, when it encounters an entry that is not a file, it comes out after
throwing IO exception.
In FileInputFormat.java, would like to call a recursive procedure in the
following piece of code. All the files at leaf level of entire directory
structure should be included in the paths to be searched. If anyone has
already done this, please help me achieving the same.
------------------------------------------------------------------------------------------
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
Path[] files = listPaths(job);
long totalSize = 0; // compute total size
for (int i = 0; i < files.length; i++) { // check we have valid
files
Path file = files[i];
FileSystem fs = file.getFileSystem(job);
if (fs.isDirectory(file) || !fs.exists(file)) {
throw new IOException("Not a file: "+files[i]);
}
totalSize += fs.getLength(files[i]);
}
....
------------------------------------------------------

Should we reset "mapred.input.dir" to inner directory and call getInputPaths
recursively?
Please help me to get all the file paths (irrespective of their depth level)
.

Thankyou
Srilatha

Search Discussions

  • Latha at Oct 18, 2008 at 7:42 pm
    Apologies for pasting a wrong command .Please find the correct command I
    used.

    ----
    branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount inputdir
    outdir
    08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
    : 3
    java.io.IOException: Not a file:
    hdfs://localhost:5310/user/username/inputdir/dir1
    ...
    ...
    ----

    And inputdir has 2 subdirectories "dir1","dir2" and a file "file1".

    My requirement is to run wordcount for all the files in all sub directories.
    Please suggest an idea.

    Regards,
    Srilatha

    On Sat, Oct 18, 2008 at 6:38 PM, Latha wrote:

    Hi All

    Greetings
    The wordcount at
    http://hadoop.apache.org/core/docs/current/mapred_tutorial.html works fine
    for following directory structure.

    inputdir -> file1
    -> file2
    -> file3

    And it does not work for
    inputdir -> dir1 -> innerfile1
    -> file1
    -> file2
    -> dir2
    For this second scenario we get error like
    ----
    branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount toplevel
    outlevel
    08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
    : 3
    java.io.IOException: Not a file:
    hdfs://localhost:5310/user/username/inputdir/dir1
    ----


    So, when it encounters an entry that is not a file, it comes out after
    throwing IO exception.
    In FileInputFormat.java, would like to call a recursive procedure in the
    following piece of code. All the files at leaf level of entire directory
    structure should be included in the paths to be searched. If anyone has
    already done this, please help me achieving the same.

    ------------------------------------------------------------------------------------------
    public InputSplit[] getSplits(JobConf job, int numSplits)
    throws IOException {
    Path[] files = listPaths(job);
    long totalSize = 0; // compute total size
    for (int i = 0; i < files.length; i++) { // check we have valid
    files
    Path file = files[i];
    FileSystem fs = file.getFileSystem(job);
    if (fs.isDirectory(file) || !fs.exists(file)) {
    throw new IOException("Not a file: "+files[i]);
    }
    totalSize += fs.getLength(files[i]);
    }
    ....
    ------------------------------------------------------

    Should we reset "mapred.input.dir" to inner directory and call
    getInputPaths recursively?
    Please help me to get all the file paths (irrespective of their depth
    level) .

    Thankyou
    Srilatha
  • Owen O'Malley at Oct 19, 2008 at 3:29 am

    On Oct 18, 2008, at 6:08 AM, Latha wrote:

    And it does not work for
    inputdir -> dir1 -> innerfile1
    -> file1
    -> file2
    -> dir2
    Typically you don't mix files and directories in the same level. The
    easiest way to get the desired result would be to use a pattern to
    list the files and directories to read:

    inputdir/dir*,inputdir/file*

    would glob out to dir1, dir2, file1, and file2. The files would just
    include themselves and the directories would expand one level.

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedOct 18, '08 at 1:09p
activeOct 19, '08 at 3:29a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Latha: 2 posts Owen O'Malley: 1 post

People

Translate

site design / logo © 2022 Grokbase