FAQ
Hi,
I am trying to access distributed cache in my custom output format but it
does not work and file open in custom output format fails with file does not
exist even though it physically does.

Looks like distributed cache only works for Mappers and Reducers ?

Is there a way I can read Distributed Cache in my custom output format ?

Thanks,
-JJ

Search Discussions

  • Mapred Learn at Jul 29, 2011 at 5:57 pm
    i m trying to access file that I sent as -files option in my hadoop jar
    command.

    in my outputformat,
    I am doing something like:

    Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

    String file1="";
    String file2="";
    Path pt=null;

    for (Path p : cacheFiles) {

    if (p != null) {
    if (p.getName().endsWith(".ryp")) {
    file1 = p.getName();
    } else if (p.getName().endsWith(".cpt")) {
    file2 = p.getName();
    pt=p;
    }

    }

    }

    // then read the file, which gives file does not exist exception:

    Path pat = new Path(file2);

    BufferedReader reader = null;
    try {
    FileSystem fs = FileSystem.get(conf);
    reader=new BufferedReader(
    new InputStreamReader(fs.open(pat)));


    String line = null;
    while ((line = reader.readLine()) != null) {
    System.out.println("Now parsing the line: " + line);


    }
    } catch (Exception e) {
    System.out.println("exception" + e.getMessage());
    }
    On Fri, Jul 29, 2011 at 10:50 AM, Alejandro Abdelnur wrote:

    Where are you getting the error, in the client submitting the job or in the
    MR tasks?

    Are you trying to access a file or trying to set a JAR in the
    DistributedCache?
    How/when are you adding the file/JAR to the DC?
    How are you retrieving the file/JAR from your outputformat code?

    Thxs.

    Alejandro

    On Fri, Jul 29, 2011 at 10:43 AM, Mapred Learn wrote:

    I am trying to create a custom text outputformat where I want to access a
    distirbuted cache file.


    On Fri, Jul 29, 2011 at 10:42 AM, Harsh J wrote:

    Mapred,

    By outputformat, do you mean the frontend, submit-time run of
    OutputFormat? Then no, it cannot access the distributed cache cause
    its not really setup at that point, and the front end doesn't need the
    distributed cache really when it can access those files directly.

    Could you describe slightly deeper on what you're attempting to do?

    On Fri, Jul 29, 2011 at 10:57 PM, Mapred Learn <mapred.learn@gmail.com>
    wrote:
    Hi,
    I am trying to access distributed cache in my custom output format but it
    does not work and file open in custom output format fails with file does not
    exist even though it physically does.

    Looks like distributed cache only works for Mappers and Reducers ?

    Is there a way I can read Distributed Cache in my custom output format ?
    Thanks,
    -JJ


    --
    Harsh J
  • Alejandro Abdelnur at Jul 29, 2011 at 6:15 pm
    Mmmh, I've never used the -files option (I don't know if it will copy the
    files to HDFS for your or you have to put them there first).

    My usage pattern of the DC is copying the files to HDFS, then use the DC API
    to add those files to the jobconf.

    Alejandro
    On Fri, Jul 29, 2011 at 10:56 AM, Mapred Learn wrote:

    i m trying to access file that I sent as -files option in my hadoop jar
    command.

    in my outputformat,
    I am doing something like:

    Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

    String file1="";
    String file2="";
    Path pt=null;

    for (Path p : cacheFiles) {

    if (p != null) {
    if (p.getName().endsWith(".ryp")) {
    file1 = p.getName();
    } else if (p.getName().endsWith(".cpt")) {
    file2 = p.getName();
    pt=p;
    }

    }

    }

    // then read the file, which gives file does not exist exception:

    Path pat = new Path(file2);

    BufferedReader reader = null;
    try {
    FileSystem fs = FileSystem.get(conf);
    reader=new BufferedReader(
    new InputStreamReader(fs.open(pat)));


    String line = null;
    while ((line = reader.readLine()) != null) {
    System.out.println("Now parsing the line: " + line);


    }
    } catch (Exception e) {
    System.out.println("exception" + e.getMessage());

    }
    On Fri, Jul 29, 2011 at 10:50 AM, Alejandro Abdelnur wrote:

    Where are you getting the error, in the client submitting the job or in
    the MR tasks?

    Are you trying to access a file or trying to set a JAR in the
    DistributedCache?
    How/when are you adding the file/JAR to the DC?
    How are you retrieving the file/JAR from your outputformat code?

    Thxs.

    Alejandro

    On Fri, Jul 29, 2011 at 10:43 AM, Mapred Learn wrote:

    I am trying to create a custom text outputformat where I want to access a
    distirbuted cache file.


    On Fri, Jul 29, 2011 at 10:42 AM, Harsh J wrote:

    Mapred,

    By outputformat, do you mean the frontend, submit-time run of
    OutputFormat? Then no, it cannot access the distributed cache cause
    its not really setup at that point, and the front end doesn't need the
    distributed cache really when it can access those files directly.

    Could you describe slightly deeper on what you're attempting to do?

    On Fri, Jul 29, 2011 at 10:57 PM, Mapred Learn <mapred.learn@gmail.com>
    wrote:
    Hi,
    I am trying to access distributed cache in my custom output format but it
    does not work and file open in custom output format fails with file does not
    exist even though it physically does.

    Looks like distributed cache only works for Mappers and Reducers ?

    Is there a way I can read Distributed Cache in my custom output format ?
    Thanks,
    -JJ


    --
    Harsh J
  • Mapred Learn at Jul 29, 2011 at 6:18 pm
    when you use -files option, it copies in a .staging directory and all
    mappers can access it but for output format, I see it is not able to access
    it.

    -files copies cache file under:

    /user/<id>/.staging/<job name>/files/<filename>

    On Fri, Jul 29, 2011 at 11:14 AM, Alejandro Abdelnur wrote:

    Mmmh, I've never used the -files option (I don't know if it will copy the
    files to HDFS for your or you have to put them there first).

    My usage pattern of the DC is copying the files to HDFS, then use the DC
    API to add those files to the jobconf.

    Alejandro

    On Fri, Jul 29, 2011 at 10:56 AM, Mapred Learn wrote:

    i m trying to access file that I sent as -files option in my hadoop jar
    command.

    in my outputformat,
    I am doing something like:

    Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

    String file1="";
    String file2="";
    Path pt=null;

    for (Path p : cacheFiles) {

    if (p != null) {
    if (p.getName().endsWith(".ryp")) {
    file1 = p.getName();
    } else if (p.getName().endsWith(".cpt")) {
    file2 = p.getName();
    pt=p;
    }

    }

    }

    // then read the file, which gives file does not exist exception:

    Path pat = new Path(file2);

    BufferedReader reader = null;
    try {
    FileSystem fs = FileSystem.get(conf);
    reader=new BufferedReader(
    new InputStreamReader(fs.open(pat)));


    String line = null;
    while ((line = reader.readLine()) != null) {
    System.out.println("Now parsing the line: " + line);


    }
    } catch (Exception e) {
    System.out.println("exception" + e.getMessage());

    }
    On Fri, Jul 29, 2011 at 10:50 AM, Alejandro Abdelnur wrote:

    Where are you getting the error, in the client submitting the job or in
    the MR tasks?

    Are you trying to access a file or trying to set a JAR in the
    DistributedCache?
    How/when are you adding the file/JAR to the DC?
    How are you retrieving the file/JAR from your outputformat code?

    Thxs.

    Alejandro

    On Fri, Jul 29, 2011 at 10:43 AM, Mapred Learn wrote:

    I am trying to create a custom text outputformat where I want to access
    a distirbuted cache file.


    On Fri, Jul 29, 2011 at 10:42 AM, Harsh J wrote:

    Mapred,

    By outputformat, do you mean the frontend, submit-time run of
    OutputFormat? Then no, it cannot access the distributed cache cause
    its not really setup at that point, and the front end doesn't need the
    distributed cache really when it can access those files directly.

    Could you describe slightly deeper on what you're attempting to do?

    On Fri, Jul 29, 2011 at 10:57 PM, Mapred Learn <mapred.learn@gmail.com>
    wrote:
    Hi,
    I am trying to access distributed cache in my custom output format but it
    does not work and file open in custom output format fails with file does not
    exist even though it physically does.

    Looks like distributed cache only works for Mappers and Reducers ?

    Is there a way I can read Distributed Cache in my custom output format ?
    Thanks,
    -JJ


    --
    Harsh J
  • Alejandro Abdelnur at Jul 29, 2011 at 6:23 pm
    So, the -files uses the '#' symlink feature, correct?

    If so, from the MR task JVM doing a new File(<FILENAME>) would work, where
    the FILENAME does not include the path of the file, correct?

    Thxs.

    Alejandro
    On Fri, Jul 29, 2011 at 11:20 AM, Brock Noland wrote:

    With -files the file will be placed in the CWD of the map/reduce tasks.

    You should be able to open the file with FileInputStream.

    Brock
    On Fri, Jul 29, 2011 at 1:18 PM, Mapred Learn wrote:

    when you use -files option, it copies in a .staging directory and all
    mappers can access it but for output format, I see it is not able to access
    it.

    -files copies cache file under:

    /user/<id>/.staging/<job name>/files/<filename>


    On Fri, Jul 29, 2011 at 11:14 AM, Alejandro Abdelnur wrote:

    Mmmh, I've never used the -files option (I don't know if it will copy the
    files to HDFS for your or you have to put them there first).

    My usage pattern of the DC is copying the files to HDFS, then use the DC
    API to add those files to the jobconf.

    Alejandro

    On Fri, Jul 29, 2011 at 10:56 AM, Mapred Learn wrote:

    i m trying to access file that I sent as -files option in my hadoop jar
    command.

    in my outputformat,
    I am doing something like:

    Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

    String file1="";
    String file2="";
    Path pt=null;

    for (Path p : cacheFiles) {

    if (p != null) {
    if (p.getName().endsWith(".ryp")) {
    file1 = p.getName();
    } else if (p.getName().endsWith(".cpt")) {
    file2 = p.getName();
    pt=p;
    }

    }

    }

    // then read the file, which gives file does not exist exception:

    Path pat = new Path(file2);

    BufferedReader reader = null;
    try {
    FileSystem fs = FileSystem.get(conf);
    reader=new BufferedReader(
    new InputStreamReader(fs.open(pat)));


    String line = null;
    while ((line = reader.readLine()) != null) {
    System.out.println("Now parsing the line: " + line);


    }
    } catch (Exception e) {
    System.out.println("exception" + e.getMessage());


    }

    On Fri, Jul 29, 2011 at 10:50 AM, Alejandro Abdelnur <tucu@cloudera.com
    wrote:
    Where are you getting the error, in the client submitting the job or in
    the MR tasks?

    Are you trying to access a file or trying to set a JAR in the
    DistributedCache?
    How/when are you adding the file/JAR to the DC?
    How are you retrieving the file/JAR from your outputformat code?

    Thxs.

    Alejandro


    On Fri, Jul 29, 2011 at 10:43 AM, Mapred Learn <mapred.learn@gmail.com
    wrote:
    I am trying to create a custom text outputformat where I want to
    access a distirbuted cache file.


    On Fri, Jul 29, 2011 at 10:42 AM, Harsh J wrote:

    Mapred,

    By outputformat, do you mean the frontend, submit-time run of
    OutputFormat? Then no, it cannot access the distributed cache cause
    its not really setup at that point, and the front end doesn't need
    the
    distributed cache really when it can access those files directly.

    Could you describe slightly deeper on what you're attempting to do?

    On Fri, Jul 29, 2011 at 10:57 PM, Mapred Learn <
    mapred.learn@gmail.com> wrote:
    Hi,
    I am trying to access distributed cache in my custom output format but it
    does not work and file open in custom output format fails with file does not
    exist even though it physically does.

    Looks like distributed cache only works for Mappers and Reducers ?

    Is there a way I can read Distributed Cache in my custom output format ?
    Thanks,
    -JJ


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJul 29, '11 at 5:28p
activeJul 29, '11 at 6:23p
posts5
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase