FAQ
I'd like to use cascalog-lzo in my project, and writing lzo files is no
problem. But reading them back in a subsequent query is a problem,
apparently because I haven't generated index files (see below). Kevin Weil
says on the Hadoop-lzo github page that this shouldn't be an issue: "Note
that if you forget to index an .lzo file, the job will work but will
process the entire file in a single split, which will be less efficient".
But I get this exception:

IllegalArgumentException This file system object (hdfs://10.83.57.40:9000)
does not support access to the request path
's3n://mybucket/part-00000.lzo.index' You possibly called
FileSystem.get(conf) when you should have called FileSystem.get(uri, conf)
to obtain a file system supporting your path.
org.apache.hadoop.fs.FileSystem.checkPath (FileSystem.java:372)

All I was trying to do was convert an lzo file into a sequence file:

(let [src (hfs-lzo-textline "s3n://mybucket/lzo")
out-loc (hfs-seqfile "s3n://mybucket/seq" :sinkmode :replace)]
(?- out-loc src))

Any thoughts on why this is popping up, or whether there's a way to build
the index files without dealing with Java?

Thanks!
-Robin

Search Discussions

  • Robin Kraft at Apr 25, 2012 at 1:29 pm
    I ended up building the indexes using the DistributedLzoIndexer per the
    hadoop-lzo readme.

    hadoop jar /path/to/your/hadoop-lzo.jar
    com.hadoop.compression.lzo.DistributedLzoIndexer s3n://mybucket/lzo_are_here

    Now my S3 directory includes part-00000.lzo and part-00000.lzo.index -
    progress! However, as is now obvious to me, this isn't actually the
    solution, since I still see the same error message.

    Do I have to pre-load all of the lzo files and indexes into HDFS instead of
    s3? Even if I have to create the indexes from the command line, it would be
    nice not to have to worry about HDFS vs. S3.

    On Tuesday, April 24, 2012 11:00:01 PM UTC-4, Robin Kraft wrote:

    I'd like to use cascalog-lzo in my project, and writing lzo files is no
    problem. But reading them back in a subsequent query is a problem,
    apparently because I haven't generated index files (see below). Kevin Weil
    says on the Hadoop-lzo github page that this shouldn't be an issue: "Note
    that if you forget to index an .lzo file, the job will work but will
    process the entire file in a single split, which will be less efficient".
    But I get this exception:

    IllegalArgumentException This file system object (hdfs://10.83.57.40:9000)
    does not support access to the request path
    's3n://mybucket/part-00000.lzo.index' You possibly called
    FileSystem.get(conf) when you should have called FileSystem.get(uri, conf)
    to obtain a file system supporting your path.
    org.apache.hadoop.fs.FileSystem.checkPath (FileSystem.java:372)

    All I was trying to do was convert an lzo file into a sequence file:

    (let [src (hfs-lzo-textline "s3n://mybucket/lzo")
    out-loc (hfs-seqfile "s3n://mybucket/seq" :sinkmode :replace)]
    (?- out-loc src))

    Any thoughts on why this is popping up, or whether there's a way to build
    the index files without dealing with Java?

    Thanks!
    -Robin
  • Robin Kraft at Apr 25, 2012 at 2:36 pm
    I've ended up pre-loading HDFS with the lzo files using distcp, then
    running my query using the files on HDFS instead the ones on S3:

    hadoop distcp s3n://mybucket/lzo_files lzo_files

    (let [src (hfs-lzo-textline "lzo_files")
    out-loc (hfs-seqfile "s3n://mybucket/seq_files" :sinkmode :replace)]
    (?- out-loc src))

    This seems to work fine, and the copy to HDFS could be a simple bash script
    in my workflow, but I'm still hoping the error I've seen is just a
    configuration issue. Am I just missing something really simple and obvious?

    On Wednesday, April 25, 2012 9:29:30 AM UTC-4, Robin Kraft wrote:

    I ended up building the indexes using the DistributedLzoIndexer per the
    hadoop-lzo readme.

    hadoop jar /path/to/your/hadoop-lzo.jar
    com.hadoop.compression.lzo.DistributedLzoIndexer s3n://mybucket/lzo_are_here

    Now my S3 directory includes part-00000.lzo and part-00000.lzo.index -
    progress! However, as is now obvious to me, this isn't actually the
    solution, since I still see the same error message.

    Do I have to pre-load all of the lzo files and indexes into HDFS instead
    of s3? Even if I have to create the indexes from the command line, it would
    be nice not to have to worry about HDFS vs. S3.

    On Tuesday, April 24, 2012 11:00:01 PM UTC-4, Robin Kraft wrote:

    I'd like to use cascalog-lzo in my project, and writing lzo files is no
    problem. But reading them back in a subsequent query is a problem,
    apparently because I haven't generated index files (see below). Kevin Weil
    says on the Hadoop-lzo github page that this shouldn't be an issue: "Note
    that if you forget to index an .lzo file, the job will work but will
    process the entire file in a single split, which will be less efficient".
    But I get this exception:

    IllegalArgumentException This file system object (hdfs://10.83.57.40:9000)
    does not support access to the request path
    's3n://mybucket/part-00000.lzo.index' You possibly called
    FileSystem.get(conf) when you should have called FileSystem.get(uri, conf)
    to obtain a file system supporting your path.
    org.apache.hadoop.fs.FileSystem.checkPath (FileSystem.java:372)

    All I was trying to do was convert an lzo file into a sequence file:

    (let [src (hfs-lzo-textline "s3n://mybucket/lzo")
    out-loc (hfs-seqfile "s3n://mybucket/seq" :sinkmode :replace)]
    (?- out-loc src))

    Any thoughts on why this is popping up, or whether there's a way to build
    the index files without dealing with Java?

    Thanks!
    -Robin
  • Sam Ritchie at Apr 25, 2012 at 5:35 pm
    Robin, I think the lzo library might be a red herring. I've been running
    into this same issue with ElephantDB and Amazon's new Hadoop version. Does
    the same thing happen when you try to transfer between two sequencefiles,
    from s3 to hdfs?

    If so, then the issue lies with Cascading. If not, then it probably lies
    with the lzo formats in ElephantBird. It looks like that exception told you
    exactly what's going on:

    "You possibly called FileSystem.get(conf) when you should have called
    FileSystem.get(uri, conf) to obtain a file system supporting your path."

    Can you send the full error, so we can see what class caused the exception?
    On Wed, Apr 25, 2012 at 7:36 AM, Robin Kraft wrote:

    I've ended up pre-loading HDFS with the lzo files using distcp, then
    running my query using the files on HDFS instead the ones on S3:

    hadoop distcp s3n://mybucket/lzo_files lzo_files

    (let [src (hfs-lzo-textline "lzo_files")
    out-loc (hfs-seqfile "s3n://mybucket/seq_files" :sinkmode :replace)]
    (?- out-loc src))

    This seems to work fine, and the copy to HDFS could be a simple bash
    script in my workflow, but I'm still hoping the error I've seen is just a
    configuration issue. Am I just missing something really simple and obvious?

    On Wednesday, April 25, 2012 9:29:30 AM UTC-4, Robin Kraft wrote:

    I ended up building the indexes using the DistributedLzoIndexer per the
    hadoop-lzo readme.

    hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.**DistributedLzoIndexer
    s3n://mybucket/lzo_are_here

    Now my S3 directory includes part-00000.lzo and part-00000.lzo.index -
    progress! However, as is now obvious to me, this isn't actually the
    solution, since I still see the same error message.

    Do I have to pre-load all of the lzo files and indexes into HDFS instead
    of s3? Even if I have to create the indexes from the command line, it would
    be nice not to have to worry about HDFS vs. S3.

    On Tuesday, April 24, 2012 11:00:01 PM UTC-4, Robin Kraft wrote:

    I'd like to use cascalog-lzo in my project, and writing lzo files is no
    problem. But reading them back in a subsequent query is a problem,
    apparently because I haven't generated index files (see below). Kevin Weil
    says on the Hadoop-lzo github page that this shouldn't be an issue: "Note
    that if you forget to index an .lzo file, the job will work but will
    process the entire file in a single split, which will be less efficient".
    But I get this exception:

    IllegalArgumentException This file system object (hdfs://
    10.83.57.40:9000) does not support access to the request path
    's3n://mybucket/part-00000.**lzo.index' You possibly called
    FileSystem.get(conf) when you should have called FileSystem.get(uri, conf)
    to obtain a file system supporting your path. org.apache.hadoop.fs.**FileSystem.checkPath
    (FileSystem.java:372)

    All I was trying to do was convert an lzo file into a sequence file:

    (let [src (hfs-lzo-textline "s3n://mybucket/lzo")
    out-loc (hfs-seqfile "s3n://mybucket/seq" :sinkmode :replace)]
    (?- out-loc src))

    Any thoughts on why this is popping up, or whether there's a way to
    build the index files without dealing with Java?

    Thanks!
    -Robin

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Robin Kraft at Apr 25, 2012 at 9:09 pm
    There's no problem with seqfiles (I just double-checked), just lzo to seq file. Here's all I've got from stdout. The only things in the namespace were cascalog.api and cascalog.lzo. None of the tasks fail, so I don't think there's any more info than that, unless there might be something on one of the nodes.


    2/04/25 21:05:51 INFO hadoop.HadoopFlowProcess: attempting to load codec: org.apache.hadoop.io.compress.GzipCodec
    12/04/25 21:05:51 INFO hadoop.HadoopFlowProcess: found codec: org.apache.hadoop.io.compress.GzipCodec
    12/04/25 21:05:51 INFO mapred.FileInputFormat: Total input paths to process : 8
    12/04/25 21:05:51 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
    12/04/25 21:05:51 WARN lzo.LzoCodec: Could not find build properties file with revision hash
    12/04/25 21:05:51 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
    IllegalArgumentException This file system object (hdfs://10.28.26.174:9000) does not support access to the request path 's3n://formaresults/analysis/xyzmonthlylzo/part-00000.lzo.index' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path. org.apache.hadoop.fs.FileSystem.checkPath (FileSystem.java:372)

    On Apr 25, 2012, at 1:35 PM, Sam Ritchie wrote:

    Robin, I think the lzo library might be a red herring. I've been running into this same issue with ElephantDB and Amazon's new Hadoop version. Does the same thing happen when you try to transfer between two sequencefiles, from s3 to hdfs?

    If so, then the issue lies with Cascading. If not, then it probably lies with the lzo formats in ElephantBird. It looks like that exception told you exactly what's going on:

    "You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path."

    Can you send the full error, so we can see what class caused the exception?

    On Wed, Apr 25, 2012 at 7:36 AM, Robin Kraft wrote:
    I've ended up pre-loading HDFS with the lzo files using distcp, then running my query using the files on HDFS instead the ones on S3:

    hadoop distcp s3n://mybucket/lzo_files lzo_files

    (let [src (hfs-lzo-textline "lzo_files")
    out-loc (hfs-seqfile "s3n://mybucket/seq_files" :sinkmode :replace)]
    (?- out-loc src))

    This seems to work fine, and the copy to HDFS could be a simple bash script in my workflow, but I'm still hoping the error I've seen is just a configuration issue. Am I just missing something really simple and obvious?


    On Wednesday, April 25, 2012 9:29:30 AM UTC-4, Robin Kraft wrote:
    I ended up building the indexes using the DistributedLzoIndexer per the hadoop-lzo readme.

    hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer s3n://mybucket/lzo_are_here

    Now my S3 directory includes part-00000.lzo and part-00000.lzo.index - progress! However, as is now obvious to me, this isn't actually the solution, since I still see the same error message.

    Do I have to pre-load all of the lzo files and indexes into HDFS instead of s3? Even if I have to create the indexes from the command line, it would be nice not to have to worry about HDFS vs. S3.


    On Tuesday, April 24, 2012 11:00:01 PM UTC-4, Robin Kraft wrote:
    I'd like to use cascalog-lzo in my project, and writing lzo files is no problem. But reading them back in a subsequent query is a problem, apparently because I haven't generated index files (see below). Kevin Weil says on the Hadoop-lzo github page that this shouldn't be an issue: "Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient". But I get this exception:

    IllegalArgumentException This file system object (hdfs://10.83.57.40:9000) does not support access to the request path 's3n://mybucket/part-00000.lzo.index' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path. org.apache.hadoop.fs.FileSystem.checkPath (FileSystem.java:372)

    All I was trying to do was convert an lzo file into a sequence file:

    (let [src (hfs-lzo-textline "s3n://mybucket/lzo")
    out-loc (hfs-seqfile "s3n://mybucket/seq" :sinkmode :replace)]
    (?- out-loc src))

    Any thoughts on why this is popping up, or whether there's a way to build the index files without dealing with Java?

    Thanks!
    -Robin



    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Sam Ritchie at Apr 25, 2012 at 9:20 pm
    Looks like you've found a bug in elephant-bird, the library we use to open
    up LZO files :) Basically, every time elephant-bird makes a call like this:

    https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/mapreduce/input/LzoInputFormat.java#L59

    you're going to get a failure with the new versions of Hadoop used on EMR.
    I'd recommend opening up a ticket on elephant-bird. Until that gets fixed,
    you'll need to run that distcp command to get data over into HDFS. This is
    the pain of having so many Hadoop forks :-/
    On Wed, Apr 25, 2012 at 2:09 PM, Robin Kraft wrote:

    There's no problem with seqfiles (I just double-checked), just lzo to seq
    file. Here's all I've got from stdout. The only things in the namespace
    were cascalog.api and cascalog.lzo. None of the tasks fail, so I don't
    think there's any more info than that, unless there might be something on
    one of the nodes.


    2/04/25 21:05:51 INFO hadoop.HadoopFlowProcess: attempting to load codec:
    org.apache.hadoop.io.compress.GzipCodec
    12/04/25 21:05:51 INFO hadoop.HadoopFlowProcess: found codec:
    org.apache.hadoop.io.compress.GzipCodec
    12/04/25 21:05:51 INFO mapred.FileInputFormat: Total input paths to
    process : 8
    12/04/25 21:05:51 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
    12/04/25 21:05:51 WARN lzo.LzoCodec: Could not find build properties file
    with revision hash
    12/04/25 21:05:51 INFO lzo.LzoCodec: Successfully loaded & initialized
    native-lzo library [hadoop-lzo rev UNKNOWN]
    IllegalArgumentException This file system object (hdfs://10.28.26.174:9000)
    does not support access to the request path '
    s3n://formaresults/analysis/xyzmonthlylzo/part-00000.lzo.index' You
    possibly called FileSystem.get(conf) when you should have called
    FileSystem.get(uri, conf) to obtain a file system supporting your path.
    org.apache.hadoop.fs.FileSystem.checkPath (FileSystem.java:372)


    On Apr 25, 2012, at 1:35 PM, Sam Ritchie wrote:

    Robin, I think the lzo library might be a red herring. I've been running
    into this same issue with ElephantDB and Amazon's new Hadoop version. Does
    the same thing happen when you try to transfer between two sequencefiles,
    from s3 to hdfs?

    If so, then the issue lies with Cascading. If not, then it probably lies
    with the lzo formats in ElephantBird. It looks like that exception told you
    exactly what's going on:

    "You possibly called FileSystem.get(conf) when you should have called
    FileSystem.get(uri, conf) to obtain a file system supporting your path."

    Can you send the full error, so we can see what class caused the exception?
    On Wed, Apr 25, 2012 at 7:36 AM, Robin Kraft wrote:

    I've ended up pre-loading HDFS with the lzo files using distcp, then
    running my query using the files on HDFS instead the ones on S3:

    hadoop distcp s3n://mybucket/lzo_files lzo_files

    (let [src (hfs-lzo-textline "lzo_files")
    out-loc (hfs-seqfile "s3n://mybucket/seq_files" :sinkmode
    :replace)]
    (?- out-loc src))

    This seems to work fine, and the copy to HDFS could be a simple bash
    script in my workflow, but I'm still hoping the error I've seen is just a
    configuration issue. Am I just missing something really simple and obvious?

    On Wednesday, April 25, 2012 9:29:30 AM UTC-4, Robin Kraft wrote:

    I ended up building the indexes using the DistributedLzoIndexer per the
    hadoop-lzo readme.

    hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.**DistributedLzoIndexer
    s3n://mybucket/lzo_are_here

    Now my S3 directory includes part-00000.lzo and part-00000.lzo.index -
    progress! However, as is now obvious to me, this isn't actually the
    solution, since I still see the same error message.

    Do I have to pre-load all of the lzo files and indexes into HDFS instead
    of s3? Even if I have to create the indexes from the command line, it would
    be nice not to have to worry about HDFS vs. S3.

    On Tuesday, April 24, 2012 11:00:01 PM UTC-4, Robin Kraft wrote:

    I'd like to use cascalog-lzo in my project, and writing lzo files is no
    problem. But reading them back in a subsequent query is a problem,
    apparently because I haven't generated index files (see below). Kevin Weil
    says on the Hadoop-lzo github page that this shouldn't be an issue: "Note
    that if you forget to index an .lzo file, the job will work but will
    process the entire file in a single split, which will be less efficient".
    But I get this exception:

    IllegalArgumentException This file system object (hdfs://
    10.83.57.40:9000) does not support access to the request path '
    s3n://mybucket/part-00000.**lzo.index' You possibly called
    FileSystem.get(conf) when you should have called FileSystem.get(uri, conf)
    to obtain a file system supporting your path. org.apache.hadoop.fs.**FileSystem.checkPath
    (FileSystem.java:372)

    All I was trying to do was convert an lzo file into a sequence file:

    (let [src (hfs-lzo-textline "s3n://mybucket/lzo")
    out-loc (hfs-seqfile "s3n://mybucket/seq" :sinkmode :replace)]
    (?- out-loc src))

    Any thoughts on why this is popping up, or whether there's a way to
    build the index files without dealing with Java?

    Thanks!
    -Robin

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedApr 25, '12 at 3:00a
activeApr 25, '12 at 9:20p
posts6
users2
websiteclojure.org
irc#clojure

2 users in discussion

Robin Kraft: 4 posts Sam Ritchie: 2 posts

People

Translate

site design / logo © 2022 Grokbase