On Apr 25, 2012, at 1:35 PM, Sam Ritchie wrote:
Robin, I think the lzo library might be a red herring. I've been running into this same issue with ElephantDB and Amazon's new Hadoop version. Does the same thing happen when you try to transfer between two sequencefiles, from s3 to hdfs?
If so, then the issue lies with Cascading. If not, then it probably lies with the lzo formats in ElephantBird. It looks like that exception told you exactly what's going on:
"You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path."
Can you send the full error, so we can see what class caused the exception?
On Wed, Apr 25, 2012 at 7:36 AM, Robin Kraft wrote:
I've ended up pre-loading HDFS with the lzo files using distcp, then running my query using the files on HDFS instead the ones on S3:
hadoop distcp s3n://mybucket/lzo_files lzo_files
(let [src (hfs-lzo-textline "lzo_files")
out-loc (hfs-seqfile "s3n://mybucket/seq_files" :sinkmode :replace)]
(?- out-loc src))
This seems to work fine, and the copy to HDFS could be a simple bash script in my workflow, but I'm still hoping the error I've seen is just a configuration issue. Am I just missing something really simple and obvious?
On Wednesday, April 25, 2012 9:29:30 AM UTC-4, Robin Kraft wrote:
I ended up building the indexes using the DistributedLzoIndexer per the hadoop-lzo readme.
hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer s3n://mybucket/lzo_are_here
Now my S3 directory includes part-00000.lzo and part-00000.lzo.index - progress! However, as is now obvious to me, this isn't actually the solution, since I still see the same error message.
Do I have to pre-load all of the lzo files and indexes into HDFS instead of s3? Even if I have to create the indexes from the command line, it would be nice not to have to worry about HDFS vs. S3.
On Tuesday, April 24, 2012 11:00:01 PM UTC-4, Robin Kraft wrote:
I'd like to use cascalog-lzo in my project, and writing lzo files is no problem. But reading them back in a subsequent query is a problem, apparently because I haven't generated index files (see below). Kevin Weil says on the Hadoop-lzo github page that this shouldn't be an issue: "Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient". But I get this exception:
IllegalArgumentException This file system object (hdfs://10.83.57.40:9000) does not support access to the request path 's3n://mybucket/part-00000.lzo.index' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path. org.apache.hadoop.fs.FileSystem.checkPath (FileSystem.java:372)
All I was trying to do was convert an lzo file into a sequence file:
(let [src (hfs-lzo-textline "s3n://mybucket/lzo")
out-loc (hfs-seqfile "s3n://mybucket/seq" :sinkmode :replace)]
(?- out-loc src))
Any thoughts on why this is popping up, or whether there's a way to build the index files without dealing with Java?
Sam Ritchie, Twitter Inc
(Too brief? Here's why! http://emailcharter.org)