FAQ
Hello All:

I think I must be missing something fundamental. Is it possible to load compressed data into HDFS, and then operate on it directly with map/reduce? I see a lot of stuff in the docs about writing compressed outputs, but nothing about reading compressed inputs.

Am I being ponderously stupid here?

Any help/comments appreciated...

Thanks,
C G


---------------------------------
Luggage? GPS? Comic books?
Check out fitting gifts for grads at Yahoo! Search.

Search Discussions

  • Jason gessner at Aug 30, 2007 at 3:47 pm
    if you put .gz files up on your HDFS cluster you don't need to do
    anything to read them. I see lots of extra control via the API, but i
    have simply put the files up and run my jobs on them.

    -jason
    On 8/30/07, C G wrote:
    Hello All:

    I think I must be missing something fundamental. Is it possible to load compressed data into HDFS, and then operate on it directly with map/reduce? I see a lot of stuff in the docs about writing compressed outputs, but nothing about reading compressed inputs.

    Am I being ponderously stupid here?

    Any help/comments appreciated...

    Thanks,
    C G


    ---------------------------------
    Luggage? GPS? Comic books?
    Check out fitting gifts for grads at Yahoo! Search.
  • Ted Dunning at Aug 30, 2007 at 8:36 pm
    With gzipped files, you do face the problem that your parallelism in the map
    phase is pretty much limited to the number of files you have (because
    gzip'ed files aren't splittable). This is often not a problem since most
    people can arrange to have dozens to hundreds of input files easier than
    they can arrange to have dozens to hundreds of CPU cores working on their
    data.

    On 8/30/07 8:46 AM, "jason gessner" wrote:

    if you put .gz files up on your HDFS cluster you don't need to do
    anything to read them. I see lots of extra control via the API, but i
    have simply put the files up and run my jobs on them.

    -jason
    On 8/30/07, C G wrote:
    Hello All:

    I think I must be missing something fundamental. Is it possible to load
    compressed data into HDFS, and then operate on it directly with map/reduce?
    I see a lot of stuff in the docs about writing compressed outputs, but
    nothing about reading compressed inputs.

    Am I being ponderously stupid here?

    Any help/comments appreciated...

    Thanks,
    C G


    ---------------------------------
    Luggage? GPS? Comic books?
    Check out fitting gifts for grads at Yahoo! Search.
  • C G at Aug 31, 2007 at 2:41 pm
    Thanks Ted and Jason for your comments. Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why. Thanks for the "answer in advance" :-).

    Ted Dunning wrote:
    With gzipped files, you do face the problem that your parallelism in the map
    phase is pretty much limited to the number of files you have (because
    gzip'ed files aren't splittable). This is often not a problem since most
    people can arrange to have dozens to hundreds of input files easier than
    they can arrange to have dozens to hundreds of CPU cores working on their
    data.

    On 8/30/07 8:46 AM, "jason gessner" wrote:

    if you put .gz files up on your HDFS cluster you don't need to do
    anything to read them. I see lots of extra control via the API, but i
    have simply put the files up and run my jobs on them.

    -jason

    On 8/30/07, C G
    wrote:
    Hello All:

    I think I must be missing something fundamental. Is it possible to load
    compressed data into HDFS, and then operate on it directly with map/reduce?
    I see a lot of stuff in the docs about writing compressed outputs, but
    nothing about reading compressed inputs.

    Am I being ponderously stupid here?

    Any help/comments appreciated...

    Thanks,
    C G


    ---------------------------------
    Luggage? GPS? Comic books?
    Check out fitting gifts for grads at Yahoo! Search.



    ---------------------------------
    Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when.
  • Jason gessner at Aug 31, 2007 at 4:39 pm
    ted, will the gzip files be a non-issue as far as splitting goes if
    they are under the default block size?

    C G, glad i could help a little.

    -jason
    On 8/31/07, C G wrote:
    Thanks Ted and Jason for your comments. Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why. Thanks for the "answer in advance" :-).

    Ted Dunning wrote:
    With gzipped files, you do face the problem that your parallelism in the map
    phase is pretty much limited to the number of files you have (because
    gzip'ed files aren't splittable). This is often not a problem since most
    people can arrange to have dozens to hundreds of input files easier than
    they can arrange to have dozens to hundreds of CPU cores working on their
    data.

    On 8/30/07 8:46 AM, "jason gessner" wrote:

    if you put .gz files up on your HDFS cluster you don't need to do
    anything to read them. I see lots of extra control via the API, but i
    have simply put the files up and run my jobs on them.

    -jason

    On 8/30/07, C G
    wrote:
    Hello All:

    I think I must be missing something fundamental. Is it possible to load
    compressed data into HDFS, and then operate on it directly with map/reduce?
    I see a lot of stuff in the docs about writing compressed outputs, but
    nothing about reading compressed inputs.

    Am I being ponderously stupid here?

    Any help/comments appreciated...

    Thanks,
    C G


    ---------------------------------
    Luggage? GPS? Comic books?
    Check out fitting gifts for grads at Yahoo! Search.



    ---------------------------------
    Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when.
  • Ted Dunning at Aug 31, 2007 at 5:24 pm
    They will only be a non-issue if you have enough of them to get the parallelism you want. If you have number of gzip files > 10*number of task nodes you should be fine.


    -----Original Message-----
    From: jason.gessner@gmail.com on behalf of jason gessner
    Sent: Fri 8/31/2007 9:38 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Compression using Hadoop...

    ted, will the gzip files be a non-issue as far as splitting goes if
    they are under the default block size?

    C G, glad i could help a little.

    -jason
    On 8/31/07, C G wrote:
    Thanks Ted and Jason for your comments. Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why. Thanks for the "answer in advance" :-).

    Ted Dunning wrote:
    With gzipped files, you do face the problem that your parallelism in the map
    phase is pretty much limited to the number of files you have (because
    gzip'ed files aren't splittable). This is often not a problem since most
    people can arrange to have dozens to hundreds of input files easier than
    they can arrange to have dozens to hundreds of CPU cores working on their
    data.
  • Arun C Murthy at Aug 31, 2007 at 5:29 pm

    On Fri, Aug 31, 2007 at 10:22:18AM -0700, Ted Dunning wrote:
    They will only be a non-issue if you have enough of them to get the parallelism you want. If you have number of gzip files > 10*number of task nodes you should be fine.
    One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile

    Of course this means you will have to do a conversion from .gzip to .seq file and load it onto hdfs for your job, which should be fairly simple piece of code.

    Arun
    -----Original Message-----
    From: jason.gessner@gmail.com on behalf of jason gessner
    Sent: Fri 8/31/2007 9:38 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Compression using Hadoop...

    ted, will the gzip files be a non-issue as far as splitting goes if
    they are under the default block size?

    C G, glad i could help a little.

    -jason
    On 8/31/07, C G wrote:
    Thanks Ted and Jason for your comments. Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why. Thanks for the "answer in advance" :-).

    Ted Dunning wrote:
    With gzipped files, you do face the problem that your parallelism in the map
    phase is pretty much limited to the number of files you have (because
    gzip'ed files aren't splittable). This is often not a problem since most
    people can arrange to have dozens to hundreds of input files easier than
    they can arrange to have dozens to hundreds of CPU cores working on their
    data.
  • Doug Cutting at Aug 31, 2007 at 5:43 pm

    Arun C Murthy wrote:
    One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile

    Of course this means you will have to do a conversion from .gzip to .seq file and load it onto hdfs for your job, which should be fairly simple piece of code.
    We really need someone to contribute an InputFormat for bzip files.
    This has come up before: bzip is a standard compression format that is
    splittable.

    Another InputFormat that would be handy is zip. Zip archives, unlike
    tar files, can be split by reading the table of contents. So one could
    package a bunch of tiny files as a zip file, then the input format could
    split the zip file into splits that each contain a number of files
    inside the zip. Each map task would then have to read the table of
    contents from the file, but could then seek directly to the files in its
    split without scanning the entire file.

    Should we file jira issues for these? Any volunteers who're interested
    in implementing these?

    Doug
  • Milind Bhandarkar at Aug 31, 2007 at 6:54 pm

    On 8/31/07 10:43 AM, "Doug Cutting" wrote:
    We really need someone to contribute an InputFormat for bzip files.
    This has come up before: bzip is a standard compression format that is
    splittable.
    +1


    - milind
    --
    Milind Bhandarkar
    408-349-2136
    (milindb@yahoo-inc.com)
  • Ted Dunning at Aug 31, 2007 at 9:32 pm
    I was just looking at this. It looks relatively easy to simply create a new codec and register it in the config files.

    I have to say, btw, that the source tree structure of this project is pretty ornate and not very parallel. I needed to add 10 source roots in IntelliJ to get a clean compile. In this process, I noticed some circular dependencies.

    Would the committers be open to some small set of changes to remove cyclic dependencies?


    -----Original Message-----
    From: Milind Bhandarkar
    Sent: Fri 8/31/2007 11:53 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Compression using Hadoop...

    On 8/31/07 10:43 AM, "Doug Cutting" wrote:


    We really need someone to contribute an InputFormat for bzip files.
    This has come up before: bzip is a standard compression format that is
    splittable.
    +1


    - milind
    --
    Milind Bhandarkar
    408-349-2136
    (milindb@yahoo-inc.com)
  • Doug Cutting at Sep 4, 2007 at 5:24 pm

    Ted Dunning wrote:
    I have to say, btw, that the source tree structure of this project is pretty ornate and not very parallel. I needed to add 10 source roots in IntelliJ to get a clean compile. In this process, I noticed some circular dependencies.

    Would the committers be open to some small set of changes to remove cyclic dependencies?
    It doesn't hurt to propose them. Even if the get shot down, we'll gain
    a log of the rationale. There are great benefits to clean layers, but
    we also can't gratuitously break source-code compatibility.

    Doug
  • Arun C Murthy at Sep 1, 2007 at 5:46 am

    On Fri, Aug 31, 2007 at 10:43:09AM -0700, Doug Cutting wrote:
    Arun C Murthy wrote:
    One way to reap benefits of both compression and better parallelism is to
    use compressed SequenceFiles:
    http://wiki.apache.org/lucene-hadoop/SequenceFile

    Of course this means you will have to do a conversion from .gzip to .seq
    file and load it onto hdfs for your job, which should be fairly simple
    piece of code.
    We really need someone to contribute an InputFormat for bzip files.
    This has come up before: bzip is a standard compression format that is
    splittable.

    Another InputFormat that would be handy is zip. Zip archives, unlike
    tar files, can be split by reading the table of contents. So one could
    package a bunch of tiny files as a zip file, then the input format could
    split the zip file into splits that each contain a number of files
    inside the zip. Each map task would then have to read the table of
    contents from the file, but could then seek directly to the files in its
    split without scanning the entire file.

    Should we file jira issues for these? Any volunteers who're interested
    in implementing these?
    Please file the bzip and zip issues Doug. I'll try and get to them in the short-term unless someone is more interested and wants to scratch that itch right-away.

    thanks,
    Arun
    Doug
  • C G at Aug 31, 2007 at 6:22 pm
    My input is typical row-based stuff across which are run a large stack of aggregations/rollups. After reading earlier posts on this thread, I modified my loader to split the input up into 1M row partitions (literally gunzip -cd input.gz | split...). I then ran an experiment using 50M rows (i.e. 50 gz files loaded into HDFS) on a 8 node cluster. Ted, from what you are saying I should be using at least 80 files given the cluster size, and I should modify the loader to be aware of the number of nodes and split accordingly. Do you concur?

    Load time to HDFS may be the next challenge. My HDFS configuration on 8 nodes uses a replication factor of 3. Sequentially copying my data to HDFS using -copyFromLocal took 23 minutes to move 266M in individual files of 5.7M each. Does anybody find this result surprising? Note that this is on EC2, where there is no such thing as rack-level or switch-level locality. Should I expect dramatically better performance on a real iron? Once I get this prototyping/education under my belt my plan is to deploy a 64 node grid of 4 way machines with a terabyte of local storage on each node.

    Thanks for the discussion...the Hadoop community is very helpful!

    C G


    Ted Dunning wrote:

    They will only be a non-issue if you have enough of them to get the parallelism you want. If you have number of gzip files > 10*number of task nodes you should be fine.


    -----Original Message-----
    From: jason.gessner@gmail.com on behalf of jason gessner
    Sent: Fri 8/31/2007 9:38 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Compression using Hadoop...

    ted, will the gzip files be a non-issue as far as splitting goes if
    they are under the default block size?

    C G, glad i could help a little.

    -jason

    On 8/31/07, C G
    wrote:
    Thanks Ted and Jason for your comments. Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why. Thanks for the "answer in advance" :-).

    Ted Dunning wrote:
    With gzipped files, you do face the problem that your parallelism in the map
    phase is pretty much limited to the number of files you have (because
    gzip'ed files aren't splittable). This is often not a problem since most
    people can arrange to have dozens to hundreds of input files easier than
    they can arrange to have dozens to hundreds of CPU cores working on their
    data.


    ---------------------------------
    Luggage? GPS? Comic books?
    Check out fitting gifts for grads at Yahoo! Search.
  • Ted Dunning at Aug 31, 2007 at 6:42 pm
    My 10x was very rough.

    I based it on:

    a) you want a few files per map task
    b) you want a map task per core

    I tend to use quad core machines and so I used 2 x 8 = 10 (roughly).

    On EC2, you don't have multi-core machines (I think) so you might be fine with 2-4 files per CPU.


    -----Original Message-----
    From: C G
    Sent: Fri 8/31/2007 11:21 AM
    To: hadoop-user@lucene.apache.org
    Subject: RE: Compression using Hadoop...
    Ted, from what you are saying I should be using at least 80 files given the cluster size, and I should modify the loader to be aware
    of the number of nodes and split accordingly. Do you concur?
  • Joydeep Sen Sarma at Aug 31, 2007 at 8:08 pm
    One thing I had done to speed up copy/put speeds was write a simple
    map-reduce job to do parallel copies of files from a input directory (in
    our case the input directory is nfs mounted from all task nodes). It
    gives us a huge speed-bump.

    It's trivial to roll ur own - but would be happy to share as well.


    -----Original Message-----
    From: C G
    Sent: Friday, August 31, 2007 11:21 AM
    To: hadoop-user@lucene.apache.org
    Subject: RE: Compression using Hadoop...

    My input is typical row-based stuff across which are run a large stack
    of aggregations/rollups. After reading earlier posts on this thread, I
    modified my loader to split the input up into 1M row partitions
    (literally gunzip -cd input.gz | split...). I then ran an experiment
    using 50M rows (i.e. 50 gz files loaded into HDFS) on a 8 node cluster.
    Ted, from what you are saying I should be using at least 80 files given
    the cluster size, and I should modify the loader to be aware of the
    number of nodes and split accordingly. Do you concur?

    Load time to HDFS may be the next challenge. My HDFS configuration on
    8 nodes uses a replication factor of 3. Sequentially copying my data to
    HDFS using -copyFromLocal took 23 minutes to move 266M in individual
    files of 5.7M each. Does anybody find this result surprising? Note
    that this is on EC2, where there is no such thing as rack-level or
    switch-level locality. Should I expect dramatically better performance
    on a real iron? Once I get this prototyping/education under my belt my
    plan is to deploy a 64 node grid of 4 way machines with a terabyte of
    local storage on each node.

    Thanks for the discussion...the Hadoop community is very helpful!

    C G


    Ted Dunning wrote:

    They will only be a non-issue if you have enough of them to get the
    parallelism you want. If you have number of gzip files > 10*number of
    task nodes you should be fine.


    -----Original Message-----
    From: jason.gessner@gmail.com on behalf of jason gessner
    Sent: Fri 8/31/2007 9:38 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Compression using Hadoop...

    ted, will the gzip files be a non-issue as far as splitting goes if
    they are under the default block size?

    C G, glad i could help a little.

    -jason

    On 8/31/07, C G
    wrote:
    Thanks Ted and Jason for your comments. Ted, your comments about gzip
    not being splittable was very timely...I'm watching my 8 node cluster
    saturate one node (with one gz file) and was wondering why. Thanks for
    the "answer in advance" :-).
    Ted Dunning wrote:
    With gzipped files, you do face the problem that your parallelism in the map
    phase is pretty much limited to the number of files you have (because
    gzip'ed files aren't splittable). This is often not a problem since most
    people can arrange to have dozens to hundreds of input files easier than
    they can arrange to have dozens to hundreds of CPU cores working on their
    data.


    ---------------------------------
    Luggage? GPS? Comic books?
    Check out fitting gifts for grads at Yahoo! Search.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 30, '07 at 3:23p
activeSep 4, '07 at 5:24p
posts15
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase