FAQ
Dear hadoop users,

Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
and one thing that I'm trying to explore is whether we can use alternative
scheduling systems like SGE with shared FS for non data intensive tasks,
since they are easier to work with for lay users.

One problem for now is how to create shared cluster filesystem similar to
HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
and permissions), that will use amazon EC2 local nonpersistent storage.

Idea is to keep original data on S3, then as needed fire up a bunch of
nodes, start shared filesystem, and quickly copy data from S3 to that FS,
run the analysis with SGE, save results and shut down that filesystem.
I tried things like S3FS and similar native S3 implementation but speed is
too bad. Currently I just have a FS on my master node that is shared via NFS
to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
more than 10 nodes.

Thank you. I'd appreciate any suggestions and links to relevant resources!.


Dmitry

Search Discussions

  • Robert Evans at Aug 31, 2011 at 3:23 pm
    Dmitry,

    It sounds like an interesting idea, but I have not really heard of anyone doing it before. It would make for a good feature to have tiered file systems all mapped into the same namespace, but that would be a lot of work and complexity.

    The quick solution would be to know what data you want to process before hand and then run distcp to copy it from S3 into HDFS before launching the other map/reduce jobs. I don't think there is anything automatic out there.

    --Bobby Evans

    On 8/29/11 4:56 PM, "Dmitry Pushkarev" wrote:

    Dear hadoop users,

    Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
    and one thing that I'm trying to explore is whether we can use alternative
    scheduling systems like SGE with shared FS for non data intensive tasks,
    since they are easier to work with for lay users.

    One problem for now is how to create shared cluster filesystem similar to
    HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
    and permissions), that will use amazon EC2 local nonpersistent storage.

    Idea is to keep original data on S3, then as needed fire up a bunch of
    nodes, start shared filesystem, and quickly copy data from S3 to that FS,
    run the analysis with SGE, save results and shut down that filesystem.
    I tried things like S3FS and similar native S3 implementation but speed is
    too bad. Currently I just have a FS on my master node that is shared via NFS
    to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
    more than 10 nodes.

    Thank you. I'd appreciate any suggestions and links to relevant resources!.


    Dmitry
  • Tom White at Aug 31, 2011 at 3:51 pm
    You might consider Apache Whirr (http://whirr.apache.org/) for
    bringing up Hadoop clusters on EC2.

    Cheers,
    Tom
    On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans wrote:
    Dmitry,

    It sounds like an interesting idea, but I have not really heard of anyone doing it before.  It would make for a good feature to have tiered file systems all mapped into the same namespace, but that would be a lot of work and complexity.

    The quick solution would be to know what data you want to process before hand and then run distcp to copy it from S3 into HDFS before launching the other map/reduce jobs.  I don't think there is anything automatic out there.

    --Bobby Evans

    On 8/29/11 4:56 PM, "Dmitry Pushkarev" wrote:

    Dear hadoop users,

    Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
    and one thing that I'm trying to explore is whether we can use alternative
    scheduling systems like SGE with shared FS for non data intensive tasks,
    since they are easier to work with for lay users.

    One problem for now is how to create shared cluster filesystem similar to
    HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
    and permissions), that will use amazon EC2 local nonpersistent storage.

    Idea is to keep original data on S3, then as needed fire up a bunch of
    nodes, start shared filesystem, and quickly copy data from S3 to that FS,
    run the analysis with SGE, save results and shut down that filesystem.
    I tried things like S3FS and similar native S3 implementation but speed is
    too bad. Currently I just have a FS on my master node that is shared via NFS
    to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
    more than 10 nodes.

    Thank you. I'd appreciate any suggestions and links to relevant resources!.


    Dmitry
  • Dmitry Pushkarev at Aug 31, 2011 at 7:03 pm
    Thank you for your suggestion, I have years of experience with HDFS and if I
    could I'd gladly use it as filesystem, however our developers require fast
    random access and symlinks, so I was wondering if there are other options.
    I'm not sensitive too much to data locality, but we need to eliminate
    network bottlenecks in the FS. Has anyone trieds filesystems like
    GlusterFS, GFS, and so on?

    Thanks!.
    On Wed, Aug 31, 2011 at 8:50 AM, Tom White wrote:

    You might consider Apache Whirr (http://whirr.apache.org/) for
    bringing up Hadoop clusters on EC2.

    Cheers,
    Tom
    On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans wrote:
    Dmitry,

    It sounds like an interesting idea, but I have not really heard of anyone
    doing it before. It would make for a good feature to have tiered file
    systems all mapped into the same namespace, but that would be a lot of work
    and complexity.
    The quick solution would be to know what data you want to process before
    hand and then run distcp to copy it from S3 into HDFS before launching the
    other map/reduce jobs. I don't think there is anything automatic out there.
    --Bobby Evans

    On 8/29/11 4:56 PM, "Dmitry Pushkarev" wrote:

    Dear hadoop users,

    Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
    and one thing that I'm trying to explore is whether we can use
    alternative
    scheduling systems like SGE with shared FS for non data intensive tasks,
    since they are easier to work with for lay users.

    One problem for now is how to create shared cluster filesystem similar to
    HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
    and permissions), that will use amazon EC2 local nonpersistent storage.

    Idea is to keep original data on S3, then as needed fire up a bunch of
    nodes, start shared filesystem, and quickly copy data from S3 to that FS,
    run the analysis with SGE, save results and shut down that filesystem.
    I tried things like S3FS and similar native S3 implementation but speed is
    too bad. Currently I just have a FS on my master node that is shared via NFS
    to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
    more than 10 nodes.

    Thank you. I'd appreciate any suggestions and links to relevant
    resources!.

    Dmitry

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 29, '11 at 9:57p
activeAug 31, '11 at 7:03p
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase