FAQ
Hi,

I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
but I want to automate it a bit more.

I want to be able to:
- start cluster
- copy data from S3 to the DFS
- run the job
- copy result data from DFS to S3
- verify it all copied ok
- shutdown the cluster.


I guess the hardest part is reliably detecting when a job is complete. I've
seen solutions that provide a time based shutdown but they are not suitable
as our jobs vary in time.

Has anyone made a script that does this already? I'm using the Cloudera
python scripts to start/terminate my cluster.

Thanks,
John

Search Discussions

  • Edmund Kohlwey at Nov 10, 2009 at 2:37 pm
    You should be able to detect the status of the job in your java main()
    method, just do either: job.waitForCompletion(), and, when the job
    finishes running, use job.isSuccessful(), or if you want to you can
    write a custom "watcher" thread to poll job status manually; this will
    allow you to, for instance, launch several jobs and wait for them to
    return. You will poll the job tracker using either method, but I think
    the overhead is pretty minimal.

    I'm not sure if its necessary to copy data from S3 to DFS, btw (unless
    you have a performance reason to do so... even then, since you're not
    really guaranteed very much locality on EC2 you probably won't see a
    huge difference). You should probably just set the default file system
    to s3. See http://wiki.apache.org/hadoop/AmazonS3 .

    On 11/10/09 9:13 AM, John Clarke wrote:
    Hi,

    I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
    but I want to automate it a bit more.

    I want to be able to:
    - start cluster
    - copy data from S3 to the DFS
    - run the job
    - copy result data from DFS to S3
    - verify it all copied ok
    - shutdown the cluster.


    I guess the hardest part is reliably detecting when a job is complete. I've
    seen solutions that provide a time based shutdown but they are not suitable
    as our jobs vary in time.

    Has anyone made a script that does this already? I'm using the Cloudera
    python scripts to start/terminate my cluster.

    Thanks,
    John
  • John Clarke at Nov 11, 2009 at 4:34 pm
    Hi Edmund,

    I'll look into what you suggested. Yes I'm aware of being able to use S3
    directly but I had problems getting it working - I must try again.

    cheers
    John

    2009/11/10 Edmund Kohlwey <ekohlwey@gmail.com>
    You should be able to detect the status of the job in your java main()
    method, just do either: job.waitForCompletion(), and, when the job finishes
    running, use job.isSuccessful(), or if you want to you can write a custom
    "watcher" thread to poll job status manually; this will allow you to, for
    instance, launch several jobs and wait for them to return. You will poll the
    job tracker using either method, but I think the overhead is pretty minimal.

    I'm not sure if its necessary to copy data from S3 to DFS, btw (unless you
    have a performance reason to do so... even then, since you're not really
    guaranteed very much locality on EC2 you probably won't see a huge
    difference). You should probably just set the default file system to s3. See
    http://wiki.apache.org/hadoop/AmazonS3 .


    On 11/10/09 9:13 AM, John Clarke wrote:

    Hi,

    I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works
    great
    but I want to automate it a bit more.

    I want to be able to:
    - start cluster
    - copy data from S3 to the DFS
    - run the job
    - copy result data from DFS to S3
    - verify it all copied ok
    - shutdown the cluster.


    I guess the hardest part is reliably detecting when a job is complete.
    I've
    seen solutions that provide a time based shutdown but they are not
    suitable
    as our jobs vary in time.

    Has anyone made a script that does this already? I'm using the Cloudera
    python scripts to start/terminate my cluster.

    Thanks,
    John

  • Hitchcock, Andrew at Nov 10, 2009 at 9:24 pm
    Hi John,

    Have you considered Amazon Elastic MapReduce? (Disclaimer: I work on Elastic MapReduce)

    http://aws.amazon.com/elasticmapreduce/

    It waits for your job to finish and then automatically shuts down the cluster. With a simple command like:

    elastic-mapreduce --create --num-instances 10 --jar s3://mybucket/my.jar --args s3://mybucket/input/,s3://mybucket/output/

    It will automatically create a cluster, run your jar, and then shut everything down. Elastic MapReduce costs a little bit more than just plain EC2, but if it prevents your cluster from running longer than necessary, you might save money.

    Andrew


    On 11/10/09 6:13 AM, "John Clarke" wrote:

    Hi,

    I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
    but I want to automate it a bit more.

    I want to be able to:
    - start cluster
    - copy data from S3 to the DFS
    - run the job
    - copy result data from DFS to S3
    - verify it all copied ok
    - shutdown the cluster.


    I guess the hardest part is reliably detecting when a job is complete. I've
    seen solutions that provide a time based shutdown but they are not suitable
    as our jobs vary in time.

    Has anyone made a script that does this already? I'm using the Cloudera
    python scripts to start/terminate my cluster.

    Thanks,
    John
  • John Clarke at Nov 11, 2009 at 4:35 pm
    I've never used Amazon Elastic MapReduce as we are trying to minimise costs
    but if I cant find a good way to solve my problem then I might reconsider.

    cheers,
    John



    2009/11/10 Hitchcock, Andrew <anhi@amazon.com>
    Hi John,

    Have you considered Amazon Elastic MapReduce? (Disclaimer: I work on
    Elastic MapReduce)

    http://aws.amazon.com/elasticmapreduce/

    It waits for your job to finish and then automatically shuts down the
    cluster. With a simple command like:

    elastic-mapreduce --create --num-instances 10 --jar s3://mybucket/my.jar
    --args s3://mybucket/input/,s3://mybucket/output/

    It will automatically create a cluster, run your jar, and then shut
    everything down. Elastic MapReduce costs a little bit more than just plain
    EC2, but if it prevents your cluster from running longer than necessary, you
    might save money.

    Andrew


    On 11/10/09 6:13 AM, "John Clarke" wrote:

    Hi,

    I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
    but I want to automate it a bit more.

    I want to be able to:
    - start cluster
    - copy data from S3 to the DFS
    - run the job
    - copy result data from DFS to S3
    - verify it all copied ok
    - shutdown the cluster.


    I guess the hardest part is reliably detecting when a job is complete. I've
    seen solutions that provide a time based shutdown but they are not suitable
    as our jobs vary in time.

    Has anyone made a script that does this already? I'm using the Cloudera
    python scripts to start/terminate my cluster.

    Thanks,
    John

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 10, '09 at 2:13p
activeNov 11, '09 at 4:35p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase