Hi John,
Have you considered Amazon Elastic MapReduce? (Disclaimer: I work on Elastic MapReduce)
http://aws.amazon.com/elasticmapreduce/It waits for your job to finish and then automatically shuts down the cluster. With a simple command like:
elastic-mapreduce --create --num-instances 10 --jar s3://mybucket/my.jar --args s3://mybucket/input/,s3://mybucket/output/
It will automatically create a cluster, run your jar, and then shut everything down. Elastic MapReduce costs a little bit more than just plain EC2, but if it prevents your cluster from running longer than necessary, you might save money.
Andrew
On 11/10/09 6:13 AM, "John Clarke" wrote:
Hi,
I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
but I want to automate it a bit more.
I want to be able to:
- start cluster
- copy data from S3 to the DFS
- run the job
- copy result data from DFS to S3
- verify it all copied ok
- shutdown the cluster.
I guess the hardest part is reliably detecting when a job is complete. I've
seen solutions that provide a time based shutdown but they are not suitable
as our jobs vary in time.
Has anyone made a script that does this already? I'm using the Cloudera
python scripts to start/terminate my cluster.
Thanks,
John