On Tue, Nov 2, 2010 at 7:59 AM, 蔡超 wrote:
hi,
I'm a newbee to hadoop. I want to employ hadoop to analyze 10 million pieces
data (10G totally). The data maybe residents in a rational DB, or a series
of XML files(thousands of pieces per file). I have some questions.
1. how to guarantee mappers' exclusive access to the DB.
I'd suggest moving files to HDFS with HIHO/Sqoop projects and such. Orhi,
I'm a newbee to hadoop. I want to employ hadoop to analyze 10 million pieces
data (10G totally). The data maybe residents in a rational DB, or a series
of XML files(thousands of pieces per file). I have some questions.
1. how to guarantee mappers' exclusive access to the DB.
use DBInputFormat, as detailed at this Cloudera weblog post:
http://www.cloudera.com/blog/2009/03/database-access-with-hadoop/
2. how to split XML files. To override MultiFileInputFormat?
Several people have discussed on how to process XML documents withHadoop already. A simple web search would yield you several results on
the same, apart from the archives of the mapreduce-user mailing list.
3. how to transfer a bunch of resources (10M) to slaves.
Use DistributedCache:http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html
4. Reduce is not necessary, is it suitable for hadoop?
Yes, you can use plain Mapper-only jobs, there is no problem. You justneed to set number of reducers to 0 in each job.
I can't find a similar case in the build-in examples of hadoop release.
Look around the web for projects that _use_ Hadoop, like Mahout forexample, they have many more examples.
Sorry to interrupt.
Chao Cai
HTHChao Cai
--
Harsh J
www.harshj.com