I have some architectural question.
For my app I have persistent 50 GB data, which stored in HDFS, data is
simple CSV format file.
Also for my app which should be run over this (50 GB) data I have 10 GB
input data also CSV format.
Persistent data and input data don't have commons keys.
In my cluster I have 5 data nodes.
The app does simple match every line of input data with every line of
For solving this task I see two different approaches:
1. Destribute input file to every node using attribute -files, and run job.
But in this case every map will go through 10 GB input data.
2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
jobs (one per data node for instance), and for every job we will put 2 GB
data. In this case every map should go through 2 GB data. In other words
I'll give every map node it's own input data. But drawback of this approache
is work which I should do before start job and after job finished.
And may be there is more subtle way in hadoop to do this work?
View this message in context: http://old.nabble.com/Architectural-question-tp31365863p31365863.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.