So, before getting any suggestions, will have to explain a few core things:

1. do you know if there exist patterns in this data?
2. will the data be read and how?
3. does there exist a hot subset of the data - both read/write?
4. what makes you think hdfs is a good option?
5. how much do you intend to pay per TB?
6. say you do build the system, how do you plan to keep lights on?
7. forgot to ask - is the data textual or binary?

Those are just the basic questions. Are you going to be building and running the system all by yourself?

On Nov 28, 2012, at 14:09, Mohammad Tariq wrote:

Hello list,

Although a lot of similar discussions have been done here, I still seek some of your able guidance. Till now I have worked only on small or mid-sized clusters. But this time situation is a bit different. I have to cpollect a lot of legacy data, stored over last few decades. This data is on tape drives and I have to collect it from there and store in my cluster. The size could go somewhere near 24 Petabytes (inclusive of replication).

Now, I need some help to kick this off, like what could be the optimal config for my NN+JT, DN+TT+RS, HMaster, ZK machines?

What should be the no. of slaves and ZK peers nodes keeping this config in mind?

What is the optimal network config for a cluster of this size.

Which kind of disks would be more efficient?

Please do provide me some guidance as I want to have some expert comments before moving ahead. Many thanks.

Mohammad Tariq

Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 2 | next ›
Discussion Overview
groupuser @
postedNov 28, '12 at 10:10p
activeNov 29, '12 at 12:11a

2 users in discussion

Gaurav Sharma: 1 post Mohammad Tariq: 1 post



site design / logo © 2021 Grokbase