FAQ
Thanks again Gaurav. At least 1 file will be read at a time. File is the
atomic unit and each of these binary file can go upto 1TB in size.

Latency is not a major concern, since almost 99.99% of the stuff would be
offline and will involve batch processing. So, we can compromise there.

Thank you for the pointer. I'll definitely have a look over it.

Regards,
Mohammad Tariq



On Fri, Nov 30, 2012 at 4:37 AM, Gaurav Sharma
wrote:
The 7th question should've been the first to rather obviate the need for
some of the other 6. So, if the data is binary, MR is of little use anyway.
Didn't understand and likely believe when you say this:

"No, entire data is equally important and will be read together."

Other than that, an 8th question:
8. how much read latency can the system tolerate?

and a 9th:
9. what is the usable size of a unit of data being read? it being binary,
does the entire stream have to be read to make sense of it for the
application are parts of the binary usable?


If you can get away with some read-latency, take a look at one of the
commercial erasure coding solutions out there (like Cleversafe) or just
code one yourself. Also, see:
https://issues.apache.org/jira/browse/HDFS-503

hth


On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq wrote:

Hello Gaurav,

Thank you so much for your reply. Please find my comments embedded
below :

1. do you know if there exist patterns in this data?
Yes, entire file is divided into data blocks of fixed length (But
there is no separator between 2 blocks).

2. will the data be read and how?
Yes, data has to be read. To be honest, we are still not sure how to
do that.

3. does there exist a hot subset of the data - both read/write?
No, entire data is equally important and will be read together.
4. what makes you think hdfs is a good option?
Distributed architecture, Flexibility to read any kind of data,
Parallelism, Native MR integration, Cost, Fault tolerance, High
throughput etc.

5. how much do you intend to pay per TB?
I have to discuss it with my superiors (Will let you know soon).
6. say you do build the system, how do you plan to keep lights on?
I am sorry I did not get this. I mean i'll do whatever it takes to
keep everything moving. I have some experience with small clusters. And I
have got a small team with me which is ready 24*7.

7. forgot to ask - is the data textual or binary?
Data is binary.
No, I would require some help. I have a team with me as I have said. But
being new to Hadoop I would need some help from whatever source it is.

Many thanks.

Regards,
Mohammad Tariq



On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <
gaurav.gs.sharma@gmail.com> wrote:
So, before getting any suggestions, will have to explain a few core
things:

1. do you know if there exist patterns in this data?
2. will the data be read and how?
3. does there exist a hot subset of the data - both read/write?
4. what makes you think hdfs is a good option?
5. how much do you intend to pay per TB?
6. say you do build the system, how do you plan to keep lights on?
7. forgot to ask - is the data textual or binary?

Those are just the basic questions. Are you going to be building and
running the system all by yourself?

On Nov 28, 2012, at 14:09, Mohammad Tariq wrote:

Hello list,

Although a lot of similar discussions have been done here, I
still seek some of your able guidance. Till now I have worked only on small
or mid-sized clusters. But this time situation is a bit different. I have
to cpollect a lot of legacy data, stored over last few decades. This data
is on tape drives and I have to collect it from there and store in my
cluster. The size could go somewhere near 24 Petabytes (inclusive of
replication).
Now, I need some help to kick this off, like what could be the optimal
config for my NN+JT, DN+TT+RS, HMaster, ZK machines?
What should be the no. of slaves and ZK peers nodes keeping this
config in mind?
What is the optimal network config for a cluster of this size.

Which kind of disks would be more efficient?

Please do provide me some guidance as I want to have some expert
comments before moving ahead. Many thanks.
Regards,
Mohammad Tariq

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 5 | next ›
Discussion Overview
grouphdfs-user @
categorieshadoop
postedNov 28, '12 at 10:10p
activeNov 30, '12 at 12:01a
posts5
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Mohammad Tariq: 3 posts Gaurav Sharma: 2 posts

People

Translate

site design / logo © 2021 Grokbase