Hi Folks,
I have a setup where in I am streaming data into HDFS from a
remote location and creating a new files every X min. The file generated
is of a very small size (512 KB - 6 MB) size. Since that is the size
range the streaming code sets the block size to 6MB whereas default that
we have set for the cluster is 128 MB. The idea behind such a thing is
to generate small temporal data chunks from multiple sources and merge
them periodically into a big chunk with our default (128 MB) block size.

The webUI for DFS reports the block size for these files to be 6 MB. My
questions are.
1. Can we have multiple files in DFS use different block sizes ?
2. If we use default block size for these small chunks, is the DFS space
wasted ?
If not then does it mean that a single DFS block can hold data from
more than one file ?


Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 2 | next ›
Discussion Overview
groupcommon-dev @
postedJun 27, '08 at 8:19a
activeJun 27, '08 at 3:42p

2 users in discussion

Goel, Ankur: 1 post Ted Dunning: 1 post



site design / logo © 2022 Grokbase