Answers in-line. Let me know if any questions follow.
On Wed, Jul 8, 2009 at 10:49 PM, Sugandha Naolekar
I have a 7 node hadoop cluster!
As of now, I am able to transfer(dump) the data in HDFS from a remote
node(not a part of hadoop cluster). And through web UI, I am able to
download the same.
-> but, If I need to restrict that web UI to few users only, what am I
supposed to do?
Hadoop doesn't have any mechanism for authentication, so you'll have to do
this with Linux tools. It's also dangerous to restrict access to the web
ports, because those same ports are used by the Hadoop daemons themselves.
You could use iptables to create an IP whitelist, and include your users'
IPs, as well as your nodes' IPs. There may be a way to massage Jetty to
restrict access, but I don't know enough about Jetty to be able to say for
-> Also, if I need to do some kind of search, i.e; whether a particular
or folder is available or not in HDFS..??? Will I be able to do it simply,
by writing a code using hadoop Filesystem api's.? Will it be fast and
efficient in case of data extending to huge amount?
The API should be sufficient here. Another possibility, if you'd rather not
use Java, is to get the Fuse contrib project working and mount HDFS onto a
Linux box. Then you could use Python, bash, or whatever to do your file
traversals. Note that fuse isn't widely used, so it may be hard to get
going (I've never done it).
-> Also, after above tasks, I want to implement compression algorithms. The
data that is getting placed in HDFS, shold be placed in a compressed
Will I have to use hadoop api's only, or some map-reudce techniques? In
those complete episode, Map-reduce is necessary? If yes, where??
There are a few different ways to do this. Probably the easiest being the
following. First, put your data in HDFS in its original format. Then, use
IdentityMapper and IdentityReducer to read your (assumingly plain text) data
via TextInputFormat, and configure your job to use SequenceFileOutputFormat
(to learn about the different compression options, see <http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html>).
After this map reduce job is done, you will have your original data, and
your data in SequenceFiles. Make sense?