On Thu, Jul 22, 2010 at 9:23 PM, Allen Wittenauer wrote:On Jul 22, 2010, at 4:40 AM, Vitaliy Semochkin wrote:
If it was a context switching would the increasing number of
mappers/reducers lead to performance improvement?
Woops, I misspoke. I meant process switching (which I guess is a form of
context switching). More on that later.
I have one log file ~140GB I use default hdfs block size (64mb)
So that's not 'huge' by Hadoop standards at all.
If I did my math correctly and it really is one big file, you're likely
seeing somewhere on the order of 2300 or so maps to process that file,
right? What is the average time per task? What scheduler is being used?
What version of Hadoop? How many machines?
I use hadoop 0.20.2 with default scheduler on 5 servers (Xeon(R) CPU E5320
2 CPU 4 CORES 4 GM RAM)
I start Datanode/Tasktracker on NameNode/JobTracker
Your maths is perfect 2100 maps, an avg time per task is about 8sec (btw,
does hadoop have tools to count avg value spent on task?)
also I set dfs.replication=1
Eek. I wonder what your locality hit rate is.
Is it possible to check it?
Am I right that the higher dfs.replication the faster map reduce will work
because the probability that split will be on a local node will be equal to
1?
If you mena the block for the map input, yes, you have a much higher
probability of it being local and therefore faster.
I mean the block. (still getting used to terminology)
Also, is it correct that it will slow down put operations? (technically put
operations will run in parallel so I'm not sure if it will slow down
performance or not)
I don't know if anyone has studied output replications factor on job
performance. I'm sure someone has though. I'm not fully awake yet (despite
it being 10am), but I'm fairly certain that the job of replicating the local
block falls onto the DN not the client, so the client may not be held up by
replication at all.
More memory means that Hadoop doesn't have to spill to disk as often due
to
being able to use a larger buffer in RAM.
Does hadoop check if it has enough memory for such operation?
That depends upon what you mean by 'check'. By default, Hadoop will spawn
whatever size heap you want. The OS, however, may have different ideas as
to what is allowable. :)
...
What I suspect is really happening is that your tasks are not very CPU
intensive and don't take long to run through 64mb of data. So your task
turn around time is very very fast. So fast, in fact, that the scheduler
can't keep up. Boosting the number of tasks per node actually helps because
the *initial* scheduling puts you that much farther ahead.
An interesting test to perform is to bump the block size up to 128mb. You
should see fewer tasks that stick around a bit longer. But you could very
well see overall throughput go up because you are spending less time cycling
through JVMs. [Even with reuse turned on.]
Now most tasks are in state initializing, but overall time spent didn't
changed.
Should I reformat HDFS after I changed the block size?
Looks like you have big experience in this field, what are the
books/articles you recommend to check?
Thanks in advance,
Vitaliy S