thanks for the quick response. My Systems runs on an i7 together with
about 8GB of RAM. The problem with my setup is, that I'm using Hadoop
to pump 40GB of JSON encoded data hashes into a MySQL database. The
data is in non-relational form and needs to be normalized before it
can enter the DB, thus the Hadoop approach. I ran the first batch of
test data last night, and it took ~10GB ~12h to get processed. ( a
python mapper writing to MySQL via the oursql pkg )
The reason for this is certainly not perfect configured mysqld along
with the fact that I gave Hadoop to much access to memory. I used 4GB
for Hadoop but forgot to remember that MySQL was granted about 6GB as
well.. so I was 2-3GB into swap most of the time. ( Though I don't
know how much was Hadoop and how much was MySQL )
Anyway, I didn't bother so set up the 'real' Handoop environment with
daemon and hdfs but instead just run the streaming jar directly from
$hadoop_home. I don't know if this really matters in any way, I just
thought I mention it.
all the best,
On Jul 16, 2010, at 11:59 AM, Michael Segel wrote:
I'm not sure what you're doing, but raising the number of mapers in your configuration isn't a 'hint'.
The number of mapers that you can run will depend on your configuration. You mention an i7 which is a quad core cpu, but you don't mention the amount of memory you have available, or what else you run on the machine. You don't want hadoop to swap.
If your initial m/r jobs are taking input from a file, the default behavior is to create one map/reduce task per block. So if your initial input file is < 64MB and you have kept your default block size of 64MB, then you will only have one map/reduce task.
I haven't played with Hadoop in a single node / pseudo distributed environment... just in a distributed environment but I believe that the functionality is the same.
PS. Please take my advice with a grain of salt. It's 5:00am and I haven't had my first cup of coffee yet. ;-)
Date: Fri, 16 Jul 2010 11:03:19 +0200
Subject: Single Node with multiple mappers?
I was curious if there is any option to use Hadoop in single node mode
in a way, that enables the process to use more system ressources.
Right now, Hadoop uses one mapper and one reducer, leaving my i7 with
about 20% CPU usage (1 core for Hadoop, .5 cores for my OS) basically
Raising the number of map tasks doesn't seem to do much, as this
parameter seems to more of a hint anyway. Still, I have lots of CPU
time and RAM left. Any hints on how to use them?
thanks in advance,
The New Busy is not the old busy. Search, chat and e-mail from your inbox.http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3