FAQ
So I have a table with roughly 145,000 records spread across 300
files. The total size is about 7MB. Right now I'm running one job
tracker and one task tracker which is a high cpu amazon box (1.7 Gbits
of RAM, ~ 4 cores). I run the following query:

SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;

And it takes about 35 minutes to finish. One of my problems is that I
can't get my task tracker to process more than one map at a time even
though it has a higher number of maximum map tasks. But even that is
relatively fast compared to the reduce which takes about 30 minutes by
itself. The status of the task is:

reduce > copy (225 of 344 at 0.01 MB/s) >

I really don't understand what is going on during this copy step or
why it is taking so long. The files are small and they're all inside
of amazon's network. Can you guys help me out?

Josh F.

Search Discussions

  • Jason hadoop at Jan 27, 2009 at 4:13 pm
    It is not clear to me fromyour email if you have the number of map tasks per
    machine set to > 1, or if you are attempting to us a multi-threaded mapper.

    How many tasks does the system split your job into? and how many execute at
    once.
    It is a first guess that you are getting 300 map tasks, and each runs for a
    small number of seconds, and most of that time is probably the task setup
    time.

    As a first try, you could try packing your 300 small files into as many
    files as you have simultaneous task execution slots and adjust the input
    split size (probably not necessary) to ensure there is no further splitting.

    The reduces all essentially stall until all of the map tasks are done, so
    the reduce copy speed is a misleading value.
    On Mon, Jan 26, 2009 at 11:27 PM, Josh Ferguson wrote:

    So I have a table with roughly 145,000 records spread across 300 files. The
    total size is about 7MB. Right now I'm running one job tracker and one task
    tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run
    the following query:
    SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;

    And it takes about 35 minutes to finish. One of my problems is that I can't
    get my task tracker to process more than one map at a time even though it
    has a higher number of maximum map tasks. But even that is relatively fast
    compared to the reduce which takes about 30 minutes by itself. The status of
    the task is:

    reduce > copy (225 of 344 at 0.01 MB/s) >

    I really don't understand what is going on during this copy step or why it
    is taking so long. The files are small and they're all inside of amazon's
    network. Can you guys help me out?

    Josh F.
  • Jason hadoop at Jan 27, 2009 at 4:14 pm
    I just realized this was a hive question, I have no experience with Hive, so
    my advice is probably incorrect.
    On Tue, Jan 27, 2009 at 8:13 AM, jason hadoop wrote:

    It is not clear to me fromyour email if you have the number of map tasks
    per machine set to > 1, or if you are attempting to us a multi-threaded
    mapper.

    How many tasks does the system split your job into? and how many execute at
    once.
    It is a first guess that you are getting 300 map tasks, and each runs for a
    small number of seconds, and most of that time is probably the task setup
    time.

    As a first try, you could try packing your 300 small files into as many
    files as you have simultaneous task execution slots and adjust the input
    split size (probably not necessary) to ensure there is no further splitting.

    The reduces all essentially stall until all of the map tasks are done, so
    the reduce copy speed is a misleading value.

    On Mon, Jan 26, 2009 at 11:27 PM, Josh Ferguson wrote:

    So I have a table with roughly 145,000 records spread across 300 files.
    The total size is about 7MB. Right now I'm running one job tracker and one
    task tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I
    run the following query:
    SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;

    And it takes about 35 minutes to finish. One of my problems is that I
    can't get my task tracker to process more than one map at a time even though
    it has a higher number of maximum map tasks. But even that is relatively
    fast compared to the reduce which takes about 30 minutes by itself. The
    status of the task is:

    reduce > copy (225 of 344 at 0.01 MB/s) >

    I really don't understand what is going on during this copy step or why it
    is taking so long. The files are small and they're all inside of amazon's
    network. Can you guys help me out?

    Josh F.
  • Joydeep Sen Sarma at Jan 27, 2009 at 4:34 pm
    Hi Josh,

    Copying large number small map outputs can take a while. Can't say why the tasktracker is not running more than one mapper.

    We are working on this. hadoop-4565 tracks a jira to create splits that cross files while preserving locality. Hive-74 will use 4565 on hive side to control number of maps better.

    Joydeep

    ________________________________
    From: Josh Ferguson
    Sent: Monday, January 26, 2009 11:28 PM
    To: hive-user@hadoop.apache.org
    Subject: Job Speed

    So I have a table with roughly 145,000 records spread across 300 files. The total size is about 7MB. Right now I'm running one job tracker and one task tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run the following query:

    SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;

    And it takes about 35 minutes to finish. One of my problems is that I can't get my task tracker to process more than one map at a time even though it has a higher number of maximum map tasks. But even that is relatively fast compared to the reduce which takes about 30 minutes by itself. The status of the task is:

    reduce > copy (225 of 344 at 0.01 MB/s) >


    I really don't understand what is going on during this copy step or why it is taking so long. The files are small and they're all inside of amazon's network. Can you guys help me out?


    Josh F.
  • Josh Ferguson at Jan 27, 2009 at 5:31 pm
    yeah so I am loading 344 files each one taking just under 1 second according
    to the log, which takes approximately 5 minutes. The other 30 minutes are
    spent doing a "reduce > copy". I'm not sure why it's so slow because it's
    copying about 144,000 small records, the total size is about 16MB after it's
    mapped. I think with this particular query the slowness could be caused by
    the reduce task itself being slow? It's a distinct count so perhaps the
    reducer code is running extremely slow? I will try to write my own tonight
    and see if it goes any faster.
    Josh F.
    On Tue, Jan 27, 2009 at 8:34 AM, Joydeep Sen Sarma wrote:

    Hi Josh,



    Copying large number small map outputs can take a while. Can't say why the
    tasktracker is not running more than one mapper.



    We are working on this. hadoop-4565 tracks a jira to create splits that
    cross files while preserving locality. Hive-74 will use 4565 on hive side to
    control number of maps better.



    Joydeep


    ------------------------------

    *From:* Josh Ferguson
    *Sent:* Monday, January 26, 2009 11:28 PM
    *To:* hive-user@hadoop.apache.org
    *Subject:* Job Speed



    So I have a table with roughly 145,000 records spread across 300 files. The
    total size is about 7MB. Right now I'm running one job tracker and one task
    tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run
    the following query:



    SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;



    And it takes about 35 minutes to finish. One of my problems is that I can't
    get my task tracker to process more than one map at a time even though it
    has a higher number of maximum map tasks. But even that is relatively fast
    compared to the reduce which takes about 30 minutes by itself. The status of
    the task is:



    reduce > copy (225 of 344 at 0.01 MB/s) >



    I really don't understand what is going on during this copy step or why
    it is taking so long. The files are small and they're all inside of amazon's
    network. Can you guys help me out?



    Josh F.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 27, '09 at 7:28a
activeJan 27, '09 at 5:31p
posts5
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase