Grokbase Groups Hive user August 2011
FAQ
I run a single query like

select retailer_key,count(*) from records group by retailer_key;

it uses a single map as shown below, since the file is already on HDFS, so I think hadoop/hive doesn't need to copy anything.


Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map100.00%


100100 / 0
reduce100.00%


100100 / 0

but the final chart in the job report shows "copy" takes about 33% of the total time, and the rest are "sort", and "reduce". So why it should copy here, or copy means something elso?
oracle@oracle-MS-7623:~/test$ hadoop fs -lsr /

drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user
drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user/hive
drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59 /user/hive/warehouse
drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59 /user/hive/warehouse/records
-rw-r--r-- 1 oracle supergroup 41600256 2011-08-10 19:59 /user/hive/warehouse/records/test.txt

Search Discussions

  • Bejoy_ks at Aug 10, 2011 at 7:02 pm
    Hi
    Hive queries are parsed into hadoop map reduce jobs. In map reduce jobs, between map and reduce tasks there are two phases, copy-phase and sort-phase together known as sort and shuffle phase. So the copy task indicated in hive job here should be the copy phase of map reduce. It does the copying of map output from map task nodes to corresponding reduce task nodes.

    Regards
    Bejoy K S

    -----Original Message-----
    From: "Daniel,Wu" <hadoop_wu@163.com>
    Date: Wed, 10 Aug 2011 20:07:48
    To: hive<user@hive.apache.org>
    Reply-To: user@hive.apache.org
    Subject: why need to copy when run a sql with a single map

    I run a single query like

    select retailer_key,count(*) from records group by retailer_key;

    it uses a single map as shown below, since the file is already on HDFS, so I think hadoop/hive doesn't need to copy anything.


    Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
    Task Attempts
    map100.00%


    100100 / 0
    reduce100.00%


    100100 / 0

    but the final chart in the job report shows "copy" takes about 33% of the total time, and the rest are "sort", and "reduce". So why it should copy here, or copy means something elso?
    oracle@oracle-MS-7623:~/test$ hadoop fs -lsr /

    drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user
    drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user/hive
    drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59 /user/hive/warehouse
    drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59 /user/hive/warehouse/records
    -rw-r--r-- 1 oracle supergroup 41600256 2011-08-10 19:59 /user/hive/warehouse/records/test.txt
  • Kai Ju Liu at Aug 10, 2011 at 7:02 pm
    Hi Daniel. The Hive query uses a reduce step to group by retailer_key and
    calculate count(*). The "copy" step is a copy of data from the mapper to the
    reducer.

    Kai Ju

    2011/8/10 Daniel,Wu <hadoop_wu@163.com>
    I run a single query like

    select retailer_key,count(*) from records group by retailer_key;

    it uses a single map as shown below, since the file is already on HDFS, so
    I think hadoop/hive doesn't need to copy anything.

    Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
    Task Attempts<http://localhost:50030/jobfailures.jsp?jobid=job_201108101943_0001>
    map<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=map&pagenum=1>
    100.00%
    1001<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=map&pagenum=1&state=completed>
    00 / 0 reduce<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=reduce&pagenum=1>
    100.00%
    1001<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=reduce&pagenum=1&state=completed>
    00 / 0
    but the final chart in the job report shows "copy" takes about 33% of the
    total time, and the rest are "sort", and "reduce". So why it should copy
    here, or copy means something elso?
    oracle@oracle-MS-7623:~/test$ hadoop fs -lsr /

    drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user
    drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user/hive
    drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59
    /user/hive/warehouse
    drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59
    /user/hive/warehouse/records
    -rw-r--r-- 1 oracle supergroup 41600256 2011-08-10 19:59
    /user/hive/warehouse/records/test.txt



Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedAug 10, '11 at 12:08p
activeAug 10, '11 at 7:02p
posts3
users3
websitehive.apache.org

3 users in discussion

Kai Ju Liu: 1 post Bejoy_ks: 1 post Daniel,Wu: 1 post

People

Translate

site design / logo © 2021 Grokbase