Grokbase Groups Hive user August 2010
FAQ
Hi - in my HIVE environment, I ran the following query and expecting to see
the rows (the data is present)...however, after 2339 seconds, the o/p I get
is as shown below in the email (I've pasted last 5~10 lines of screen
output)...

Is there anything that I'm missing? Did the process finish correctly? should
there be something that could point me how to debug?



----------------------------- Query ------------------------------

select substr(CB.EXEC_DATE,1,10), count(CB.ID)
from callbacks CB JOIN
(select * from pages p where
p.page like '%google.com/search%<http://google.com/search%>'
or p.page like '%google.com/custom%<http://google.com/custom%>'
or p.page like '%google.com/#hl%<http://google.com/#hl%>'
or p.page like '%google.com/cse%<http://google.com/cse%>'
or p.page like '%search.yahoo.com/search%<
http://search.yahoo.com/search%>'
or p.page like '%bing.com/search%<http://bing.com/search%>'
or p.page like '%google.com/product%<http://google.com/product%>'
) s
ON s.id = cb.id
group by substr(CB.EXEC_DATE,1,10);


================= o/p on screen ==================

10/08/20 11:13:38 INFO mapred.TaskRunner: Task
'attempt_local_0001_r_000000_0' done.
2010-08-20 11:13:39,526 map = 100%, reduce =100%
10/08/20 11:13:39 INFO exec.ExecDriver: 2010-08-20 11:13:39,526 map = 100%,
reduce =100%
Ended Job = job_local_0001
10/08/20 11:13:39 INFO exec.ExecDriver: Ended Job = job_local_0001
10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001 to: hdfs://
hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate
10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate to:
hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/10001

OK
Time taken: 2339.331 seconds
=====================================================
*


thanks, ronak*
*
*
*408 504 4847*
*My Blog : http://ronakbaps.posterous.com*
*
*
*
*

Search Discussions

  • Ashish Thusoo at Aug 20, 2010 at 7:21 pm
    I think the query did finish properly. Can you recheck the data to see if you would really get a few rows of output?

    Ashish

    ________________________________
    From: Ronak Bhatt
    Sent: Friday, August 20, 2010 12:12 PM
    To: hive-user@hadoop.apache.org
    Subject: question..

    Hi - in my HIVE environment, I ran the following query and expecting to see the rows (the data is present)...however, after 2339 seconds, the o/p I get is as shown below in the email (I've pasted last 5~10 lines of screen output)...

    Is there anything that I'm missing? Did the process finish correctly? should there be something that could point me how to debug?



    ----------------------------- Query ------------------------------

    select substr(CB.EXEC_DATE,1,10), count(CB.ID<http://CB.ID>)
    from callbacks CB JOIN
    (select * from pages p where
    p.page like '%google.com/search%<http://google.com/search%><http://google.com/search%>'
    or p.page like '%google.com/custom%<http://google.com/custom%><http://google.com/custom%>'
    or p.page like '%google.com/#hl%<http://google.com/#hl%><http://google.com/#hl%>'
    or p.page like '%google.com/cse%<http://google.com/cse%><http://google.com/cse%>'
    or p.page like '%search.yahoo.com/search%<http://search.yahoo.com/search%><http://search.yahoo.com/search%>'
    or p.page like '%bing.com/search%<http://bing.com/search%><http://bing.com/search%>'
    or p.page like '%google.com/product%<http://google.com/product%><http://google.com/product%>' ) s
    ON s.id<http://s.id> = cb.id<http://cb.id>
    group by substr(CB.EXEC_DATE,1,10);


    ================= o/p on screen ==================

    10/08/20 11:13:38 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
    2010-08-20 11:13:39,526 map = 100%, reduce =100%
    10/08/20 11:13:39 INFO exec.ExecDriver: 2010-08-20 11:13:39,526 map = 100%, reduce =100%
    Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.ExecDriver: Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001> to: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate>
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate> to: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/10001<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/10001>

    OK
    Time taken: 2339.331 seconds
    =====================================================


    thanks, ronak

    408 504 4847
    My Blog : http://ronakbaps.posterous.com
  • Ronak Bhatt at Aug 20, 2010 at 8:56 pm
    Hi - Ashish was right that in original query, I was using a wrong field to
    join e.g. used *cb.id* instead of *cb.page_id* (however, that one should
    have also brought the result because the content was integer values
    comprising of 1,2,3 etc..)

    But to not leave any doubt, I did change the query to join on right
    fields...i.e. *page.id = cb.page_id ..*but I was getting heap error, so put
    the inline table on left hand and callbacks on right side (as I learned from
    forum that smaller table on lhs should help address the error)

    The query (pasted below) ran without any error, but the result is the same
    (pasted below) i.e. no records are coming as a result of query...but I
    verified in files that the data does exist i.e. there are records in CB
    where page_id for example has value 1 and that 1 has page like '%google%'

    Any pointers as to what could be going wrong?

    ================== query ================

    select substr(CB.EXEC_DATE,1,10), count(CB.ID)
    from
    (select * from pages p where
    p.page like '%google.com/search%<http://google.com/search%>'
    or p.page like '%google.com/custom%<http://google.com/custom%>'
    or p.page like '%google.com/#hl%<http://google.com/#hl%>'
    or p.page like '%google.com/cse%<http://google.com/cse%>'
    or p.page like '%search.yahoo.com/search%<
    http://search.yahoo.com/search%>'
    or p.page like '%bing.com/search%<http://bing.com/search%>'
    or p.page like '%google.com/product%<http://google.com/product%>'
    ) s join callbacks CB ON s.id = cb.page_id
    group by substr(CB.EXEC_DATE,1,10);


    ==================query output =============

    10/08/20 13:43:32 INFO mapred.LocalJobRunner: reduce > reduce
    10/08/20 13:43:32 INFO mapred.TaskRunner: Task
    'attempt_local_0001_r_000000_0' done.
    2010-08-20 01:43:32,839 map = 100%, reduce =100%
    10/08/20 13:43:32 INFO exec.ExecDriver: 2010-08-20 01:43:32,839 map = 100%,
    reduce =100%
    Ended Job = job_local_0001
    10/08/20 13:43:32 INFO exec.ExecDriver: Ended Job = job_local_0001
    10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001 to: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate
    10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediateto:
    hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/10001
    OK
    Time taken: 2241.278 seconds


    *thanks, ronak*
    *
    *
    *408 504 4847*
    *My Blog : http://ronakbaps.posterous.com*
    *
    *
    *
    *


    On Fri, Aug 20, 2010 at 12:18 PM, Ashish Thusoo wrote:

    I think the query did finish properly. Can you recheck the data to see if
    you would really get a few rows of output?

    Ashish

    ------------------------------
    *From:* Ronak Bhatt
    *Sent:* Friday, August 20, 2010 12:12 PM
    *To:* hive-user@hadoop.apache.org
    *Subject:* question..

    Hi - in my HIVE environment, I ran the following query and expecting to
    see the rows (the data is present)...however, after 2339 seconds, the o/p I
    get is as shown below in the email (I've pasted last 5~10 lines of screen
    output)...

    Is there anything that I'm missing? Did the process finish correctly?
    should there be something that could point me how to debug?



    ----------------------------- Query ------------------------------

    select substr(CB.EXEC_DATE,1,10), count(CB.ID)
    from callbacks CB JOIN
    (select * from pages p where
    p.page like '%google.com/search%<http://google.com/search%
    '
    or p.page like '%google.com/custom%<http://google.com/custom%
    '
    or p.page like '%google.com/#hl%<http://google.com/#hl%>'
    or p.page like '%google.com/cse%<http://google.com/cse%>'
    or p.page like '%search.yahoo.com/search%<
    http://search.yahoo.com/search%>'
    or p.page like '%bing.com/search%<http://bing.com/search%>'
    or p.page like '%google.com/product%<
    http://google.com/product%>' ) s
    ON s.id = cb.id
    group by substr(CB.EXEC_DATE,1,10);


    ================= o/p on screen ==================

    10/08/20 11:13:38 INFO mapred.TaskRunner: Task
    'attempt_local_0001_r_000000_0' done.
    2010-08-20 11:13:39,526 map = 100%, reduce =100%
    10/08/20 11:13:39 INFO exec.ExecDriver: 2010-08-20 11:13:39,526 map = 100%,
    reduce =100%
    Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.ExecDriver: Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001 to: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediateto: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/10001

    OK
    Time taken: 2339.331 seconds
    =====================================================
    *


    thanks, ronak*
    *
    *
    *408 504 4847*
    *My Blog : http://ronakbaps.posterous.com*
    *
    *
    *
    *
  • Ashish Thusoo at Aug 20, 2010 at 9:23 pm
    Can you decompose the problem a bit .e.g

    can you just count the number of rows returned by the subq to ensure that all the predicates are returning the desired results?

    Also what does count(CB.ID) mean?

    Ashish

    ________________________________
    From: Ronak Bhatt
    Sent: Friday, August 20, 2010 1:56 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: question..

    Hi - Ashish was right that in original query, I was using a wrong field to join e.g. used cb.id<http://cb.id> instead of cb.page_id (however, that one should have also brought the result because the content was integer values comprising of 1,2,3 etc..)

    But to not leave any doubt, I did change the query to join on right fields...i.e. page.id<http://page.id> = cb.page_id ..but I was getting heap error, so put the inline table on left hand and callbacks on right side (as I learned from forum that smaller table on lhs should help address the error)

    The query (pasted below) ran without any error, but the result is the same (pasted below) i.e. no records are coming as a result of query...but I verified in files that the data does exist i.e. there are records in CB where page_id for example has value 1 and that 1 has page like '%google%'

    Any pointers as to what could be going wrong?

    ================== query ================

    select substr(CB.EXEC_DATE,1,10), count(CB.ID<http://CB.ID>)
    from
    (select * from pages p where
    p.page like '%google.com/search%<http://google.com/search%><http://google.com/search%>'
    or p.page like '%google.com/custom%<http://google.com/custom%><http://google.com/custom%>'
    or p.page like '%google.com/#hl%<http://google.com/#hl%><http://google.com/#hl%>'
    or p.page like '%google.com/cse%<http://google.com/cse%><http://google.com/cse%>'
    or p.page like '%search.yahoo.com/search%<http://search.yahoo.com/search%><http://search.yahoo.com/search%>'
    or p.page like '%bing.com/search%<http://bing.com/search%><http://bing.com/search%>'
    or p.page like '%google.com/product%<http://google.com/product%><http://google.com/product%>' ) s join callbacks CB ON s.id<http://s.id> = cb.page_id
    group by substr(CB.EXEC_DATE,1,10);


    ==================query output =============

    10/08/20 13:43:32 INFO mapred.LocalJobRunner: reduce > reduce
    10/08/20 13:43:32 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
    2010-08-20 01:43:32,839 map = 100%, reduce =100%
    10/08/20 13:43:32 INFO exec.ExecDriver: 2010-08-20 01:43:32,839 map = 100%, reduce =100%
    Ended Job = job_local_0001
    10/08/20 13:43:32 INFO exec.ExecDriver: Ended Job = job_local_0001
    10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001<http://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001> to: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate<http://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate>
    10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate<http://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate> to: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/10001<http://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/10001>
    OK
    Time taken: 2241.278 seconds


    thanks, ronak

    408 504 4847
    My Blog : http://ronakbaps.posterous.com





    On Fri, Aug 20, 2010 at 12:18 PM, Ashish Thusoo wrote:
    I think the query did finish properly. Can you recheck the data to see if you would really get a few rows of output?

    Ashish

    ________________________________
    From: Ronak Bhatt
    Sent: Friday, August 20, 2010 12:12 PM
    To: hive-user@hadoop.apache.org
    Subject: question..

    Hi - in my HIVE environment, I ran the following query and expecting to see the rows (the data is present)...however, after 2339 seconds, the o/p I get is as shown below in the email (I've pasted last 5~10 lines of screen output)...

    Is there anything that I'm missing? Did the process finish correctly? should there be something that could point me how to debug?



    ----------------------------- Query ------------------------------

    select substr(CB.EXEC_DATE,1,10), count(CB.ID<http://CB.ID>)
    from callbacks CB JOIN
    (select * from pages p where
    p.page like '%google.com/search%<http://google.com/search%><http://google.com/search%>'
    or p.page like '%google.com/custom%<http://google.com/custom%><http://google.com/custom%>'
    or p.page like '%google.com/#hl%<http://google.com/#hl%><http://google.com/#hl%>'
    or p.page like '%google.com/cse%<http://google.com/cse%><http://google.com/cse%>'
    or p.page like '%search.yahoo.com/search%<http://search.yahoo.com/search%><http://search.yahoo.com/search%>'
    or p.page like '%bing.com/search%<http://bing.com/search%><http://bing.com/search%>'
    or p.page like '%google.com/product%<http://google.com/product%><http://google.com/product%>' ) s
    ON s.id<http://s.id> = cb.id<http://cb.id>
    group by substr(CB.EXEC_DATE,1,10);


    ================= o/p on screen ==================

    10/08/20 11:13:38 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
    2010-08-20 11:13:39,526 map = 100%, reduce =100%
    10/08/20 11:13:39 INFO exec.ExecDriver: 2010-08-20 11:13:39,526 map = 100%, reduce =100%
    Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.ExecDriver: Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001> to: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate>
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate> to: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/10001<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/10001>

    OK
    Time taken: 2339.331 seconds
    =====================================================


    thanks, ronak

    408 504 4847
    My Blog : http://ronakbaps.posterous.com
  • Ronak Bhatt at Aug 20, 2010 at 10:15 pm
    yes, I'm trying that approach of one step at a time....count(cb.id) is to
    get the count of id from table callback which is aliased as cb.


    *thanks, ronak*
    *
    *
    *408 504 4847*
    *My Blog : http://ronakbaps.posterous.com*
    *
    *
    *
    *


    On Fri, Aug 20, 2010 at 2:18 PM, Ashish Thusoo wrote:

    Can you decompose the problem a bit .e.g

    can you just count the number of rows returned by the subq to ensure that
    all the predicates are returning the desired results?

    Also what does count(CB.ID) mean?

    Ashish

    ------------------------------
    *From:* Ronak Bhatt
    *Sent:* Friday, August 20, 2010 1:56 PM

    *To:* hive-user@hadoop.apache.org
    *Subject:* Re: question..

    Hi - Ashish was right that in original query, I was using a wrong field
    to join e.g. used *cb.id* instead of *cb.page_id* (however, that one
    should have also brought the result because the content was integer values
    comprising of 1,2,3 etc..)

    But to not leave any doubt, I did change the query to join on right
    fields...i.e. *page.id = cb.page_id ..*but I was getting heap error, so
    put the inline table on left hand and callbacks on right side (as I learned
    from forum that smaller table on lhs should help address the error)

    The query (pasted below) ran without any error, but the result is the same
    (pasted below) i.e. no records are coming as a result of query...but I
    verified in files that the data does exist i.e. there are records in CB
    where page_id for example has value 1 and that 1 has page like '%google%'

    Any pointers as to what could be going wrong?

    ================== query ================

    select substr(CB.EXEC_DATE,1,10), count(CB.ID)
    from
    (select * from pages p where
    p.page like '%google.com/search%<http://google.com/search%
    '
    or p.page like '%google.com/custom%<http://google.com/custom%
    '
    or p.page like '%google.com/#hl%<http://google.com/#hl%>'
    or p.page like '%google.com/cse%<http://google.com/cse%>'
    or p.page like '%search.yahoo.com/search%<
    http://search.yahoo.com/search%>'
    or p.page like '%bing.com/search%<http://bing.com/search%>'
    or p.page like '%google.com/product%<
    http://google.com/product%>' ) s join callbacks CB ON s.id = cb.page_id
    group by substr(CB.EXEC_DATE,1,10);


    ==================query output =============

    10/08/20 13:43:32 INFO mapred.LocalJobRunner: reduce > reduce
    10/08/20 13:43:32 INFO mapred.TaskRunner: Task
    'attempt_local_0001_r_000000_0' done.
    2010-08-20 01:43:32,839 map = 100%, reduce =100%
    10/08/20 13:43:32 INFO exec.ExecDriver: 2010-08-20 01:43:32,839 map = 100%,
    reduce =100%
    Ended Job = job_local_0001
    10/08/20 13:43:32 INFO exec.ExecDriver: Ended Job = job_local_0001
    10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001 to: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate
    10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediateto: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/10001
    OK
    Time taken: 2241.278 seconds


    *thanks, ronak*
    *
    *
    *408 504 4847*
    *My Blog : http://ronakbaps.posterous.com*
    *
    *
    *
    *


    On Fri, Aug 20, 2010 at 12:18 PM, Ashish Thusoo wrote:

    I think the query did finish properly. Can you recheck the data to see
    if you would really get a few rows of output?

    Ashish

    ------------------------------
    *From:* Ronak Bhatt
    *Sent:* Friday, August 20, 2010 12:12 PM
    *To:* hive-user@hadoop.apache.org
    *Subject:* question..

    Hi - in my HIVE environment, I ran the following query and expecting to
    see the rows (the data is present)...however, after 2339 seconds, the o/p I
    get is as shown below in the email (I've pasted last 5~10 lines of screen
    output)...

    Is there anything that I'm missing? Did the process finish correctly?
    should there be something that could point me how to debug?



    ----------------------------- Query ------------------------------

    select substr(CB.EXEC_DATE,1,10), count(CB.ID)
    from callbacks CB JOIN
    (select * from pages p where
    p.page like '%google.com/search%<http://google.com/search%
    '
    or p.page like '%google.com/custom%<http://google.com/custom%
    '
    or p.page like '%google.com/#hl%<http://google.com/#hl%>'
    or p.page like '%google.com/cse%<http://google.com/cse%>'
    or p.page like '%search.yahoo.com/search%<
    http://search.yahoo.com/search%>'
    or p.page like '%bing.com/search%<http://bing.com/search%>'
    or p.page like '%google.com/product%<
    http://google.com/product%>' ) s
    ON s.id = cb.id
    group by substr(CB.EXEC_DATE,1,10);


    ================= o/p on screen ==================

    10/08/20 11:13:38 INFO mapred.TaskRunner: Task
    'attempt_local_0001_r_000000_0' done.
    2010-08-20 11:13:39,526 map = 100%, reduce =100%
    10/08/20 11:13:39 INFO exec.ExecDriver: 2010-08-20 11:13:39,526 map =
    100%, reduce =100%
    Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.ExecDriver: Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001 to: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediateto: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/10001

    OK
    Time taken: 2339.331 seconds
    =====================================================
    *


    thanks, ronak*
    *
    *
    *408 504 4847*
    *My Blog : http://ronakbaps.posterous.com*
    *
    *
    *
    *
  • Ronak Bhatt at Aug 21, 2010 at 1:40 am
    resolved...some hidden characters were causing the issue...

    all - thanks for your help.

    *thanks, ronak*
    *
    *
    *408 504 4847*
    *My Blog : http://ronakbaps.posterous.com*
    *
    *
    *
    *


    On Fri, Aug 20, 2010 at 2:18 PM, Ashish Thusoo wrote:

    Can you decompose the problem a bit .e.g

    can you just count the number of rows returned by the subq to ensure that
    all the predicates are returning the desired results?

    Also what does count(CB.ID) mean?

    Ashish

    ------------------------------
    *From:* Ronak Bhatt
    *Sent:* Friday, August 20, 2010 1:56 PM

    *To:* hive-user@hadoop.apache.org
    *Subject:* Re: question..

    Hi - Ashish was right that in original query, I was using a wrong field
    to join e.g. used *cb.id* instead of *cb.page_id* (however, that one
    should have also brought the result because the content was integer values
    comprising of 1,2,3 etc..)

    But to not leave any doubt, I did change the query to join on right
    fields...i.e. *page.id = cb.page_id ..*but I was getting heap error, so
    put the inline table on left hand and callbacks on right side (as I learned
    from forum that smaller table on lhs should help address the error)

    The query (pasted below) ran without any error, but the result is the same
    (pasted below) i.e. no records are coming as a result of query...but I
    verified in files that the data does exist i.e. there are records in CB
    where page_id for example has value 1 and that 1 has page like '%google%'

    Any pointers as to what could be going wrong?

    ================== query ================

    select substr(CB.EXEC_DATE,1,10), count(CB.ID)
    from
    (select * from pages p where
    p.page like '%google.com/search%<http://google.com/search%
    '
    or p.page like '%google.com/custom%<http://google.com/custom%
    '
    or p.page like '%google.com/#hl%<http://google.com/#hl%>'
    or p.page like '%google.com/cse%<http://google.com/cse%>'
    or p.page like '%search.yahoo.com/search%<
    http://search.yahoo.com/search%>'
    or p.page like '%bing.com/search%<http://bing.com/search%>'
    or p.page like '%google.com/product%<
    http://google.com/product%>' ) s join callbacks CB ON s.id = cb.page_id
    group by substr(CB.EXEC_DATE,1,10);


    ==================query output =============

    10/08/20 13:43:32 INFO mapred.LocalJobRunner: reduce > reduce
    10/08/20 13:43:32 INFO mapred.TaskRunner: Task
    'attempt_local_0001_r_000000_0' done.
    2010-08-20 01:43:32,839 map = 100%, reduce =100%
    10/08/20 13:43:32 INFO exec.ExecDriver: 2010-08-20 01:43:32,839 map = 100%,
    reduce =100%
    Ended Job = job_local_0001
    10/08/20 13:43:32 INFO exec.ExecDriver: Ended Job = job_local_0001
    10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001 to: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate
    10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediateto: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/10001
    OK
    Time taken: 2241.278 seconds


    *thanks, ronak*
    *
    *
    *408 504 4847*
    *My Blog : http://ronakbaps.posterous.com*
    *
    *
    *
    *


    On Fri, Aug 20, 2010 at 12:18 PM, Ashish Thusoo wrote:

    I think the query did finish properly. Can you recheck the data to see
    if you would really get a few rows of output?

    Ashish

    ------------------------------
    *From:* Ronak Bhatt
    *Sent:* Friday, August 20, 2010 12:12 PM
    *To:* hive-user@hadoop.apache.org
    *Subject:* question..

    Hi - in my HIVE environment, I ran the following query and expecting to
    see the rows (the data is present)...however, after 2339 seconds, the o/p I
    get is as shown below in the email (I've pasted last 5~10 lines of screen
    output)...

    Is there anything that I'm missing? Did the process finish correctly?
    should there be something that could point me how to debug?



    ----------------------------- Query ------------------------------

    select substr(CB.EXEC_DATE,1,10), count(CB.ID)
    from callbacks CB JOIN
    (select * from pages p where
    p.page like '%google.com/search%<http://google.com/search%
    '
    or p.page like '%google.com/custom%<http://google.com/custom%
    '
    or p.page like '%google.com/#hl%<http://google.com/#hl%>'
    or p.page like '%google.com/cse%<http://google.com/cse%>'
    or p.page like '%search.yahoo.com/search%<
    http://search.yahoo.com/search%>'
    or p.page like '%bing.com/search%<http://bing.com/search%>'
    or p.page like '%google.com/product%<
    http://google.com/product%>' ) s
    ON s.id = cb.id
    group by substr(CB.EXEC_DATE,1,10);


    ================= o/p on screen ==================

    10/08/20 11:13:38 INFO mapred.TaskRunner: Task
    'attempt_local_0001_r_000000_0' done.
    2010-08-20 11:13:39,526 map = 100%, reduce =100%
    10/08/20 11:13:39 INFO exec.ExecDriver: 2010-08-20 11:13:39,526 map =
    100%, reduce =100%
    Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.ExecDriver: Ended Job = job_local_0001
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001 to: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate
    10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediateto: hdfs://
    hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/10001

    OK
    Time taken: 2339.331 seconds
    =====================================================
    *


    thanks, ronak*
    *
    *
    *408 504 4847*
    *My Blog : http://ronakbaps.posterous.com*
    *
    *
    *
    *

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedAug 20, '10 at 7:12p
activeAug 21, '10 at 1:40a
posts6
users2
websitehive.apache.org

2 users in discussion

Ronak Bhatt: 4 posts Ashish Thusoo: 2 posts

People

Translate

site design / logo © 2021 Grokbase