Grokbase Groups Hive user May 2011
FAQ
I've been using Hive in production for two months now. We're mainly using
it for processing server logs, about 1-2GB per day (2-2.5 million
requests). Typically we import a day's worth of logs at once. That said,
sometimes we decide to tweak a calculated column. When that happens, we
modify our transformation script and re-import the entire set of logs (~200
days) into ~600 partitions.

A few days ago I noticed that simple queries, such as a count of page views
over a given week, were giving results up to 10% higher than they yielded
just a week before. I suspected that we may have "found" unprocessed log
files, so I set up a script to re-import the entire inventory of logs and
re-run the queries. I got identical results for some weeks, but different
results for some errors. I repeated this experiment and got different
results.

In the course of this I found that sometimes Hive will create all of the
partitions but write no data to them while not reporting any errors in the
job tracker. Other times it will fail and leave a stack trace blaming a
broken pipe.

Does anyone have any ideas what I may be doing wrong? I can change our
practices whichever way; all I want is confidence that all of my data has
been properly imported.
Thanks,
Tim

Search Discussions

  • Ning Zhang at May 11, 2011 at 9:33 pm
    Hive queries are compiled to different types tasks (MapReduce, MoveTask, etc), so a successful MR task as indicated in the JT doesn't mean the whole query succeeded. So you need to examine the status of the hive query to see if it succeeded or not. You can also check the hive's log file under /tmp/<user>/hive.log to debug if a query failed.

    Also the reason of a broken pipe errors are mostly due to the fact that the script crashed during the mapreduce job. In this case the MR job should fail, as well as the whole Hive query.
    On May 11, 2011, at 2:16 PM, Tim Spence wrote:

    I've been using Hive in production for two months now. We're mainly using it for processing server logs, about 1-2GB per day (2-2.5 million requests). Typically we import a day's worth of logs at once. That said, sometimes we decide to tweak a calculated column. When that happens, we modify our transformation script and re-import the entire set of logs (~200 days) into ~600 partitions.

    A few days ago I noticed that simple queries, such as a count of page views over a given week, were giving results up to 10% higher than they yielded just a week before. I suspected that we may have "found" unprocessed log files, so I set up a script to re-import the entire inventory of logs and re-run the queries. I got identical results for some weeks, but different results for some errors. I repeated this experiment and got different results.

    In the course of this I found that sometimes Hive will create all of the partitions but write no data to them while not reporting any errors in the job tracker. Other times it will fail and leave a stack trace blaming a broken pipe.

    Does anyone have any ideas what I may be doing wrong? I can change our practices whichever way; all I want is confidence that all of my data has been properly imported.
    Thanks,
    Tim
  • Tim Spence at May 12, 2011 at 12:06 am
    Thank you. Are there tools for parsing the Hive logs for errors? If not,
    can you talk about the strategy used at Facebook to deal with detection and
    resolution of MR errors?

    Perhaps I can write a script to identify errors. First I have to solve the
    mystery of why there are no logs on my hadoop master.

    I'm trying now to import each day's server logs one at a time (instead of
    importing all logs in one Hive command) to see if that solves my issue with
    inconsistent results after mass loading of server logs. I'll post an update
    if I find anything useful.
    Tim



    On Wed, May 11, 2011 at 2:33 PM, Ning Zhang wrote:

    Hive queries are compiled to different types tasks (MapReduce, MoveTask,
    etc), so a successful MR task as indicated in the JT doesn't mean the whole
    query succeeded. So you need to examine the status of the hive query to see
    if it succeeded or not. You can also check the hive's log file under
    /tmp/<user>/hive.log to debug if a query failed.

    Also the reason of a broken pipe errors are mostly due to the fact that the
    script crashed during the mapreduce job. In this case the MR job should
    fail, as well as the whole Hive query.
    On May 11, 2011, at 2:16 PM, Tim Spence wrote:

    I've been using Hive in production for two months now. We're mainly
    using it for processing server logs, about 1-2GB per day (2-2.5 million
    requests). Typically we import a day's worth of logs at once. That said,
    sometimes we decide to tweak a calculated column. When that happens, we
    modify our transformation script and re-import the entire set of logs (~200
    days) into ~600 partitions.
    A few days ago I noticed that simple queries, such as a count of page
    views over a given week, were giving results up to 10% higher than they
    yielded just a week before. I suspected that we may have "found"
    unprocessed log files, so I set up a script to re-import the entire
    inventory of logs and re-run the queries. I got identical results for some
    weeks, but different results for some errors. I repeated this experiment
    and got different results.
    In the course of this I found that sometimes Hive will create all of the
    partitions but write no data to them while not reporting any errors in the
    job tracker. Other times it will fail and leave a stack trace blaming a
    broken pipe.
    Does anyone have any ideas what I may be doing wrong? I can change our
    practices whichever way; all I want is confidence that all of my data has
    been properly imported.
    Thanks,
    Tim

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMay 11, '11 at 9:16p
activeMay 12, '11 at 12:06a
posts3
users2
websitehive.apache.org

2 users in discussion

Tim Spence: 2 posts Ning Zhang: 1 post

People

Translate

site design / logo © 2022 Grokbase