i found it easier to use scripts for special parsing needs. so say you want
to import apache access logs for analysis,
first load the log file to some staging table, say
tmp_log_file_staging_parts, then you would write query like this:
INSERT OVERWRITE TABLE access_logs PARTITION(date_str='2011-08-24',
machine='foobox1')
SELECT TRANSFORM (json) USING '/opts/pkgs/python/2.6.1-1/bin/python
/opts/scripts/access_log_parser.py' AS (ts, remote_host, xforwarded_host,
time, request, status, size, referer, useragent) FROM
tmp_log_file_staging_parts WHERE
tmp_log_file_staging_parts.ts='1317074469:565074000'
and inside the python script, you would do complex parsing logic to extract
out the fields before importing that data to hive table partition.
best part of this approach is, you can use any script you want, ;)
hope this helps.
2011/9/22 王志强 <wangzhiqiang@360.cn>
Hi, guys****
**・ ** I use hive to statistic the log , in order to parse
log more flexible, I overwrite the TextInputFormat to parse the log, but if
I only need compute log record contains some special keyword, How can I
filter this logs?****
** **
** **
*
**王志强***
*系统部***
*奇虎360***
手机:13488627521****
邮件:wangzhiqiang@360.cn <zhuxiaonan@360.cn>****
地址:北京朝阳区建国路71号惠通时代广场C座202 100025****
** **
** **
** **