FAQ
Hi,

I have an apache web log (sample below), and want to LOAD DATA INPATH.

My fields are separated by a space, and those that contains spaces are
enclosed in quotes.

I tried this,

ROW FORMAT DELIMITED
FIELDS TERMINATED BY " "
COLLECTION ITEMS TERMINATED BY '"'
MAP KEYS TERMINATED BY ","

but it did not work, and thought that GET is a separate field. What should I
change?

Thank you,
Mark


[01/May/2011:00:00:00 +0000] 68.115.109.118 TLSv1 RC4-MD5 "GET
/dynLink/?PCD=CHICHHH&EBC=3425154412&RCC=D2RVX&GAD=20110426&NMN=2&NOA=1&NOC=0&LNG=en&TBP=325.43&GEM=STEPHENCLAUDENELSON%
40GMAIL.COM&GEN=&GSL=&GLN=NELSON&GFN=STEPHEN&GCC=&GST=&GCT=&GPC=&GAR=&GPN=&PRT=0&PLC=&PCC=brandwebsite&PSC=&SRP=CIBMS0&PID=HIL&PET=WEB&GNR=1&CRP=0901452
HTTP/1.1" 200 95 0 99885 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1;
.NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2;
.NET4.0C; .NET4.0E; MS-RTC LM 8)" "
https://secure.hilton.com/en/hi/res/retrieved_reservation.jhtml;jsessionid=UIBJ2MH0JDJPOCSGBIYMVCQ?_requestid=153483"
"t=1304208000431979" "D=99766"

Search Discussions

  • Longmans163 at Sep 25, 2011 at 3:26 am
    hi, Mark, I saw ""GET /dynLink/?" contains a space, since hive should recognize this as a FIELDS TERMINATED which you have defined before. I think you should encode the spaces to other non-terminate char.



    At 2011-09-23 04:58:59,"Mark Kerzner" wrote:
    Hi,

    I have an apache web log (sample below), and want toLOAD DATA INPATH.

    My fields are separated by a space, and those that contains spaces are enclosed in quotes.

    I tried this,

    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY " "
    COLLECTION ITEMS TERMINATED BY '"'
    MAP KEYS TERMINATED BY ","

    but it did not work, and thought that GET is a separate field. What should I change?

    Thank you,
    Mark


    [01/May/2011:00:00:00 +0000] 68.115.109.118 TLSv1 RC4-MD5 "GET /dynLink/?PCD=CHICHHH&EBC=3425154412&RCC=D2RVX&GAD=20110426&NMN=2&NOA=1&NOC=0&LNG=en&TBP=325.43&GEM=STEPHENCLAUDENELSON%40GMAIL.COM&GEN=&GSL=&GLN=NELSON&GFN=STEPHEN&GCC=&GST=&GCT=&GPC=&GAR=&GPN=&PRT=0&PLC=&PCC=brandwebsite&PSC=&SRP=CIBMS0&PID=HIL&PET=WEB&GNR=1&CRP=0901452 HTTP/1.1" 200 95 0 99885 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2; .NET4.0C; .NET4.0E; MS-RTC LM 8)" "https://secure.hilton.com/en/hi/res/retrieved_reservation.jhtml;jsessionid=UIBJ2MH0JDJPOCSGBIYMVCQ?_requestid=153483" "t=1304208000431979" "D=99766"
  • Mark Kerzner at Sep 25, 2011 at 3:43 am
    That is what I ended up doing - since I could not change the format of the
    existing logs, I wrote a
    utility<https://github.com/markkerzner/WebLogAnalyzer/blob/master/src/main/java/com/shmsoft/webloganalyzer/ApacheWebLog.java>to
    convert them to something more standard that Hive can easily accept.

    Thank you,
    Mark

    2011/9/24 longmans163 <longmans163@163.com>
    hi, Mark, I saw ""GET /dynLink/?" contains a space, since hive should
    recognize this as a FIELDS TERMINATED which you have defined before. I
    think you should encode the spaces to other non-terminate char.


    At 2011-09-23 04:58:59,"Mark Kerzner" wrote:

    Hi,

    I have an apache web log (sample below), and want to LOAD DATA INPATH.

    My fields are separated by a space, and those that contains spaces are
    enclosed in quotes.

    I tried this,

    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY " "
    COLLECTION ITEMS TERMINATED BY '"'
    MAP KEYS TERMINATED BY ","

    but it did not work, and thought that GET is a separate field. What should
    I change?

    Thank you,
    Mark


    [01/May/2011:00:00:00 +0000] 68.115.109.118 TLSv1 RC4-MD5 "GET
    /dynLink/?PCD=CHICHHH&EBC=3425154412&RCC=D2RVX&GAD=20110426&NMN=2&NOA=1&
    amp;NOC=0&LNG=en&TBP=325.43&GEM=STEPHENCLAUDENELSON%40GMAIL.COM&GEN=&GSL=&GLN=NELSON&GFN=STEPHEN&GCC=&GST=&GCT=&GPC=&GAR=&GPN=&PRT=0&PLC=&PCC=brandwebsite&PSC=&SRP=CIBMS0&PID=HIL&PET=WEB&GNR=1&CRP=0901452
    HTTP/1.1" 200 95 0 99885 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1;
    .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729;
    InfoPath.2; .NET4.0C; .NET4.0E; MS-RTC LM 8)" "
    https://secure.hilton.com/en/hi/res/retrieved_reservation.jhtml;jsessionid=UIBJ2MH0JDJPOCSGBIYMVCQ?_requestid=153483"
    "t=1304208000431979" "D=99766"


  • Vince Hoang at Sep 25, 2011 at 3:59 am
    Hi Mark,

    You could try RegexSerDe to deserialize using regular expression. Here is a good example:

    http://books.google.com/books?id=Nff49D7vnJcC&lpg=PA391&ots=IicwYn7zOq&dq=ROW%20FORMAT%20SERDE%20input.regex&pg=PA391#v=onepage&q=ROW%20FORMAT%20SERDE%20input.regex&f=false

    Good luck,
    Vince


    From: Mark Kerzner <mark.kerzner@shmsoft.com
    Reply-To: <user@hive.apache.org
    Date: Sat, 24 Sep 2011 22:43:23 -0500
    To: <user@hive.apache.org
    Subject: Re: How to load quote-separated fields?

    That is what I ended up doing - since I could not change the format of the existing logs, I wrote a utility<https://github.com/markkerzner/WebLogAnalyzer/blob/master/src/main/java/com/shmsoft/webloganalyzer/ApacheWebLog.java> to convert them to something more standard that Hive can easily accept.

    Thank you,
    Mark

    2011/9/24 longmans163 <longmans163@163.com
    hi, Mark, I saw ""GET /dynLink/?" contains a space, since hive should recognize this as a FIELDS TERMINATED which you have defined before. I think you should encode the spaces to other non-terminate char.


    At 2011-09-23 04:58:59,"Mark Kerzner" wrote:
    Hi,

    I have an apache web log (sample below), and want to LOAD DATA INPATH.

    My fields are separated by a space, and those that contains spaces are enclosed in quotes.

    I tried this,

    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY " "
    COLLECTION ITEMS TERMINATED BY '"'
    MAP KEYS TERMINATED BY ","

    but it did not work, and thought that GET is a separate field. What should I change?

    Thank you,
    Mark


    [01/May/2011:00:00:00 +0000] 68.115.109.118 TLSv1 RC4-MD5 "GET /dynLink/?PCD=CHICHHH&EBC=3425154412&RCC=D2RVX&GAD=20110426&NMN=2&NOA=1& amp;NOC=0&LNG=en&TBP=325.43&GEM=STEPHENCLAUDENELSON%40GMAIL.COM<http://40GMAIL.COM>&GEN=&GSL=&GLN=NELSON&GFN=STEPHEN&GCC=&GST=&GCT=&GPC=&GAR=&GPN=&PRT=0&PLC=&PCC=brandwebsite&PSC=&SRP=CIBMS0&PID=HIL&PET=WEB&GNR=1&CRP=0901452 HTTP/1.1" 200 95 0 99885 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152<tel:3.0.4506.2152>; .NET CLR 3.5.30729; InfoPath.2; .NET4.0C; .NET4.0E; MS-RTC LM 8)" "https://secure.hilton.com/en/hi/res/retrieved_reservation.jhtml;jsessionid=UIBJ2MH0JDJPOCSGBIYMVCQ?_requestid=153483" "t=1304208000431979" "D=99766"





    The contents of this message, together with any attachments, are intended only for the use of the individual or entity to which they are addressed and may contain information that is confidential and exempt from disclosure. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this message, or any attachment, is strictly prohibited. If you have received this message in error, please notify the original sender immediately by telephone or by return E-mail and delete this message, along with any attachments, from your computer. Thank you.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedSep 22, '11 at 8:59p
activeSep 25, '11 at 3:59a
posts4
users3
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase