You could try RegexSerDe to deserialize using regular expression. Here is a good example:http://books.google.com/books?id=Nff49D7vnJcC&lpg=PA391&ots=IicwYn7zOq&dq=ROW%20FORMAT%20SERDE%20input.regex&pg=PA391#v=onepage&q=ROW%20FORMAT%20SERDE%20input.regex&f=false
From: Mark Kerzner <email@example.com
Date: Sat, 24 Sep 2011 22:43:23 -0500
Subject: Re: How to load quote-separated fields?
That is what I ended up doing - since I could not change the format of the existing logs, I wrote a utility<https://github.com/markkerzner/WebLogAnalyzer/blob/master/src/main/java/com/shmsoft/webloganalyzer/ApacheWebLog.java
> to convert them to something more standard that Hive can easily accept.
2011/9/24 longmans163 <firstname.lastname@example.org
hi, Mark, I saw ""GET /dynLink/?" contains a space, since hive should recognize this as a FIELDS TERMINATED which you have defined before. I think you should encode the spaces to other non-terminate char.
At 2011-09-23 04:58:59,"Mark Kerzner" wrote:
I have an apache web log (sample below), and want to LOAD DATA INPATH.
My fields are separated by a space, and those that contains spaces are enclosed in quotes.
I tried this,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY " "
COLLECTION ITEMS TERMINATED BY '"'
MAP KEYS TERMINATED BY ","
but it did not work, and thought that GET is a separate field. What should I change?
[01/May/2011:00:00:00 +0000] 188.8.131.52 TLSv1 RC4-MD5 "GET /dynLink/?PCD=CHICHHH&EBC=3425154412&RCC=D2RVX&GAD=20110426&NMN=2&NOA=1& amp;NOC=0&LNG=en&TBP=325.43&GEM=STEPHENCLAUDENELSON%40GMAIL.COM<http://40GMAIL.COM
>&GEN=&GSL=&GLN=NELSON&GFN=STEPHEN&GCC=&GST=&GCT=&GPC=&GAR=&GPN=&PRT=0&PLC=&PCC=brandwebsite&PSC=&SRP=CIBMS0&PID=HIL&PET=WEB&GNR=1&CRP=0901452 HTTP/1.1" 200 95 0 99885 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152<tel:3.0.4506.2152>; .NET CLR 3.5.30729; InfoPath.2; .NET4.0C; .NET4.0E; MS-RTC LM 8)" "https://secure.hilton.com/en/hi/res/retrieved_reservation.jhtml;jsessionid=UIBJ2MH0JDJPOCSGBIYMVCQ?_requestid=153483" "t=1304208000431979" "D=99766"
The contents of this message, together with any attachments, are intended only for the use of the individual or entity to which they are addressed and may contain information that is confidential and exempt from disclosure. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this message, or any attachment, is strictly prohibited. If you have received this message in error, please notify the original sender immediately by telephone or by return E-mail and delete this message, along with any attachments, from your computer. Thank you.