FAQ
Thanks to Thad and Nick.

Your suggestions are very useful! I'll keep trying.


nick於 2014年4月30日星期三UTC+8下午8時33分21秒寫道:
Hello Ivan

One way to go about your problem would be to write your own hive Serde:
pros: simple to write,very close to your needs and you can extra cleaning
parsing.
cons: you have to write code

Another way would ba a first hive table on the most common delimiter say
"," so create external table table ... field delmitier ","
and from that table create a view to split the leftover fields, say
"complexfield" is a text "aa|bbb" and needs to be split on the "|" the view
table woudl show all good fields and explicitly split the complex fileds
further.
pros: simple hive table, and view easy to change
con: the view is always a function of the initial table, so slower

finally you could use the regexserde of hive to split what you want from
the data how you want it.

hope any of this helps


On Tue, Apr 29, 2014 at 9:27 PM, Thad Scalf <thad....@edointeractive.com<javascript:>
wrote:
We do something similar with Syslog data we capture via Flume. We use Pig
to parse and aggregate the little files flume creates into a 5 column Hive
table: log_date (partitioning column), date_time, message_id, priority, and
message_string. We then use Impala/Hive with something like "select
regexp_extract(message_string, '(User <|Username = )([^>,]*)', 2) as
Logon from captured_data_table".

I'm thinking something similar would work for you. You may have more
initial column, but you can have Hive/Impala parse the message string for a
substring that you need at request time, rather than parse and aggregate
time.

Thanks,

Thad


On Tue, Apr 29, 2014 at 12:35 PM, Darren Lo <d...@cloudera.com<javascript:>
wrote:
Moving to cdh-user


On Tue, Apr 29, 2014 at 10:07 AM, Ivan Hsueh <ivan....@gmail.com<javascript:>
wrote:
The CEF Log format is used by ArcSight.

Here is the sample log:

CEF:0|Check Point|FireWall-1|4.1|accept|CP FW In Action:accept
Service:telnet Rule:5 ( Sec Log)|Low| eventId=116
externalId=arcsightDemo:54 proto=TCP customerURI=/All Customers/ArcNet
Customers/west.arcnet categorySignificance=/Normal categoryBehavior=/Access
categoryDeviceGroup=/Firewall catdt=Firewall categoryOutcome=/Success
categoryObject=/Host/Application/Service art=1398755279514 act=accept
rt=1398755279514 deviceDirection=0 shost=node9774.dslzn23.pacbell.netsrc=192.168.10.138 sourceZoneURI=/All Zones/System Zones/Private Address
Space spt=2814 dhost=w2ksj101.sj1.west.arcnet.com dst=209.128.98.149
destinationZoneURI=/All Zones/ArcNet Zones/west.arcnet.com - external
destinationTranslatedAddress=10.0.20.21 destinationTranslatedZoneURI=/All
Zones/ArcNet Zones/sj1.west.arcnet.com - internal dproc=telnet
fileType=security cs1=/Pass/Accept cs2=eth-s1p4c0 cs3=inbound cs4=5 cn2=0
cn3=0 cs1Label=v2.x ArcSight Category cs2Label=v2.x Custom String
cs3Label=v2.x Custom String cs4Label=v2.x Custom String cs5Label=v2.x
Custom String cs6Label=v2.x Custom String cn1Label=v2.x Custom Number
cn2Label=v2.x Custom Number cn3Label=v2.x Custom Number
deviceCustomDate1Label=v2.x Custom Date deviceCustomDate2Label=v2.x Custom
Date ahost=fe80:0:0:0:d12a:31e3:8dca:9d20%11 agt=192.168.217.129
agentZoneURI=/All Zones/ArcNet Zones/sj2.west.arcnet.com - internal
av=2.1.0.3401.0 atz=America/Chicago aid=3XPpfc0UBABCAAUZDy8Vfdw\=\=
at=checkpointfirewall_opsec dvchost=cpfwsj104.sj1.west.arcnet.comdvc=10.0.112.3 deviceZoneURI=/All Zones/ArcNet Zones/
sj2.west.arcnet.com - internal dtz=America/Chicago
deviceInboundInterface=eth-s1p4c0 _cefVer=0.1

The delimiter are mixed by pipeline, colon, tag (ex:cs1, cs2, src, dat,
etc.).

But both Pig and Hive have to use the same delimiter to parse logs.

If I just need to extract specific tag(or value) for calculating
(ex:src, dat), like counting Top10 connection IP pairs, is there any idea
to do this?

Thanks all!

To unsubscribe from this group and stop receiving emails from it, send
an email to scm-users+...@cloudera.org <javascript:>.
--

---
You received this message because you are subscribed to the Google
Groups "CDH Users" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to cdh-user+u...@cloudera.org <javascript:>.
For more options, visit
https://groups.google.com/a/cloudera.org/d/optout.


--

Thad Scalf | *edo* <http://www.edointeractive.com/>

Senior Linux Administrator


3841 Green Hills Village Dr. #425 Nashville, TN

*p: *615.297.6080 *-* x163

*e:* thad....@edointeractive.com <javascript:>

Confidentiality Notice: The information in this e-mail message, including
any attachments thereto, is intended to be confidential and is for the use
of the individual or entity named above. If the reader of this message is
not the intended recipient, you are hereby notified that retention,
dissemination, distribution, or copying of this message is strictly
prohibited. If you receive this message in error, please notify the sender
and delete the material immediately.

--

---
You received this message because you are subscribed to the Google Groups
"CDH Users" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to cdh-user+u...@cloudera.org <javascript:>.
For more options, visit https://groups.google.com/a/cloudera.org/d/optout
.


--
Nicolas Maillard project manager| 55 | fifty-five.com<http://www.fifty-five.com/pourquoi-55>
4, place de l'Opéra, 75002 Paris
--

---
You received this message because you are subscribed to the Google Groups "CDH Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdh-user+unsubscribe@cloudera.org.
For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 4 | next ›
Discussion Overview
groupcdh-user @
categorieshadoop
postedApr 29, '14 at 5:35p
activeMay 1, '14 at 5:52p
posts4
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase