Grokbase Groups Pig user June 2011
FAQ
Hello,

I'm trying to write a pig script to examine a csv file and I'm having problems with the flatten and extract functions. The problem is when I run the pig script below I get:

ERROR 1017: Schema mismatch. A basic type on flattening cannot have more than one column. User defined schema: {startip: chararray,endip: chararray,country: chararray,region: chararray,city: chararray,postal: chararray,lat: chararray,lon: chararray,dma: chararray,areacode: chararray}

and if I take flatten out:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered "" at line 27, column 6.
Was expecting one of:



Here is an example of my data:
startIpNum,endIpNum,country,region,city,postalCode,latitude,longitude,dmaCode,areaCode
1.0.0.0,1.0.0.255,"AU","","","",-27.0000,133.0000,,
1.0.1.0,1.0.1.255,"FR","B8","Avignon","",43.9500,4.8167,,

Here is my program:

--declare udf
REGISTER file:/usr/lib/pig/contrib/piggybank/java/piggybank.jar

--define aliases for any classes you wanto to use
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract();

--load in data
rawlogs = load 'geoshort.csv' using TextLoader as (line:chararray);

--print out a couple lines of data
illustrate rawlogs;

logbase = foreach rawlogs generate
FLATTEN(
EXTRACT(line, '^(\\S+) (\\S+) "(.+?)" "(.+?)" "(.+?)" "(.+?)" (\\S+) (\\S+) (\\S+) (\\S+)')
)
as (
startip: chararray,
endip: chararray,
country: chararray,
region: chararray,
city: chararray,
postal: chararray,
lat: chararray,
lon: chararray,
dma: chararray,
areacode: chararray
);

illustrate logbase;



Thanks in advance.

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 21, '11 at 9:03p
activeJun 21, '11 at 9:03p
posts1
users1
websitepig.apache.org

1 user in discussion

Ross Nordeen: 1 post

People

Translate

site design / logo © 2021 Grokbase