I have a python script defined as
import sys
for line in sys.stdin:
if not line:
break
sys.stdout.write(line)
my data test looks like
({(19199vzFj6+uRbJf,7388,9074,50|22598,1267739954,0.0020,365,1,1)},1L)
my pig script is
temp = STREAM test THROUGH GroupStreamer as
(test_bag:chararray,·num_entries: long );
when I ran that my job will fail with
===== Task Information Header =====
Command: TestStream.py
(stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)
Start time: Wed Oct 06 17:57:52 PDT 2010
===== * * * =====
/Users/felixgao/Documents/data/logs/TestStream.py: line 1: import: command
not found
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: syntax error near
unexpected token `if'
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: ` if not
line:'
2010-10-06 17:57:52,690 [Thread-21] ERROR
org.apache.pig.impl.streaming.ExecutableManager - 'TestStream.py ' failed
with exit status: 2
2010-10-06 17:57:52,697 [Thread-14] WARN
org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
org.apache.pig.backend.executionengine.ExecException: ERROR 2090: Received
Error while processing the reduce plan: 'TestStream.py ' failed with exit
status: 2
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:465)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:250)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
What did I do wrong here?
Another question is if I specify by alias as
temp = STREAM Test THROUGH GroupStreamer
as (test_grp_cnt:bag {test_none_zero: tuple(f1:chararray, f2:int, f3:int,
f4:chararray, f5:int, f6:double, f7:int, f8:int, f9:int)}, ·num_entries:
long );
I will get
2010-10-06 17:38:57,092 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered " ";" "; "" at line 76, column
179.
Was expecting one of:
")" ...
"," ...
What is the correct way of specifying a bag of tuples based on my data
sample?
Thanks,
Felix