Grokbase Groups Pig user March 2012

Search Discussions

86 discussions - 341 posts

  • Hi, I have a pig script which does a simple GROUPing followed by couting and I get this error. My data is certaining not that big for it to cause this out of memory error. Is there a chance that this ...
    Rohini URohini U
    Mar 21, 2012 at 7:34 pm
    Mar 23, 2012 at 8:23 pm
  • We currently have 100s of GB of uncompressed data which we would like to zip using some compression that is block compression so that we can use multiple input splits. Does pig support any such ...
    Mohit AnchliaMohit Anchlia
    Mar 28, 2012 at 4:45 pm
    Apr 5, 2012 at 3:05 pm
  • Hi guys, I use Pig to process some clickstream data. I need to track a new field, so I added a new field to my avro schema, and changed my Pig script accordingly. It works fine with the new files ...
    IGZ NickIGZ Nick
    Mar 28, 2012 at 8:22 pm
    Apr 2, 2012 at 5:34 pm
  • I need to put a small shared file on distributed cache so I can load it my udf in pig0.7. We are using Hadoop 0.20.2+228. I tried to run it using ...
    Felix gaoFelix gao
    Mar 17, 2012 at 12:32 am
    Mar 20, 2012 at 12:58 am
  • Hi I'm following a short tutorial from I have a running HBase cluster and Hadoop cluster. Steps I've performed: - prepared a sample input ...
    Marcin CylkeMarcin Cylke
    Mar 7, 2012 at 3:48 pm
    Mar 9, 2012 at 3:07 pm
  • I am running a script to load data in the database. When I use [0-4] I see 2 rows being created for every record that I process. But when I run them individually then it works. Could someone please ...
    Mohit AnchliaMohit Anchlia
    Mar 23, 2012 at 11:57 pm
    Mar 28, 2012 at 5:04 am
  • I am reading bunch of columns from a flat file and inserting it into the database. Is there a way to also insert date?
    Mohit AnchliaMohit Anchlia
    Mar 22, 2012 at 7:48 pm
    Mar 23, 2012 at 1:04 am
  • Hi all, I'm using pig with protobuf and I have some byte fields containing serialized protobuf data. Is it possible to handle this nested serialized data with pig? ex. message A { required bytes data ...
    Benjamin JuhnBenjamin Juhn
    Mar 26, 2012 at 10:30 pm
    Apr 3, 2012 at 11:47 pm
  • Hello all, I'm trying to store a bag of tuples using AvroStorage but am not able to figure out what I'm doing wrong (or if it' supported). What I have is the following: grunt illustrate c; .... ... ...
    Dan YoungDan Young
    Mar 25, 2012 at 4:36 am
    Apr 3, 2012 at 6:21 pm
  • Hello, I'm new to these lists. I'm trying to get Pig working, for my first time. I have setup Hadoop and HBase (on HDFS) using the psuedo-distributed setup, all on one machine. I am able to run ...
    Ryan ColeRyan Cole
    Mar 23, 2012 at 1:17 am
    Mar 23, 2012 at 3:16 pm
  • Hi, I need to initialize the HBase connection, which I normally do in configure() in the Mapper, and then my mapper uses it. How do I do it in Pig? I am ready to define a UDF that will return a ...
    Mark KerznerMark Kerzner
    Mar 7, 2012 at 1:02 am
    Mar 9, 2012 at 2:31 am
  • Hi, I'm loading a bunch of data into Pig using CassandraStorage. When I do a dump and/or store, the amount of data that is outputted is actually only 2-3% of the amount of data in Cassandra ...
    Dan FeldmanDan Feldman
    Mar 29, 2012 at 6:25 pm
    Apr 10, 2012 at 1:25 am
  • Hey all, I'm trying to write a script to pull the count of a dataset that I've filtered. Here's the script so far: /* scans by title */ scans = LOAD '/hive/scans/*' USING PigStorage(',') AS ...
    Jason AlexanderJason Alexander
    Mar 22, 2012 at 8:29 pm
    Mar 22, 2012 at 9:46 pm
  • -- Russell Jurney
    Russell JurneyRussell Jurney
    Mar 2, 2012 at 1:20 am
    Mar 2, 2012 at 10:24 pm
  • Hello all, I used to run pig on the same node where the hadoop job tracker is running and everything was fine. Now I am trying to run pig on my laptop to access the cluster where hadoop is running ...
    Iman EIman E
    Mar 28, 2012 at 5:19 pm
    Mar 28, 2012 at 7:25 pm
  • One record in a 125MB avro file is killing my script. I could patch AvroStorage() to catch the exception and return null after logging an error - I think. Should I? -- Russell Jurney ...
    Russell JurneyRussell Jurney
    Mar 24, 2012 at 2:03 am
    Mar 25, 2012 at 11:04 pm
  • Hi All, I have a situation where I need to create a relation by a combination of UDF and parameter values. For example, first field will be generated by UDF UUIDGenerator, second field by parameter ...
    Rakesh sharmaRakesh sharma
    Mar 13, 2012 at 9:38 pm
    Mar 14, 2012 at 2:35 am
  • - dev@pig + user@pig What command are you using to run this? Are you upping the max heap? 2012/3/28 Herbert Mühlburger <
    Jonathan CoveneyJonathan Coveney
    Mar 28, 2012 at 4:28 pm
    Mar 30, 2012 at 6:38 am
  • In this - The following precedence order is supported: -D Pig property -P properties file set command. This means that if the ...
    Mar 27, 2012 at 6:44 pm
    Mar 28, 2012 at 8:13 am
  • I'm having a possible issue with a simple pig load that writes to an HBase table. The issue is that when I run the test pig script it does not invoke the region observer coprocessor on the table. I ...
    Mar 23, 2012 at 5:54 am
    Mar 28, 2012 at 5:05 am
  • Hey guys, Continuing on in my Pig education, I'm trying to pivot my previous script to give me a break down of count by title. The script I have so far is: /* scans grouped by title */ scans = LOAD ...
    Jason AlexanderJason Alexander
    Mar 26, 2012 at 5:39 pm
    Mar 26, 2012 at 7:56 pm
  • Pig users and developers, The Apache Pig PMCs is pleased to announce the new additions to Pig project: * Jonathan Coveney is now Apache Pig committer * Julien Le Dem is now Apache Pig PMC member ...
    Daniel DaiDaniel Dai
    Mar 20, 2012 at 12:04 am
    Mar 20, 2012 at 8:56 pm
  • Hi all, I just test a very simple pig script as following: records = LOAD '$input' AS (hash:chararray, domain:chararray, host:chararray, page:chararray, freq:int); grpd = GROUP records BY (domain, ...
    Yen SYUYen SYU
    Mar 13, 2012 at 7:00 pm
    Mar 16, 2012 at 2:28 pm
  • Dear All: this is the description of wiki about distinct: grunt A = load 'mydata' using PigStorage() as (a, b, c); grunt B = group A by a; grunt C = foreach B { D = distinct A.b; generate ...
    Mar 6, 2012 at 3:20 am
    Mar 16, 2012 at 2:03 am
  • Hi, I am running a pig query on around 500 GB input data. The current block size is 128 MB and split size is the default 128 MB. I have also specified 16 reducers and around 3800 mappers are running. ...
    Austin ChungathAustin Chungath
    Mar 13, 2012 at 12:25 pm
    Mar 14, 2012 at 9:12 pm
  • I am trying to process the output which has key in it from the map-reduce job. Is there a way I can ignore the key when I load data from that file? When I load data in the variable I don't want the ...
    Mohit AnchliaMohit Anchlia
    Mar 8, 2012 at 10:56 pm
    Mar 13, 2012 at 1:12 pm
  • Hi, We want want to do Linear regression analysis to achieve Interpolation for a set of values, using PIG Scripts. Do we have any in-built functions to achieve this, if not how to achieve. Thanks & ...
    Mar 12, 2012 at 7:22 am
    Mar 13, 2012 at 5:39 am
  • Hello, I think there is a bug in PIG when using COUNT on Bag of Tuple with empty element. Here is a minimal script to reproduce this bug : I've this CSV file : ,a 1,a 2,a ,a 3,b 4,b 5,b I use that ...
    Kevin LionKevin Lion
    Mar 8, 2012 at 4:56 pm
    Mar 9, 2012 at 5:59 am
  • As I wanted to increment some counters in some UDFs I wrote, I came across as THE answer which basically says I ...
    Ahmed SobhiAhmed Sobhi
    Mar 30, 2012 at 9:00 am
    Mar 30, 2012 at 3:57 pm
  • I know there has been lots of discussion on git going on. I've been wanting a place to stick useful UDFS that are pretty generic as well as nice place to share other people's work. I was thinking ...
    Corbin HoenesCorbin Hoenes
    Mar 26, 2012 at 11:16 pm
    Mar 27, 2012 at 2:56 am
  • In the relational database we have a large key, value type of data in 2 tables. Let’s call it Entity and EntityAttribute. Table: Entity Columns: Entity ID, Entity Type Table: EntityAttribute ...
    Shan sShan s
    Mar 21, 2012 at 6:50 pm
    Mar 23, 2012 at 3:38 pm
  • Hi guys, Thanks again for your awesome hint about sqoop. I have another question: The data I'm working with is stored as AVRO Files in the Hadoop. When I try to glob them everything works just ...
    Markus ReschMarkus Resch
    Mar 21, 2012 at 3:02 pm
    Mar 22, 2012 at 2:13 am
  • I wanted to share a deck that has some details regarding how we use Pig for one of the the projects at Salesforce. Essentially, we merged platform with Hadoop/Pig to generate very critical ...
    Prashant KommireddiPrashant Kommireddi
    Mar 20, 2012 at 9:07 pm
    Mar 21, 2012 at 6:35 am
  • I want to read a small reference data file from a UDF. How do I make use of the distributed cache for this purpose ? Sam William
    Sam WilliamSam William
    Mar 9, 2012 at 11:18 pm
    Mar 13, 2012 at 9:06 pm
  • Hello, I am using: hadoop-0.20.2-cdh3u2, hbase-0.90.4-cdh3u3, pig-0.8.1-cdh3u3 I have successfully loaded data into HBase tables (implying my Hadoop & HBase setup is good). I can look at the data ...
    Something SomethingSomething Something
    Mar 8, 2012 at 6:30 am
    Mar 8, 2012 at 9:55 pm
  • Hello All, I was wondering if there was a way for me to store the DESCRIBE on an alias in a file. Often we have many fields to store and we keep adding fields that we want to store, it would be great ...
    Gayatri RaoGayatri Rao
    Mar 29, 2012 at 9:39 pm
    Mar 29, 2012 at 11:10 pm
  • In this page : Xingang 2012/3/28 This email (including any attachments) is confidential and may be legally privileged. If you received this ...
    Mar 28, 2012 at 4:01 pm
    Mar 28, 2012 at 4:29 pm
  • Hi, There is a trivial issue with PigStats (during HASHJOIN), it does not print correct record count. My job does a LEFT OUTER join operation and hence the row count with input B should match output ...
    Subir SSubir S
    Mar 27, 2012 at 10:45 am
    Mar 28, 2012 at 6:44 am
  • Hi All, I have a statement like this: -- A is omitted, loads dataB = FOREACH A GENERATE FLATTEN(data1.b.v) as dataPoint1, FLATTEN(data2.b.v) as dataPoint2;C = FILTER B BY dataPoint1 == ...
    Michael MooreMichael Moore
    Mar 19, 2012 at 7:49 pm
    Mar 27, 2012 at 3:30 am
  • Folks -- how are folks handling the "productionalization" of their Pig submit nodes? For our PROD environment, I originally thought we'd just have a few VMs from which Pig jobs would be submitted ...
    Norbert BurgerNorbert Burger
    Mar 21, 2012 at 1:50 pm
    Mar 21, 2012 at 2:55 pm
  • Hi, Can write UDF with overrides LOAD SimpleTextLoader without mapreduce, I am bit confused with the use of mapreduce, because i am not able to get the flow of the LOAD SimpleTextLoader when the ...
    Mar 12, 2012 at 11:51 am
    Mar 16, 2012 at 7:23 am
  • Hi all. I'm trying to write a simple filter function (to be used with the FILTER operator) in python, but I don't seem to find the right way to specify its schema. I'm using pig 0.9.2. The filter's ...
    Marco CovaMarco Cova
    Mar 15, 2012 at 11:03 pm
    Mar 16, 2012 at 7:08 am
  • Hi Folks, I'm currently working on a framework that's going to do some awesome graphing stuff grabbing data out using Pig. What I'm wondering is, is there any way I can put embedded pig in a module ...
    Eli FinkelshteynEli Finkelshteyn
    Mar 14, 2012 at 6:16 am
    Mar 14, 2012 at 3:12 pm
  • I tried to return a Set<String from my UDF, but it seems to give some problems. what are the allowed return data types in UDF? is it constrained to those in the "Pig Types" section in ...
    Mar 12, 2012 at 6:16 pm
    Mar 12, 2012 at 8:27 pm
  • I tried to subscribe but a mail client box came up, not what I wanted, so we'll see if this works. I wrote this script: register s3n://uw-cse344-code/myudfs.jar -- load the test file into Pig --raw = ...
    Colleen RossColleen Ross
    Mar 10, 2012 at 3:49 pm
    Mar 11, 2012 at 4:58 am
  • I have "set 5" in the pig job and still I am seeing around 214 map tasks and around 30 actively running jobs. I was expecting only 5 map tasks. My cluster has 5 nodes.
    Mohit AnchliaMohit Anchlia
    Mar 10, 2012 at 12:39 am
    Mar 10, 2012 at 1:16 am
  • Hi, I have a UDF that parses a line and then return a bag, and sometimes the line is bad so I'm returning null in the UDF. In my pig script, I'd like to filter those nulls like this: raw = LOAD ...
    Dexin WangDexin Wang
    Mar 2, 2012 at 12:46 am
    Mar 7, 2012 at 11:08 pm
  • Hi Can I see the user-payload for the MapReduce job that is created by Pig. How? i.e. the Map and Reduce function code that is generated by Pig script.. Thanks,
    Shan shanShan shan
    Mar 6, 2012 at 1:28 pm
    Mar 6, 2012 at 6:01 pm
  • Hello everyone, Is Pig capable of appending to a file? I know that an exception is thrown when a file exists using PigStorage, but is there a way to get around this? Thanks, Daan.
    Daan GeritsDaan Gerits
    Mar 5, 2012 at 9:34 am
    Mar 5, 2012 at 6:14 pm
  • I've created a vim snipmate plugin for PigLatin which saves me a lot of time developing Pig jobs. For those unfamiliar: Snipmate is a Vim plugin for code completion. I've made a small writeup here ...
    Rob VerkuylenRob Verkuylen
    Mar 3, 2012 at 1:06 am
    Mar 3, 2012 at 9:23 am
Group Navigation
period‹ prev | Mar 2012 | next ›
Group Overview
groupuser @
categoriespig, hadoop

77 users for March 2012

Prashant Kommireddi: 39 posts Jonathan Coveney: 26 posts Bill Graham: 24 posts Dmitriy Ryaboy: 21 posts Mohit Anchlia: 20 posts Norbert Burger: 12 posts Rakesh sharma: 11 posts Russell Jurney: 10 posts Thejas Nair: 9 posts Aniket Mokashi: 7 posts Rohini U: 7 posts Stan Rosenberg: 7 posts Jason Alexander: 6 posts Marcin Cylke: 6 posts Ryan Cole: 6 posts Alan Gates: 5 posts Eli Finkelshteyn: 5 posts IGZ Nick: 5 posts Dan Feldman: 4 posts Daniel Dai: 4 posts
show more