Grokbase Groups Pig user April 2011
FAQ
Hi Folks

I've done a load of a dataset and I am attempting to filter out unwanted
records by checking that one of my tuple fields contains a particular
string. I've distilled this issue down to the sample excite.log that ships
with Pig for easy recreation. I've read through the INDEXOF code and I think
this should work (lots of queries that contain the word yahoo) but my
queries dump always contains zero records. Can anyone tell me what I am
doing wrong?

raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
query);
queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
dump queries;

Regards
Steve Watt

Search Discussions

  • Richard Ding at Apr 22, 2011 at 10:17 pm
    raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, query:chararray);
    queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0);
    dump queries;


    On 4/22/11 2:25 PM, "Steve Watt" wrote:

    Hi Folks

    I've done a load of a dataset and I am attempting to filter out unwanted
    records by checking that one of my tuple fields contains a particular
    string. I've distilled this issue down to the sample excite.log that ships
    with Pig for easy recreation. I've read through the INDEXOF code and I think
    this should work (lots of queries that contain the word yahoo) but my
    queries dump always contains zero records. Can anyone tell me what I am
    doing wrong?

    raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
    query);
    queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
    dump queries;

    Regards
    Steve Watt
  • Steve Watt at Apr 23, 2011 at 12:31 am
    Richard, if you're coming to OSCON or Hadoop Summit, please let me know so I
    can buy you a beer. Thanks for the help. This now works for with the excite
    log using PigStorage();

    It is however still not working with my custom LoadFunc and data. For
    reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
    Nutch Segments that reads in each page that is crawled and represents it as
    a Tuple of (Url, ContentType, PageContent) as shown in the script below:

    webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
    using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
    content:chararray);
    companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0);
    dump companies;

    This keeps failing with ERROR 1071: Cannot convert a
    generic_writablecomparable to a String. However, If I change the script to
    the following (remove schema type & straight dump after load), it works:

    webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
    using com.hp.demo.SegmentLoader() AS (url, type, content);
    dump webcrawl;

    Clearly, as soon as I inject types into the Load Schema it starts bombing.
    Can anyone tell me what I am doing wrong? I have attached my Nutch LoadFunc
    below for reference:

    public class SegmentLoader extends FileInputLoadFunc {

    private SequenceFileRecordReader<WritableComparable, Content> reader;
    protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
    @Override
    public void setLocation(String location, Job job) throws IOException {
    FileInputFormat.setInputPaths(job, location);
    }
    @SuppressWarnings("unchecked")
    @Override
    public InputFormat getInputFormat() throws IOException {
    return new SequenceFileInputFormat<WritableComparable, Content>();
    }

    @SuppressWarnings("unchecked")
    @Override
    public void prepareToRead(RecordReader reader, PigSplit split) throws
    IOException {
    this.reader = (SequenceFileRecordReader) reader;
    }

    @Override
    public Tuple getNext() throws IOException {
    try {
    if (!reader.nextKeyValue()){
    return null;
    }
    Content value = ((Content)reader.getCurrentValue());
    String url = value.getUrl();
    String type = value.getContentType();
    String content = value.getContent().toString();
    Tuple tuple = TupleFactory.getInstance().newTuple(3);
    tuple.set(0, new DataByteArray(url));
    tuple.set(1, new DataByteArray(type));
    tuple.set(2, new DataByteArray(content));
    return tuple;
    } catch (InterruptedException e){
    throw new ExecException(e);
    }
    }

    }
    On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding wrote:

    raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
    query:chararray);

    queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0);
    dump queries;


    On 4/22/11 2:25 PM, "Steve Watt" wrote:

    Hi Folks

    I've done a load of a dataset and I am attempting to filter out unwanted
    records by checking that one of my tuple fields contains a particular
    string. I've distilled this issue down to the sample excite.log that ships
    with Pig for easy recreation. I've read through the INDEXOF code and I
    think
    this should work (lots of queries that contain the word yahoo) but my
    queries dump always contains zero records. Can anyone tell me what I am
    doing wrong?

    raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
    query);
    queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
    dump queries;

    Regards
    Steve Watt
  • Dmitriy Ryaboy at Apr 23, 2011 at 1:06 am
    If the expected return type of your loader is (String, String, String) you
    should just put Strings into the tuple (no conversion to DataByteArrays) and
    report your schema to Pig via
    an implementation of LoadMetadata.getSchema()

    D
    On Fri, Apr 22, 2011 at 5:30 PM, Steve Watt wrote:

    Richard, if you're coming to OSCON or Hadoop Summit, please let me know so
    I
    can buy you a beer. Thanks for the help. This now works for with the excite
    log using PigStorage();

    It is however still not working with my custom LoadFunc and data. For
    reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
    Nutch Segments that reads in each page that is crawled and represents it as
    a Tuple of (Url, ContentType, PageContent) as shown in the script below:

    webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
    using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
    content:chararray);
    companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0);
    dump companies;

    This keeps failing with ERROR 1071: Cannot convert a
    generic_writablecomparable to a String. However, If I change the script to
    the following (remove schema type & straight dump after load), it works:

    webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
    using com.hp.demo.SegmentLoader() AS (url, type, content);
    dump webcrawl;

    Clearly, as soon as I inject types into the Load Schema it starts bombing.
    Can anyone tell me what I am doing wrong? I have attached my Nutch LoadFunc
    below for reference:

    public class SegmentLoader extends FileInputLoadFunc {

    private SequenceFileRecordReader<WritableComparable, Content> reader;
    protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
    @Override
    public void setLocation(String location, Job job) throws IOException {
    FileInputFormat.setInputPaths(job, location);
    }
    @SuppressWarnings("unchecked")
    @Override
    public InputFormat getInputFormat() throws IOException {
    return new SequenceFileInputFormat<WritableComparable, Content>();
    }

    @SuppressWarnings("unchecked")
    @Override
    public void prepareToRead(RecordReader reader, PigSplit split) throws
    IOException {
    this.reader = (SequenceFileRecordReader) reader;
    }

    @Override
    public Tuple getNext() throws IOException {
    try {
    if (!reader.nextKeyValue()){
    return null;
    }
    Content value = ((Content)reader.getCurrentValue());
    String url = value.getUrl();
    String type = value.getContentType();
    String content = value.getContent().toString();
    Tuple tuple = TupleFactory.getInstance().newTuple(3);
    tuple.set(0, new DataByteArray(url));
    tuple.set(1, new DataByteArray(type));
    tuple.set(2, new DataByteArray(content));
    return tuple;
    } catch (InterruptedException e){
    throw new ExecException(e);
    }
    }

    }
    On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding wrote:

    raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
    query:chararray);

    queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0);
    dump queries;


    On 4/22/11 2:25 PM, "Steve Watt" wrote:

    Hi Folks

    I've done a load of a dataset and I am attempting to filter out unwanted
    records by checking that one of my tuple fields contains a particular
    string. I've distilled this issue down to the sample excite.log that ships
    with Pig for easy recreation. I've read through the INDEXOF code and I
    think
    this should work (lots of queries that contain the word yahoo) but my
    queries dump always contains zero records. Can anyone tell me what I am
    doing wrong?

    raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
    query);
    queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
    dump queries;

    Regards
    Steve Watt
  • Aniket Mokashi at Apr 23, 2011 at 1:07 am
    I think the fix is-
    tuple.set(0, new DataByteArray(url));
    to
    tuple.set(0, url);

    Thanks,
    Aniket
    On Fri, April 22, 2011 8:30 pm, Steve Watt wrote:
    Richard, if you're coming to OSCON or Hadoop Summit, please let me know
    so I can buy you a beer. Thanks for the help. This now works for with the
    excite log using PigStorage();

    It is however still not working with my custom LoadFunc and data. For
    reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
    Nutch Segments that reads in each page that is crawled and represents it
    as a Tuple of (Url, ContentType, PageContent) as shown in the script
    below:


    webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
    using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
    content:chararray);
    companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0); dump companies;

    This keeps failing with ERROR 1071: Cannot convert a
    generic_writablecomparable to a String. However, If I change the script to
    the following (remove schema type & straight dump after load), it works:


    webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
    using com.hp.demo.SegmentLoader() AS (url, type, content); dump webcrawl;


    Clearly, as soon as I inject types into the Load Schema it starts
    bombing. Can anyone tell me what I am doing wrong? I have attached my
    Nutch LoadFunc
    below for reference:

    public class SegmentLoader extends FileInputLoadFunc {

    private SequenceFileRecordReader<WritableComparable, Content> reader;
    protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
    @Override
    public void setLocation(String location, Job job) throws IOException {
    FileInputFormat.setInputPaths(job, location);
    }
    @SuppressWarnings("unchecked")
    @Override
    public InputFormat getInputFormat() throws IOException { return new
    SequenceFileInputFormat<WritableComparable, Content>();
    }


    @SuppressWarnings("unchecked")
    @Override
    public void prepareToRead(RecordReader reader, PigSplit split) throws
    IOException {
    this.reader = (SequenceFileRecordReader) reader; }


    @Override
    public Tuple getNext() throws IOException { try { if
    (!reader.nextKeyValue()){
    return null; }
    Content value = ((Content)reader.getCurrentValue());
    String url = value.getUrl();
    String type = value.getContentType();
    String content = value.getContent().toString();
    Tuple tuple = TupleFactory.getInstance().newTuple(3);
    tuple.set(0, new DataByteArray(url)); tuple.set(1, new
    DataByteArray(type));
    tuple.set(2, new DataByteArray(content)); return tuple; } catch
    (InterruptedException e){
    throw new ExecException(e); }
    }


    }


    On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding wrote:

    raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
    query:chararray);


    queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0); dump queries;


    On 4/22/11 2:25 PM, "Steve Watt" wrote:


    Hi Folks


    I've done a load of a dataset and I am attempting to filter out
    unwanted records by checking that one of my tuple fields contains a
    particular string. I've distilled this issue down to the sample
    excite.log that ships with Pig for easy recreation. I've read through
    the INDEXOF code and I think this should work (lots of queries that
    contain the word yahoo) but my queries dump always contains zero
    records. Can anyone tell me what I am doing wrong?

    raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
    query); queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); dump
    queries;

    Regards
    Steve Watt

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 22, '11 at 9:26p
activeApr 23, '11 at 1:07a
posts5
users4
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase