Grokbase Groups Pig user January 2012
FAQ
Hi all,

I'm new to Pig (and a bit rusty with Java!) and still just playing
around with it, nothing serious yet. I might be misunderstanding
something important here.

I'm trying to write a custom loader for a custom XML file format, i.e.
deserialize the XML into Pig data type. However all the documentation
and other code is based on taking a RecordReader and spitting out things
from getNext().

Is there anyway to make a custom loader that works on InputStreams or
more common java-io-y type stuff? I'd like to use more commonly
available XML parsers (which work on these). Since it's XML, line by
line parsing doesn't really work. I will just have one input file that
will be parsed. Is there some reason why there are no InputStreams?

I have also asked this question on StackOverflow:
http://stackoverflow.com/questions/8843790/custom-apache-pig-loadfunc-where-can-i-get-the-inputstream-on-the-file

--
Rory

Search Discussions

  • William Dowling at Jan 13, 2012 at 3:11 pm
    I'm using org.apache.pig.piggybank.storage.XMLLoader from piggybank and that's working well for me. I do something like this:

    -- The analyze_src_recs.py script reads XML from stdin, and writes to
    -- stdout comma-separated lines rec_type,...
    --
    define analyze_src `analyze_src_recs.py`
    input (stdin)
    output (stdout USING PigStreaming(','))
    ship ('$scriptDir/analyze_src_recs.py');
    SrcLines = load '$src_xml/*.xml*'
    using org.apache.pig.piggybank.storage.XMLLoader('REC')
    as (doc:chararray);
    ParseOut = stream SrcLines through analyze_src
    as (rec_type : int,
    -- other fields my parser pulled out of the XML
    );



    William F Dowling
    Senior Technologist
    Thomson Reuters
    0 +1 215 823 3853


    -----Original Message-----
    From: Rory McCann
    Sent: Friday, January 13, 2012 7:12 AM
    To: user@pig.apache.org
    Subject: Custom Loaders that use Input Streams for reading data?

    Hi all,

    I'm new to Pig (and a bit rusty with Java!) and still just playing
    around with it, nothing serious yet. I might be misunderstanding
    something important here.

    I'm trying to write a custom loader for a custom XML file format, i.e.
    deserialize the XML into Pig data type. However all the documentation
    and other code is based on taking a RecordReader and spitting out things
    from getNext().

    Is there anyway to make a custom loader that works on InputStreams or
    more common java-io-y type stuff? I'd like to use more commonly
    available XML parsers (which work on these). Since it's XML, line by
    line parsing doesn't really work. I will just have one input file that
    will be parsed. Is there some reason why there are no InputStreams?

    I have also asked this question on StackOverflow:
    http://stackoverflow.com/questions/8843790/custom-apache-pig-loadfunc-where-can-i-get-the-inputstream-on-the-file

    --
    Rory
  • Dmitriy Ryaboy at Jan 13, 2012 at 6:28 pm
    You just have to drop into the hadoop level to do this. Implement a
    custom InputFormat / RecordReader; the record reader gets a normal
    java stream.

    D
    On Fri, Jan 13, 2012 at 4:12 AM, Rory McCann wrote:
    Hi all,

    I'm new to Pig (and a bit rusty with Java!) and still just playing
    around with it, nothing serious yet. I might be misunderstanding
    something important here.

    I'm trying to write a custom loader for a custom XML file format, i.e.
    deserialize the XML into Pig data type. However all the documentation
    and other code is based on taking a RecordReader and spitting out things
    from getNext().

    Is there anyway to make a custom loader that works on InputStreams or
    more common java-io-y type stuff? I'd like to use more commonly
    available XML parsers (which work on these). Since it's XML, line by
    line parsing doesn't really work. I will just have one input file that
    will be parsed. Is there some reason why there are no InputStreams?

    I have also asked this question on StackOverflow:
    http://stackoverflow.com/questions/8843790/custom-apache-pig-loadfunc-where-can-i-get-the-inputstream-on-the-file

    --
    Rory

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 13, '12 at 12:13p
activeJan 13, '12 at 6:28p
posts3
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase