Grokbase Groups Pig user May 2009
FAQ
I would like to find a way to escape the delimiter character in my
data so that it doesn't get interpreted as extra columns. For example,
if I'm using comma as a delimiter, and I have a column with value
"foo,bar" I want that string interpreted as a single column without
having the loader pick up the comma in the middle. I noticed that old
versions of PigLoader actually used a regex match as the delimiter
(which would have been perfect), but that was removed in favor of a
simple string match.

Before I go to the trouble of writing a custom loader, which I'd
rather not do if I can avoid it:

1. Is there a way that I could pre-process the data to escape that
character that I missed in the wiki docs? (e.g. could I escape it as
"foo\,bar" or "foo,,bar" or something similar with the existing loader?
2. Is there a different approach altogether I could take? My data is
being generated in a controllable format, and can certainly be
massaged or filtered in some way to make life easier.

thanks,
Greg

Search Discussions

  • Alan Gates at May 12, 2009 at 5:09 pm
    PigStorage currently does not provide a way to escape delimiters. One
    common solution is to use control characters that don't appear in your
    data, such as ^A. If you are forced to write your own loader, you
    could have it inherit from PigStorage and just overload the getNext
    method to parse tuples the way you want. Then you won't have to
    implement an entire load function. Or, you could modify PigStorage to
    handle escapes. If you can do that in a way that preserves
    performance, that might be the best option.

    Alan.
    On May 12, 2009, at 7:06 AM, Gregory Harman wrote:

    I would like to find a way to escape the delimiter character in my
    data so that it doesn't get interpreted as extra columns. For
    example, if I'm using comma as a delimiter, and I have a column with
    value "foo,bar" I want that string interpreted as a single column
    without having the loader pick up the comma in the middle. I noticed
    that old versions of PigLoader actually used a regex match as the
    delimiter (which would have been perfect), but that was removed in
    favor of a simple string match.

    Before I go to the trouble of writing a custom loader, which I'd
    rather not do if I can avoid it:

    1. Is there a way that I could pre-process the data to escape that
    character that I missed in the wiki docs? (e.g. could I escape it as
    "foo\,bar" or "foo,,bar" or something similar with the existing
    loader?
    2. Is there a different approach altogether I could take? My data is
    being generated in a controllable format, and can certainly be
    massaged or filtered in some way to make life easier.

    thanks,
    Greg

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 12, '09 at 2:24p
activeMay 12, '09 at 5:09p
posts2
users2
websitepig.apache.org

2 users in discussion

Alan Gates: 1 post Gregory Harman: 1 post

People

Translate

site design / logo © 2022 Grokbase