Grokbase Groups Pig dev November 2010
FAQ
PiggyBank AllLoader - Load multiple file formats in one load statement
----------------------------------------------------------------------

Key: PIG-1722
URL: https://issues.apache.org/jira/browse/PIG-1722
Project: Pig
Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
Priority: Minor


This gives the ability to point one loader at a directory and have multiple formats loaded and used in the same query

----- Overview -----

Lets say we have a directory with files:
/logs/myfile.lzo
/logs/myfile.rc
/logs/myfile.bz2
/logs/myfile.gz

To load these currently requires multiple loaders, load statements in pig and then have the query perform a union on these.

With this Loader the query becomes:
a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader();

The AllLoader will use the mapping property in the $PIG_HOME/conf/pig.properties

file.extension.loaders that can be setup as:

file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(), rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()

The formats of this property is:

-> [file extension]:[loader func spec]
-> [file-extension]:[optional path tag]:[loader func spec]
-> [file-extension]:[optional path tag]:[sequence file key value writer class name]:[loader func spec]

----- File path tagging: -----

Loaders can also be chosen based on folder names in the file path:
e.g.
file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()

So that if you have /logs/type1/mylog and /logs/type2/mylog
doing : a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader(); will use Type1Loader for mylog in /logs/type1 and Type2Loader for mylog in /logs/type2

----- File content guessing: -----

If the files do not have extensions the AllLoader will try to guess the type of file by looking at the first three bytes mapping the following bytes to each extension:

[ -119, 76, 90 ] = lzo
[ 31, -117, 8 ] = gz
[ 66, 90, 104 ] = bz2
[ 83, 69, 81 ] = seq

----- Loader selection based on sequence file writer class -----

Loaders can be configured to be selected based on the getKeyClassName of the Sequence File.
e.g.
file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader
will use the HiveColumnarLoader loader for all sequence files that have been written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.

All $ extensions are removed from the getKeyClassName's return value.

----- Path Partition Handling -----

Hive style partitioning is supported in the Loader itself so that if you have /logs/type=1 /logs/type=2 /logs/type=3
The partition columns will be recougnised as "type" and filtering can be done like type<=2 etc.

For this current implementation filtering expressions should be passed into the AllLoader's constructor e.g.


a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files that are in /logs/type=1 and /logs/type=2





--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Gerrit Jansen van Vuuren (JIRA) at Nov 12, 2010 at 4:34 pm
    [ https://issues.apache.org/jira/browse/PIG-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931437#action_12931437 ]

    Gerrit Jansen van Vuuren commented on PIG-1722:
    -----------------------------------------------

    ---- Schema Selection ---

    This Loader uses the JsonMetadata class in piggybank to try and load json schema's if they are available in the path.
    If no json schema is available null would be returned by the AllLoader in the getSchema method.




    PiggyBank AllLoader - Load multiple file formats in one load statement
    ----------------------------------------------------------------------

    Key: PIG-1722
    URL: https://issues.apache.org/jira/browse/PIG-1722
    Project: Pig
    Issue Type: New Feature
    Reporter: Gerrit Jansen van Vuuren
    Assignee: Gerrit Jansen van Vuuren
    Priority: Minor

    This gives the ability to point one loader at a directory and have multiple formats loaded and used in the same query
    ----- Overview -----
    Lets say we have a directory with files:
    /logs/myfile.lzo
    /logs/myfile.rc
    /logs/myfile.bz2
    /logs/myfile.gz
    To load these currently requires multiple loaders, load statements in pig and then have the query perform a union on these.
    With this Loader the query becomes:
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader();
    The AllLoader will use the mapping property in the $PIG_HOME/conf/pig.properties
    file.extension.loaders that can be setup as:
    file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(), rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()
    The formats of this property is:
    -> [file extension]:[loader func spec]
    -> [file-extension]:[optional path tag]:[loader func spec]
    -> [file-extension]:[optional path tag]:[sequence file key value writer class name]:[loader func spec]
    ----- File path tagging: -----
    Loaders can also be chosen based on folder names in the file path:
    e.g.
    file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()
    So that if you have /logs/type1/mylog and /logs/type2/mylog
    doing : a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader(); will use Type1Loader for mylog in /logs/type1 and Type2Loader for mylog in /logs/type2
    ----- File content guessing: -----
    If the files do not have extensions the AllLoader will try to guess the type of file by looking at the first three bytes mapping the following bytes to each extension:
    [ -119, 76, 90 ] = lzo
    [ 31, -117, 8 ] = gz
    [ 66, 90, 104 ] = bz2
    [ 83, 69, 81 ] = seq
    ----- Loader selection based on sequence file writer class -----
    Loaders can be configured to be selected based on the getKeyClassName of the Sequence File.
    e.g.
    file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader
    will use the HiveColumnarLoader loader for all sequence files that have been written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.
    All $ extensions are removed from the getKeyClassName's return value.
    ----- Path Partition Handling -----
    Hive style partitioning is supported in the Loader itself so that if you have /logs/type=1 /logs/type=2 /logs/type=3
    The partition columns will be recougnised as "type" and filtering can be done like type<=2 etc.
    For this current implementation filtering expressions should be passed into the AllLoader's constructor e.g.
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files that are in /logs/type=1 and /logs/type=2
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Gerrit Jansen van Vuuren (JIRA) at Nov 12, 2010 at 5:10 pm
    [ https://issues.apache.org/jira/browse/PIG-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Gerrit Jansen van Vuuren updated PIG-1722:
    ------------------------------------------

    Attachment: PIG-1722.patch
    PiggyBank AllLoader - Load multiple file formats in one load statement
    ----------------------------------------------------------------------

    Key: PIG-1722
    URL: https://issues.apache.org/jira/browse/PIG-1722
    Project: Pig
    Issue Type: New Feature
    Reporter: Gerrit Jansen van Vuuren
    Assignee: Gerrit Jansen van Vuuren
    Priority: Minor
    Attachments: PIG-1722.patch


    This gives the ability to point one loader at a directory and have multiple formats loaded and used in the same query
    ----- Overview -----
    Lets say we have a directory with files:
    /logs/myfile.lzo
    /logs/myfile.rc
    /logs/myfile.bz2
    /logs/myfile.gz
    To load these currently requires multiple loaders, load statements in pig and then have the query perform a union on these.
    With this Loader the query becomes:
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader();
    The AllLoader will use the mapping property in the $PIG_HOME/conf/pig.properties
    file.extension.loaders that can be setup as:
    file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(), rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()
    The formats of this property is:
    -> [file extension]:[loader func spec]
    -> [file-extension]:[optional path tag]:[loader func spec]
    -> [file-extension]:[optional path tag]:[sequence file key value writer class name]:[loader func spec]
    ----- File path tagging: -----
    Loaders can also be chosen based on folder names in the file path:
    e.g.
    file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()
    So that if you have /logs/type1/mylog and /logs/type2/mylog
    doing : a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader(); will use Type1Loader for mylog in /logs/type1 and Type2Loader for mylog in /logs/type2
    ----- File content guessing: -----
    If the files do not have extensions the AllLoader will try to guess the type of file by looking at the first three bytes mapping the following bytes to each extension:
    [ -119, 76, 90 ] = lzo
    [ 31, -117, 8 ] = gz
    [ 66, 90, 104 ] = bz2
    [ 83, 69, 81 ] = seq
    ----- Loader selection based on sequence file writer class -----
    Loaders can be configured to be selected based on the getKeyClassName of the Sequence File.
    e.g.
    file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader
    will use the HiveColumnarLoader loader for all sequence files that have been written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.
    All $ extensions are removed from the getKeyClassName's return value.
    ----- Path Partition Handling -----
    Hive style partitioning is supported in the Loader itself so that if you have /logs/type=1 /logs/type=2 /logs/type=3
    The partition columns will be recougnised as "type" and filtering can be done like type<=2 etc.
    For this current implementation filtering expressions should be passed into the AllLoader's constructor e.g.
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files that are in /logs/type=1 and /logs/type=2
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Gerrit Jansen van Vuuren (JIRA) at Nov 12, 2010 at 5:10 pm
    [ https://issues.apache.org/jira/browse/PIG-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Gerrit Jansen van Vuuren updated PIG-1722:
    ------------------------------------------

    Tags: PIG-1722.patch
    Status: Patch Available (was: Open)
    PiggyBank AllLoader - Load multiple file formats in one load statement
    ----------------------------------------------------------------------

    Key: PIG-1722
    URL: https://issues.apache.org/jira/browse/PIG-1722
    Project: Pig
    Issue Type: New Feature
    Reporter: Gerrit Jansen van Vuuren
    Assignee: Gerrit Jansen van Vuuren
    Priority: Minor
    Attachments: PIG-1722.patch


    This gives the ability to point one loader at a directory and have multiple formats loaded and used in the same query
    ----- Overview -----
    Lets say we have a directory with files:
    /logs/myfile.lzo
    /logs/myfile.rc
    /logs/myfile.bz2
    /logs/myfile.gz
    To load these currently requires multiple loaders, load statements in pig and then have the query perform a union on these.
    With this Loader the query becomes:
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader();
    The AllLoader will use the mapping property in the $PIG_HOME/conf/pig.properties
    file.extension.loaders that can be setup as:
    file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(), rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()
    The formats of this property is:
    -> [file extension]:[loader func spec]
    -> [file-extension]:[optional path tag]:[loader func spec]
    -> [file-extension]:[optional path tag]:[sequence file key value writer class name]:[loader func spec]
    ----- File path tagging: -----
    Loaders can also be chosen based on folder names in the file path:
    e.g.
    file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()
    So that if you have /logs/type1/mylog and /logs/type2/mylog
    doing : a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader(); will use Type1Loader for mylog in /logs/type1 and Type2Loader for mylog in /logs/type2
    ----- File content guessing: -----
    If the files do not have extensions the AllLoader will try to guess the type of file by looking at the first three bytes mapping the following bytes to each extension:
    [ -119, 76, 90 ] = lzo
    [ 31, -117, 8 ] = gz
    [ 66, 90, 104 ] = bz2
    [ 83, 69, 81 ] = seq
    ----- Loader selection based on sequence file writer class -----
    Loaders can be configured to be selected based on the getKeyClassName of the Sequence File.
    e.g.
    file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader
    will use the HiveColumnarLoader loader for all sequence files that have been written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.
    All $ extensions are removed from the getKeyClassName's return value.
    ----- Path Partition Handling -----
    Hive style partitioning is supported in the Loader itself so that if you have /logs/type=1 /logs/type=2 /logs/type=3
    The partition columns will be recougnised as "type" and filtering can be done like type<=2 etc.
    For this current implementation filtering expressions should be passed into the AllLoader's constructor e.g.
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files that are in /logs/type=1 and /logs/type=2
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Nov 12, 2010 at 7:27 pm
    [ https://issues.apache.org/jira/browse/PIG-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931503#action_12931503 ]

    Alan Gates commented on PIG-1722:
    ---------------------------------

    Patch looks good. Tests pass. It has good documentation. One issue is it uses tabs instead of spaces. I've fixed that. I'll commit this shortly.
    PiggyBank AllLoader - Load multiple file formats in one load statement
    ----------------------------------------------------------------------

    Key: PIG-1722
    URL: https://issues.apache.org/jira/browse/PIG-1722
    Project: Pig
    Issue Type: New Feature
    Reporter: Gerrit Jansen van Vuuren
    Assignee: Gerrit Jansen van Vuuren
    Priority: Minor
    Attachments: PIG-1722.patch


    This gives the ability to point one loader at a directory and have multiple formats loaded and used in the same query
    ----- Overview -----
    Lets say we have a directory with files:
    /logs/myfile.lzo
    /logs/myfile.rc
    /logs/myfile.bz2
    /logs/myfile.gz
    To load these currently requires multiple loaders, load statements in pig and then have the query perform a union on these.
    With this Loader the query becomes:
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader();
    The AllLoader will use the mapping property in the $PIG_HOME/conf/pig.properties
    file.extension.loaders that can be setup as:
    file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(), rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()
    The formats of this property is:
    -> [file extension]:[loader func spec]
    -> [file-extension]:[optional path tag]:[loader func spec]
    -> [file-extension]:[optional path tag]:[sequence file key value writer class name]:[loader func spec]
    ----- File path tagging: -----
    Loaders can also be chosen based on folder names in the file path:
    e.g.
    file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()
    So that if you have /logs/type1/mylog and /logs/type2/mylog
    doing : a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader(); will use Type1Loader for mylog in /logs/type1 and Type2Loader for mylog in /logs/type2
    ----- File content guessing: -----
    If the files do not have extensions the AllLoader will try to guess the type of file by looking at the first three bytes mapping the following bytes to each extension:
    [ -119, 76, 90 ] = lzo
    [ 31, -117, 8 ] = gz
    [ 66, 90, 104 ] = bz2
    [ 83, 69, 81 ] = seq
    ----- Loader selection based on sequence file writer class -----
    Loaders can be configured to be selected based on the getKeyClassName of the Sequence File.
    e.g.
    file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader
    will use the HiveColumnarLoader loader for all sequence files that have been written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.
    All $ extensions are removed from the getKeyClassName's return value.
    ----- Path Partition Handling -----
    Hive style partitioning is supported in the Loader itself so that if you have /logs/type=1 /logs/type=2 /logs/type=3
    The partition columns will be recougnised as "type" and filtering can be done like type<=2 etc.
    For this current implementation filtering expressions should be passed into the AllLoader's constructor e.g.
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files that are in /logs/type=1 and /logs/type=2
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Gerrit Jansen van Vuuren (JIRA) at Nov 12, 2010 at 8:07 pm
    [ https://issues.apache.org/jira/browse/PIG-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931515#action_12931515 ]

    Gerrit Jansen van Vuuren commented on PIG-1722:
    -----------------------------------------------

    Thanks, sorry about the tabs, I did the auto-formatting in eclipse but will check it to do TABS=4spaces :)

    PiggyBank AllLoader - Load multiple file formats in one load statement
    ----------------------------------------------------------------------

    Key: PIG-1722
    URL: https://issues.apache.org/jira/browse/PIG-1722
    Project: Pig
    Issue Type: New Feature
    Reporter: Gerrit Jansen van Vuuren
    Assignee: Gerrit Jansen van Vuuren
    Priority: Minor
    Attachments: PIG-1722.patch


    This gives the ability to point one loader at a directory and have multiple formats loaded and used in the same query
    ----- Overview -----
    Lets say we have a directory with files:
    /logs/myfile.lzo
    /logs/myfile.rc
    /logs/myfile.bz2
    /logs/myfile.gz
    To load these currently requires multiple loaders, load statements in pig and then have the query perform a union on these.
    With this Loader the query becomes:
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader();
    The AllLoader will use the mapping property in the $PIG_HOME/conf/pig.properties
    file.extension.loaders that can be setup as:
    file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(), rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()
    The formats of this property is:
    -> [file extension]:[loader func spec]
    -> [file-extension]:[optional path tag]:[loader func spec]
    -> [file-extension]:[optional path tag]:[sequence file key value writer class name]:[loader func spec]
    ----- File path tagging: -----
    Loaders can also be chosen based on folder names in the file path:
    e.g.
    file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()
    So that if you have /logs/type1/mylog and /logs/type2/mylog
    doing : a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader(); will use Type1Loader for mylog in /logs/type1 and Type2Loader for mylog in /logs/type2
    ----- File content guessing: -----
    If the files do not have extensions the AllLoader will try to guess the type of file by looking at the first three bytes mapping the following bytes to each extension:
    [ -119, 76, 90 ] = lzo
    [ 31, -117, 8 ] = gz
    [ 66, 90, 104 ] = bz2
    [ 83, 69, 81 ] = seq
    ----- Loader selection based on sequence file writer class -----
    Loaders can be configured to be selected based on the getKeyClassName of the Sequence File.
    e.g.
    file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader
    will use the HiveColumnarLoader loader for all sequence files that have been written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.
    All $ extensions are removed from the getKeyClassName's return value.
    ----- Path Partition Handling -----
    Hive style partitioning is supported in the Loader itself so that if you have /logs/type=1 /logs/type=2 /logs/type=3
    The partition columns will be recougnised as "type" and filtering can be done like type<=2 etc.
    For this current implementation filtering expressions should be passed into the AllLoader's constructor e.g.
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files that are in /logs/type=1 and /logs/type=2
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Nov 16, 2010 at 6:24 pm
    [ https://issues.apache.org/jira/browse/PIG-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-1722:
    ----------------------------

    Resolution: Fixed
    Fix Version/s: 0.9.0
    Status: Resolved (was: Patch Available)

    Patch checked in. Thank you Gerrit for contributing.
    PiggyBank AllLoader - Load multiple file formats in one load statement
    ----------------------------------------------------------------------

    Key: PIG-1722
    URL: https://issues.apache.org/jira/browse/PIG-1722
    Project: Pig
    Issue Type: New Feature
    Reporter: Gerrit Jansen van Vuuren
    Assignee: Gerrit Jansen van Vuuren
    Priority: Minor
    Fix For: 0.9.0

    Attachments: PIG-1722.patch


    This gives the ability to point one loader at a directory and have multiple formats loaded and used in the same query
    ----- Overview -----
    Lets say we have a directory with files:
    /logs/myfile.lzo
    /logs/myfile.rc
    /logs/myfile.bz2
    /logs/myfile.gz
    To load these currently requires multiple loaders, load statements in pig and then have the query perform a union on these.
    With this Loader the query becomes:
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader();
    The AllLoader will use the mapping property in the $PIG_HOME/conf/pig.properties
    file.extension.loaders that can be setup as:
    file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(), rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()
    The formats of this property is:
    -> [file extension]:[loader func spec]
    -> [file-extension]:[optional path tag]:[loader func spec]
    -> [file-extension]:[optional path tag]:[sequence file key value writer class name]:[loader func spec]
    ----- File path tagging: -----
    Loaders can also be chosen based on folder names in the file path:
    e.g.
    file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()
    So that if you have /logs/type1/mylog and /logs/type2/mylog
    doing : a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader(); will use Type1Loader for mylog in /logs/type1 and Type2Loader for mylog in /logs/type2
    ----- File content guessing: -----
    If the files do not have extensions the AllLoader will try to guess the type of file by looking at the first three bytes mapping the following bytes to each extension:
    [ -119, 76, 90 ] = lzo
    [ 31, -117, 8 ] = gz
    [ 66, 90, 104 ] = bz2
    [ 83, 69, 81 ] = seq
    ----- Loader selection based on sequence file writer class -----
    Loaders can be configured to be selected based on the getKeyClassName of the Sequence File.
    e.g.
    file.extension.loaders:seq::org.apache.hadoop.hive.ql.io.RCFile:HiveColumnarLoader
    will use the HiveColumnarLoader loader for all sequence files that have been written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.
    All $ extensions are removed from the getKeyClassName's return value.
    ----- Path Partition Handling -----
    Hive style partitioning is supported in the Loader itself so that if you have /logs/type=1 /logs/type=2 /logs/type=3
    The partition columns will be recougnised as "type" and filtering can be done like type<=2 etc.
    For this current implementation filtering expressions should be passed into the AllLoader's constructor e.g.
    a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files that are in /logs/type=1 and /logs/type=2
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedNov 12, '10 at 4:30p
activeNov 16, '10 at 6:24p
posts7
users1
websitepig.apache.org

1 user in discussion

Alan Gates (JIRA): 7 posts

People

Translate

site design / logo © 2022 Grokbase