FAQ
Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket. Although the files share the same schema, they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data. As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use. My hope is that this information could be added in (upon request) via Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO ( <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this? Is there a way to add a new column within Hive that contains the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com

Search Discussions

  • Avram Aelony at Sep 16, 2009 at 6:07 pm
    Clarification, the line below should have been "many files, roughly 500MB compressed" instead of "many files, roughly 500M compressed"..
    -A


    From: Avram Aelony
    Sent: Wednesday, September 16, 2009 10:49 AM
    To: hive-user@hadoop.apache.org
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket. Although the files share the same schema, they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data. As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use. My hope is that this information could be added in (upon request) via Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this? Is there a way to add a new column within Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Prasad Chakka at Sep 16, 2009 at 6:11 pm
    I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME)
    stored as textfile location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

    Prasad
    ________________________________
    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket. Although the files share the same schema, they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data. As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use. My hope is that this information could be added in (upon request) via Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this? Is there a way to add a new column within Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Avram Aelony at Sep 16, 2009 at 6:24 pm
    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new) derived fields, I will be happy... :)
    For example, in a later query I'd like to be able to do something like:

    select
    substr(filename, 4, 7) as class_A,
    substr(filename, 8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename, 8, 10) ;


    thanks,
    -A


    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME)
    stored as textfile location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

    Prasad
    ________________________________
    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket. Although the files share the same schema, they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data. As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use. My hope is that this information could be added in (upon request) via Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this? Is there a way to add a new column within Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Prasad Chakka at Sep 16, 2009 at 6:33 pm
    FYI, all partition columns can be used as any regular columns select queries. So it should be fine.


    ________________________________
    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 11:23:45 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new) derived fields, I will be happy... :)
    For example, in a later query I'd like to be able to do something like:

    select
    substr(filename, 4, 7) as class_A,
    substr(filename, 8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename, 8, 10) ;


    thanks,
    -A



    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME)
    stored as textfile location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

    Prasad

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket. Although the files share the same schema, they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data. As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use. My hope is that this information could be added in (upon request) via Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this? Is there a way to add a new column within Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Avram Aelony at Sep 16, 2009 at 6:39 pm
    Very cool. Looking forward to seeing this feature in action... :)

    Thanks,
    -A


    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:33 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    FYI, all partition columns can be used as any regular columns select queries. So it should be fine.

    ________________________________
    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 11:23:45 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new) derived fields, I will be happy... :)
    For example, in a later query I'd like to be able to do something like:

    select
    substr(filename, 4, 7) as class_A,
    substr(filename, 8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename, 8, 10) ;


    thanks,
    -A



    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME)
    stored as textfile location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

    Prasad
    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket. Although the files share the same schema, they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data. As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use. My hope is that this information could be added in (upon request) via Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this? Is there a way to add a new column within Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Namit Jain at Sep 16, 2009 at 7:02 pm
    I don't think it is a good idea to make it a part of table metadata in any way.
    What happens if the filename changes ? It will be very difficult to maintain.
    But, we can definitely add some virtual columns (FILENAME can be one of them
    to start with - it should not show up in describe, select * etc.

    But, the user can query based on them - this is mostly for advanced users and
    can be used for pruning etc. also


    I will open a new jira, and we can continue the discussion there.


    -namit




    From: Avram Aelony
    Sent: Wednesday, September 16, 2009 11:39 AM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive


    Very cool. Looking forward to seeing this feature in action... :)

    Thanks,
    -A


    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:33 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    FYI, all partition columns can be used as any regular columns select queries. So it should be fine.
    ________________________________
    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 11:23:45 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new) derived fields, I will be happy... :)
    For example, in a later query I'd like to be able to do something like:

    select
    substr(filename, 4, 7) as class_A,
    substr(filename, 8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename, 8, 10) ;


    thanks,
    -A



    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME)
    stored as textfile location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

    Prasad
    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket. Although the files share the same schema, they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data. As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use. My hope is that this information could be added in (upon request) via Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO ( <list of fields and types> )
    row format delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this? Is there a way to add a new column within Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Edward Capriolo at Sep 16, 2009 at 7:05 pm
    I just put in a related thread about this. This would be really nice.
    It is just a virtual column, we dont need it in the metadata if we
    also have a command like 'show files in partition' so we can inspect
    what is there as well.

    On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain wrote:
    I don’t think it is a good idea to make it a part of table metadata in any
    way.

    What happens if the filename changes ? It will be very difficult to
    maintain.

    But, we can definitely add some virtual columns (FILENAME can be one of them

    to start with – it should not show up in describe, select * etc.



    But, the user can query based on them – this is mostly for advanced users
    and

    can be used for pruning etc. also





    I will open a new jira, and we can continue the discussion there.





    -namit









    From: Avram Aelony
    Sent: Wednesday, September 16, 2009 11:39 AM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive





    Very cool.  Looking forward to seeing this feature in action… J



    Thanks,

    -A





    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:33 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive



    FYI, all partition columns can be used as any regular columns select
    queries. So it should be fine.

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 11:23:45 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new) derived
    fields, I will be happy… J
    For example, in a later query I’d like to be able to do something like:

    select
    substr(filename, 4, 7) as  class_A,
    substr(filename,  8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename,  8, 10) ;


    thanks,
    -A



    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to be a
    partition column (one of such) instead of a separate type of column. Would
    that work?

    create external table FOO (  <list of fields and types> )
    row format delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/’;

    Or table partitioned by datestamp and filename

    create external table FOO (  <list of fields and types> )
    row format delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME)
    stored as textfile location 's3:/somebucket/’;


    So FILENAME becomes a new type. I like this because partition columns are
    virtual columns just like the filename column and do not exist along with
    data on the disk.

    Prasad

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M compressed
    ) with Hive that reside in an S3 bucket.  Although the files share the same
    schema,  they have individual filenames that provide useful information that
    does not get captured and does not exist separately as a column within each
    file’s data.  As a general problem, I’d like to be able to add a new column
    via Hive that contains the filename of the files read in that were present
    in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and
    I am thinking that at some point Hadoop or Hive must have a file handle with
    the filenames that perhaps could be of use.  My hope is that this
    information could be added in (upon request) via Hive.   Perhaps as this
    could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO (  <list of fields and types> )
    row format delimited fields terminated by ','
    add_filename as ‘filename’
    stored as textfile location 's3:/somebucket/’;


    Has anyone thought of this?  Is there a way to add a new column within Hive
    that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Ashish Thusoo at Sep 16, 2009 at 7:11 pm
    You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


    Select FILENAME, xyz from T

    We could just do

    Select Filename(), xyz from T

    Thoughts?

    Ashish

    -----Original Message-----
    From: Edward Capriolo
    Sent: Wednesday, September 16, 2009 12:05 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I just put in a related thread about this. This would be really nice.
    It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.

    On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain wrote:
    I don't think it is a good idea to make it a part of table metadata in
    any way.

    What happens if the filename changes ? It will be very difficult to
    maintain.

    But, we can definitely add some virtual columns (FILENAME can be one
    of them

    to start with - it should not show up in describe, select * etc.



    But, the user can query based on them - this is mostly for advanced
    users and

    can be used for pruning etc. also





    I will open a new jira, and we can continue the discussion there.





    -namit









    From: Avram Aelony
    Sent: Wednesday, September 16, 2009 11:39 AM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive





    Very cool.  Looking forward to seeing this feature in action. J



    Thanks,

    -A





    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:33 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive



    FYI, all partition columns can be used as any regular columns select
    queries. So it should be fine.

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 11:23:45 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new)
    derived fields, I will be happy. J For example, in a later query I'd
    like to be able to do something like:

    select
    substr(filename, 4, 7) as  class_A,
    substr(filename,  8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename,  8, 10) ;


    thanks,
    -A



    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to
    be a partition column (one of such) instead of a separate type of
    column. Would that work?

    create external table FOO (  <list of fields and types> ) row format
    delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO (  <list of fields and types> ) row format
    delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME) stored as textfile
    location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns
    are virtual columns just like the filename column and do not exist
    along with data on the disk.

    Prasad

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M
    compressed
    ) with Hive that reside in an S3 bucket.  Although the files share the
    same schema,  they have individual filenames that provide useful
    information that does not get captured and does not exist separately
    as a column within each file's data.  As a general problem, I'd like
    to be able to add a new column via Hive that contains the filename of
    the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container
    bucket, and I am thinking that at some point Hadoop or Hive must have
    a file handle with the filenames that perhaps could be of use.  My
    hope is that this information could be added in (upon request) via
    Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO (  <list of fields and types> ) row format
    delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this?  Is there a way to add a new column within
    Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Prasad Chakka at Sep 16, 2009 at 7:14 pm
    Yeah, udfs are better semantically. Will this be a problem for partition pruning?


    ________________________________
    From: Ashish Thusoo <athusoo@facebook.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 12:11:00 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


    Select FILENAME, xyz from T

    We could just do

    Select Filename(), xyz from T

    Thoughts?

    Ashish

    -----Original Message-----
    From: Edward Capriolo
    Sent: Wednesday, September 16, 2009 12:05 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I just put in a related thread about this. This would be really nice.
    It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.

    On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain wrote:
    I don't think it is a good idea to make it a part of table metadata in
    any way.

    What happens if the filename changes ? It will be very difficult to
    maintain.

    But, we can definitely add some virtual columns (FILENAME can be one
    of them

    to start with - it should not show up in describe, select * etc.



    But, the user can query based on them - this is mostly for advanced
    users and

    can be used for pruning etc. also





    I will open a new jira, and we can continue the discussion there.





    -namit









    From: Avram Aelony
    Sent: Wednesday, September 16, 2009 11:39 AM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive





    Very cool. Looking forward to seeing this feature in action. J



    Thanks,

    -A





    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:33 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive



    FYI, all partition columns can be used as any regular columns select
    queries. So it should be fine.

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 11:23:45 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new)
    derived fields, I will be happy. J For example, in a later query I'd
    like to be able to do something like:

    select
    substr(filename, 4, 7) as class_A,
    substr(filename, 8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename, 8, 10) ;


    thanks,
    -A



    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to
    be a partition column (one of such) instead of a separate type of
    column. Would that work?

    create external table FOO ( <list of fields and types> ) row format
    delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO ( <list of fields and types> ) row format
    delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME) stored as textfile
    location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns
    are virtual columns just like the filename column and do not exist
    along with data on the disk.

    Prasad

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M
    compressed
    ) with Hive that reside in an S3 bucket. Although the files share the
    same schema, they have individual filenames that provide useful
    information that does not get captured and does not exist separately
    as a column within each file's data. As a general problem, I'd like
    to be able to add a new column via Hive that contains the filename of
    the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container
    bucket, and I am thinking that at some point Hadoop or Hive must have
    a file handle with the filenames that perhaps could be of use. My
    hope is that this information could be added in (upon request) via
    Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO ( <list of fields and types> ) row format
    delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this? Is there a way to add a new column within
    Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Avram Aelony at Sep 16, 2009 at 7:19 pm
    Actually, I think it would be great to see filename in the describe command (if the table specified it be included at table create time).

    Part of the issue is that necessary information is only available in the filename and not in the row data. Making the filename available (for further parsing) in the describe command would alert the analyst that the sought-after information is actually present and need only be further massaged...

    -A


    -----Original Message-----
    From: Ashish Thusoo
    Sent: Wednesday, September 16, 2009 12:11 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive

    You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


    Select FILENAME, xyz from T

    We could just do

    Select Filename(), xyz from T

    Thoughts?

    Ashish

    -----Original Message-----
    From: Edward Capriolo
    Sent: Wednesday, September 16, 2009 12:05 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I just put in a related thread about this. This would be really nice.
    It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.

    On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain wrote:
    I don't think it is a good idea to make it a part of table metadata in
    any way.

    What happens if the filename changes ? It will be very difficult to
    maintain.

    But, we can definitely add some virtual columns (FILENAME can be one
    of them

    to start with - it should not show up in describe, select * etc.



    But, the user can query based on them - this is mostly for advanced
    users and

    can be used for pruning etc. also





    I will open a new jira, and we can continue the discussion there.





    -namit









    From: Avram Aelony
    Sent: Wednesday, September 16, 2009 11:39 AM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive





    Very cool.  Looking forward to seeing this feature in action. J



    Thanks,

    -A





    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:33 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive



    FYI, all partition columns can be used as any regular columns select
    queries. So it should be fine.

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 11:23:45 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new)
    derived fields, I will be happy. J For example, in a later query I'd
    like to be able to do something like:

    select
    substr(filename, 4, 7) as  class_A,
    substr(filename,  8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename,  8, 10) ;


    thanks,
    -A



    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to
    be a partition column (one of such) instead of a separate type of
    column. Would that work?

    create external table FOO (  <list of fields and types> ) row format
    delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO (  <list of fields and types> ) row format
    delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME) stored as textfile
    location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns
    are virtual columns just like the filename column and do not exist
    along with data on the disk.

    Prasad

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M
    compressed
    ) with Hive that reside in an S3 bucket.  Although the files share the
    same schema,  they have individual filenames that provide useful
    information that does not get captured and does not exist separately
    as a column within each file's data.  As a general problem, I'd like
    to be able to add a new column via Hive that contains the filename of
    the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container
    bucket, and I am thinking that at some point Hadoop or Hive must have
    a file handle with the filenames that perhaps could be of use.  My
    hope is that this information could be added in (upon request) via
    Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO (  <list of fields and types> ) row format
    delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this?  Is there a way to add a new column within
    Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Prasad Chakka at Sep 16, 2009 at 7:22 pm
    I think all hive tables will have this metadata (relpath and other hdfs file properties potentially) available by default. So there is no need for this to be shown in describe.


    ________________________________
    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 12:19:21 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Actually, I think it would be great to see filename in the describe command (if the table specified it be included at table create time).

    Part of the issue is that necessary information is only available in the filename and not in the row data. Making the filename available (for further parsing) in the describe command would alert the analyst that the sought-after information is actually present and need only be further massaged...

    -A


    -----Original Message-----
    From: Ashish Thusoo
    Sent: Wednesday, September 16, 2009 12:11 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive

    You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


    Select FILENAME, xyz from T

    We could just do

    Select Filename(), xyz from T

    Thoughts?

    Ashish

    -----Original Message-----
    From: Edward Capriolo
    Sent: Wednesday, September 16, 2009 12:05 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I just put in a related thread about this. This would be really nice.
    It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.

    On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain wrote:
    I don't think it is a good idea to make it a part of table metadata in
    any way.

    What happens if the filename changes ? It will be very difficult to
    maintain.

    But, we can definitely add some virtual columns (FILENAME can be one
    of them

    to start with - it should not show up in describe, select * etc.



    But, the user can query based on them - this is mostly for advanced
    users and

    can be used for pruning etc. also





    I will open a new jira, and we can continue the discussion there.





    -namit









    From: Avram Aelony
    Sent: Wednesday, September 16, 2009 11:39 AM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive





    Very cool. Looking forward to seeing this feature in action. J



    Thanks,

    -A





    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:33 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive



    FYI, all partition columns can be used as any regular columns select
    queries. So it should be fine.

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 11:23:45 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new)
    derived fields, I will be happy. J For example, in a later query I'd
    like to be able to do something like:

    select
    substr(filename, 4, 7) as class_A,
    substr(filename, 8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename, 8, 10) ;


    thanks,
    -A



    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to
    be a partition column (one of such) instead of a separate type of
    column. Would that work?

    create external table FOO ( <list of fields and types> ) row format
    delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO ( <list of fields and types> ) row format
    delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME) stored as textfile
    location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns
    are virtual columns just like the filename column and do not exist
    along with data on the disk.

    Prasad

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M
    compressed
    ) with Hive that reside in an S3 bucket. Although the files share the
    same schema, they have individual filenames that provide useful
    information that does not get captured and does not exist separately
    as a column within each file's data. As a general problem, I'd like
    to be able to add a new column via Hive that contains the filename of
    the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container
    bucket, and I am thinking that at some point Hadoop or Hive must have
    a file handle with the filenames that perhaps could be of use. My
    hope is that this information could be added in (upon request) via
    Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO ( <list of fields and types> ) row format
    delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this? Is there a way to add a new column within
    Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com
  • Avram Aelony at Sep 16, 2009 at 7:37 pm
    I defer to your judgment, but this appears to me to be a case where distinctions between meta-data and row-data are blurry. In my case, the filename is useful to have only because it contains information that should have been included in the row-data. Thus the filename adds information that is not truly meta-data, especially if there are hundreds of filenames, each with various substrings that really should have been new grouping variables within each file.

    That said, the important point is that it is clear that the filename info is present and available for further parsing. I would think the intent of a describe command would be to describe what is available, but if there is a way to describe meta-data that wouldn't be overlooked by an analyst poking around for the data being looked for, that works for me...

    Cheers,
    -A




    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 12:22 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think all hive tables will have this metadata (relpath and other hdfs file properties potentially) available by default. So there is no need for this to be shown in describe.

    ________________________________
    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 12:19:21 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Actually, I think it would be great to see filename in the describe command (if the table specified it be included at table create time).

    Part of the issue is that necessary information is only available in the filename and not in the row data. Making the filename available (for further parsing) in the describe command would alert the analyst that the sought-after information is actually present and need only be further massaged...

    -A


    -----Original Message-----
    From: Ashish Thusoo
    Sent: Wednesday, September 16, 2009 12:11 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive

    You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


    Select FILENAME, xyz from T

    We could just do

    Select Filename(), xyz from T

    Thoughts?

    Ashish

    -----Original Message-----
    From: Edward Capriolo
    Sent: Wednesday, September 16, 2009 12:05 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I just put in a related thread about this. This would be really nice.
    It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.

    On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain wrote:
    I don't think it is a good idea to make it a part of table metadata in
    any way.

    What happens if the filename changes ? It will be very difficult to
    maintain.

    But, we can definitely add some virtual columns (FILENAME can be one
    of them

    to start with - it should not show up in describe, select * etc.



    But, the user can query based on them - this is mostly for advanced
    users and

    can be used for pruning etc. also





    I will open a new jira, and we can continue the discussion there.





    -namit









    From: Avram Aelony
    Sent: Wednesday, September 16, 2009 11:39 AM
    To: hive-user@hadoop.apache.org
    Subject: RE: adding filenames as new columns via Hive





    Very cool. Looking forward to seeing this feature in action. J



    Thanks,

    -A





    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:33 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive



    FYI, all partition columns can be used as any regular columns select
    queries. So it should be fine.

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 11:23:45 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: RE: adding filenames as new columns via Hive

    Sounds great, Prasad.

    As long as I can further parse the filename field to piece out (new)
    derived fields, I will be happy. J For example, in a later query I'd
    like to be able to do something like:

    select
    substr(filename, 4, 7) as class_A,
    substr(filename, 8, 10) as class_B
    count( x ) as cnt
    from FOO
    group by
    substr(filename, 4, 7),
    substr(filename, 8, 10) ;


    thanks,
    -A



    From: Prasad Chakka
    Sent: Wednesday, September 16, 2009 11:10 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: adding filenames as new columns via Hive

    I think this can be a good feature though I would like the filename to
    be a partition column (one of such) instead of a separate type of
    column. Would that work?

    create external table FOO ( <list of fields and types> ) row format
    delimited fields terminated by ','
    partitioned by (file_name FILENAME)
    stored as textfile location 's3:/somebucket/';

    Or table partitioned by datestamp and filename

    create external table FOO ( <list of fields and types> ) row format
    delimited fields terminated by ','
    Partitioned by (ds STRING, file_name FILENAME) stored as textfile
    location 's3:/somebucket/';


    So FILENAME becomes a new type. I like this because partition columns
    are virtual columns just like the filename column and do not exist
    along with data on the disk.

    Prasad

    ________________________________

    From: Avram Aelony <AvramAelony@eharmony.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Wed, 16 Sep 2009 10:48:33 -0700
    To: <hive-user@hadoop.apache.org>
    Subject: adding filenames as new columns via Hive

    Dear Hive list,

    I am processing a large volume of files (many files, roughly 500M
    compressed
    ) with Hive that reside in an S3 bucket. Although the files share the
    same schema, they have individual filenames that provide useful
    information that does not get captured and does not exist separately
    as a column within each file's data. As a general problem, I'd like
    to be able to add a new column via Hive that contains the filename of
    the files read in that were present in the bucket.

    My Hive CREATE EXTERNAL TABLE command points to the S3 container
    bucket, and I am thinking that at some point Hadoop or Hive must have
    a file handle with the filenames that perhaps could be of use. My
    hope is that this information could be added in (upon request) via
    Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

    Ideally, the syntax would look something like this:

    create external table FOO ( <list of fields and types> ) row format
    delimited fields terminated by ','
    add_filename as 'filename'
    stored as textfile location 's3:/somebucket/';


    Has anyone thought of this? Is there a way to add a new column within
    Hive that contains the filename?



    Many thanks in advance!!
    -Avram



    Avram Aelony
    Senior Analyst, Matching
    eHarmony.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedSep 16, '09 at 5:49p
activeSep 16, '09 at 7:37p
posts13
users5
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase