FAQ
Yeah, udfs are better semantically. Will this be a problem for partition pruning?


________________________________
From: Ashish Thusoo <athusoo@facebook.com>
Reply-To: <hive-user@hadoop.apache.org>
Date: Wed, 16 Sep 2009 12:11:00 -0700
To: <hive-user@hadoop.apache.org>
Subject: RE: adding filenames as new columns via Hive

You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


Select FILENAME, xyz from T

We could just do

Select Filename(), xyz from T

Thoughts?

Ashish

-----Original Message-----
From: Edward Capriolo
Sent: Wednesday, September 16, 2009 12:05 PM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.

On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain wrote:
I don't think it is a good idea to make it a part of table metadata in
any way.

What happens if the filename changes ? It will be very difficult to
maintain.

But, we can definitely add some virtual columns (FILENAME can be one
of them

to start with - it should not show up in describe, select * etc.



But, the user can query based on them - this is mostly for advanced
users and

can be used for pruning etc. also





I will open a new jira, and we can continue the discussion there.





-namit









From: Avram Aelony
Sent: Wednesday, September 16, 2009 11:39 AM
To: hive-user@hadoop.apache.org
Subject: RE: adding filenames as new columns via Hive





Very cool. Looking forward to seeing this feature in action. J



Thanks,

-A





From: Prasad Chakka
Sent: Wednesday, September 16, 2009 11:33 AM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive



FYI, all partition columns can be used as any regular columns select
queries. So it should be fine.

________________________________

From: Avram Aelony <AvramAelony@eharmony.com>
Reply-To: <hive-user@hadoop.apache.org>
Date: Wed, 16 Sep 2009 11:23:45 -0700
To: <hive-user@hadoop.apache.org>
Subject: RE: adding filenames as new columns via Hive

Sounds great, Prasad.

As long as I can further parse the filename field to piece out (new)
derived fields, I will be happy. J For example, in a later query I'd
like to be able to do something like:

select
substr(filename, 4, 7) as class_A,
substr(filename, 8, 10) as class_B
count( x ) as cnt
from FOO
group by
substr(filename, 4, 7),
substr(filename, 8, 10) ;


thanks,
-A



From: Prasad Chakka
Sent: Wednesday, September 16, 2009 11:10 AM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I think this can be a good feature though I would like the filename to
be a partition column (one of such) instead of a separate type of
column. Would that work?

create external table FOO ( <list of fields and types> ) row format
delimited fields terminated by ','
partitioned by (file_name FILENAME)
stored as textfile location 's3:/somebucket/';

Or table partitioned by datestamp and filename

create external table FOO ( <list of fields and types> ) row format
delimited fields terminated by ','
Partitioned by (ds STRING, file_name FILENAME) stored as textfile
location 's3:/somebucket/';


So FILENAME becomes a new type. I like this because partition columns
are virtual columns just like the filename column and do not exist
along with data on the disk.

Prasad

________________________________

From: Avram Aelony <AvramAelony@eharmony.com>
Reply-To: <hive-user@hadoop.apache.org>
Date: Wed, 16 Sep 2009 10:48:33 -0700
To: <hive-user@hadoop.apache.org>
Subject: adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M
compressed
) with Hive that reside in an S3 bucket. Although the files share the
same schema, they have individual filenames that provide useful
information that does not get captured and does not exist separately
as a column within each file's data. As a general problem, I'd like
to be able to add a new column via Hive that contains the filename of
the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container
bucket, and I am thinking that at some point Hadoop or Hive must have
a file handle with the filenames that perhaps could be of use. My
hope is that this information could be added in (upon request) via
Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO ( <list of fields and types> ) row format
delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this? Is there a way to add a new column within
Hive that contains the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 10 of 13 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedSep 16, '09 at 5:49p
activeSep 16, '09 at 7:37p
posts13
users5
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase