Clarification, the line below should have been "many files, roughly 500MB compressed" instead of "many files, roughly 500M compressed"..

From: Avram Aelony
Sent: Wednesday, September 16, 2009 10:49 AM
To: hive-user@hadoop.apache.org
Subject: adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket. Although the files share the same schema, they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data. As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use. My hope is that this information could be added in (upon request) via Hive. Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO ( <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';

Has anyone thought of this? Is there a way to add a new column within Hive that contains the filename?

Many thanks in advance!!

Avram Aelony
Senior Analyst, Matching

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 13 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedSep 16, '09 at 5:49p
activeSep 16, '09 at 7:37p



site design / logo © 2021 Grokbase