hmmm, I've seen mention of SymLink but I don't yet grasp how it works/applies to selecting files to process. Also, I don't have much control over how the data gets to the bucket I end up reading from, hence the need to powerfully select.

Could you point me to some SymLink documentation or an example so I can give it a try?

Many thanks,

On Jan 24, 2011, at 3:03 PM, Edward Capriolo wrote:
On Mon, Jan 24, 2011 at 5:58 PM, Avram Aelony wrote:

I really like the virtual column feature in 0.7 that allows me to request INPUT__FILE__NAME and see the names of files that are being acted on.

Because I can see the files that are being read, I see that I am spending time querying many, many very large files, most of which I do not need to process because these extra files are in the same s3 bucket location that contains the files I need.

The files I do need to process only a represent a subset of all files in the bucket. Nevertheless, the files I am interested in are quite large, and large enough to make copying to hdfs unwieldy.

Since I know the files I want to process by name before the scan of all files, can I be more efficient and only process a selection of files from a bucket avoiding those I don't?

I guess I am still looking for something likehttps://issues.apache.org/jira/browse/HIVE-951
I tried sending this message to the dev list initially, but since I haven't seen a response yet, perhaps this list is more appropriate.

Any suggestions or update on HIVE-951 ?

We do have the SymLink input format. It is a little more work then
hive-951 but accomplishes roughly the same thing.


Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 3 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 24, '11 at 10:59p
activeJan 24, '11 at 11:12p

2 users in discussion

Avram Aelony: 2 posts Edward Capriolo: 1 post



site design / logo © 2021 Grokbase