|
Jagaran das |
at Jun 15, 2011 at 6:25 pm
|
⇧ |
| |
Hi,
Can't we append in hadoop-0.20.203.0?
Regards,
Jagaran
________________________________
From: Dmitriy Ryaboy <dvryaboy@gmail.com>
To: user@pig.apache.org
Sent: Wed, 15 June, 2011 9:57:20 AM
Subject: Re: Running multiple Pig jobs simultaneously on same data
Yong,
You can't. Hence, immutable. It's not a database. It's a write-once file system.
Approaches to solve updates include:
1) rewrite everything
2) write a separate set of "deltas" into other files and join them in
at read time
3) do 2, and occasionally run a "compaction" which does a complete
rewrite based on existing deltas
4) write to something like HBase that handles all of this under the covers
D
2011/6/15 勇胡 <yongyong313@gmail.com>:
Jon,
If I want to modify data(insert or delete) in the HDFS, how can I do it?
From the description, I can not directly modify the data itself(update the
data), I can not append the new data to the file! How the HDFS implement the
data modification? I just feel a little bit confusion.
Yong
在 2011年6月15日 下午3:36,Jonathan Coveney <jcoveney@gmail.com>写道:
Yong,
Currently, HDFS does not support appending to a file. So once a file is
created, it literally cannot be changed (although it can be deleted, I
suppose). this lets you avoid issues where I do a SELECT * on the entire
database, and the dba can't update a row, or other things like that. There
are some append patches in the works but I am not sure how they handle the
concurrency implications.
Make sense?
Jon
2011/6/15 勇胡 <yongyong313@gmail.com>
I read the link, and I just felt that the HDFS is designed for the
read-frequently operation, not for the write-frequently( A file
once created, written, and closed need not be changed.) .
For your description (Immutable means that after creation it cannot be
modified.), if I understand correct, you mean that the HDFS can not
implement "update" semantics as same as in the database area? The write
operation can not directly apply to the specific tuple or record? The
result
of write operation just appends at the end of the file.
Regards
Yong
2011/6/15 Nathan Bijnens <nathan@nathan.gs>
Immutable means that after creation it cannot be modified.
HDFS applications need a write-once-read-many access model for files. A
file
once created, written, and closed need not be changed. This assumption
simplifies data coherency issues and enables high throughput data
access.
A
MapReduce application or a web crawler application fits perfectly with this
model. There is a plan to support appending-writes to files in the
future.
http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Modell
lock
mechanism to obtain immutable data access when the concurrent tasks process
the same set of data or uses other strategy to implement immutable?
Thanks
Yong
2011/6/14 Bill Graham <billgraham@gmail.com>
Yes, this is possible. Data in HDFS is immutable and MR tasks are
spawned
in
their own VM so multiple concurrent jobs acting on the same input
data
are
fine.
On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
pradipta.dutta@me.com> wrote:
Hi All,
We have a requirement where we have to process same set of data
(in
Hadoop
cluster) by running multiple Pig jobs simultaneously.
Any idea whether this is possible in Pig?
Thanks,
Pradipta