FAQ
I think the default TextInputFormat can meet my requirement. However,
even if the JavaDoc of TextInputFormat says the TextInputFormat could
divide input file as text lines which ends with CRLF. But I'd like to
know if the FileSplit size is not N times of line length, what will be
happen eventually?

BR/anderson

-----Original Message-----
From: jason hadoop
Sent: Wednesday, June 10, 2009 8:39 PM
To: core-user@hadoop.apache.org
Subject: Re: Large size Text file split

There is always NLineInputFormat. You specify the number of lines per
split.
The key is the position of the line start in the file, value is the line
itself.
The parameter mapred.line.input.format.linespermap controls the number
of lines per split
On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi wrote:
On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wrote:

Hi, all

I have a large csv file ( larger than 10 GB ), I'd like to use a
certain InputFormat to split it into smaller part thus each Mapper
can deal with piece of the csv file. However, as far as I know,
FileInputFormat only cares about byte size of file, that is, the
class can divide the csv file as many part, and maybe some part is
not a well-format CVS file.
For example, one line of the CSV file is not terminated with CRLF,
or maybe some text is trimed.

How to ensure each FileSplit is a smaller valid CSV file using a
proper InputFormat?

BR/anderson
If all you care about is the splits occurring at line boundaries, then
TextInputFormat will work.

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre
d/TextInputFormat.html

If not I guess you can write your own InputFormat class.

--
Harish Mallipeddi
http://blog.poundbang.in


--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 10 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 10, '09 at 12:07p
activeJun 12, '09 at 7:21a
posts10
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase