FAQ
Dear Hadoopers,

i'm trying to find out how and where hadoop splits a file into blocks and
decides to send them to the datanodes.

My specific problem:
i have two types of data files.
One large file is used as a database-file where information is sorted like
this:
[BEGIN DATAROW]
... lots of data 1
[END DATAROW]

[BEGIN DATAROW]
... lots of data 2
[END DATAROW]
and so on.

and the other smaller files contain raw data and are to be compared to a
datarow in the large file.

so my question is: is it possible to manually set how hadoop splits the
large data file into blocks?
obviously i want the begin-end section to be in one block to optimize
performance. thus i can replicate the smaller files on each node and so
those can work independently from the other.

thanks, yk
--
View this message in context: http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28015936.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Patrick Angeles at Mar 24, 2010 at 3:39 pm
    Yuri,

    Probably the easiest thing is to actually create distinct files and
    configure the block size per file such that HDFS doesn't split it into
    smaller blocks for you.

    - P
    On Wed, Mar 24, 2010 at 11:23 AM, Yuri K. wrote:


    Dear Hadoopers,

    i'm trying to find out how and where hadoop splits a file into blocks and
    decides to send them to the datanodes.

    My specific problem:
    i have two types of data files.
    One large file is used as a database-file where information is sorted like
    this:
    [BEGIN DATAROW]
    ... lots of data 1
    [END DATAROW]

    [BEGIN DATAROW]
    ... lots of data 2
    [END DATAROW]
    and so on.

    and the other smaller files contain raw data and are to be compared to a
    datarow in the large file.

    so my question is: is it possible to manually set how hadoop splits the
    large data file into blocks?
    obviously i want the begin-end section to be in one block to optimize
    performance. thus i can replicate the smaller files on each node and so
    those can work independently from the other.

    thanks, yk
    --
    View this message in context:
    http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28015936.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Sonal Goyal at Mar 24, 2010 at 4:17 pm
    Hi Yuri,

    You can also check the source code of FileInputFormat and create your own
    RecordReader implementation.
    Thanks and Regards,
    Sonal
    www.meghsoft.com

    On Wed, Mar 24, 2010 at 9:08 PM, Patrick Angeles wrote:

    Yuri,

    Probably the easiest thing is to actually create distinct files and
    configure the block size per file such that HDFS doesn't split it into
    smaller blocks for you.

    - P
    On Wed, Mar 24, 2010 at 11:23 AM, Yuri K. wrote:


    Dear Hadoopers,

    i'm trying to find out how and where hadoop splits a file into blocks and
    decides to send them to the datanodes.

    My specific problem:
    i have two types of data files.
    One large file is used as a database-file where information is sorted like
    this:
    [BEGIN DATAROW]
    ... lots of data 1
    [END DATAROW]

    [BEGIN DATAROW]
    ... lots of data 2
    [END DATAROW]
    and so on.

    and the other smaller files contain raw data and are to be compared to a
    datarow in the large file.

    so my question is: is it possible to manually set how hadoop splits the
    large data file into blocks?
    obviously i want the begin-end section to be in one block to optimize
    performance. thus i can replicate the smaller files on each node and so
    those can work independently from the other.

    thanks, yk
    --
    View this message in context:
    http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28015936.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • ANKITBHATNAGAR at Mar 24, 2010 at 8:59 pm

    Yuri K. wrote:

    Dear Hadoopers,

    i'm trying to find out how and where hadoop splits a file into blocks and
    decides to send them to the datanodes.

    My specific problem:
    i have two types of data files.
    One large file is used as a database-file where information is sorted like
    this:
    [BEGIN DATAROW]
    ... lots of data 1
    [END DATAROW]

    [BEGIN DATAROW]
    ... lots of data 2
    [END DATAROW]
    and so on.

    and the other smaller files contain raw data and are to be compared to a
    datarow in the large file.

    so my question is: is it possible to manually set how hadoop splits the
    large data file into blocks?
    obviously i want the begin-end section to be in one block to optimize
    performance. thus i can replicate the smaller files on each node and so
    those can work independently from the other.

    thanks, yk

    You should create a CustomInputSplit and CustomRecordReader (should have
    start and end tag )




    --
    View this message in context: http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28021294.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Yuri K. at Mar 26, 2010 at 2:49 pm
    ok so far so good. thanks for the reply. i'm trying to implement a custom
    file input format. but i can set it only in the job configuration:
    job.setInputFormatClass(CustomFileInputFormat.class);

    how do i make hadoop implement the file format, or the custom file split
    when i upload new files to the hdfs? do i need a custom upload interface for
    that or is there a hadoop config option for that?

    tnx


    ANKITBHATNAGAR wrote:


    Yuri K. wrote:
    Dear Hadoopers,

    i'm trying to find out how and where hadoop splits a file into blocks and
    decides to send them to the datanodes.

    My specific problem:
    i have two types of data files.
    One large file is used as a database-file where information is sorted
    like this:
    [BEGIN DATAROW]
    ... lots of data 1
    [END DATAROW]

    [BEGIN DATAROW]
    ... lots of data 2
    [END DATAROW]
    and so on.

    and the other smaller files contain raw data and are to be compared to a
    datarow in the large file.

    so my question is: is it possible to manually set how hadoop splits the
    large data file into blocks?
    obviously i want the begin-end section to be in one block to optimize
    performance. thus i can replicate the smaller files on each node and so
    those can work independently from the other.

    thanks, yk

    You should create a CustomInputSplit and CustomRecordReader (should have
    start and end tag )



    --
    View this message in context: http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28043517.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Ankit Bhatnagar at Mar 26, 2010 at 4:25 pm
    So this is how it goes

    1- customInputFormat or whatever name extends the TextFormat
    2- this class has a method that returns the RecordReader Object
    3- u have to create a customRecordReader as well that reads the blocks

    Ankit



    -----Original Message-----
    From: Yuri K.
    Sent: Friday, March 26, 2010 10:49 AM
    To: core-user@hadoop.apache.org
    Subject: Re: Manually splitting files in blocks


    ok so far so good. thanks for the reply. i'm trying to implement a custom
    file input format. but i can set it only in the job configuration:
    job.setInputFormatClass(CustomFileInputFormat.class);

    how do i make hadoop implement the file format, or the custom file split
    when i upload new files to the hdfs? do i need a custom upload interface for
    that or is there a hadoop config option for that?

    tnx


    ANKITBHATNAGAR wrote:


    Yuri K. wrote:
    Dear Hadoopers,

    i'm trying to find out how and where hadoop splits a file into blocks and
    decides to send them to the datanodes.

    My specific problem:
    i have two types of data files.
    One large file is used as a database-file where information is sorted
    like this:
    [BEGIN DATAROW]
    ... lots of data 1
    [END DATAROW]

    [BEGIN DATAROW]
    ... lots of data 2
    [END DATAROW]
    and so on.

    and the other smaller files contain raw data and are to be compared to a
    datarow in the large file.

    so my question is: is it possible to manually set how hadoop splits the
    large data file into blocks?
    obviously i want the begin-end section to be in one block to optimize
    performance. thus i can replicate the smaller files on each node and so
    those can work independently from the other.

    thanks, yk

    You should create a CustomInputSplit and CustomRecordReader (should have
    start and end tag )



    --
    View this message in context: http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28043517.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Antonio Barbuzzi at Mar 26, 2010 at 5:43 pm
    HDFS splits files in chunks regardless of their content.

    This is what I understood so far:
    an InputFormat object reads the file and returns a list of InputSplit
    object (an object that usually contains the boundaries of the file
    section your map task will read). Moreover InputFormat contains a method
    to return a RecordReader, able to read and interpret your InputSplit.

    Therefore, when you have a file for which none of the existent
    InputSplit implementations work, you can:

    - if you are able to start reading from an arbitrary offset in the file
    (there are sync points in the file, such as \n o spaces in textfile),
    you can just define your custom recordreader.
    - if your file is splittable in fixed length chunks, you have to
    override the getSplits method of InputFormat in order to supply to the
    job a whole record, and of course, subclass InputFormat.

    BR,
    Antonio Barbuzzi


    -------- Original Message --------
    Subject: Re: Manually splitting files in blocks
    From: Yuri K. <mr_greenshit@hotmail.com>
    To: core-user@hadoop.apache.org
    Date: Fri Mar 26 2010 15:49:09 GMT+0100 (CET)
    ok so far so good. thanks for the reply. i'm trying to implement a custom
    file input format. but i can set it only in the job configuration:
    job.setInputFormatClass(CustomFileInputFormat.class);

    how do i make hadoop implement the file format, or the custom file split
    when i upload new files to the hdfs? do i need a custom upload interface for
    that or is there a hadoop config option for that?

    tnx


    ANKITBHATNAGAR wrote:

    Yuri K. wrote:
    Dear Hadoopers,

    i'm trying to find out how and where hadoop splits a file into blocks and
    decides to send them to the datanodes.

    My specific problem:
    i have two types of data files.
    One large file is used as a database-file where information is sorted
    like this:
    [BEGIN DATAROW]
    ... lots of data 1
    [END DATAROW]

    [BEGIN DATAROW]
    ... lots of data 2
    [END DATAROW]
    and so on.

    and the other smaller files contain raw data and are to be compared to a
    datarow in the large file.

    so my question is: is it possible to manually set how hadoop splits the
    large data file into blocks?
    obviously i want the begin-end section to be in one block to optimize
    performance. thus i can replicate the smaller files on each node and so
    those can work independently from the other.

    thanks, yk
    You should create a CustomInputSplit and CustomRecordReader (should have
    start and end tag )



  • Nick Dimiduk at Mar 26, 2010 at 6:07 pm
    Inline
    On Fri, Mar 26, 2010 at 7:49 AM, Yuri K. wrote:


    ok so far so good. thanks for the reply. i'm trying to implement a custom
    file input format. but i can set it only in the job configuration:
    job.setInputFormatClass(CustomFileInputFormat.class);
    This is exactly right. The custom input code ends up bundled in your job jar
    and is available to the job at runtime just like any other dependency
    library. Alternately, you could package your new input format into it's own
    jar and "install" it onto the cluster by pushing it out to the
    $HADOOP_HOME/lib on every machine. Unless you're building a common
    infrastructure for a disparate set of users, I'd recommend the former
    approach.

    how do i make hadoop implement the file format, or the custom file split
    when i upload new files to the hdfs? do i need a custom upload interface
    for
    that or is there a hadoop config option for that?
    My understanding (please correct me, list) is that hadoop will always spit
    your files based on the block size setting. The InputSplit and RecordReaders
    are used by jobs to retrieve chunks of files for processing - that is, there
    are two separate splits happening here: one "physical" split for storage and
    one "logical" split for processing.

    Cheers,
    -Nick

    ANKITBHATNAGAR wrote:


    Yuri K. wrote:
    Dear Hadoopers,

    i'm trying to find out how and where hadoop splits a file into blocks
    and
    decides to send them to the datanodes.

    My specific problem:
    i have two types of data files.
    One large file is used as a database-file where information is sorted
    like this:
    [BEGIN DATAROW]
    ... lots of data 1
    [END DATAROW]

    [BEGIN DATAROW]
    ... lots of data 2
    [END DATAROW]
    and so on.

    and the other smaller files contain raw data and are to be compared to a
    datarow in the large file.

    so my question is: is it possible to manually set how hadoop splits the
    large data file into blocks?
    obviously i want the begin-end section to be in one block to optimize
    performance. thus i can replicate the smaller files on each node and so
    those can work independently from the other.

    thanks, yk

    You should create a CustomInputSplit and CustomRecordReader (should have
    start and end tag )



    --
    View this message in context:
    http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28043517.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Patrick Angeles at Mar 26, 2010 at 6:27 pm
    My understanding (please correct me, list) is that hadoop will always spit
    your files based on the block size setting. The InputSplit and
    RecordReaders
    are used by jobs to retrieve chunks of files for processing - that is,
    there
    are two separate splits happening here: one "physical" split for storage
    and
    one "logical" split for processing.
    That's right. The physical split are HDFS "blocks". An InputSplit is a
    logical split, and represents a unit of work that is sent to a single
    Mapper. A RecordReader provides a record-oriented view of the data. In most
    cases, the last record in each InputSplit will span the input split's
    boundaries, or even a block's boundaries. In the latter case, data is
    transfered from a DN that holds the next contiguous block such that the
    reader can construct a full record.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 24, '10 at 3:24p
activeMar 26, '10 at 6:27p
posts9
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase