FAQ
Hey ,
M pretty new to Hadoop .

I need to Sort a Metafile (TBs) and thought of using Hadoop Sort (in
examples) for it.
My input metafile looks like this --> binary stream (only 1's and 0's). It
basically contains records of 40 bytes.
Every record goes like this :

long a; <key> --> 8 bytes. The rest of the structure will be the <value> -->
32 bytes
long b;
int c;
int d;
int e;
int unprocessed;
int compress_attempted;
int gatherer;


I have created a *FpMetaId.java (extends BytesWritable)* corresponding to
the <value> and *FpMetadata.java (extends BytesWritable)* corresponding to
the <key>.

My sole aim is to get these records (40 bytes) sorted with the fp (double)
as the key. And I need to write these sorted records back into a metafile
(exactly my old metafile but with sorted records----> binaries only).
I also implemented ::

*MetafileInputFormat.java ( extends SequenceFileAsBinaryInputFormat) * --->
file making an input file format compatible to my record.
*MetafileOutputFormat<K, V> extends SequenceFileOutputFormat* ---> file
making the output file format compatible to my record.
*MetafileRecordReader.java (extends
SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader )* --->
file implementing the record reader compatible to my record.

MetafileRecordWriter class has been implemented with in my
MetafileOutputFormat.java file.

Let me kindly get you through the sequence of events which followed :

1) I resolved all the errors in the writable classes (FpMetaId, FpMetadata)
and in/out formats (MetafileInputFormat, MetafileOutputFormat,) and
RecordReaders I implemented.

2) Writables I copied to /io folder. Other new files were copied to /mapred
folder. I successfully built it.

3) I modified the Sort file (the function I want to run with FpMetaId as key
and FpMetadata as value and imported these new classes in the file.) I
changed default conf settings to these required Writables and
RecordReaders.. I built hadoop using ant command after this. It successfully
got built.

*Q). Does this ensure all the new changes have got reflected on the jar. (
am I ready to go execute the sort function ?? )*

4) As I had already mentioned before, I am working with sequential file
format (binary) with a datastructure (key,value) repeating. So I wrote a C
code which generates random values for my datastructure and populated a file
, sequentially writing (binary) my (key,value)datastructure. I gave this as
my input for the sort which should sort my (key,values) with respect to
keys. I got the error : fp_input not a SequenceFile (fp_input is my input
file). I thought Seqfiles will just be stream of binaries.. Does it contain
any specific format ?

*Command used : bin/hadoop jar hadoop-0.20.2-examples.jar sort fp_input
fp_output*

*Q) What does this imply ? I have no clue how to proceed further. Again, is
it because my jar file used to execute doesnt have the latest libraries ? I
could not get any good tutorials on this.
*

It would be great if someone can offer an helping hand to this noob.

Thanks,
Matthew John

Search Discussions

  • Ted Yu at Sep 8, 2010 at 4:00 am
    Please get hadoop source code and read the comment at the beginning of
    SequenceFile.java:
    * <p>Essentially there are 3 different formats for
    <code>SequenceFile</code>s
    ...
    On Tue, Sep 7, 2010 at 8:13 PM, Matthew John wrote:

    Hey ,
    M pretty new to Hadoop .

    I need to Sort a Metafile (TBs) and thought of using Hadoop Sort (in
    examples) for it.
    My input metafile looks like this --> binary stream (only 1's and 0's). It
    basically contains records of 40 bytes.
    Every record goes like this :

    long a; <key> --> 8 bytes. The rest of the structure will be the <value>
    -->
    32 bytes
    long b;
    int c;
    int d;
    int e;
    int unprocessed;
    int compress_attempted;
    int gatherer;


    I have created a *FpMetaId.java (extends BytesWritable)* corresponding to
    the <value> and *FpMetadata.java (extends BytesWritable)* corresponding to
    the <key>.

    My sole aim is to get these records (40 bytes) sorted with the fp (double)
    as the key. And I need to write these sorted records back into a metafile
    (exactly my old metafile but with sorted records----> binaries only).
    I also implemented ::

    *MetafileInputFormat.java ( extends SequenceFileAsBinaryInputFormat) * --->
    file making an input file format compatible to my record.
    *MetafileOutputFormat<K, V> extends SequenceFileOutputFormat* ---> file
    making the output file format compatible to my record.
    *MetafileRecordReader.java (extends
    SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader )* --->
    file implementing the record reader compatible to my record.

    MetafileRecordWriter class has been implemented with in my
    MetafileOutputFormat.java file.

    Let me kindly get you through the sequence of events which followed :

    1) I resolved all the errors in the writable classes (FpMetaId, FpMetadata)
    and in/out formats (MetafileInputFormat, MetafileOutputFormat,) and
    RecordReaders I implemented.

    2) Writables I copied to /io folder. Other new files were copied to /mapred
    folder. I successfully built it.

    3) I modified the Sort file (the function I want to run with FpMetaId as
    key
    and FpMetadata as value and imported these new classes in the file.) I
    changed default conf settings to these required Writables and
    RecordReaders.. I built hadoop using ant command after this. It
    successfully
    got built.

    *Q). Does this ensure all the new changes have got reflected on the jar. (
    am I ready to go execute the sort function ?? )*

    4) As I had already mentioned before, I am working with sequential file
    format (binary) with a datastructure (key,value) repeating. So I wrote a C
    code which generates random values for my datastructure and populated a
    file
    , sequentially writing (binary) my (key,value)datastructure. I gave this as
    my input for the sort which should sort my (key,values) with respect to
    keys. I got the error : fp_input not a SequenceFile (fp_input is my input
    file). I thought Seqfiles will just be stream of binaries.. Does it contain
    any specific format ?

    *Command used : bin/hadoop jar hadoop-0.20.2-examples.jar sort fp_input
    fp_output*

    *Q) What does this imply ? I have no clue how to proceed further. Again, is
    it because my jar file used to execute doesnt have the latest libraries ? I
    could not get any good tutorials on this.
    *

    It would be great if someone can offer an helping hand to this noob.

    Thanks,
    Matthew John
  • Matthew John at Sep 8, 2010 at 3:03 pm
    Thanks for the reply Ted !!

    What I understand is that a SequenceFile will have a header followed by the
    records in a format : Recordlength,Keylength,Key,Value with a sync marker
    coming at some regular interval..

    It would be great if someone can take a look at the following..

    Q 1) The thing is my file is basically in the format : header ( a different
    one) followed by Record (Key Value). In this case the size of Record and Key
    is fixed.I would like to know* if I can modify the core code to make the
    SequenceFile format like this *. If yes what code should I look at ??

    Q 2) *What is a Sync marker (can we define it )* ? Obviously my file would
    not be having this. Can someone suggest a way to get around this obstacle.
    My final aim is to take this file in , sort it with respect to Key and print
    the sorted file ..

    Thanks,
    Matthew

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 8, '10 at 3:13a
activeSep 8, '10 at 3:03p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Matthew John: 2 posts Ted Yu: 1 post

People

Translate

site design / logo © 2023 Grokbase