Assume we have a medium size cluster - say 20 nodes and that the cluster is
used for one job and cannot change in size.
Assume we are sorting a large data set. As we increase the size of the data
sorted say from 100GB to 1000GB to 10000GB does the
time for the sort scale as N or as NLogN? I have heard both answers with
NLogN coming largely from folks less familiar with hadoop and
as N from others with more experience but I am skeptical - has anyone done
tests and can contribute real data

Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA

Search Discussions

  • Mohamed Riadh Trad at Jul 1, 2010 at 2:33 pm

    Has any one addressed the org.apache.hadoop.mapreduce.lib.input.TextInputFormat compatibility with hadoop streaming?

    The new API generates the following exception when lunching pipes jobs with org.apache.hadoop.mapreduce.lib.input.TextInputFormat Input Format instead of org.apache.hadoop.mapred.TextInputFormat.

    Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: class org.apache.hadoop.mapreduce.lib.input.TextInputFormat not org.apache.hadoop.mapred.InputFormat

    My problem with the deprecated classes stands in mapred.min.split.size and the Map Tasks number.

    I need to generate N Maps on splits of approximately a same size. However, by fixing the mapred.min.split.size to 20MB I get splits of 6 to 64 MB.

    Any suggestions?

    Trad Mohamed Riadh, M.Sc, Ing.
    PhD. student

    Office: 11-15
    Phone: (33)-1 39 63 59 33
    Fax: (33)-1 39 63 56 74
    Email: Riadh.Trad(a)inria.fr
    Home page: http://www-rocq.inria.fr/who/Mohamed.Trad/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
postedJul 1, '10 at 5:16a
activeJul 1, '10 at 2:33p

2 users in discussion

Mohamed Riadh Trad: 1 post Steve Lewis: 1 post



site design / logo © 2022 Grokbase