FAQ
Hi all,
I have some architectural question.
For my app I have persistent 50 GB data, which stored in HDFS, data is
simple CSV format file.
Also for my app which should be run over this (50 GB) data I have 10 GB
input data also CSV format.
Persistent data and input data don't have commons keys.

In my cluster I have 5 data nodes.
The app does simple match every line of input data with every line of
persistent data.

For solving this task I see two different approaches:
1. Destribute input file to every node using attribute -files, and run job.
But in this case every map will go through 10 GB input data.
2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
jobs (one per data node for instance), and for every job we will put 2 GB
data. In this case every map should go through 2 GB data. In other words
I'll give every map node it's own input data. But drawback of this approache
is work which I should do before start job and after job finished.

And may be there is more subtle way in hadoop to do this work?

--
View this message in context: http://old.nabble.com/Architectural-question-tp31365863p31365863.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Mehmet Tepedelenlioglu at Apr 10, 2011 at 11:29 pm
    My understanding is you have two sets of strings S1, and S2 and you want to mark all strings that
    belong to both sets. If this is correct, then:

    Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i.
    Reducer: For key K, if the list of values includes both 1 and 2, you have a match, emit: K MATCH, else emit: K NO_MATCH (or nothing).

    I assume that the load is not terribly unbalanced. The logic goes for intersection of any number of sets. Mark the members with their sets, reduce over them to see if they belong to every set.

    Good luck.

    On Apr 10, 2011, at 2:10 PM, oleksiy wrote:


    Hi all,
    I have some architectural question.
    For my app I have persistent 50 GB data, which stored in HDFS, data is
    simple CSV format file.
    Also for my app which should be run over this (50 GB) data I have 10 GB
    input data also CSV format.
    Persistent data and input data don't have commons keys.

    In my cluster I have 5 data nodes.
    The app does simple match every line of input data with every line of
    persistent data.

    For solving this task I see two different approaches:
    1. Destribute input file to every node using attribute -files, and run job.
    But in this case every map will go through 10 GB input data.
    2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
    jobs (one per data node for instance), and for every job we will put 2 GB
    data. In this case every map should go through 2 GB data. In other words
    I'll give every map node it's own input data. But drawback of this approache
    is work which I should do before start job and after job finished.

    And may be there is more subtle way in hadoop to do this work?

    --
    View this message in context: http://old.nabble.com/Architectural-question-tp31365863p31365863.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Ted Dunning at Apr 11, 2011 at 2:08 am
    The original poster said that there was no common key. Your suggestion
    presupposes that such a key exists.
    On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu wrote:

    My understanding is you have two sets of strings S1, and S2 and you want to
    mark all strings that
    belong to both sets. If this is correct, then:

    Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i.
    Reducer: For key K, if the list of values includes both 1 and 2, you have a
    match, emit: K MATCH, else emit: K NO_MATCH (or nothing).

    I assume that the load is not terribly unbalanced. The logic goes for
    intersection of any number of sets. Mark the members with their sets, reduce
    over them to see if they belong to every set.

    Good luck.

    On Apr 10, 2011, at 2:10 PM, oleksiy wrote:


    Hi all,
    I have some architectural question.
    For my app I have persistent 50 GB data, which stored in HDFS, data is
    simple CSV format file.
    Also for my app which should be run over this (50 GB) data I have 10 GB
    input data also CSV format.
    Persistent data and input data don't have commons keys.

    In my cluster I have 5 data nodes.
    The app does simple match every line of input data with every line of
    persistent data.

    For solving this task I see two different approaches:
    1. Destribute input file to every node using attribute -files, and run job.
    But in this case every map will go through 10 GB input data.
    2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
    jobs (one per data node for instance), and for every job we will put 2 GB
    data. In this case every map should go through 2 GB data. In other words
    I'll give every map node it's own input data. But drawback of this approache
    is work which I should do before start job and after job finished.

    And may be there is more subtle way in hadoop to do this work?

    --
    View this message in context:
    http://old.nabble.com/Architectural-question-tp31365863p31365863.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Sumit ghosh at Apr 11, 2011 at 8:42 am
    The original posting said - "The app does simple match every line of input data
    with every line of persistent data." Hence the "key" should be replaced by a
    String from the 10 GB store or a hash of it. Hence, we can match it with the
    hash or String from the persistent Store.



    ________________________________
    From: Ted Dunning <tdunning@maprtech.com>
    To: common-user@hadoop.apache.org
    Sent: Mon, 11 April, 2011 7:38:04 AM
    Subject: Re: Architectural question

    The original poster said that there was no common key. Your suggestion
    presupposes that such a key exists.
    On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu wrote:

    My understanding is you have two sets of strings S1, and S2 and you want to
    mark all strings that
    belong to both sets. If this is correct, then:

    Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i.
    Reducer: For key K, if the list of values includes both 1 and 2, you have a
    match, emit: K MATCH, else emit: K NO_MATCH (or nothing).

    I assume that the load is not terribly unbalanced. The logic goes for
    intersection of any number of sets. Mark the members with their sets, reduce
    over them to see if they belong to every set.

    Good luck.

    On Apr 10, 2011, at 2:10 PM, oleksiy wrote:


    Hi all,
    I have some architectural question.
    For my app I have persistent 50 GB data, which stored in HDFS, data is
    simple CSV format file.
    Also for my app which should be run over this (50 GB) data I have 10 GB
    input data also CSV format.
    Persistent data and input data don't have commons keys.

    In my cluster I have 5 data nodes.
    The app does simple match every line of input data with every line of
    persistent data.

    For solving this task I see two different approaches:
    1. Destribute input file to every node using attribute -files, and run job.
    But in this case every map will go through 10 GB input data.
    2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
    jobs (one per data node for instance), and for every job we will put 2 GB
    data. In this case every map should go through 2 GB data. In other words
    I'll give every map node it's own input data. But drawback of this approache
    is work which I should do before start job and after job finished.

    And may be there is more subtle way in hadoop to do this work?

    --
    View this message in context:
    http://old.nabble.com/Architectural-question-tp31365863p31365863.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Mehmet Tepedelenlioglu at Apr 11, 2011 at 3:43 pm
    That is how I interpreted it, but if by "simple" some other matching function
    then the most obvious one is meant, then it still is possible to extend theText
    class and overwrite the hashCode and equals functions to accommodate for
    this new sort of equality.

    On Apr 11, 2011, at 1:41 AM, sumit ghosh wrote:

    The original posting said - "The app does simple match every line of input data
    with every line of persistent data." Hence the "key" should be replaced by a
    String from the 10 GB store or a hash of it. Hence, we can match it with the
    hash or String from the persistent Store.



    ________________________________
    From: Ted Dunning <tdunning@maprtech.com>
    To: common-user@hadoop.apache.org
    Sent: Mon, 11 April, 2011 7:38:04 AM
    Subject: Re: Architectural question

    The original poster said that there was no common key. Your suggestion
    presupposes that such a key exists.

    On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu <
    mehmetsino@gmail.com> wrote:
    My understanding is you have two sets of strings S1, and S2 and you want to
    mark all strings that
    belong to both sets. If this is correct, then:

    Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i.
    Reducer: For key K, if the list of values includes both 1 and 2, you have a
    match, emit: K MATCH, else emit: K NO_MATCH (or nothing).

    I assume that the load is not terribly unbalanced. The logic goes for
    intersection of any number of sets. Mark the members with their sets, reduce
    over them to see if they belong to every set.

    Good luck.

    On Apr 10, 2011, at 2:10 PM, oleksiy wrote:


    Hi all,
    I have some architectural question.
    For my app I have persistent 50 GB data, which stored in HDFS, data is
    simple CSV format file.
    Also for my app which should be run over this (50 GB) data I have 10 GB
    input data also CSV format.
    Persistent data and input data don't have commons keys.

    In my cluster I have 5 data nodes.
    The app does simple match every line of input data with every line of
    persistent data.

    For solving this task I see two different approaches:
    1. Destribute input file to every node using attribute -files, and run job.
    But in this case every map will go through 10 GB input data.
    2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
    jobs (one per data node for instance), and for every job we will put 2 GB
    data. In this case every map should go through 2 GB data. In other words
    I'll give every map node it's own input data. But drawback of this approache
    is work which I should do before start job and after job finished.

    And may be there is more subtle way in hadoop to do this work?

    --
    View this message in context:
    http://old.nabble.com/Architectural-question-tp31365863p31365863.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Ted Dunning at Apr 11, 2011 at 2:07 am
    There are no subtle ways to deal with quadratic problems like this. They
    just don't scale.

    Your suggestions are roughly on course. When matching 10GB against 50GB,
    the choice of which input to use as input to the mapper depends a lot on how
    much you can buffer in memory and how long such a buffer takes to build.

    If you can't store the entire 10GB of data in memory at once, then consider
    a program like this:

    a) split the 50GB of data across as many mappers as you have using standard
    methods

    b) in the mapper, emit each record several times with keys of the form (i,
    j) where i cycles through [0,n) and is incremented once for each record read
    and j cycles through [0, m) and is incremented each time you emit a record.
    Choose m so that 1/m of your 10GB data will fit in your reducers memory.
    Choose n so that n x m is as large as your desired number of reducers.

    c) in the reducer, you will get some key (i,j) and an iterator for a number
    of records. Read the i-th segment of your 10GB data and compare each of
    the records that the iterator gives you to that data. If you made n = 1 in
    step (b), then you will have at most m-way parallelism in this step. If n
    is large, however, your reducer may need to read the same segment of your
    10GB data more than once. In such conditions you may want to sort the
    records and remember which segment you have already read.

    In general, though, as I mentioned this is not a scalable process and as
    your data grows it is likely to become untenable.

    If you can split your data into pieces and estimate which piece each record
    should be matched to then you might be able to make the process more
    scalable. Consider indexing techniques to do this rough targeting. For
    instance, if you are trying to find the closes few strings based on edit
    distance, you might be able to use n-grams to get approximate matches via a
    text retrieval index. This can substantially reduce the cost of your
    algorithm.
    On Sun, Apr 10, 2011 at 2:10 PM, oleksiy wrote:

    ... Persistent data and input data don't have commons keys.

    In my cluster I have 5 data nodes.
    The app does simple match every line of input data with every line of
    persistent data.

    ...
    And may be there is more subtle way in hadoop to do this work?
  • Daniel McEnnis at Apr 11, 2011 at 2:17 am
    Dear oleksiy,

    You can read and compare from file without fully loading the file into
    memory on each map process. This will save memory on your cluster.
    Otherwise, there is no quick solution as every other solution still
    requires you to have the 10GB in memory (in pieces or in whole) for
    every entry of the 50 GB file.

    Daniel.
    On Sun, Apr 10, 2011 at 5:10 PM, oleksiy wrote:

    Hi all,
    I have some architectural question.
    For my app I have persistent 50 GB data, which stored in HDFS, data is
    simple CSV format file.
    Also for my app which should be run over this (50 GB) data I have 10 GB
    input data also CSV format.
    Persistent data and input data don't have commons keys.

    In my cluster I have 5 data nodes.
    The app does simple match every line of input data with every line of
    persistent data.

    For solving this task I see two different approaches:
    1. Destribute input file to every node using attribute -files, and run job.
    But in this case every map will go through 10 GB input data.
    2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
    jobs (one per data node for instance), and for every job we will put 2 GB
    data. In this case every map should go through 2 GB data. In other words
    I'll give every map node it's own input data. But drawback of this approache
    is work which I should do before start job and after job finished.

    And may be there is more subtle way in hadoop to do this work?

    --
    View this message in context: http://old.nabble.com/Architectural-question-tp31365863p31365863.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 10, '11 at 9:10p
activeApr 11, '11 at 3:43p
posts7
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase