FAQ
Hi,

I am trying to include the stop words into hadoop map reduce, and later on,
into hive.
What is the accepted solution regarding the stop words in hadoop?

All I can think is to load all the stop words into an array in the mapper,
and then check each token against the stop words..(this would be O(n^2) )

Regards

Search Discussions

  • Tim robertson at May 16, 2009 at 12:55 pm
    Perhaps some kind of in memory index would be better than iterating an
    array? Binary tree or so.
    I did similar with polygon indexes and point data. It requires
    careful memory planning on the nodes if the indexes are large (mine
    were several GB).

    Just a thought,

    Tim
    On Sat, May 16, 2009 at 1:56 PM, PORTO aLET wrote:
    Hi,

    I am trying to include the stop words into hadoop map reduce, and later on,
    into hive.
    What is the accepted solution regarding the stop words in hadoop?

    All I can think is to load all the stop words into an array in the mapper,
    and then check each token against the stop words..(this would be O(n^2) )

    Regards
  • PORTO aLET at May 16, 2009 at 1:23 pm
    Can you please elaborate more about in memory index?
    What kind of software did you used to implement this ?
    Regards
    On Sat, May 16, 2009 at 8:55 PM, tim robertson wrote:

    Perhaps some kind of in memory index would be better than iterating an
    array? Binary tree or so.
    I did similar with polygon indexes and point data. It requires
    careful memory planning on the nodes if the indexes are large (mine
    were several GB).

    Just a thought,

    Tim
    On Sat, May 16, 2009 at 1:56 PM, PORTO aLET wrote:
    Hi,

    I am trying to include the stop words into hadoop map reduce, and later on,
    into hive.
    What is the accepted solution regarding the stop words in hadoop?

    All I can think is to load all the stop words into an array in the mapper,
    and then check each token against the stop words..(this would be O(n^2) )

    Regards
  • Tim robertson at May 16, 2009 at 1:48 pm
    Try and google "binary tree java" and you will get loads of hits...

    This is a simple implementation but I am sure there are better ones
    that handle balancing better.

    Cheers
    Tim


    public class BinaryTree {

    public static void main(String[] args) {
    BinaryTree bt = new BinaryTree();

    for (int i = 0; i < 10000; i++) {
    bt.insert("" + i);
    }

    System.out.println(bt.lookup("999"));
    System.out.println(bt.lookup("100"));
    System.out.println(bt.lookup("a")); // should be null
    }

    private Node root;

    private static class Node {
    Node left;
    Node right;
    String value;

    public Node(String value) {
    this.value = value;
    }
    }

    public String lookup(String key) {
    return (lookup(root, key));
    }

    private String lookup(Node node, String value) {
    if (node == null) {
    return (null);
    }

    if (value.equals(node.value)) {
    return (node.value);
    } else if (value.compareTo(node.value) < 0) {
    return (lookup(node.left, value));
    } else {
    return (lookup(node.right, value));
    }
    }

    public void insert(String value) {
    root = insert(root, value);
    }

    private Node insert(Node node, String value) {
    if (node == null) {
    node = new Node(value);
    } else {
    if (value.compareTo(node.value) <= 0) {
    node.left = insert(node.left, value);
    } else {
    node.right = insert(node.right, value);
    }
    }

    return (node);
    }
    }
  • Stefan Will at May 16, 2009 at 8:35 pm
    Just use a java.util.HashSet for this. There should only be a few dozen
    stopwords, so load them into a HashSet when the Mapper starts up, and then
    check your tokens against it while you're processing records.

    -- Stefan

    From: tim robertson <timrobertson100@gmail.com>
    Reply-To: <core-user@hadoop.apache.org>
    Date: Sat, 16 May 2009 15:48:23 +0200
    To: <core-user@hadoop.apache.org>
    Subject: Re: hadoop MapReduce and stop words

    Try and google "binary tree java" and you will get loads of hits...

    This is a simple implementation but I am sure there are better ones
    that handle balancing better.

    Cheers
    Tim


    public class BinaryTree {

    public static void main(String[] args) {
    BinaryTree bt = new BinaryTree();

    for (int i = 0; i < 10000; i++) {
    bt.insert("" + i);
    }

    System.out.println(bt.lookup("999"));
    System.out.println(bt.lookup("100"));
    System.out.println(bt.lookup("a")); // should be null
    }

    private Node root;

    private static class Node {
    Node left;
    Node right;
    String value;

    public Node(String value) {
    this.value = value;
    }
    }

    public String lookup(String key) {
    return (lookup(root, key));
    }

    private String lookup(Node node, String value) {
    if (node == null) {
    return (null);
    }

    if (value.equals(node.value)) {
    return (node.value);
    } else if (value.compareTo(node.value) < 0) {
    return (lookup(node.left, value));
    } else {
    return (lookup(node.right, value));
    }
    }

    public void insert(String value) {
    root = insert(root, value);
    }

    private Node insert(Node node, String value) {
    if (node == null) {
    node = new Node(value);
    } else {
    if (value.compareTo(node.value) <= 0) {
    node.left = insert(node.left, value);
    } else {
    node.right = insert(node.right, value);
    }
    }

    return (node);
    }
    }

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 16, '09 at 11:57a
activeMay 16, '09 at 8:35p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase