I have code that dynamically generates Cascalog queries. It seems to
work fine locally so now I want to submit the queries to my Hadoop
cluster. At first, I naively thought that I could set a configuration
option somewhere to point my local process to the JobTracker for my
cluster. But, the examples I see talk about running the Cascalog
queries directly on the cluster. Is it possible to point a local
Cascalog program to a remote Hadoop cluster?

Thanks for any pointers.
-David McNeil

Search Discussions

  • David McNeil at Feb 9, 2012 at 7:42 pm
    To partially answer my own question: It looks like the solution is to
    place the various Hadoop config files on my local classpath (e.g core-
    site.xml, mapred-site.xml, etc.) then the hadoop library will find
    them and talk to the remote cluster.

    I am still testing this to see if it works.

    -David
  • Paul Lam at Feb 10, 2012 at 1:10 pm
    that's we do too. Then running < hadoop jar ... locally would send the
    cascalog jar to our remote cluster
  • David McNeil at Feb 21, 2012 at 5:35 pm
    Based on my testing, these are files needed on the local classpath. I
    copied these from my Hadoop cluster to the classpath of my Cascalog
    process:

    core-site.xml
    hdfs-site.xml
    mapred-site.xml

    -David
  • Sam Ritchie at Mar 8, 2012 at 9:17 am
    Oh, wow, very cool. I didn't realize that that was the case; I've been uploading the jar to the jobtracker manually and running commands up there. What do you do with the local process? Leave it running in a screen or something?

    --
    Sam Ritchie
    Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

    On Tuesday, February 21, 2012 at 9:35 AM, David McNeil wrote:

    Based on my testing, these are files needed on the local classpath. I
    copied these from my Hadoop cluster to the classpath of my Cascalog
    process:

    core-site.xml
    hdfs-site.xml
    mapred-site.xml

    -David
  • David McNeil at Mar 9, 2012 at 1:20 pm

    On Mar 8, 3:17 am, Sam Ritchie wrote:
    What do you do with the local process? Leave it running in a screen or something?
    Our application runs as a server process that accepts user requests
    and dynamically turns them into Cascalog queries which are executed on
    our Hadoop cluster. The setup we currently have works as follows:

    * install our application jars on the Hadoop classpath via hadoop-
    env.sh
    * put the Hadoop config files pointing to our cluster on our
    application classpath (i.e. core-site.xml, hdfs-site.xml, mapred-
    site.xml)
    * currently we are running our application as a standalone process on
    the same machine that serves as the Hadoop master. I don't believe
    this is strictly necessary, but rather than keep fighting networking
    and security issues I opted to run our app on the master node for now.

    -David
  • Sthuebner at May 3, 2012 at 1:50 pm
    Another approach I've been using for a while successfully:

    We have some auxiliary boxes next to the cluster for users to SSH to
    and execute Hadoop stuff (like Pig scripts, etc)

    I use one of these boxes to:

    * copy the application JAR into my home folder there
    * run "hadoop jar <app.jar> clojure.main -i swank-server.clj" in a
    screen session
    * SSH port forward a local port to port 4005 on the remote aux box
    * M-x slime-connect to the local port

    The application jar holds loads of predefined functions, cascalog
    operators and queries that I then can use to query data on the
    cluster. As long as no new subqueries or operations are needed the
    application doesn't need to be repackaged and redeployed. In such a
    case, though, writing some additional operations, repackage and redeploy
    with a one liner, plus two additional steps of restarting the app on the
    remote box and M-x slime-connect to it again.

    swank-server.clj looks like this:

    (require 'swank.swank)
    (swank.swank/start-server :host $HOST :port 4005)


    (swank-clojure needs to be packaged in the app JAR)


    With this setup in place I have a SLIME REPL open for days to
    interactively query various datasets, without restarting.


    Maybe this is useful to some.
    Stefan Hübner

    David McNeil <mcneil.david@gmail.com>
    writes:
    On Mar 8, 3:17 am, Sam Ritchie wrote:
    What do you do with the local process? Leave it running in a screen or something?
    Our application runs as a server process that accepts user requests
    and dynamically turns them into Cascalog queries which are executed on
    our Hadoop cluster. The setup we currently have works as follows:

    * install our application jars on the Hadoop classpath via hadoop-
    env.sh
    * put the Hadoop config files pointing to our cluster on our
    application classpath (i.e. core-site.xml, hdfs-site.xml, mapred-
    site.xml)
    * currently we are running our application as a standalone process on
    the same machine that serves as the Hadoop master. I don't believe
    this is strictly necessary, but rather than keep fighting networking
    and security issues I opted to run our app on the master node for now.

    -David

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedFeb 9, '12 at 6:49p
activeMay 3, '12 at 1:50p
posts7
users4
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2022 Grokbase