FAQ
Hi guys,

If I want to share a file in the distributed cache from a cascalog
query, do I have to fall back to hadoop's java apis? Or is there a
cascalog way of doing it?

Search Discussions

  • Sam Ritchie at Jan 4, 2012 at 6:24 pm
    There isn't any way of doing this currently. What are you trying to share
    in the distributed cache? One other option might be to distribute the value
    to each operation as a parametrized argument, as with

    (defmapop [add-n [n]]
    [x]
    (+ x n))

    but this gets a little flakey with large data structures as arguments.

    On Wed, Jan 4, 2012 at 5:49 AM, Gerrard McNulty
    wrote:
    Hi guys,

    If I want to share a file in the distributed cache from a cascalog
    query, do I have to fall back to hadoop's java apis? Or is there a
    cascalog way of doing it?



    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Gerrard McNulty at Jan 4, 2012 at 9:48 pm
    I'm using Maxmind Geoip to do different kinds of lookups on ip
    addresses.
    They export their databases as CSV or .dat files (with a library to
    access
    the .dat). I'd like to make cascalog queries based on lookup
    information
    without performing an expensive join or adding the lookups in advance
    on my
    data

    It seems pushing the .dat file out to the distributed cache is the
    quickest way
    to do this, but of course I'm open to suggestions :)

    On Jan 4, 6:24 pm, Sam Ritchie wrote:
    There isn't any way of doing this currently. What are you trying to share
    in the distributed cache? One other option might be to distribute the value
    to each operation as a parametrized argument, as with

    (defmapop [add-n [n]]
    [x]
    (+ x n))

    but this gets a little flakey with large data structures as arguments.

    On Wed, Jan 4, 2012 at 5:49 AM, Gerrard McNulty
    wrote:
    Hi guys,
    If I want to share a file in the distributed cache from a cascalog
    query, do I have to fall back to hadoop's java apis?  Or is there a
    cascalog way of doing it?
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why!http://emailcharter.org)
  • Andrew Xue at Jan 5, 2012 at 4:39 am
    i do something similar but i just jar the csv files up in into uberjar
    instead of using distributed cache

    i wonder if theres a performance difference between that and
    distributed cache though; they both (app jar and distributed cache
    stuff) gets copied from master to slaves all the same, so, it seems
    equivalent?
    On Jan 4, 4:48 pm, Gerrard McNulty wrote:
    I'm using Maxmind Geoip to do different kinds of lookups on ip
    addresses.
    They export their databases as CSV or .dat files (with a library to
    access
    the .dat).  I'd like to make cascalog queries based on lookup
    information
    without performing an expensive join or adding the lookups in advance
    on my
    data

    It seems pushing the .dat file out to the distributed cache is the
    quickest way
    to do this, but of course I'm open to suggestions :)

    On Jan 4, 6:24 pm, Sam Ritchie wrote:






    There isn't any way of doing this currently. What are you trying to share
    in the distributed cache? One other option might be to distribute the value
    to each operation as a parametrized argument, as with
    (defmapop [add-n [n]]
    [x]
    (+ x n))
    but this gets a little flakey with large data structures as arguments.
    On Wed, Jan 4, 2012 at 5:49 AM, Gerrard McNulty
    wrote:
    Hi guys,
    If I want to share a file in the distributed cache from a cascalog
    query, do I have to fall back to hadoop's java apis?  Or is there a
    cascalog way of doing it?
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
  • Andrew Xue at Jan 5, 2012 at 4:42 am
    also, take a look at this

    https://gist.github.com/872918

    "join a small file that can fit in memory, map-side"

    my solution, which is pretty heavy handed but my data is small enough,
    is to load the csv files up into a hashmap and do look ups from that
    On Jan 4, 11:39 pm, Andrew Xue wrote:
    i do something similar but i just jar the csv files up in into uberjar
    instead of using distributed cache

    i wonder if theres a performance difference between that and
    distributed cache though; they both (app jar and distributed cache
    stuff) gets copied from master to slaves all the same, so, it seems
    equivalent?

    On Jan 4, 4:48 pm, Gerrard McNulty wrote:






    I'm using Maxmind Geoip to do different kinds of lookups on ip
    addresses.
    They export their databases as CSV or .dat files (with a library to
    access
    the .dat).  I'd like to make cascalog queries based on lookup
    information
    without performing an expensive join or adding the lookups in advance
    on my
    data
    It seems pushing the .dat file out to the distributed cache is the
    quickest way
    to do this, but of course I'm open to suggestions :)
    On Jan 4, 6:24 pm, Sam Ritchie wrote:

    There isn't any way of doing this currently. What are you trying to share
    in the distributed cache? One other option might be to distribute the value
    to each operation as a parametrized argument, as with
    (defmapop [add-n [n]]
    [x]
    (+ x n))
    but this gets a little flakey with large data structures as arguments.
    On Wed, Jan 4, 2012 at 5:49 AM, Gerrard McNulty
    wrote:
    Hi guys,
    If I want to share a file in the distributed cache from a cascalog
    query, do I have to fall back to hadoop's java apis?  Or is there a
    cascalog way of doing it?
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
  • Rweald at Jan 5, 2012 at 8:50 pm
    We are currently bundling the .dat file as part of the uberjar and then
    using the maxmind java api through clojure. It works well for us and is
    still quite concise.

    (def lookup (new LookupService (.getPath (clojure.java.io/resource
    "GeoLiteCity.dat")) LookupService/GEOIP_MEMORY_CACHE))

    Then to perform a lookup we simply define a function that takes the ip
    address as an argument

    (defn geocode_ip [ip]
    (.getLocation lookup ip))

    If you discover a more idomatic way I would love to hear about it.

    Ryan Weald
    Engineer @ Sharethrough <http://sharethrough.com>
  • Andrew Xue at May 18, 2012 at 6:38 am
    Hey Ryan -- Do you run this code on Amazon EMR? I used this solution for
    getting the maxmind geoip and for some reason using the geoip java api (ver
    1.2.5) leads to a really odd and hard to figure out classloading issue and
    lots of task failures -- you have never encountered anything like that?

    On Thursday, January 5, 2012 12:59:13 PM UTC-5, rweald wrote:

    We are currently bundling the .dat file as part of the uberjar and then
    using the maxmind java api through clojure. It works well for us and is
    still quite concise.

    (def lookup (new LookupService (.getPath (clojure.java.io/resource"GeoLiteCity.dat")) LookupService/GEOIP_MEMORY_CACHE))

    Then to perform a lookup we simply define a function that takes the ip
    address as an argument

    (defn geocode_ip [ip]
    (.getLocation lookup ip))

    If you discover a more idomatic way I would love to hear about it.

    Ryan Weald
    Engineer @ Sharethrough <http://sharethrough.com>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedJan 4, '12 at 1:49p
activeMay 18, '12 at 6:38a
posts7
users4
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2022 Grokbase