Grokbase Groups Pig user January 2011
FAQ
I have a custom LoadFunc (I'm actually just extending PigStorage) that
has some added logic to spider a given path and pick out the paths
that I want. I am currently doing the spidering in setLocation
because that seemed like the place to do it. It appears as if this is
getting called on both the client and the cluster side, though, so my
mappers are spidering a path that was already spidered on the client
(wasted effort). Whenever the spidering is over a lot of directories
this is adding a significant amount of unneeded overhead to my jobs.

I looked into using UDFContext to save the paths and then try to get
the cluster-side processes to look up the paths in UDF context and
just use them if they exist. But, it looks like the actual job
configuration object is being created before the calls to
setLocation() so the stuff that I set in UDFContext is not making it
across the wire.

Is there a method that is called before setLocation that I can use to
set the value in UDFContext (I'd prefer something that is given a
Context/Configuration object). Or, is my only option to build a
Configuration object in the constructor, do the crawl and set the
UDFContext there (will that even work)?

--Eric

Search Discussions

  • Daniel Dai at Jan 5, 2011 at 3:02 pm
    You are right. setLocation is called in frontend, however, it is in the
    context of InputFormat.getSplits() and it is too late to save anything in
    UDFContext. Your best bet is relativeToAbsolutePath, which is called in
    frontend and you can save your stuff in UDFContext.

    Daniel

    -----Original Message-----
    From: Eric Tschetter
    Sent: Tuesday, January 04, 2011 11:52 AM
    To: pig-user@hadoop.apache.org
    Subject: UDFContext in 0.8 LoadFunc?

    I have a custom LoadFunc (I'm actually just extending PigStorage) that
    has some added logic to spider a given path and pick out the paths
    that I want. I am currently doing the spidering in setLocation
    because that seemed like the place to do it. It appears as if this is
    getting called on both the client and the cluster side, though, so my
    mappers are spidering a path that was already spidered on the client
    (wasted effort). Whenever the spidering is over a lot of directories
    this is adding a significant amount of unneeded overhead to my jobs.

    I looked into using UDFContext to save the paths and then try to get
    the cluster-side processes to look up the paths in UDF context and
    just use them if they exist. But, it looks like the actual job
    configuration object is being created before the calls to
    setLocation() so the stuff that I set in UDFContext is not making it
    across the wire.

    Is there a method that is called before setLocation that I can use to
    set the value in UDFContext (I'd prefer something that is given a
    Context/Configuration object). Or, is my only option to build a
    Configuration object in the constructor, do the crawl and set the
    UDFContext there (will that even work)?

    --Eric
  • Eric Tschetter at Jan 6, 2011 at 12:14 am
    Daniel,

    Awesome, thank you. I will try that out.

    --Eric

    On Wed, Jan 5, 2011 at 1:14 AM, Daniel Dai wrote:
    You are right. setLocation is called in frontend, however, it is in the
    context of InputFormat.getSplits() and it is too late to save anything in
    UDFContext. Your best bet is relativeToAbsolutePath, which is called in
    frontend and you can save your stuff in UDFContext.

    Daniel

    -----Original Message----- From: Eric Tschetter
    Sent: Tuesday, January 04, 2011 11:52 AM
    To: pig-user@hadoop.apache.org
    Subject: UDFContext in 0.8 LoadFunc?

    I have a custom LoadFunc (I'm actually just extending PigStorage) that
    has some added logic to spider a given path and pick out the paths
    that I want.  I am currently doing the spidering in setLocation
    because that seemed like the place to do it.  It appears as if this is
    getting called on both the client and the cluster side, though, so my
    mappers are spidering a path that was already spidered on the client
    (wasted effort).  Whenever the spidering is over a lot of directories
    this is adding a significant amount of unneeded overhead to my jobs.

    I looked into using UDFContext to save the paths and then try to get
    the cluster-side processes to look up the paths in UDF context and
    just use them if they exist.  But, it looks like the actual job
    configuration object is being created before the calls to
    setLocation() so the stuff that I set in UDFContext is not making it
    across the wire.

    Is there a method that is called before setLocation that I can use to
    set the value in UDFContext (I'd prefer something that is given a
    Context/Configuration object).  Or, is my only option to build a
    Configuration object in the constructor, do the crawl and set the
    UDFContext there (will that even work)?

    --Eric

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 4, '11 at 7:52p
activeJan 6, '11 at 12:14a
posts3
users2
websitepig.apache.org

2 users in discussion

Eric Tschetter: 2 posts Daniel Dai: 1 post

People

Translate

site design / logo © 2021 Grokbase