I have a custom LoadFunc (I'm actually just extending PigStorage) that
has some added logic to spider a given path and pick out the paths
that I want. I am currently doing the spidering in setLocation
because that seemed like the place to do it. It appears as if this is
getting called on both the client and the cluster side, though, so my
mappers are spidering a path that was already spidered on the client
(wasted effort). Whenever the spidering is over a lot of directories
this is adding a significant amount of unneeded overhead to my jobs.
I looked into using UDFContext to save the paths and then try to get
the cluster-side processes to look up the paths in UDF context and
just use them if they exist. But, it looks like the actual job
configuration object is being created before the calls to
setLocation() so the stuff that I set in UDFContext is not making it
across the wire.
Is there a method that is called before setLocation that I can use to
set the value in UDFContext (I'd prefer something that is given a
Context/Configuration object). Or, is my only option to build a
Configuration object in the constructor, do the crawl and set the
UDFContext there (will that even work)?