FAQ
I'm running into errors where CombinedHiveInputFormat is combining data from
two different tables which is causing problems because the tables have
different input formats.

It looks like the problem is in
org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim. It calls
CombineFileInputFormat.getInputPaths which returns the list of input paths
and then chops off the first 5 characters to remove file: from the
beginning, but the return value I'm getting from getInputPaths is actually
hdfs://domain/path. So then when it creates the pools using these paths,
none of the input paths match the pools (since they're just the file path
which protocol or domain).

Any suggestions?

Thanks!

Search Discussions

  • Zheng Shao at Dec 21, 2009 at 7:45 am
    Sorry about the delay.

    Are you using Hive trunk?

    Filed https://issues.apache.org/jira/browse/HIVE-1001
    We should use (new Path(str)).getPath() instead of chopping off the
    first 5 chars.

    Zheng
    On Mon, Dec 14, 2009 at 4:43 PM, David Lerman wrote:
    I'm running into errors where CombinedHiveInputFormat is combining data from
    two different tables which is causing problems because the tables have
    different input formats.

    It looks like the problem is in
    org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
    CombineFileInputFormat.getInputPaths which returns the list of input paths
    and then chops off the first 5 characters to remove file: from the
    beginning, but the return value I'm getting from getInputPaths is actually
    hdfs://domain/path.  So then when it creates the pools using these paths,
    none of the input paths match the pools (since they're just the file path
    which protocol or domain).

    Any suggestions?

    Thanks!


    --
    Yours,
    Zheng
  • Xiaohexiaohe at Dec 21, 2009 at 7:50 am
    sorry! we are not using the HIVE for now
    Date: Sun, 20 Dec 2009 23:44:34 -0800
    Subject: Re: CombinedHiveInputFormat combining across tables
    From: zshao9@gmail.com
    To: hive-user@hadoop.apache.org

    Sorry about the delay.

    Are you using Hive trunk?

    Filed https://issues.apache.org/jira/browse/HIVE-1001
    We should use (new Path(str)).getPath() instead of chopping off the
    first 5 chars.

    Zheng
    On Mon, Dec 14, 2009 at 4:43 PM, David Lerman wrote:
    I'm running into errors where CombinedHiveInputFormat is combining data from
    two different tables which is causing problems because the tables have
    different input formats.

    It looks like the problem is in
    org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim. It calls
    CombineFileInputFormat.getInputPaths which returns the list of input paths
    and then chops off the first 5 characters to remove file: from the
    beginning, but the return value I'm getting from getInputPaths is actually
    hdfs://domain/path. So then when it creates the pools using these paths,
    none of the input paths match the pools (since they're just the file path
    which protocol or domain).

    Any suggestions?

    Thanks!


    --
    Yours,
    Zheng
    _________________________________________________________________
    上Windows Live 中国首页,下载Messenger2009安全版!
    http://www.windowslive.cn
  • David Lerman at Dec 22, 2009 at 2:59 am
    Thanks Zheng. We're using trunk, r888452.

    We actually ended up making three changes to CombineHiveInputFormat.java to
    get it working in our environment. If these aren't known issues, let me
    know and I can file bugs and patches in Jira.

    1. The issue mentioned below. Along the lines you mentioned, we fixed it
    by changing:

    combine.createPool(job, new CombineFilter(paths[i]));

    to:

    combine.createPool(job, new CombineFilter(new
    Path(paths[i].toUri().getPath())));

    and then getting rid of the code that strips the "file:" in
    Hadoop20Shims.getInputPathsShim and having it just call
    CombineFileInputFormat.getInputPaths(job);

    2. When HiveInputFormat.getPartitionDescFromPath was called from
    CombineHiveInputFormat, it was sometimes failing to return a matching
    partitionDesc which then caused an Exception down the line since the split
    didn't have an inputFormatClassName. The issue was that the path format
    used as the key in pathToPartitionInfo varies between stage - in the first
    stage it was the complete path as returned from the table definitions (eg.
    hdfs://server/path), and then in subsequent stages, it was the complete path
    with port (eg. hdfs://server:8020/path) of the result of the previous stage.
    This isn't a problem in HiveInputFormat since the directory you're looking
    up always uses the same format as the keys, but in CombineHiveInputFormat,
    you take that path and look up its children in the file system to get all
    the block information, and then use one of the returned paths to get the
    partition info -- and that returned path does not include the port. So, in
    any stage after the first, we were looking for a path without the port, but
    all the keys in the map contained a port, so we didn't find anything.

    Since I didn't fully understand the logic for when the port was included in
    the path and when it wasn't, my hack fix was just to give
    CombineHiveInputFormat its own implementation of getPartitionDescFromPath
    which just walks through partitionDesc and compares using just the path:

    protected static partitionDesc getPartitionDescFromPath(Map<String,
    partitionDesc> pathToPartitionInfo, Path dir)
    throws IOException {
    for (Map.Entry<String, partitionDesc> entry :
    pathToPartitionInfo.entrySet()) {
    try {
    if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) {
    return entry.getValue();
    }
    } catch (URISyntaxException e2) {}
    }
    throw new IOException("cannot find dir = " + dir.toString()

    + " in partToPartitionInfo!");
    }

    3. In a multi-stage query, when one stage returned no data (resulting in a
    bunch of output files with size 0), the next stage would hang in Hadoop
    because it would have 0 mappers in the job definition. The issue was that
    CombineHiveInputFormat would look for blocks, find none, and return 0 splits
    which would hang Hadoop. There may be good a way to just skip that job
    altogether, but as a quick hack to get it working, when there were no
    splits, I'd just create a single empty one so that the job wouldn't hang: at
    the end of getSplits, I just added:

    if (result.size() == 0) {
    Path firstChild =
    paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath();

    CombineFileSplit emptySplit = new CombineFileSplit(
    job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l},
    new String[0]);
    FixedCombineHiveInputSplit emptySplitWrapper =
    new FixedCombineHiveInputSplit(job,
    newHadoop20Shims.InputSplitShim(emptySplit));

    result.add(emptySplitWrapper);
    }

    With those three changes, it's working beautifully -- some of our queries
    which previously had thousands of mappers loading very small data files now
    have a hundred or so and are running about 10x faster. Many thanks for the
    new functionality!
    On 12/21/09 2:44 AM, "Zheng Shao" wrote:

    Sorry about the delay.

    Are you using Hive trunk?

    Filed https://issues.apache.org/jira/browse/HIVE-1001
    We should use (new Path(str)).getPath() instead of chopping off the
    first 5 chars.

    Zheng
    On Mon, Dec 14, 2009 at 4:43 PM, David Lerman wrote:
    I'm running into errors where CombinedHiveInputFormat is combining data from
    two different tables which is causing problems because the tables have
    different input formats.

    It looks like the problem is in
    org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
    CombineFileInputFormat.getInputPaths which returns the list of input paths
    and then chops off the first 5 characters to remove file: from the
    beginning, but the return value I'm getting from getInputPaths is actually
    hdfs://domain/path.  So then when it creates the pools using these paths,
    none of the input paths match the pools (since they're just the file path
    which protocol or domain).

    Any suggestions?

    Thanks!


    --
    Yours,
    Zheng
  • Namit Jain at Dec 22, 2009 at 5:37 am
    Thanks David,
    It would be very useful if you can file jiras and patches for the same.


    Thanks,
    -namit


    On 12/21/09 6:58 PM, "David Lerman" wrote:

    Thanks Zheng. We're using trunk, r888452.

    We actually ended up making three changes to CombineHiveInputFormat.java to
    get it working in our environment. If these aren't known issues, let me
    know and I can file bugs and patches in Jira.

    1. The issue mentioned below. Along the lines you mentioned, we fixed it
    by changing:

    combine.createPool(job, new CombineFilter(paths[i]));

    to:

    combine.createPool(job, new CombineFilter(new
    Path(paths[i].toUri().getPath())));

    and then getting rid of the code that strips the "file:" in
    Hadoop20Shims.getInputPathsShim and having it just call
    CombineFileInputFormat.getInputPaths(job);

    2. When HiveInputFormat.getPartitionDescFromPath was called from
    CombineHiveInputFormat, it was sometimes failing to return a matching
    partitionDesc which then caused an Exception down the line since the split
    didn't have an inputFormatClassName. The issue was that the path format
    used as the key in pathToPartitionInfo varies between stage - in the first
    stage it was the complete path as returned from the table definitions (eg.
    hdfs://server/path), and then in subsequent stages, it was the complete path
    with port (eg. hdfs://server:8020/path) of the result of the previous stage.
    This isn't a problem in HiveInputFormat since the directory you're looking
    up always uses the same format as the keys, but in CombineHiveInputFormat,
    you take that path and look up its children in the file system to get all
    the block information, and then use one of the returned paths to get the
    partition info -- and that returned path does not include the port. So, in
    any stage after the first, we were looking for a path without the port, but
    all the keys in the map contained a port, so we didn't find anything.

    Since I didn't fully understand the logic for when the port was included in
    the path and when it wasn't, my hack fix was just to give
    CombineHiveInputFormat its own implementation of getPartitionDescFromPath
    which just walks through partitionDesc and compares using just the path:

    protected static partitionDesc getPartitionDescFromPath(Map<String,
    partitionDesc> pathToPartitionInfo, Path dir)
    throws IOException {
    for (Map.Entry<String, partitionDesc> entry :
    pathToPartitionInfo.entrySet()) {
    try {
    if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) {
    return entry.getValue();
    }
    } catch (URISyntaxException e2) {}
    }
    throw new IOException("cannot find dir = " + dir.toString()

    + " in partToPartitionInfo!");
    }

    3. In a multi-stage query, when one stage returned no data (resulting in a
    bunch of output files with size 0), the next stage would hang in Hadoop
    because it would have 0 mappers in the job definition. The issue was that
    CombineHiveInputFormat would look for blocks, find none, and return 0 splits
    which would hang Hadoop. There may be good a way to just skip that job
    altogether, but as a quick hack to get it working, when there were no
    splits, I'd just create a single empty one so that the job wouldn't hang: at
    the end of getSplits, I just added:

    if (result.size() == 0) {
    Path firstChild =
    paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath();

    CombineFileSplit emptySplit = new CombineFileSplit(
    job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l},
    new String[0]);
    FixedCombineHiveInputSplit emptySplitWrapper =
    new FixedCombineHiveInputSplit(job,
    newHadoop20Shims.InputSplitShim(emptySplit));

    result.add(emptySplitWrapper);
    }

    With those three changes, it's working beautifully -- some of our queries
    which previously had thousands of mappers loading very small data files now
    have a hundred or so and are running about 10x faster. Many thanks for the
    new functionality!
    On 12/21/09 2:44 AM, "Zheng Shao" wrote:

    Sorry about the delay.

    Are you using Hive trunk?

    Filed https://issues.apache.org/jira/browse/HIVE-1001
    We should use (new Path(str)).getPath() instead of chopping off the
    first 5 chars.

    Zheng
    On Mon, Dec 14, 2009 at 4:43 PM, David Lerman wrote:
    I'm running into errors where CombinedHiveInputFormat is combining data from
    two different tables which is causing problems because the tables have
    different input formats.

    It looks like the problem is in
    org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim. It calls
    CombineFileInputFormat.getInputPaths which returns the list of input paths
    and then chops off the first 5 characters to remove file: from the
    beginning, but the return value I'm getting from getInputPaths is actually
    hdfs://domain/path. So then when it creates the pools using these paths,
    none of the input paths match the pools (since they're just the file path
    which protocol or domain).

    Any suggestions?

    Thanks!


    --
    Yours,
    Zheng
  • David Lerman at Dec 22, 2009 at 3:24 pm
    Thanks Namit. Filed as HIVE-1006 and HIVE-1007.

    On 12/22/09 12:36 AM, "Namit Jain" wrote:

    Thanks David,
    It would be very useful if you can file jiras and patches for the same.


    Thanks,
    -namit

    On 12/21/09 6:58 PM, "David Lerman" wrote:

    Thanks Zheng. We're using trunk, r888452.

    We actually ended up making three changes to CombineHiveInputFormat.java to
    get it working in our environment. If these aren't known issues, let me
    know and I can file bugs and patches in Jira.

    1. The issue mentioned below. Along the lines you mentioned, we fixed it
    by changing:

    combine.createPool(job, new CombineFilter(paths[i]));

    to:

    combine.createPool(job, new CombineFilter(new
    Path(paths[i].toUri().getPath())));

    and then getting rid of the code that strips the "file:" in
    Hadoop20Shims.getInputPathsShim and having it just call
    CombineFileInputFormat.getInputPaths(job);

    2. When HiveInputFormat.getPartitionDescFromPath was called from
    CombineHiveInputFormat, it was sometimes failing to return a matching
    partitionDesc which then caused an Exception down the line since the split
    didn't have an inputFormatClassName. The issue was that the path format
    used as the key in pathToPartitionInfo varies between stage - in the first
    stage it was the complete path as returned from the table definitions (eg.
    hdfs://server/path), and then in subsequent stages, it was the complete path
    with port (eg. hdfs://server:8020/path) of the result of the previous stage.
    This isn't a problem in HiveInputFormat since the directory you're looking
    up always uses the same format as the keys, but in CombineHiveInputFormat,
    you take that path and look up its children in the file system to get all
    the block information, and then use one of the returned paths to get the
    partition info -- and that returned path does not include the port. So, in
    any stage after the first, we were looking for a path without the port, but
    all the keys in the map contained a port, so we didn't find anything.

    Since I didn't fully understand the logic for when the port was included in
    the path and when it wasn't, my hack fix was just to give
    CombineHiveInputFormat its own implementation of getPartitionDescFromPath
    which just walks through partitionDesc and compares using just the path:

    protected static partitionDesc getPartitionDescFromPath(Map<String,
    partitionDesc> pathToPartitionInfo, Path dir)
    throws IOException {
    for (Map.Entry<String, partitionDesc> entry :
    pathToPartitionInfo.entrySet()) {
    try {
    if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) {
    return entry.getValue();
    }
    } catch (URISyntaxException e2) {}
    }
    throw new IOException("cannot find dir = " + dir.toString()

    + " in partToPartitionInfo!");
    }

    3. In a multi-stage query, when one stage returned no data (resulting in a
    bunch of output files with size 0), the next stage would hang in Hadoop
    because it would have 0 mappers in the job definition. The issue was that
    CombineHiveInputFormat would look for blocks, find none, and return 0 splits
    which would hang Hadoop. There may be good a way to just skip that job
    altogether, but as a quick hack to get it working, when there were no
    splits, I'd just create a single empty one so that the job wouldn't hang: at
    the end of getSplits, I just added:

    if (result.size() == 0) {
    Path firstChild =
    paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath();

    CombineFileSplit emptySplit = new CombineFileSplit(
    job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l},
    new String[0]);
    FixedCombineHiveInputSplit emptySplitWrapper =
    new FixedCombineHiveInputSplit(job,
    newHadoop20Shims.InputSplitShim(emptySplit));

    result.add(emptySplitWrapper);
    }

    With those three changes, it's working beautifully -- some of our queries
    which previously had thousands of mappers loading very small data files now
    have a hundred or so and are running about 10x faster. Many thanks for the
    new functionality!
    On 12/21/09 2:44 AM, "Zheng Shao" wrote:

    Sorry about the delay.

    Are you using Hive trunk?

    Filed https://issues.apache.org/jira/browse/HIVE-1001
    We should use (new Path(str)).getPath() instead of chopping off the
    first 5 chars.

    Zheng
    On Mon, Dec 14, 2009 at 4:43 PM, David Lerman wrote:
    I'm running into errors where CombinedHiveInputFormat is combining data
    from
    two different tables which is causing problems because the tables have
    different input formats.

    It looks like the problem is in
    org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
    CombineFileInputFormat.getInputPaths which returns the list of input
    paths
    and then chops off the first 5 characters to remove file: from the
    beginning, but the return value I'm getting from getInputPaths is
    actually
    hdfs://domain/path.  So then when it creates the pools using these paths,
    none of the input paths match the pools (since they're just the file path
    which protocol or domain).

    Any suggestions?

    Thanks!


    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedDec 15, '09 at 12:44a
activeDec 22, '09 at 3:24p
posts6
users4
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase