FAQ
Sqoop should only use a single map task
---------------------------------------

Key: HADOOP-5967
URL: https://issues.apache.org/jira/browse/HADOOP-5967
Project: Hadoop Core
Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
Priority: Minor
Attachments: single-mapper.patch

The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.

This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Aaron Kimball (JIRA) at Jun 4, 2009 at 1:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Aaron Kimball updated HADOOP-5967:
    ----------------------------------

    Attachment: single-mapper.patch

    This patch implements this as a one-liner. No new tests because it's trivial. I've verified that it passes existing unit tests, and also that it does indeed use a single mapper on a cluster.
    Sqoop should only use a single map task
    ---------------------------------------

    Key: HADOOP-5967
    URL: https://issues.apache.org/jira/browse/HADOOP-5967
    Project: Hadoop Core
    Issue Type: Improvement
    Reporter: Aaron Kimball
    Assignee: Aaron Kimball
    Priority: Minor
    Attachments: single-mapper.patch


    The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
    This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Aaron Kimball (JIRA) at Jun 4, 2009 at 1:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Aaron Kimball updated HADOOP-5967:
    ----------------------------------

    Status: Patch Available (was: Open)
    Sqoop should only use a single map task
    ---------------------------------------

    Key: HADOOP-5967
    URL: https://issues.apache.org/jira/browse/HADOOP-5967
    Project: Hadoop Core
    Issue Type: Improvement
    Reporter: Aaron Kimball
    Assignee: Aaron Kimball
    Priority: Minor
    Attachments: single-mapper.patch


    The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
    This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Scott Carey (JIRA) at Jun 4, 2009 at 2:24 am
    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716131#action_12716131 ]

    Scott Carey commented on HADOOP-5967:
    -------------------------------------

    Some databases optimize multiple queries doing sequential scans on the same table at the same time by having them 'tag along' with the same sequential scan (Postgres, at least) which avoids the O( N^2 ) issue. But LIMIT ... OFFSET is not guaranteed to return distinct, consistent partitions unless it has an ORDER BY clause and is in the same transaction anyway.
    Sqoop should only use a single map task
    ---------------------------------------

    Key: HADOOP-5967
    URL: https://issues.apache.org/jira/browse/HADOOP-5967
    Project: Hadoop Core
    Issue Type: Improvement
    Reporter: Aaron Kimball
    Assignee: Aaron Kimball
    Priority: Minor
    Attachments: single-mapper.patch


    The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
    This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Aaron Kimball (JIRA) at Jun 4, 2009 at 7:48 am
    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716186#action_12716186 ]

    Aaron Kimball commented on HADOOP-5967:
    ---------------------------------------

    An ORDER BY clause is included in DBInputFormat's SQL statements that it sends over JDBC. But each mapper (necessarily) runs in a separate transaction, as it's on a separate node.
    Sqoop should only use a single map task
    ---------------------------------------

    Key: HADOOP-5967
    URL: https://issues.apache.org/jira/browse/HADOOP-5967
    Project: Hadoop Core
    Issue Type: Improvement
    Reporter: Aaron Kimball
    Assignee: Aaron Kimball
    Priority: Minor
    Attachments: single-mapper.patch


    The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
    This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Jun 6, 2009 at 10:50 am
    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716864#action_12716864 ]

    Hadoop QA commented on HADOOP-5967:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12409836/single-mapper.patch
    against trunk revision 782083.

    +1 @author. The patch does not contain any @author tags.

    -1 tests included. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    -1 core tests. The patch failed core unit tests.

    -1 contrib tests. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/console

    This message is automatically generated.
    Sqoop should only use a single map task
    ---------------------------------------

    Key: HADOOP-5967
    URL: https://issues.apache.org/jira/browse/HADOOP-5967
    Project: Hadoop Core
    Issue Type: Improvement
    Reporter: Aaron Kimball
    Assignee: Aaron Kimball
    Priority: Minor
    Attachments: single-mapper.patch


    The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
    This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Aaron Kimball (JIRA) at Jun 8, 2009 at 5:16 pm
    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717330#action_12717330 ]

    Aaron Kimball commented on HADOOP-5967:
    ---------------------------------------

    Hudson's test failures are unrelated...

    Sqoop should only use a single map task
    ---------------------------------------

    Key: HADOOP-5967
    URL: https://issues.apache.org/jira/browse/HADOOP-5967
    Project: Hadoop Core
    Issue Type: Improvement
    Reporter: Aaron Kimball
    Assignee: Aaron Kimball
    Priority: Minor
    Attachments: single-mapper.patch


    The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
    This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 23, 2009 at 7:54 pm
    [ https://issues.apache.org/jira/browse/HADOOP-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-5967:
    ------------------------------

    Resolution: Fixed
    Fix Version/s: 0.21.0
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    +1

    I've just committed this. Thanks Aaron!
    Sqoop should only use a single map task
    ---------------------------------------

    Key: HADOOP-5967
    URL: https://issues.apache.org/jira/browse/HADOOP-5967
    Project: Hadoop Common
    Issue Type: Improvement
    Reporter: Aaron Kimball
    Assignee: Aaron Kimball
    Priority: Minor
    Fix For: 0.21.0

    Attachments: single-mapper.patch


    The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
    This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJun 4, '09 at 1:04a
activeJun 23, '09 at 7:54p
posts8
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Tom White (JIRA): 8 posts

People

Translate

site design / logo © 2022 Grokbase