FAQ
Add endRow parameter to HClient#obtainScanner
---------------------------------------------

Key: HADOOP-1439
URL: https://issues.apache.org/jira/browse/HADOOP-1439
Project: Hadoop
Issue Type: Improvement
Components: contrib/hbase
Reporter: stack
Assignee: stack
Priority: Minor


Currently the HClient#obtainScanner looks like this:

{code}
public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
{code}

Add an overload that allows specification of endRow:

{code}
public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
{code}

Use Case: Table contains the whole web. Client just wants to scan google's pages. Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false





--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • James Kennedy (JIRA) at Jun 27, 2007 at 9:22 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508658 ]

    James Kennedy commented on HADOOP-1439:
    ---------------------------------------

    Michael suggested that when finished, Hadoop-1531, RowFilters, may be used to achieve the above functionality.

    As the RowFilter impl is right now, using a regexp on each key encountered may be an expensive way to do it.

    In the above example, even if the endRow functionality works, how do you know where the end row is? how do you know when you leave the google domain?

    It seems to me that there may be several restrictions a user may want to apply to row-keys:
    1) Specify a range. Use start/end keys assuming you know what they are.
    2) Specify a range, use a start key and a "page size". This is useful for retrieving data in pages, e.g. displaying to UI as user clicks next/last page.
    3) Specify a criteria. e.g. regular expressions or more basic string comparison.

    Fortunately my RowFilterInterface design can be used to generalize the above. In the Google example, I could create a custom RowFilter implementation that can do domain name comparison more efficiently than general regular expression matching. Pass that via the client as you would any other RowFilter impl. Only thing to make sure of is that the custom impl is in the classpath of the HRegionServer too.

    For start/end range, you could have a custom RowFilter that checks for an exact match on the end key. But this won't be as efficient as an explicit endRow parameter because:
    A) when RowFilter is not null, HRegion#HScanner is always going to have a little more overhead even if the filter() implementation itself always just returns false.
    B) The filter isn't currently designed to stop the scanner when a certain criteria is reached. When it encounters the endRow, it will just loop through the rest of the rows, filtering them all out, until it reaches the end of the HRegion.

    I think start/page range has the same issues. Only difference is that it requires scan-lifetime state to count number of (unfiltered?) rows encountered. Still requires stop condition trigger.

    If i add that stop condition trigger functionality to the RowFilterInterface and update HScanner to use it. We could have a number of built-in RowFilter implementations that deal with restrictions like those above.

    WRT simple restrictions like start/end/page parameters there will still be a, perhaps small, trade-off between performance and generality depending on if we implement them independently or via RowFilterInterface.










    Add endRow parameter to HClient#obtainScanner
    ---------------------------------------------

    Key: HADOOP-1439
    URL: https://issues.apache.org/jira/browse/HADOOP-1439
    Project: Hadoop
    Issue Type: Improvement
    Components: contrib/hbase
    Reporter: stack
    Assignee: stack
    Priority: Minor

    Currently the HClient#obtainScanner looks like this:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
    {code}
    Add an overload that allows specification of endRow:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
    {code}
    Use Case: Table contains the whole web. Client just wants to scan google's pages. Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jim Kellerman (JIRA) at Jun 28, 2007 at 12:41 am
    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508698 ]

    Jim Kellerman commented on HADOOP-1439:
    ---------------------------------------

    It seems to me that if I specify a filter that is row key filter, then if the filter finds a match, next() keeps returning values so long as the row filter matches. Once it stops matching, the filter should close out the scanner since there will be no additional rows that match that filter.

    In this particular case, I am talking about row key filters based on >, =, < and not regexp filters, because a regexp can potentially match any row.

    Add endRow parameter to HClient#obtainScanner
    ---------------------------------------------

    Key: HADOOP-1439
    URL: https://issues.apache.org/jira/browse/HADOOP-1439
    Project: Hadoop
    Issue Type: Improvement
    Components: contrib/hbase
    Reporter: stack
    Assignee: stack
    Priority: Minor

    Currently the HClient#obtainScanner looks like this:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
    {code}
    Add an overload that allows specification of endRow:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
    {code}
    Use Case: Table contains the whole web. Client just wants to scan google's pages. Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • James Kennedy (JIRA) at Jun 28, 2007 at 1:11 am
    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508701 ]

    James Kennedy commented on HADOOP-1439:
    ---------------------------------------

    Right, so in the case of >, =, < type RowFilters you're quite right. More generally a RowFilter implementing those functions or otherwise may need to signal the scanner to stop altogether for whatever reason, even when the target rows are not located in a single consecutive chunk like >, =. <. e.g. reached a maximum of nonconsecutive matched rows.

    I'll implement this mechanism, clean up, and re-post the Hadoop-1531 patch when i get a chance.

    That will make RowFilter more conducive to the EndRow filtering needed for this task. But as I said there will still be a little overhead vs. implementing an explicit endRow param to the scanner.
    Add endRow parameter to HClient#obtainScanner
    ---------------------------------------------

    Key: HADOOP-1439
    URL: https://issues.apache.org/jira/browse/HADOOP-1439
    Project: Hadoop
    Issue Type: Improvement
    Components: contrib/hbase
    Reporter: stack
    Assignee: stack
    Priority: Minor

    Currently the HClient#obtainScanner looks like this:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
    {code}
    Add an overload that allows specification of endRow:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
    {code}
    Use Case: Table contains the whole web. Client just wants to scan google's pages. Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jun 28, 2007 at 4:15 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508864 ]

    stack commented on HADOOP-1439:
    -------------------------------

    This comment applies to this issue and to hadoop-1531.

    After the exposition above, I'm now of the opinion that the endRow parameter will be little used. Better for now to have a set of filters available for the client to choose from. If 'performance' becomes an issue, we can backfill the endRow parameter later.

    We can divide the work if you'd like. I need the endRow functionality *a tout de suite*. If you add the 'stop condition trigger' to the interface I can work on a couple of filter implementations and their tests.

    Add endRow parameter to HClient#obtainScanner
    ---------------------------------------------

    Key: HADOOP-1439
    URL: https://issues.apache.org/jira/browse/HADOOP-1439
    Project: Hadoop
    Issue Type: Improvement
    Components: contrib/hbase
    Reporter: stack
    Assignee: stack
    Priority: Minor

    Currently the HClient#obtainScanner looks like this:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
    {code}
    Add an overload that allows specification of endRow:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
    {code}
    Use Case: Table contains the whole web. Client just wants to scan google's pages. Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • James Kennedy (JIRA) at Jun 28, 2007 at 8:38 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508928 ]

    James Kennedy commented on HADOOP-1439:
    ---------------------------------------

    Deal. I don't think i'll get to it today but certainly before Monday, tomorrow likely.
    Add endRow parameter to HClient#obtainScanner
    ---------------------------------------------

    Key: HADOOP-1439
    URL: https://issues.apache.org/jira/browse/HADOOP-1439
    Project: Hadoop
    Issue Type: Improvement
    Components: contrib/hbase
    Reporter: stack
    Assignee: stack
    Priority: Minor

    Currently the HClient#obtainScanner looks like this:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
    {code}
    Add an overload that allows specification of endRow:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
    {code}
    Use Case: Table contains the whole web. Client just wants to scan google's pages. Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • James Kennedy (JIRA) at Jun 28, 2007 at 8:41 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508929 ]

    James Kennedy commented on HADOOP-1439:
    ---------------------------------------

    Oh, one thing I forgot to add in the limitations above:

    Column criteria can only apply to columns included int he results. You cannot retrieve COL1, COL2 where COL3 = 'XYZ'
    This is because the filtering is happening at the HScanner level and for e.g. the lower level scanner for COL3 is not employed and so all COL3's values appear as null.

    Add endRow parameter to HClient#obtainScanner
    ---------------------------------------------

    Key: HADOOP-1439
    URL: https://issues.apache.org/jira/browse/HADOOP-1439
    Project: Hadoop
    Issue Type: Improvement
    Components: contrib/hbase
    Reporter: stack
    Assignee: stack
    Priority: Minor

    Currently the HClient#obtainScanner looks like this:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
    {code}
    Add an overload that allows specification of endRow:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
    {code}
    Use Case: Table contains the whole web. Client just wants to scan google's pages. Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jul 5, 2007 at 8:17 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack resolved HADOOP-1439.
    ---------------------------

    Resolution: Won't Fix

    Resolving as "won't fix" Will add endRow using new mechanism just-added by 'HADOOP-1531 Add RowFilter to HRegion.HScanner'
    Add endRow parameter to HClient#obtainScanner
    ---------------------------------------------

    Key: HADOOP-1439
    URL: https://issues.apache.org/jira/browse/HADOOP-1439
    Project: Hadoop
    Issue Type: Improvement
    Components: contrib/hbase
    Reporter: stack
    Assignee: stack
    Priority: Minor

    Currently the HClient#obtainScanner looks like this:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws IOException;
    {code}
    Add an overload that allows specification of endRow:
    {code}
    public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text endRow) throws IOException;
    {code}
    Use Case: Table contains the whole web. Client just wants to scan google's pages. Currently, client could cut off the scanner as soon as the row key leaves the google domain but cleaner if {{HScannerInterface#next()}} returns false
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMay 29, '07 at 5:12p
activeJul 5, '07 at 8:17p
posts8
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

stack (JIRA): 8 posts

People

Translate

site design / logo © 2022 Grokbase