FAQ
I am taking this to dev, as there's still no clarity in my mind on how to approach it.

So the goal here is to optimize case-insensitive (CI) LIKE, and the issue is that MySQL doesn't support indexes on transformed columns (e.g. "UPPER(column)"), while it does support CI column collations, and hence - CI indexes and CI LIKE. Behavior of other DBs is unknown. Oracle supports UPPER(column) indexes, but beyond that we haven't researched it. Suggested solutions come down to either mapping collation behavior per DbAttribute in Cayenne, or trying to determine it in runtime via DB metadata.

My concern with former is that Cayenne model will have too much information about the underlying DB, not directly relevant to ORM. So if we are to go this way, we need to first demonstrate how this info may be used with other databases. E.g if we find that LIKE can be optimized for at least 2-3 other major DBs by taking collation case-sensitivity into account, then this solution may be worth pursuing.

The later approach may incur some overhead. Each LIKE query will have to determine whether the columns involved are CI or CS. One way to do it in MySQL is this:

select collation(c1) from t1 limit 1;
+-------------------+
collation(c1) |
+-------------------+
utf8_general_ci |
+-------------------+
1 row in set (0.02 sec)

I guess with metadata caching, the overhead may be reduced to a bearable minimum. Not sure if that's the easiest approach?

Also maybe there are other ways to solve this problem (maybe even outside Cayenne, just like that "UPPER(column)" index idea)?

Thoughts?

Andrus

On Jan 10, 2011, at 11:45 AM, Ari Maniatis (JIRA) wrote:
[ https://issues.apache.org/jira/browse/CAY-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979519#action_12979519 ]

Ari Maniatis commented on CAY-1210:
-----------------------------------

I'd like to solve this one since it also affects other database types and is a bit of a performance issue for us right now. What about this as a solution:

* Add new column to Cayenne modeler in the dbAttribute called "case sensitive". It is a checkbox which is ticked by default (to preserve existing behaviour).
* Change the SQL generator to not spit out UPPER for those fields
mysql does not use index for case insensitive searches
------------------------------------------------------

Key: CAY-1210
URL: https://issues.apache.org/jira/browse/CAY-1210
Project: Cayenne
Issue Type: Improvement
Components: Database integration
Reporter: Ari Maniatis
Assignee: Ari Maniatis
Fix For: 3.1M1


When performing a case insensitive search Cayenne spits out SQL which looks like this
SELECT .... WHERE upper(name) LIKE upper("fred")
This prevents any index being used for the search. Since mysql already performed case insensitive searches on text fields we need to suppress the 'upper' functions being used in these situations. All searches on these fields are already case insensitive.
http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Andrus Adamchik at Jan 11, 2011 at 10:07 am
    As an aside we should close the hole on the opposite end - Cayenne CS LIKE. MySQL adapter should use BINARY keyword to ensure that CS matching is performed. I am personally using CS LIKE as a hack around CAY-1210, but that of course makes Cayenne code non-portable.

    Andrus
    On Jan 11, 2011, at 11:58 AM, Andrus Adamchik wrote:

    I am taking this to dev, as there's still no clarity in my mind on how to approach it.

    So the goal here is to optimize case-insensitive (CI) LIKE, and the issue is that MySQL doesn't support indexes on transformed columns (e.g. "UPPER(column)"), while it does support CI column collations, and hence - CI indexes and CI LIKE. Behavior of other DBs is unknown. Oracle supports UPPER(column) indexes, but beyond that we haven't researched it. Suggested solutions come down to either mapping collation behavior per DbAttribute in Cayenne, or trying to determine it in runtime via DB metadata.

    My concern with former is that Cayenne model will have too much information about the underlying DB, not directly relevant to ORM. So if we are to go this way, we need to first demonstrate how this info may be used with other databases. E.g if we find that LIKE can be optimized for at least 2-3 other major DBs by taking collation case-sensitivity into account, then this solution may be worth pursuing.

    The later approach may incur some overhead. Each LIKE query will have to determine whether the columns involved are CI or CS. One way to do it in MySQL is this:

    select collation(c1) from t1 limit 1;
    +-------------------+
    collation(c1) |
    +-------------------+
    utf8_general_ci |
    +-------------------+
    1 row in set (0.02 sec)

    I guess with metadata caching, the overhead may be reduced to a bearable minimum. Not sure if that's the easiest approach?

    Also maybe there are other ways to solve this problem (maybe even outside Cayenne, just like that "UPPER(column)" index idea)?

    Thoughts?

    Andrus

    On Jan 10, 2011, at 11:45 AM, Ari Maniatis (JIRA) wrote:
    [ https://issues.apache.org/jira/browse/CAY-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979519#action_12979519 ]

    Ari Maniatis commented on CAY-1210:
    -----------------------------------

    I'd like to solve this one since it also affects other database types and is a bit of a performance issue for us right now. What about this as a solution:

    * Add new column to Cayenne modeler in the dbAttribute called "case sensitive". It is a checkbox which is ticked by default (to preserve existing behaviour).
    * Change the SQL generator to not spit out UPPER for those fields
    mysql does not use index for case insensitive searches
    ------------------------------------------------------

    Key: CAY-1210
    URL: https://issues.apache.org/jira/browse/CAY-1210
    Project: Cayenne
    Issue Type: Improvement
    Components: Database integration
    Reporter: Ari Maniatis
    Assignee: Ari Maniatis
    Fix For: 3.1M1


    When performing a case insensitive search Cayenne spits out SQL which looks like this
    SELECT .... WHERE upper(name) LIKE upper("fred")
    This prevents any index being used for the search. Since mysql already performed case insensitive searches on text fields we need to suppress the 'upper' functions being used in these situations. All searches on these fields are already case insensitive.
    http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Aristedes Maniatis at Jan 11, 2011 at 10:29 am

    On 11/01/11 8:58 PM, Andrus Adamchik wrote:
    I am taking this to dev, as there's still no clarity in my mind on how to approach it.

    So the goal here is to optimize case-insensitive (CI) LIKE, and the issue is that MySQL doesn't support indexes on transformed columns (e.g. "UPPER(column)"), while it does support CI column collations, and hence - CI indexes and CI LIKE. Behavior of other DBs is unknown. Oracle supports UPPER(column) indexes, but beyond that we haven't researched it.
    I've looked at MS-SQL and Derby as well. Derby does not support functional indexes, such as UPPER(column). MS-SQL does.

    But even if the database does support that index, the problem is the the dba has to know to add these additional indexes. Otherwise they might just blame Cayenne for being slow compared to 'real' SQL.

    Suggested solutions come down to either mapping collation behavior per DbAttribute in Cayenne, or trying to determine it in runtime via DB metadata.

    My concern with former is that Cayenne model will have too much information about the underlying DB, not directly relevant to ORM. So if we are to go this way, we need to first demonstrate how this info may be used with other databases. E.g if we find that LIKE can be optimized for at least 2-3 other major DBs by taking collation case-sensitivity into account, then this solution may be worth pursuing.
    I am pretty sure this is useful for every DB out there. I've never come across one that will optimise UPPER() searches without the user adding another index especially to support that. If the collation is case-insensitive, then UPPER is simply making things slower without adding functionality.

    Postgresql has a column type: citext. That is case insensitive text, just like the mysql collation. UNIQUE constraints on that field work just the way you expect them to, and the same as mysql.

    So another approach is that instead of thinking of this as 'collation', we can think of this as a different field type in the modeler. VARCHAR and ciVARCHAR, etc. After all, 'case insensitive' is just as important a concept as the difference between BLOB and TEXT.

    The later approach may incur some overhead. Each LIKE query will have to determine whether the columns involved are CI or CS. One way to do it in MySQL is this:

    select collation(c1) from t1 limit 1;
    +-------------------+
    collation(c1) |
    +-------------------+
    utf8_general_ci |
    +-------------------+
    1 row in set (0.02 sec)

    I guess with metadata caching, the overhead may be reduced to a bearable minimum. Not sure if that's the easiest approach?
    My gut feeling is that this is a decision that the architecture designer makes when creating their model, not a by-product of deployment. That is, it should be explicitly set in the model and the appropriate database collation or field types set when forward-engineering the db.

    Having said that, a "force all columns as case sensitive" switch in the DataNode setup could give the sysadmin the choice to force UPPER back into the SQL at the risk of slowing things down.

    Ari


    Also maybe there are other ways to solve this problem (maybe even outside Cayenne, just like that "UPPER(column)" index idea)?

    Thoughts?

    Andrus

    On Jan 10, 2011, at 11:45 AM, Ari Maniatis (JIRA) wrote:
    [ https://issues.apache.org/jira/browse/CAY-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979519#action_12979519 ]

    Ari Maniatis commented on CAY-1210:
    -----------------------------------

    I'd like to solve this one since it also affects other database types and is a bit of a performance issue for us right now. What about this as a solution:

    * Add new column to Cayenne modeler in the dbAttribute called "case sensitive". It is a checkbox which is ticked by default (to preserve existing behaviour).
    * Change the SQL generator to not spit out UPPER for those fields
    mysql does not use index for case insensitive searches
    ------------------------------------------------------

    Key: CAY-1210
    URL: https://issues.apache.org/jira/browse/CAY-1210
    Project: Cayenne
    Issue Type: Improvement
    Components: Database integration
    Reporter: Ari Maniatis
    Assignee: Ari Maniatis
    Fix For: 3.1M1


    When performing a case insensitive search Cayenne spits out SQL which looks like this
    SELECT .... WHERE upper(name) LIKE upper("fred")
    This prevents any index being used for the search. Since mysql already performed case insensitive searches on text fields we need to suppress the 'upper' functions being used in these situations. All searches on these fields are already case insensitive.
    http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
    --
    -------------------------->
    Aristedes Maniatis
    GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
  • Andrew Lindesay at Jan 11, 2011 at 11:19 am
    Hello;

    PG also appears to support functional indexes;

    http://www.postgresql.org/docs/9.0/interactive/indexes-expressional.html

    I use functional indexes with Oracle for this purpose.
    ...the problem is the the dba has to know to add these additional
    indexes. ...
    we can think of this as a different field type in the modeler. VARCHAR
    and ciVARCHAR, etc. After all, 'case insensitive' is just as important a
    concept as the difference between BLOB and TEXT.
    I can see a bit of a problem arising with this ^^^ solution where
    somebody wants to be CS sometimes and CI at other times -- case
    sensitivity is often an option in searches. By using an explicit column
    type, the choice appears to be one way or other at the database level.

    I think the functional indexes do make sense and it seems reasonable
    that a DBA would apply them as part of tuning. Maybe it would be best
    to have a page in the manual on this explaining that for MySQL/... CI
    search it is best to change the collation on the schema objects and for
    Oracle/PG/... CI search it is best to implement functional indexes?

    cheers.

    --
    Andrew Lindesay
    www.silvereye.co.nz
  • Andrus Adamchik at Jan 11, 2011 at 12:05 pm
    Yeah, a DB-side solution would have been ideal here, and I am really reluctant about adding a new modeling abstraction. However I still don't see a clean solution for writing efficient cross-DB CI LIKE code in Cayenne (as MySQL won't support expressions in indexes).

    Maybe do a total hack to get us out of this limbo - how about a DI extension point for CI LIKE SQL generation strategy (or rather a System property activating this strategy - 'default collation case sensitivity')? This won't complicate the mapping, won't force us to generalize, and will allow the same mapping to be used with both kinds of schemas.

    Andrus

    On Jan 11, 2011, at 1:18 PM, Andrew Lindesay wrote:
    Hello;

    PG also appears to support functional indexes;

    http://www.postgresql.org/docs/9.0/interactive/indexes-expressional.html

    I use functional indexes with Oracle for this purpose.
    ...the problem is the the dba has to know to add these additional
    indexes. ...
    we can think of this as a different field type in the modeler. VARCHAR
    and ciVARCHAR, etc. After all, 'case insensitive' is just as important a
    concept as the difference between BLOB and TEXT.
    I can see a bit of a problem arising with this ^^^ solution where somebody wants to be CS sometimes and CI at other times -- case sensitivity is often an option in searches. By using an explicit column type, the choice appears to be one way or other at the database level.

    I think the functional indexes do make sense and it seems reasonable that a DBA would apply them as part of tuning. Maybe it would be best to have a page in the manual on this explaining that for MySQL/... CI search it is best to change the collation on the schema objects and for Oracle/PG/... CI search it is best to implement functional indexes?

    cheers.

    --
    Andrew Lindesay
    www.silvereye.co.nz
  • Aristedes Maniatis at Jan 12, 2011 at 5:01 am

    On 11/01/11 10:18 PM, Andrew Lindesay wrote:
    I think the functional indexes do make sense and it seems reasonable that a DBA would apply them as part of tuning. Maybe it would be best to have a page in the manual on this explaining that for MySQL/... CI search it is best to change the collation on the schema objects and for Oracle/PG/... CI search it is best to implement functional indexes?
    It is all a bit ugly. Postgresql docs recommend using LOWER() to achieve ci search, whereas Cayenne spits out UPPER(). Hibernate also produces LOWER(). So a db used by different systems is going to need quite a few indexes.

    Right now, I'm looking for a solution for Derby and mysql in particular, which both lack functional indexes. And a MS-SQL installation which is already in ci 'mode' anyway and so adding functional indexes is just a workaround for Cayenne executing inappropriate SQL for this environment.

    Another small gotcha for developers is that Expression.filterObjects may be inconsistent with the queries against the database, when the expectation is that it should be the same.

    A ci database will always be ci, unless we add the BINARY keyword (I know that works for mysql, not sure about everything else). [1] However that also bypasses the use of the index, so we don't want to add it when it isn't needed. It is easy to be correct. But correct and fast is much harder.

    On 11/01/11 11:04 PM, Andrus Adamchik wrote:
    Maybe do a total hack to get us out of this limbo - how about a DI extension point for CI LIKE SQL generation strategy (or rather a System property activating this strategy - 'default collation case sensitivity')? This won't complicate the mapping, won't force us to generalize, and will allow the same mapping to be used with both kinds of schemas.
    Well, a database-wide property solves my particular use-case. Seems clunky (since it is database-wide and not specific to columns), but if you are convinced that case-sensitivity is not an attribute of the model, then this is the only way. I still can't help but think this is a property of an attribute in the model though. Other than some JDBC specification, why is the choice between BLOB, CLOB and TEXT a modeller property, but the choice between ci-TEXT and cs-TEXT isn't?

    "BLOB values are treated as binary strings (byte strings). They have no character set, and sorting and comparison are based on the numeric values of the bytes in column values. TEXT values are treated as nonbinary strings (character strings). They have a character set, and values are sorted and compared based on the collation of the character set." Effectively BLOBs are just TEXT with collation = none. I've not tried to see what happens if you try to perform a case insensitive search on BLOB from Cayenne. Will it try to output UPPER?



    Out of interest, my googling discovered that Atlassian developers are running into this problem with Hibernate. But they didn't have a nice solution. http://jira.atlassian.com/browse/CONF-10030

    Regards

    Ari



    [1] http://dev.mysql.com/doc/refman/5.0/en/charset-binary-op.html

    --
    -------------------------->
    Aristedes Maniatis
    GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
  • Andrus Adamchik at Jan 12, 2011 at 12:55 pm

    On Jan 12, 2011, at 7:01 AM, Aristedes Maniatis wrote:
    On 11/01/11 10:18 PM, Andrew Lindesay wrote:
    I think the functional indexes do make sense and it seems reasonable that a DBA would apply them as part of tuning. Maybe it would be best to have a page in the manual on this explaining that for MySQL/... CI search it is best to change the collation on the schema objects and for Oracle/PG/... CI search it is best to implement functional indexes?
    It is all a bit ugly. Postgresql docs recommend using LOWER() to achieve ci search, whereas Cayenne spits out UPPER(). Hibernate also produces LOWER(). So a db used by different systems is going to need quite a few indexes.
    Not ideal, but DB indexes always depend on the data access patterns by the clients, not the other way around.
    On 11/01/11 11:04 PM, Andrus Adamchik wrote:
    Maybe do a total hack to get us out of this limbo - how about a DI extension point for CI LIKE SQL generation strategy (or rather a System property activating this strategy - 'default collation case sensitivity')? This won't complicate the mapping, won't force us to generalize, and will allow the same mapping to be used with both kinds of schemas.
    Well, a database-wide property solves my particular use-case. Seems clunky (since it is database-wide and not specific to columns), but if you are convinced that case-sensitivity is not an attribute of the model, then this is the only way.
    From our discussion, it can be a property of the model (either DB-wide or per-column), or a property of a specific DB deployment. Also I am not completely opposed to mapping case sensitivity settings per DbAttribute, but like I said, I am very conservative about adding new mapping abstractions, so I'd rather we try it in a less invasive way first (that would allow us to test a new CI LIKE strategy) and then decide if it is worthwhile extending DbAttribute.

    Anyways, I'll add the description of this interim solution to Jira.
    Other than some JDBC specification, why is the choice between BLOB, CLOB and TEXT a modeller property, but the choice between ci-TEXT and cs-TEXT isn't?
    "BLOB values are treated as binary strings (byte strings). They have no character set, and sorting and comparison are based on the numeric values of the bytes in column values. TEXT values are treated as nonbinary strings (character strings). They have a character set, and values are sorted and compared based on the collation of the character set." Effectively BLOBs are just TEXT with collation = none.
    There's no TEXT type in JDBC (and Cayenne). The only similarity between BLOB and CLOB is that they are sequences of elements. Beyond that they are different data type (of course you can represent anything as a BLOB, but that would be highly impractical). So not sure where this analogy goes (and like I said above I am not completely against an extra CS property for *char* types).
    Out of interest, my googling discovered that Atlassian developers are running into this problem with Hibernate. But they didn't have a nice solution. http://jira.atlassian.com/browse/CONF-10030
    Yeah, that shows that we are not inventing the problem at least.

    Andrus

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriescayenne
postedJan 11, '11 at 9:58a
activeJan 12, '11 at 12:55p
posts7
users3
websitecayenne.apache.org

People

Translate

site design / logo © 2022 Grokbase