FAQ
Hi

While reviewing the (now applied) XPATH escaping patches, Radoslaw found one
case where the previous failure of XPATH to escape its return value was offset
by XMLATTRIBUTES insistence to escape all input values, even if they're
already of type XML.

To wit, if you do

SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES('&'::XML AS "a"))

you get

xmlelement
--------------------
<t a="&amp;amp;"/>

which is somewhat surprising. Especially since

SELECT XMLELEMENT(NAME "t", '&amp;'::XML);

gives

xmlelement
--------------
<t>&amp;</t>

as one would except. Now, it seems rather unlikely that a real application
would actually contain a query similar to the former one - you usually don 't
store XML in the attributes of XML nodes. But since XPATH() returns XML, it's
not unlikely that an application would do

SELECT XMLELEMENT(NAME ..., XMLATTRIBUTES((XPATH(...))[1] AS ...))

As it stands, on 9.2 the values returned by the XPath expression will be
escaped twice - once by XPATH and a second time by XMLATTRIBUTES, while on 9.1
the value will be escaped only once.

Since this seems to be the most likely situation in which XMLATTRIBUTES would
receive an input argument of type XML, and since it's not entirely logical for
XMLATTRIBUTES to unconditionally escapes values already of type XML, I propose
to change the behaviour of XMLATTRIBUTES as follows.

Values not of type XML are be treated as they always have been, i.e. all
special characters (<,>,&,",') are replaced by an entity reference (&lt;,
&gt;, &amp;, &quot;, &apos;).

Values of type XML are assumed to be already escaped, so '&' isn't treated as
a special character to avoid double escaping. The other special characters
(<,>,",') are escaped as usual.

The safety of this relies on the fact that '&' may never appear in a
well-formed XML document, except as part of an entity reference. This seems to
be the case, except if CDATA sections are involved, which may contain plain
ampersands. So the actual escaping rule would have to be a bit more complex
than I made it sound above - we'd need to detect CDATA sections, and
re-enabled the escaping of '&' until the end of such a section.

To actually implement this, I'd remove the use of libxml from the
implementation of XMLELEMENT, and instead use our already existing
escape_xml() function, enhanced with the ability to handle the partial
escaping algorithm outlined above. We already only use libxml to escape
attribute values, so doing that isn't a radical departure from xml.c's ways.
As an added benefit, all the encoding-related problems of XMLELEMENT and
XMLATTRIBUTES would go away once libxml is removed from this code path. So
XPATH() would be the only remaining function which breaks in non-UTF-8
databases.

Comments? Thoughts? Suggestions?

best regards,
Florian Pflug

Search Discussions

  • Peter Eisentraut at Jul 27, 2011 at 2:18 pm

    On tis, 2011-07-26 at 22:44 +0200, Florian Pflug wrote:
    While reviewing the (now applied) XPATH escaping patches, Radoslaw
    found one
    case where the previous failure of XPATH to escape its return value
    was offset
    by XMLATTRIBUTES insistence to escape all input values, even if
    they're
    already of type XML.

    To wit, if you do

    SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES('&amp;'::XML AS "a"))

    you get

    xmlelement
    --------------------
    <t a="&amp;amp;"/>
    Per SQL standard, the attribute values may not be of type XML, so maybe
    we should just prohibit it.
  • Florian Pflug at Jul 27, 2011 at 5:37 pm

    On Jul27, 2011, at 16:18 , Peter Eisentraut wrote:
    On tis, 2011-07-26 at 22:44 +0200, Florian Pflug wrote:
    While reviewing the (now applied) XPATH escaping patches, Radoslaw
    found one
    case where the previous failure of XPATH to escape its return value
    was offset
    by XMLATTRIBUTES insistence to escape all input values, even if
    they're
    already of type XML.

    To wit, if you do

    SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES('&amp;'::XML AS "a"))

    you get

    xmlelement
    --------------------
    <t a="&amp;amp;"/>
    Per SQL standard, the attribute values may not be of type XML, so maybe
    we should just prohibit it.
    We probably should have, but I think it's too late for that. I don't
    believe I'm the only one who uses XPATH results as attribute values,
    and we'd severely break that use-case.

    You might say the same thing about my proposal, of course, but I believe
    the risk is much smaller there. Applications would only break if they
    (a) Pass XML from a source other than a XPath expression selecting
    a text or attribute and
    (b) actually want double-escaping to occur.

    As a data point, I've written an application with makes heavy use of
    our XML infrastructure over the last few months (as you might have guessed
    from the stream of patches ;-)). That application would be pretty much
    untroubled by the changes to XMLATTRIBUTES I proposed, but would be
    severely broken if we rejected values of type XML all together.

    best regards,
    Florian Pflug
  • Peter Eisentraut at Jul 27, 2011 at 9:09 pm

    On ons, 2011-07-27 at 19:37 +0200, Florian Pflug wrote:
    Per SQL standard, the attribute values may not be of type XML, so maybe
    we should just prohibit it.
    We probably should have, but I think it's too late for that. I don't
    believe I'm the only one who uses XPATH results as attribute values,
    and we'd severely break that use-case.

    You might say the same thing about my proposal, of course, but I
    believe
    the risk is much smaller there. Applications would only break if they
    (a) Pass XML from a source other than a XPath expression selecting
    a text or attribute and
    (b) actually want double-escaping to occur.
    Well, offhand I would expect that passing an XML value to XMLATTRIBUTES
    would behave as in

    SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&amp;'::XML AS text) AS "a"))

    which is what it indeed does in 9.1.

    So if we don't want to restrict this, for backward compatibility, then I
    would suggest that we fix it to work like it used to.

    I would be very hesitant about adding another escape mechanism that
    escapes some things but not others. We already have two or three of
    those for XML, and it doesn't seem worth adding another one just for
    this, which is outside the standard and for which a valid workaround
    exists.
  • Florian Pflug at Jul 27, 2011 at 9:21 pm

    On Jul27, 2011, at 23:08 , Peter Eisentraut wrote:
    Well, offhand I would expect that passing an XML value to XMLATTRIBUTES
    would behave as in

    SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&amp;'::XML AS text) AS "a"))
    With both 9.1 and 9.2 this query returns

    xmlelement
    --------------------
    <t a="&amp;amp;"/>

    i.e. makes the value of "a" represent the *literal* string '&amp;', *not*
    the literal string '&'. Just to be sure there's no miss-understanding here
    - is this what you expect?
    which is what it indeed does in 9.1.

    So if we don't want to restrict this, for backward compatibility, then I
    would suggest that we fix it to work like it used to.
    There's currently no difference in behaviour between 9.1 and 9.2 there.
    We've only modified XPATH to always correctly escape it's result in 9.2,
    so there's only a difference if you pass the result of XPATH() to
    XMLATTRIBUTES. Which I figured to be the most likely reason for to pass
    values of type XML to XMLATTRIBUTES, but maybe you disagree there.
    I would be very hesitant about adding another escape mechanism that
    escapes some things but not others. We already have two or three of
    those for XML, and it doesn't seem worth adding another one just for
    this, which is outside the standard and for which a valid workaround
    exists.
    What's the workaround if you have a value of type XML, e.g. '&amp;',
    and want to set an attribute to the value represented by that XML fragment
    (i.e. '&')? Since we have no XMLUNESCAPE function, I don't see an easy
    way to do that. Maybe I'm missing something, though.

    best regards,
    Florian Pflug
  • Peter Eisentraut at Jul 28, 2011 at 8:51 pm

    On ons, 2011-07-27 at 23:21 +0200, Florian Pflug wrote:
    On Jul27, 2011, at 23:08 , Peter Eisentraut wrote:
    Well, offhand I would expect that passing an XML value to XMLATTRIBUTES
    would behave as in

    SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&amp;'::XML AS text) AS "a"))
    With both 9.1 and 9.2 this query returns

    xmlelement
    --------------------
    <t a="&amp;amp;"/>

    i.e. makes the value of "a" represent the *literal* string '&amp;', *not*
    the literal string '&'. Just to be sure there's no miss-understanding here
    - is this what you expect?
    Well, I expect it to fail.
    What's the workaround if you have a value of type XML, e.g. '&amp;',
    and want to set an attribute to the value represented by that XML fragment
    (i.e. '&')? Since we have no XMLUNESCAPE function, I don't see an easy
    way to do that. Maybe I'm missing something, though.
    It may be worth researching whether the XMLSERIALIZE function is
    actually doing what it should, or whether it could/should do some
    unescaping. Unfortunately, in the latest SQL/XML standard the final
    answer it nested deep in the three other standards, so I don't have an
    answer right now. But there are plenty of standards in this area, so
    I'd hope that one of them can give us the right behavior, instead of us
    making something up.
  • Florian Pflug at Jul 29, 2011 at 9:37 am

    On Jul28, 2011, at 22:51 , Peter Eisentraut wrote:
    On ons, 2011-07-27 at 23:21 +0200, Florian Pflug wrote:
    On Jul27, 2011, at 23:08 , Peter Eisentraut wrote:
    Well, offhand I would expect that passing an XML value to XMLATTRIBUTES
    would behave as in

    SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&amp;'::XML AS text) AS "a"))
    With both 9.1 and 9.2 this query returns

    xmlelement
    --------------------
    <t a="&amp;amp;"/>

    i.e. makes the value of "a" represent the *literal* string '&amp;', *not*
    the literal string '&'. Just to be sure there's no miss-understanding here
    - is this what you expect?
    Well, I expect it to fail.
    Now you've lost me. What exactly should fail under what circumstances?
    What's the workaround if you have a value of type XML, e.g. '&amp;',
    and want to set an attribute to the value represented by that XML fragment
    (i.e. '&')? Since we have no XMLUNESCAPE function, I don't see an easy
    way to do that. Maybe I'm missing something, though.
    It may be worth researching whether the XMLSERIALIZE function is
    actually doing what it should, or whether it could/should do some
    unescaping.
    I don't see how that could work. It can't unescape anything in e.g.
    '<t>&amp;</t>', because the result would be a quite useless not-really-XML
    kind of thing. It could unescape '&amp;' but that kind of content-dependent
    behaviour seem even worse than my proposed escaping rules for XMLATTRIBUTES.
    Unfortunately, in the latest SQL/XML standard the final
    answer it nested deep in the three other standards, so I don't have an
    answer right now. But there are plenty of standards in this area, so
    I'd hope that one of them can give us the right behavior, instead of us
    making something up.
    Which standards to you have in mind there? If you can point me to a place
    where I can obtain them, I could check if there's something in them
    which helps.

    best regards,
    Florian Pflug
  • Peter Eisentraut at Aug 11, 2011 at 7:16 am

    On fre, 2011-07-29 at 11:37 +0200, Florian Pflug wrote:
    On Jul28, 2011, at 22:51 , Peter Eisentraut wrote:
    On ons, 2011-07-27 at 23:21 +0200, Florian Pflug wrote:
    On Jul27, 2011, at 23:08 , Peter Eisentraut wrote:
    Well, offhand I would expect that passing an XML value to XMLATTRIBUTES
    would behave as in

    SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&amp;'::XML AS text) AS "a"))
    With both 9.1 and 9.2 this query returns

    xmlelement
    --------------------
    <t a="&amp;amp;"/>

    i.e. makes the value of "a" represent the *literal* string '&amp;', *not*
    the literal string '&'. Just to be sure there's no miss-understanding here
    - is this what you expect?
    Well, I expect it to fail.
    Now you've lost me. What exactly should fail under what circumstances?
    To me, the best solution still appears to be forbidding passing values
    of type xml to XMLATTRIBUTES, unless we find an obviously better
    solution that is not, "I came up with this custom escape function that I
    tweaked so that it appears to make sense".
    Unfortunately, in the latest SQL/XML standard the final
    answer it nested deep in the three other standards, so I don't have an
    answer right now. But there are plenty of standards in this area, so
    I'd hope that one of them can give us the right behavior, instead of us
    making something up.
    Which standards to you have in mind there? If you can point me to a place
    where I can obtain them, I could check if there's something in them
    which helps.
    In SQL/XML 2008, the actual behavior of XMLSERIALIZE is delegated to
    "XSLT 2.0 and XQuery 1.0 Serialization". I'm not familiar with this
    latter standard, but it appears to have lots of options and parameters,
    one of which might help us.
  • Florian Pflug at Aug 11, 2011 at 9:13 am

    On Aug11, 2011, at 09:16 , Peter Eisentraut wrote:
    On fre, 2011-07-29 at 11:37 +0200, Florian Pflug wrote:
    On Jul28, 2011, at 22:51 , Peter Eisentraut wrote:
    On ons, 2011-07-27 at 23:21 +0200, Florian Pflug wrote:
    On Jul27, 2011, at 23:08 , Peter Eisentraut wrote:
    Well, offhand I would expect that passing an XML value to XMLATTRIBUTES
    would behave as in

    SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&amp;'::XML AS text) AS "a"))
    With both 9.1 and 9.2 this query returns

    xmlelement
    --------------------
    <t a="&amp;amp;"/>

    i.e. makes the value of "a" represent the *literal* string '&amp;', *not*
    the literal string '&'. Just to be sure there's no miss-understanding here
    - is this what you expect?
    Well, I expect it to fail.
    Now you've lost me. What exactly should fail under what circumstances?
    To me, the best solution still appears to be forbidding passing values
    of type xml to XMLATTRIBUTES, unless we find an obviously better
    solution that is not, "I came up with this custom escape function that I
    tweaked so that it appears to make sense".
    Hm, OK, I see your point. However, if we simply raise an error in 9.2,
    and do nothing else, that we make it impossible to use the result of
    an XPath expression as an XML attribute value. Not just inconvenient,
    but impossible, so I don't think we can do that. We'd thus need to add a
    function

    XMLUNESCAPE(XML) RETURNS TEXT

    to restore that functionality. Defining a sane behaviour for such a function,
    however, seems no easier than defining sane behaviour for an XML attribute
    of already of type XML. The core of the problems remains to define the result
    of XMLUNESCAPE('<tag>content</tag>'), just as the core of the XMLATTRIBUTES
    problems is to define XMLELEMENT(... XMLATTRIBUTES('<tag>content</tag>' as a)).

    Thinking about this further, it seems that we essentially have two distinct
    classes of XML values. Some are essentially plain text, but might contains
    entity references, while others are "real" XML fragments which contain at
    least one tag. That suggests that a sensible behaviour for XMLUNESCAPE might
    be to return a string with the entity references resolved in the former case,
    and simply return an error in the latter.

    To summarize, we'd have

    XMLUNESCAPE(''::XML) -> 'a'
    XMLUNESCAPE('a'::XML) -> 'a'
    XMLUNESCAPE('&lt;'::XML) -> '<'
    XMLUNESCAPE('<t/>'::XML) -> error

    To not break applications needlessly, I'd then be inclined to make

    XMLATTRIBUTES(xml_value as "a")

    mean

    XMLATTRIBUTES(XMLUNESCAPE(xml_value) as "a")

    i.e. throw an error if xml_value contains anything but plain text and
    entity references. But I could probably also live with not doing that.
    Unfortunately, in the latest SQL/XML standard the final
    answer it nested deep in the three other standards, so I don't have an
    answer right now. But there are plenty of standards in this area, so
    I'd hope that one of them can give us the right behavior, instead of us
    making something up.
    Which standards to you have in mind there? If you can point me to a place
    where I can obtain them, I could check if there's something in them
    which helps.
    In SQL/XML 2008, the actual behavior of XMLSERIALIZE is delegated to
    "XSLT 2.0 and XQuery 1.0 Serialization". I'm not familiar with this
    latter standard, but it appears to have lots of options and parameters,
    one of which might help us.
    I'll try to obtain a copy of that. Thanks.

    best regards,
    Florian Pflug

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJul 26, '11 at 8:45p
activeAug 11, '11 at 9:13a
posts9
users2
websitepostgresql.org...
irc#postgresql

2 users in discussion

Florian Pflug: 5 posts Peter Eisentraut: 4 posts

People

Translate

site design / logo © 2022 Grokbase