I am wanting to write some new C functions which leverage postgresql's
existing regexp code in an extension module. I notice that the functions
RE_compile_and_cache and RE_compile_and_execute in
src/backend/util/regexp.c contain the code necessary to connect the regexp
code in src/backend/regex with the postgresql string conversion, error
reporting, and memory management infrastructure, as well as providing
caching of regexes which would probably be a win to any regex function in
postgres. It would seem that these functions would be useful to any
C function dealing with regexp matching in postgresql, but they are static
functions, so they cannot be used outside of
src/backend/utils/adt/regexp.c. Since all of the core regexp functions
are in this file, this has been ok, but it is my opinion that these
functions should be made visible and added to a header file so that
extensions can make use of them, because any add-on functions that want to
use the regex code in postgres in some new way would need to basically
duplicate that same code in order to do so.

Is there some specific reason that these functions are static, or would it
be ok to make them non-static and add them to a header (say,
src/include/utils/regexp.h) so that extensions could use them as well? I
could put together a patch for this if desired, or it seems simple enough
that someone could just do it...

--
I can't decide whether to commit suicide or go bowling.
-- Florence Henderson

Search Discussions

  • Tom Lane at Feb 2, 2007 at 12:41 am

    Jeremy Drake writes:
    Is there some specific reason that these functions are static,
    Yeah: not cluttering the global namespace. I'm not excited about
    exporting everything that anybody could possibly want access to;
    that just makes it harder to maintain the code. When you see a
    static function, you know that you don't have to look further than
    the current file to understand how it's used. When you see a global
    function, the difficulty of knowing how it's used is an order of
    magnitude higher, maybe more. What's more, if you want to change it
    then you have to worry about the possible side-effects on unknown
    non-core code that might be calling it.

    Is there a reason for not putting your new code itself into regexp.c?

    regards, tom lane
  • Jeremy Drake at Feb 2, 2007 at 1:11 am

    On Thu, 1 Feb 2007, Tom Lane wrote:

    Jeremy Drake <pgsql@jdrake.com> writes:
    Is there some specific reason that these functions are static,
    Yeah: not cluttering the global namespace.
    Is there a reason for not putting your new code itself into regexp.c?
    Not really, I just figured it would be cleaner/easier to write it as an
    extension. I also figure that it is unlikely that every regexp function
    that anyone could possibly want will be implemented in core in that one
    file. If anyone writes an extension like this, they would need to
    duplicate a good amount of code in order to do so, that would make more
    difficulty in maintaining the code if it should need to change. It also
    makes developing a new function a lot easier, no need to re-initdb to add
    the function, no need to relink the postmaster and restart it every time
    the function changes.

    Anyway, the particular thing I was writing was a function like
    substring(str FROM pattern) which instead of returning just the first
    match group, would return an array of text containing all of the match
    groups. I exported the functions in my sandbox, and wrote a module with a
    function that does this.
    regards, tom lane

    ---------------------------(end of broadcast)---------------------------
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org
    --
    Calling J-Man Kink. Calling J-Man Kink. Hash missile sighted, target
    Los Angeles. Disregard personal feelings about city and intercept.
  • David Fetter at Feb 2, 2007 at 5:27 am

    On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote:
    On Thu, 1 Feb 2007, Tom Lane wrote:

    Jeremy Drake <pgsql@jdrake.com> writes:
    Is there some specific reason that these functions are static,
    Yeah: not cluttering the global namespace.
    Is there a reason for not putting your new code itself into regexp.c?
    Not really, I just figured it would be cleaner/easier to write it as an
    extension. I also figure that it is unlikely that every regexp function
    that anyone could possibly want will be implemented in core in that one
    file. If anyone writes an extension like this, they would need to
    duplicate a good amount of code in order to do so, that would make more
    difficulty in maintaining the code if it should need to change. It also
    makes developing a new function a lot easier, no need to re-initdb to add
    the function, no need to relink the postmaster and restart it every time
    the function changes.

    Anyway, the particular thing I was writing was a function like
    substring(str FROM pattern) which instead of returning just the
    first match group, would return an array of text containing all of
    the match groups.
    That'd be great! People who use dynamic languages like Perl would
    feel much more at home having access to all the matches. While you're
    at it, could you could make pre-match and post-match (optionally--I
    know it's expensive) available?

    Cheers,
    D
    --
    David Fetter <david@fetter.org> http://fetter.org/
    phone: +1 415 235 3778 AIM: dfetter666
    Skype: davidfetter

    Remember to vote!
  • Jeremy Drake at Feb 2, 2007 at 6:17 am

    On Thu, 1 Feb 2007, David Fetter wrote:
    On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote:
    Anyway, the particular thing I was writing was a function like
    substring(str FROM pattern) which instead of returning just the
    first match group, would return an array of text containing all of
    the match groups.
    If you are subscribed to -patches, I sent my code to date there earlier
    this evening. I also said that I wanted to make a function that split on
    a pattern (like perl split) and returned setof text.
    That'd be great! People who use dynamic languages like Perl would
    feel much more at home having access to all the matches. While you're
    at it, could you could make pre-match and post-match (optionally--I
    know it's expensive) available?
    I could, but I'm not sure how someone would go about accessing such a
    thing. What I just wrote would be most like this perl:
    @foo = ($str=~/pattern/);

    Where would pre and post match fit into this? Are you talking about a
    different function? Or sticking prematch at the beginning of the array
    and postmatch at the end? I could also put the whole match somewhere
    also, but I did not in this version.

    The code I wrote returns a text[] which is one-dimensional, has a lower
    bound of 1 (as most postgres arrays do), where if there are n capture
    groups, ra[1] has the first capture group and ra[n] has the last one.
    Since postgres has an option to make different lower bounds, I suppose I
    could have an option to put the prematch in [-1], the entire match in [0],
    and the postmatch in [n+1]. This seems to be odd to me though.

    I guess I'm saying, I agree that the entire match, prematch, and postmatch
    would be helpful, but how would you propose to present these to the user?
    Cheers,
    D
    --
    To err is human, to forgive, beyond the scope of the Operating System.
  • David Fetter at Feb 2, 2007 at 6:55 am

    On Thu, Feb 01, 2007 at 10:16:54PM -0800, Jeremy Drake wrote:
    On Thu, 1 Feb 2007, David Fetter wrote:
    On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote:
    Anyway, the particular thing I was writing was a function like
    substring(str FROM pattern) which instead of returning just the
    first match group, would return an array of text containing all
    of the match groups.
    If you are subscribed to -patches, I sent my code to date there
    earlier this evening. I also said that I wanted to make a function
    that split on a pattern (like perl split) and returned setof text.
    That'd be great! People who use dynamic languages like Perl would
    feel much more at home having access to all the matches. While
    you're at it, could you could make pre-match and post-match
    (optionally--I know it's expensive) available?
    I could, but I'm not sure how someone would go about accessing such
    a thing. What I just wrote would be most like this perl: @foo =
    ($str=~/pattern/);
    Where would pre and post match fit into this? Are you talking about a
    different function?
    Yes, although it might have the same name, as in regex_match(pattern
    TEXT, string TEXT, return_pre_and_post BOOL).
    Or sticking prematch at the beginning of the array and postmatch at
    the end? I could also put the whole match somewhere also, but I did
    not in this version.
    The data structure could be something like

    TYPE matches (
    prematch TEXT,
    match TEXT[],
    postmatch TEXT
    )
    The code I wrote returns a text[] which is one-dimensional, has a lower
    bound of 1 (as most postgres arrays do), where if there are n capture
    groups, ra[1] has the first capture group and ra[n] has the last one.
    Since postgres has an option to make different lower bounds, I suppose I
    could have an option to put the prematch in [-1], the entire match in [0],
    and the postmatch in [n+1]. This seems to be odd to me though.
    Odd == bad. I think the pre- and post-matches should be different in
    essence, not just in index :)
    I guess I'm saying, I agree that the entire match, prematch, and postmatch
    would be helpful, but how would you propose to present these to the user?
    See above :)

    Cheers,
    D
    --
    David Fetter <david@fetter.org> http://fetter.org/
    phone: +1 415 235 3778 AIM: dfetter666
    Skype: davidfetter

    Remember to vote!
  • Jeremy Drake at Feb 2, 2007 at 8:15 am

    On Thu, 1 Feb 2007, David Fetter wrote:

    Yes, although it might have the same name, as in regex_match(pattern
    TEXT, string TEXT, return_pre_and_post BOOL).

    The data structure could be something like

    TYPE matches (
    prematch TEXT,
    match TEXT[],
    postmatch TEXT
    )
    I just coded up for this:

    CREATE FUNCTION regexp_matches(IN str text, IN pattern text) RETURNS
    text[]
    AS 'MODULE_PATHNAME', 'regexp_matches'
    LANGUAGE C IMMUTABLE STRICT;

    CREATE FUNCTION regexp_matches(
    IN str text, IN pattern text, IN return_pre_and_post bool,
    OUT prematch text, OUT fullmatch text, OUT matches text[], OUT
    postmatch text) RETURNS record
    AS 'MODULE_PATHNAME', 'regexp_matches'
    LANGUAGE C IMMUTABLE STRICT;


    Which works like this:

    jeremyd=# \pset null '\\N'
    Null display is "\N".
    jeremyd=# select * from regexp_matches('foobarbequebaz',
    $re$(bar)(beque)$re$);
    regexp_matches
    ----------------
    {bar,beque}
    (1 row)

    jeremyd=# select * from regexp_matches('foobarbequebaz',
    $re$(bar)(beque)$re$, false);
    prematch | fullmatch | matches | postmatch
    ----------+-----------+-------------+-----------
    \N | \N | {bar,beque} | \N
    (1 row)

    jeremyd=# select * from regexp_matches('foobarbequebaz',
    $re$(bar)(beque)$re$, true);
    prematch | fullmatch | matches | postmatch
    ----------+-----------+-------------+-----------
    foo | barbeque | {bar,beque} | baz
    (1 row)


    And then you also have this behavior in the matches array:

    jeremyd=# select * from regexp_matches('foobarbequebaz',
    $re$(bar)(.*)(beque)$re$);
    regexp_matches
    ----------------
    {bar,"",beque}
    (1 row)

    jeremyd=# select * from regexp_matches('foobarbequebaz',
    $re$(bar)(.+)(beque)$re$);
    regexp_matches
    ----------------
    \N
    (1 row)

    jeremyd=# select * from regexp_matches('foobarbequebaz',
    $re$(bar)(.+)?(beque)$re$);
    regexp_matches
    ------------------
    {bar,NULL,beque}
    (1 row)

    Reasonable?

    --
    A.A.A.A.A.:
    An organization for drunks who drive
  • Jeremy Drake at Feb 2, 2007 at 8:54 am

    On Fri, 2 Feb 2007, Jeremy Drake wrote:

    jeremyd=# select * from regexp_matches('foobarbequebaz',
    $re$(bar)(beque)$re$, false);
    prematch | fullmatch | matches | postmatch
    ----------+-----------+-------------+-----------
    \N | \N | {bar,beque} | \N
    (1 row)
    I just changed this to fill in fullmatch when the bool is false, so this
    one would look like:
    prematch | fullmatch | matches | postmatch
    ----------+-----------+-------------+-----------
    \N | barbeque | {bar,beque} | \N
    (1 row)

    I also removed my check for capture groups, since in this setup you could
    get useful output without any. I am still trying to decide whether or not
    to add back an error if you called the no-bool version which just returns
    the array, and you do not have any capture groups. ISTM this is likely an
    oversight on the query author's part, and it would be helpful to alert him
    to this.

    If you have no capture groups, the matches array is empty (not null). If
    the match happened at the start of the string, the prematch is an empty
    string, and if the match happened at the end of the string, the postmatch
    is an empty string.

    Reasonable?
    --
    It's odd, and a little unsettling, to reflect upon the fact that
    English is the only major language in which "I" is capitalized; in many
    other languages "You" is capitalized and the "i" is lower case.
    -- Sydney J. Harris
  • David Fetter at Feb 2, 2007 at 5:22 pm

    On Fri, Feb 02, 2007 at 12:54:30AM -0800, Jeremy Drake wrote:
    On Fri, 2 Feb 2007, Jeremy Drake wrote:

    jeremyd=# select * from regexp_matches('foobarbequebaz',
    $re$(bar)(beque)$re$, false);
    prematch | fullmatch | matches | postmatch
    ----------+-----------+-------------+-----------
    \N | \N | {bar,beque} | \N
    (1 row)
    I just changed this to fill in fullmatch when the bool is false, so this
    one would look like:
    prematch | fullmatch | matches | postmatch
    ----------+-----------+-------------+-----------
    \N | barbeque | {bar,beque} | \N
    (1 row)

    I also removed my check for capture groups, since in this setup you could
    get useful output without any. I am still trying to decide whether or not
    to add back an error if you called the no-bool version which just returns
    the array, and you do not have any capture groups. ISTM this is likely an
    oversight on the query author's part, and it would be helpful to alert him
    to this.

    If you have no capture groups, the matches array is empty (not null). If
    the match happened at the start of the string, the prematch is an empty
    string, and if the match happened at the end of the string, the postmatch
    is an empty string.
    Reasonable?
    This is great :)

    Cheers,
    D
    --
    David Fetter <david@fetter.org> http://fetter.org/
    phone: +1 415 235 3778 AIM: dfetter666
    Skype: davidfetter

    Remember to vote!
  • Jeremy Drake at Feb 3, 2007 at 1:00 am

    On Fri, 2 Feb 2007, Jeremy Drake wrote:

    I just coded up for this:

    CREATE FUNCTION regexp_matches(IN str text, IN pattern text) RETURNS
    text[]
    AS 'MODULE_PATHNAME', 'regexp_matches'
    LANGUAGE C IMMUTABLE STRICT;

    CREATE FUNCTION regexp_matches(
    IN str text, IN pattern text, IN return_pre_and_post bool,
    OUT prematch text, OUT fullmatch text, OUT matches text[], OUT
    postmatch text) RETURNS record
    AS 'MODULE_PATHNAME', 'regexp_matches'
    LANGUAGE C IMMUTABLE STRICT;
    I wanted to put out there the question of what order the parameters to
    these regex functions should go. ISTM most people expect them to go
    (pattern, string), but I made these functions consistant with
    substring(text,text) which takes (string, pattern). Now I have been
    working on a regexp_split function, which takes (pattern, string), which
    is what someone familiar with the function from perl would expect, but is
    not consistant with substring or now with my regexp_matches function.

    I want to ask, should I break with following substring's precedent, and
    put the pattern first (as most people probably would expect), or should I
    break with perl's precedent and put the pattern second (to behave like
    substring)?


    --
    We cannot put the face of a person on a stamp unless said person is
    deceased. My suggestion, therefore, is that you drop dead.
    -- James E. Day, Postmaster General
  • Tom Lane at Feb 3, 2007 at 1:56 am

    Jeremy Drake writes:
    I want to ask, should I break with following substring's precedent, and
    put the pattern first (as most people probably would expect), or should I
    break with perl's precedent and put the pattern second (to behave like
    substring)?
    All of SQL's pattern match operators have the pattern on the right, so
    my advice is to stick with that and try not to think about Perl ;-)

    regards, tom lane
  • David Fetter at Feb 3, 2007 at 2:03 am

    On Fri, Feb 02, 2007 at 08:56:31PM -0500, Tom Lane wrote:
    Jeremy Drake <pgsql@jdrake.com> writes:
    I want to ask, should I break with following substring's
    precedent, and put the pattern first (as most people probably
    would expect), or should I break with perl's precedent and put the
    pattern second (to behave like substring)?
    All of SQL's pattern match operators have the pattern on the right,
    so my advice is to stick with that and try not to think about Perl
    ;-)
    Perl provides inspiration, but here, consistency would help more than
    orderly imitation of how it does what it does. And besides, when
    people really need Perl, they can pull it in as a PL :)

    Cheers,
    D
    --
    David Fetter <david@fetter.org> http://fetter.org/
    phone: +1 415 235 3778 AIM: dfetter666
    Skype: davidfetter

    Remember to vote!
  • Jeremy Drake at Feb 7, 2007 at 8:26 am

    On Sun, 4 Feb 2007, David Fetter wrote:
    On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote:

    Let me know if you see any bugs or issues with this code, and I am
    open to suggestions for further regression tests ;)
    I have not heard anything, so I guess at this point I should figure out
    where to go next with this. I see a couple options:

    * Set this up as a pgfoundry project or contrib. This would require
    merging the patch to expose some functions from regexp.c outside that
    file, which has raised some concerns about maintainability.

    * Put together a patch to add these functions to core. I could put them
    directly in regexp.c, so the support functions could stay static. My
    concern here is that I don't know if there are any functions currently
    in core with OUT parameters. I don't know the acceptable style for
    handling this: OUT parameters, a named composite type, ...?

    Does anyone have any opinions either way, as to how I should proceed
    from here?


    * maybe a join function that works as an aggregate
    SELECT join(',', col) FROM tbl
    currently can be written as
    SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')
    The array_accum() aggregate in the docs works OK for this purpose.
    I have decided not to pursue this function, I think the array construct,
    or the array_accum option, is about the best possible currently. If it
    should become possible in the future to write aggregates with a non-sql
    state type (structs with pointers) it may be worthwhile to re-evaluate
    this.


    --
    The cost of living hasn't affected its popularity.
  • Tom Lane at Feb 7, 2007 at 2:24 pm

    Jeremy Drake writes:
    * Put together a patch to add these functions to core. I could put them
    directly in regexp.c, so the support functions could stay static. My
    concern here is that I don't know if there are any functions currently
    in core with OUT parameters.
    As of 8.2 there are.

    If we are going to include these I would vote for core not contrib
    status, exactly to avoid having to export those functions.

    regards, tom lane
  • David Fetter at Feb 7, 2007 at 7:07 pm

    On Wed, Feb 07, 2007 at 09:23:58AM -0500, Tom Lane wrote:
    Jeremy Drake <pgsql@jdrake.com> writes:
    * Put together a patch to add these functions to core. I could put them
    directly in regexp.c, so the support functions could stay static. My
    concern here is that I don't know if there are any functions currently
    in core with OUT parameters.
    As of 8.2 there are.

    If we are going to include these I would vote for core not contrib
    status, exactly to avoid having to export those functions.
    +1 for core. :)

    Cheers,
    D
    --
    David Fetter <david@fetter.org> http://fetter.org/
    phone: +1 415 235 3778 AIM: dfetter666
    Skype: davidfetter

    Remember to vote!
  • Jeremy Drake at Feb 7, 2007 at 8:34 pm

    On Wed, 7 Feb 2007, Tom Lane wrote:

    Jeremy Drake <pgsql@jdrake.com> writes:
    * Put together a patch to add these functions to core. I could put them
    directly in regexp.c, so the support functions could stay static. My
    concern here is that I don't know if there are any functions currently
    in core with OUT parameters.
    As of 8.2 there are.
    Could you give me the name of one in pg_proc.h so I can see how I should
    go about adding one there?
    If we are going to include these I would vote for core not contrib
    status, exactly to avoid having to export those functions.
    OK, this patch will be my next project.


    --
    History is curious stuff
    You'd think by now we had enough
    Yet the fact remains I fear
    They make more of it every year.
  • Tom Lane at Feb 7, 2007 at 8:37 pm

    Jeremy Drake writes:
    On Wed, 7 Feb 2007, Tom Lane wrote:
    As of 8.2 there are.
    Could you give me the name of one in pg_proc.h so I can see how I should
    go about adding one there?
    select * from pg_proc where proargmodes is not null;

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedFeb 1, '07 at 9:20p
activeFeb 7, '07 at 8:37p
posts17
users3
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase