I broached this topic last year[1], but the project got tabled until
now; so I raise it again. We want to be able to search text
(extracted from character-based PDF files) which will contain legal
terms and statute cites, and we want to be able to do tsearch2
searches (under 8.3.recent). It's clear enough how to create a
dictionary to gracefully handle the legal terms, but I'm less sure
about the statute cites.

I got one response[2], which mentioned a prefix search in the 8.4
release, and provided a link to a perl regular expression based
dictionary. I'm wondering if anyone has feedback one either of these
techniques, and whether they might work for our needs. I'm not sure I
adequately described our needs, so I'll fill that out a little more.

People are likely to search for statute cites, which tend to have a
hierarchical form. I'm not sure the prefix approach will work for
this. For example, there is a section 939.64 in the state statutes
dealing with commission of a crime while wearing a bulletproof
garment. If someone searches for that, they should find subsections
like 939.64(1) or 939.64(2) but not different sections which start
with the same characters like 939.641 (the section on concealing
identity) or 939.645 (the section on hate crimes). A search for
chapter 939 should return any of the above.

Of course, we want someone to be able to search on 939.64, 939.641,
and 939.645 and get documents which reference all of the above (i.e.,
to look for a document referring to a hate crime committed while
concealing identity and wearing a bulletproof garment).

Suggestions welcome on how to handle this user requirement.

-Kevin

[1] http://archives.postgresql.org/pgsql-admin/2008-06/msg00033.php
[2] http://archives.postgresql.org/pgsql-admin/2008-06/msg00034.php

Search Discussions

  • Tom Lane at Mar 11, 2009 at 12:30 am

    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    People are likely to search for statute cites, which tend to have a
    hierarchical form. I'm not sure the prefix approach will work for
    this. For example, there is a section 939.64 in the state statutes
    dealing with commission of a crime while wearing a bulletproof
    garment. If someone searches for that, they should find subsections
    like 939.64(1) or 939.64(2) but not different sections which start
    with the same characters like 939.641 (the section on concealing
    identity) or 939.645 (the section on hate crimes). A search for
    chapter 939 should return any of the above.
    I think what you need is a custom parser that treats these similarly to
    hyphenated words. If I pretend that the dot is a hyphen I get matching
    behavior that seems to meet all those requirements.

    Unfortunately we don't seem to have any really easy way to plug in a
    custom parser, other than copy-paste-modify the existing one which would
    be a PITA from a maintenance standpoint. Perhaps you could pass the
    texts and the queries through a regexp substitution that converts
    digit-dot-digit to digit-dash-digit?

    regards, tom lane
  • Oleg Bartunov at Mar 11, 2009 at 6:58 am

    On Tue, 10 Mar 2009, Tom Lane wrote:

    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    People are likely to search for statute cites, which tend to have a
    hierarchical form. I'm not sure the prefix approach will work for
    this. For example, there is a section 939.64 in the state statutes
    dealing with commission of a crime while wearing a bulletproof
    garment. If someone searches for that, they should find subsections
    like 939.64(1) or 939.64(2) but not different sections which start
    with the same characters like 939.641 (the section on concealing
    identity) or 939.645 (the section on hate crimes). A search for
    chapter 939 should return any of the above.
    I think what you need is a custom parser that treats these similarly to
    hyphenated words. If I pretend that the dot is a hyphen I get matching
    behavior that seems to meet all those requirements.

    Unfortunately we don't seem to have any really easy way to plug in a
    custom parser, other than copy-paste-modify the existing one which would
    be a PITA from a maintenance standpoint. Perhaps you could pass the
    texts and the queries through a regexp substitution that converts
    digit-dot-digit to digit-dash-digit?
    perhaps, for 8.4 it's better to utilize prefix search, like
    to_tsquery('939.645:*') will find what Kevin need. The problem is with
    parser, so I'd preprocess text before indexing to convert all
    digit.digit(digit) to digit.digit.digit, which is what parser recognizes as
    a single lexem 'version'. Here is just an illustration

    qq=# select * from ts_parse('default',translate('939.64(1)','()','. '));
    tokid | token
    -------+----------
    8 | 939.64.1
    12 |

    btw, having 'version' it's possible to use dict_regex for 8.3.

    regards, tom lane
    Regards,
    Oleg
    _____________________________________________________________
    Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83
  • Kevin Grittner at Mar 11, 2009 at 2:01 pm

    Oleg Bartunov wrote:
    On Tue, 10 Mar 2009, Tom Lane wrote:
    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    People are likely to search for statute cites, which tend to have
    a
    hierarchical form. I'm not sure the prefix approach will work for
    this. For example, there is a section 939.64 in the state
    statutes
    dealing with commission of a crime while wearing a bulletproof
    garment. If someone searches for that, they should find
    subsections
    like 939.64(1) or 939.64(2) but not different sections which start
    with the same characters like 939.641 (the section on concealing
    identity) or 939.645 (the section on hate crimes). A search for
    chapter 939 should return any of the above.
    Perhaps you could pass the texts and the queries through a regexp
    substitution that converts digit-dot-digit to digit-dash-digit?
    perhaps, for 8.4 it's better to utilize prefix search, like
    to_tsquery('939.645:*') will find what Kevin need. The problem is with
    parser, so I'd preprocess text before indexing to convert all
    digit.digit(digit) to digit.digit.digit, which is what parser
    recognizes as
    a single lexem 'version'. Here is just an illustration

    qq=# select * from ts_parse('default',translate('939.64(1)','()','. '));
    tokid | token
    -------+----------
    8 | 939.64.1
    12 |

    btw, having 'version' it's possible to use dict_regex for 8.3.
    Tom, Oleg: Thanks for the suggestions. Looks promising.

    -Kevin
  • Kevin Grittner at Apr 6, 2009 at 8:52 pm

    Tom Lane wrote:
    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    People are likely to search for statute cites, which tend to have a
    hierarchical form.
    I think what you need is a custom parser
    I've just returned to this and after review have become convinced that
    this is absolutely necessary; once the default parser has done its
    work, figuring out the bounds of a statute cite would be next to
    impossible. Examples of the kind of fun you can have labeling
    statutes, ordinances, and rules should you ever get elected to public
    office:

    10-3-350.10(1)(k)
    10.1(40)(d)1
    10.40.040(c)(2)
    100.525(2)(a)3
    105-10.G(3)(a)
    11.04C.3.R.(1)
    8.961.41(cm)
    9.125.07(4A)(3)
    947.013(1m)(a)

    In any of these, a search string which exactly matches something up to
    (but not including) a dash, dot, or left paren should find that thing.
    Unfortunately we don't seem to have any really easy way to plug in a
    custom parser, other than copy-paste-modify the existing one which
    would be a PITA from a maintenance standpoint.
    I'm afraid I'm going to have to bite the bullet and do this anyway.
    Any guidance on how to go about it may save me some time. Also, if
    there is any way to do this which may be useful to others or integrate
    into PostgreSQL to reduce the long-term PITA aspect, I'm all ears.

    -Kevin
  • Oleg Bartunov at Apr 7, 2009 at 9:08 am
    Kevin,

    contrib/test_parser - an example parser code.
    On Mon, 6 Apr 2009, Kevin Grittner wrote:

    Tom Lane wrote:
    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    People are likely to search for statute cites, which tend to have a
    hierarchical form.
    I think what you need is a custom parser
    I've just returned to this and after review have become convinced that
    this is absolutely necessary; once the default parser has done its
    work, figuring out the bounds of a statute cite would be next to
    impossible. Examples of the kind of fun you can have labeling
    statutes, ordinances, and rules should you ever get elected to public
    office:

    10-3-350.10(1)(k)
    10.1(40)(d)1
    10.40.040(c)(2)
    100.525(2)(a)3
    105-10.G(3)(a)
    11.04C.3.R.(1)
    8.961.41(cm)
    9.125.07(4A)(3)
    947.013(1m)(a)

    In any of these, a search string which exactly matches something up to
    (but not including) a dash, dot, or left paren should find that thing.
    Unfortunately we don't seem to have any really easy way to plug in a
    custom parser, other than copy-paste-modify the existing one which
    would be a PITA from a maintenance standpoint.
    I'm afraid I'm going to have to bite the bullet and do this anyway.
    Any guidance on how to go about it may save me some time. Also, if
    there is any way to do this which may be useful to others or integrate
    into PostgreSQL to reduce the long-term PITA aspect, I'm all ears.

    -Kevin
    Regards,
    Oleg
    _____________________________________________________________
    Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83
  • Kevin Grittner at Apr 7, 2009 at 2:11 pm

    Oleg Bartunov wrote:

    contrib/test_parser - an example parser code.
    Thanks! Sorry I missed that.

    -Kevin
  • Kevin Grittner at Apr 7, 2009 at 6:28 pm

    Oleg Bartunov wrote:
    contrib/test_parser - an example parser code.
    Using that as a template, I seem to be on track to use the regexp.c
    code to pick out statute cites from the text in my start function, and
    recognize when I'm positioned on one in my getlexeme (GETTOKEN)
    function, delegating everything before, between, and after statute
    cites to the default parser. (I really didn't want to copy/paste and
    modify the whole default parser.)

    That leaves one question I'm still pretty fuzzy on -- how do I go
    about having a statute cite in a tsquery match the entire statute cite
    from a tsvector, or delimited leading portions of it, without having
    it match shorter portions?

    For example:

    If the document text contains '341.15(3)' I want to find it with a
    search string of '341', '341.15', '341.15(3)' but not '341.15(3)(b)',
    '341.1', or '15'. How do I handle that? Do I have to build my
    tsquery values myself as text and cast to tsquery, or is there
    something more graceful that I'm missing?

    -Kevin
  • Oleg Bartunov at Apr 7, 2009 at 6:48 pm

    On Tue, 7 Apr 2009, Kevin Grittner wrote:

    If the document text contains '341.15(3)' I want to find it with a
    search string of '341', '341.15', '341.15(3)' but not '341.15(3)(b)',
    '341.1', or '15'. How do I handle that? Do I have to build my
    tsquery values myself as text and cast to tsquery, or is there
    something more graceful that I'm missing?
    of course, you can build tsquery youself, but once your parser can
    recognize your very own token 'xxx', it'd be much better to have
    mapping xxx -> dict_xxx, where dict_xxx knows all semantics.
    For example, we have our dict_regex
    http://vo.astronet.ru/arxiv/dict_regex.html

    Regards,
    Oleg
    _____________________________________________________________
    Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83
  • Kevin Grittner at Apr 7, 2009 at 7:20 pm

    Oleg Bartunov wrote:
    of course, you can build tsquery youself, but once your parser can
    recognize your very own token 'xxx', it'd be much better to have
    mapping xxx -> dict_xxx, where dict_xxx knows all semantics.
    I probably just need to have that "Aha!" moment, slap my forehead, and
    move on; but I'm not quite understanding something. The answer to
    this question could be it: Can I use a different set of dictionaries
    for creating the tsquery than I did for the tsvector?

    If so, I can have the dictionaries which generate the tsvector include
    the appropriate leading tokens ('341', '341.15', '341.15(3)') and the
    dictionaries for the tsquery can only generate the token based on
    exactly what the user typed. That would give me exactly what I want,
    but somehow I have gotten the impression that the tsvector and tsquery
    need to be generated using the same dictionary set.

    I hope that's a mistaken impression?

    -Kevin
  • Tom Lane at Apr 7, 2009 at 7:29 pm

    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    Can I use a different set of dictionaries
    for creating the tsquery than I did for the tsvector?
    Sure, as long as the tokens (normalized words) that they produce match
    up for words that you want to have match. Once the tokens come out,
    they're just strings as far as the rest of the text search machinery
    is concerned.

    regards, tom lane
  • Kevin Grittner at Apr 7, 2009 at 7:33 pm

    Tom Lane wrote:
    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    Can I use a different set of dictionaries
    for creating the tsquery than I did for the tsvector?
    Sure, as long as the tokens (normalized words) that they produce
    match up for words that you want to have match. Once the tokens
    come out, they're just strings as far as the rest of the text search
    machinery is concerned.
    Fantastic! Don't know how I got confused about that, but the way now
    looks clear.

    Thanks!

    -Kevin
  • Oleg Bartunov at Apr 7, 2009 at 7:32 pm

    On Tue, 7 Apr 2009, Kevin Grittner wrote:

    Oleg Bartunov wrote:
    of course, you can build tsquery youself, but once your parser can
    recognize your very own token 'xxx', it'd be much better to have
    mapping xxx -> dict_xxx, where dict_xxx knows all semantics.
    I probably just need to have that "Aha!" moment, slap my forehead, and
    move on; but I'm not quite understanding something. The answer to
    this question could be it: Can I use a different set of dictionaries
    for creating the tsquery than I did for the tsvector?
    Sure ! For example, you want to index all words, so your dictionaries
    doesn't have stop word lists, but forbid people to search common words.
    Or, if you want to search 'to be or not to be' you have to use
    dictionaries without stop words.

    If so, I can have the dictionaries which generate the tsvector include
    the appropriate leading tokens ('341', '341.15', '341.15(3)') and the
    dictionaries for the tsquery can only generate the token based on
    exactly what the user typed. That would give me exactly what I want,
    but somehow I have gotten the impression that the tsvector and tsquery
    need to be generated using the same dictionary set.

    I hope that's a mistaken impression? yes.
    -Kevin
    Regards,
    Oleg
    _____________________________________________________________
    Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83
  • Kevin Grittner at Apr 8, 2009 at 4:05 pm

    Oleg Bartunov wrote:
    I probably just need to have that "Aha!" moment, slap my forehead,
    and
    move on; but I'm not quite understanding something. The answer to
    this question could be it: Can I use a different set of
    dictionaries
    for creating the tsquery than I did for the tsvector?
    Sure ! For example, you want to index all words, so your
    dictionaries
    doesn't have stop word lists, but forbid people to search common words.
    Or, if you want to search 'to be or not to be' you have to use
    dictionaries without stop words.
    I found a creative solution which I think meets my needs. I'm posting
    both to help out anyone with similar issues who finds the thread, and
    in case someone sees an obvious defect. By creating one function to
    generate the "legal" tsvector (which recognizes statute cites) and
    another function to generate the search values, with casts from text
    to the ts objects, I can get more targeted results than the parser and
    dictionary changes alone could give me.

    I'm still working on the dictionaries and the query function, but the
    vector function currently looks like the attached.

    Thanks to Oleg and Tom for assistance; while neither suggested quite
    this solution, their comments moved me along to where I found it.

    -Kevin
  • Kevin Grittner at Apr 6, 2009 at 10:05 pm

    Tom Lane wrote:
    Perhaps you could pass the texts and the queries through a regexp
    substitution that converts digit-dot-digit to digit-dash-digit?
    This doesn't seem to get me anywhere. For cite '9.125.07(4A)(3)'
    I got this:

    select ts_debug('9-125-07-4A-3');
    ts_debug
    ----------------------------------------------------------------
    (uint,"Unsigned integer",9,{simple},simple,{9})
    (int,"Signed integer",-125,{simple},simple,{-125})
    (int,"Signed integer",-07,{simple},simple,{-07})
    (int,"Signed integer",-4,{simple},simple,{-4})
    (asciiword,"Word, all ASCII",A,{english_stem},english_stem,{})
    (int,"Signed integer",-3,{simple},simple,{-3})
    (6 rows)

    Would there be a reasonable generalized way to pick something like
    this out of a body of text using dictionaries and treat it as a
    statute cite?

    -Kevin
  • Kevin Grittner at Apr 6, 2009 at 11:04 pm

    Tom Lane wrote:
    regexp substitution
    I found a way to at least keep the cite in one piece. Perhaps I can
    do the rest in custom dictionaries, which are more pluggable.

    select ts_debug
    ('State Statute <cite value="SS9.125.07(4A)(3)"> pertaining to');
    ts_debug
    --------------------------------------------------------------------------------
    (asciiword,"Word, all
    ASCII",State,{english_stem},english_stem,{state})
    (blank,"Space symbols"," ",{},,)
    (asciiword,"Word, all
    ASCII",Statute,{english_stem},english_stem,{statut})
    (blank,"Space symbols"," ",{},,)
    (tag,"XML tag","<cite value=""SS9.125.07(4A)(3)"">",{},,)
    (blank,"Space symbols"," ",{},,)
    (asciiword,"Word, all
    ASCII",pertaining,{english_stem},english_stem,{pertain})
    (blank,"Space symbols"," ",{},,)
    (asciiword,"Word, all ASCII",to,{english_stem},english_stem,{})
    (9 rows)

    -Kevin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedMar 10, '09 at 9:47p
activeApr 8, '09 at 4:05p
posts16
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase