FAQ
Hello,

is there a way through regexp or regexp/syntax to know the last character
of a string that matched a regex?

I have a set of regular expressions like
"www.somesite.com/(?P<itemid>[0-9]+[A-Z][A-Z])/...". The regular
expressions are matched against urls entered by users through a web
interface.

When a regular expression matches, the url is valid and the code can work
on it. When it doesn't, the url is invalid.
I'd like to improve the interface to tell the user when the url is
incomplete, or where the error is in the url.

Thinking about it, I believe it would be relatively straightforward to
implement if, given a regexp, there was a way to know the last character
that matched in case of mismatch.

Example:
If the last character that matched is the last character of the url, then
the url is incomplete. I can just add some visual clues for the user to
complete it.
If the last character that matched is somewhere in the middle, I can tell
the user "there's an error somewhere around here".
If multiple regexps match, I can tell the user "which url did you intend to
enter? those are the formats that are supported".

I can roll out my own parser most likely, but just using regexps seemed
nice. Any other suggestion on how to solve this problem?

Thanks for the suggestions,
Carlo


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Nsch at Nov 3, 2014 at 2:42 pm
    This is trivial to implement only in regex matchers that go through the
    input exactly once, from left to right.

    This one will work for you (that is: you can trivially hack in your
    change), but it's in Java: https://github.com/nes1983/tree-regex.

    In a backtracking matcher, such as java.util.Regex, your feature is no
    longer easy to implement.

    More importantly: Regex gets you only so far in life. At some point, you
    have to start parsing yourself. Maybe that point has now come? :)
    See http://commandcenter.blogspot.ch/2011/08/regular-expressions-in-lexing-and.html

    Cheers,

    Niko


    On Monday, November 3, 2014 7:26:55 AM UTC+1, Carlo Contavalli wrote:

    Hello,

    is there a way through regexp or regexp/syntax to know the last character
    of a string that matched a regex?

    I have a set of regular expressions like "www.somesite.com/(?P<itemid>[0-9]+[A-Z][A-Z])/...".
    The regular expressions are matched against urls entered by users through a
    web interface.

    When a regular expression matches, the url is valid and the code can work
    on it. When it doesn't, the url is invalid.
    I'd like to improve the interface to tell the user when the url is
    incomplete, or where the error is in the url.

    Thinking about it, I believe it would be relatively straightforward to
    implement if, given a regexp, there was a way to know the last character
    that matched in case of mismatch.

    Example:
    If the last character that matched is the last character of the url, then
    the url is incomplete. I can just add some visual clues for the user to
    complete it.
    If the last character that matched is somewhere in the middle, I can tell
    the user "there's an error somewhere around here".
    If multiple regexps match, I can tell the user "which url did you intend
    to enter? those are the formats that are supported".

    I can roll out my own parser most likely, but just using regexps seemed
    nice. Any other suggestion on how to solve this problem?

    Thanks for the suggestions,
    Carlo

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Carlo Contavalli at Nov 3, 2014 at 11:00 pm

    On Mon, Nov 3, 2014 at 12:30 AM, wrote:
    This is trivial to implement only in regex matchers that go through the
    input exactly once, from left to right.
    Does the default regexp library in go go through the input exactly once?

    Even with backtracking, or if there are multiple possibilities, the
    parser should know up to what byte the state machine could move
    forward?
    More importantly: Regex gets you only so far in life. At some point, you
    have to start parsing yourself. Maybe that point has now come? :) See
    http://commandcenter.blogspot.ch/2011/08/regular-expressions-in-lexing-and.html
    Agreed in principle, although I'm not sure it is practical in this
    case. Those regexes depend on random web site URLs, and change fairly
    often with time. Having a set of ad-hoc parsers would be hard to
    maintain over time imho.

    Carlo

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dan Kortschak at Nov 3, 2014 at 11:05 pm
    Have you considered using ragel? You can get pretty good error reporting
    using that.
    On Mon, 2014-11-03 at 15:00 -0800, Carlo Contavalli wrote:
    On Mon, Nov 3, 2014 at 12:30 AM, wrote:
    This is trivial to implement only in regex matchers that go through the
    input exactly once, from left to right.
    Does the default regexp library in go go through the input exactly
    once?

    Even with backtracking, or if there are multiple possibilities, the
    parser should know up to what byte the state machine could move
    forward?
    More importantly: Regex gets you only so far in life. At some point, you
    have to start parsing yourself. Maybe that point has now come? :) See
    http://commandcenter.blogspot.ch/2011/08/regular-expressions-in-lexing-and.html

    Agreed in principle, although I'm not sure it is practical in this
    case. Those regexes depend on random web site URLs, and change fairly
    often with time. Having a set of ad-hoc parsers would be hard to
    maintain over time imho.

    Carlo

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Rob Pike at Nov 3, 2014 at 11:38 pm
    I'm not sure the question is well-founded from a regular grammar point of
    view. There is no "last character that matched".

    Since you're scanning URLs, don't bother with a regexp. Use a proper URL
    parser, which is trivial to write, and give the actual error point when
    something goes wrong.

    -rob

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nsch at Nov 4, 2014 at 7:29 am
    I believe it could be well-defined in terms of "longest path through an
    NFA".

    Carlo: Another reason nobody's going to put your functionality into a regex
    library is that regex libraries are tuned to death, to the level of
    counting instructions per input byte. They're also quite competitive:
    people care if you're a bit faster than another implementation. Since
    there's very few instructions per input byte going around, a single extra
    instruction can be deadly.

    Think of it this way: people expect performance from regex similar to
    java.lang.String#indexOf. Now look how that's implemented[1]. It spends
    most of its time in this loop:

    while (++i <= max && source[i] != first);


    There's very little room for improvement in there. A regex, in contrast, is
    an unwieldily apparatus. Here's how aforementioned tree-regex does it:
    1. Convert regexp to AST.
    2. Convert AST to NFA.
    3. Match through the NFA, while building up DFA states.
    4. If we haven't had to build a DFA state for some time, switch to DFA
    execution.
    5. If DFA execution needs an additional state, switch back to NFA execution
    and build it. Go back to step 3.

    Remember the loop up there we're trying to compete with? The crazy thing
    is: it works. tree-regex is within punching range of indexOf.

    Your feature may be quite compatible with how tree-regex works -- but by
    implementing it I'd tie my hand to a particular implementation.

    In a back-tracking implementation, however, your suggestion would kill
    performance.

    Cheers,

    Niko

    [1] This is just the fall-back implementation that you see in String.java.
    The JVM has an intrinsic implementation of the function, written in
    assembly, which is even faster.
    On Tuesday, November 4, 2014 12:38:47 AM UTC+1, Rob 'Commander' Pike wrote:

    I'm not sure the question is well-founded from a regular grammar point of
    view. There is no "last character that matched".

    Since you're scanning URLs, don't bother with a regexp. Use a proper URL
    parser, which is trivial to write, and give the actual error point when
    something goes wrong.

    -rob
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedNov 3, '14 at 6:26a
activeNov 4, '14 at 7:29a
posts6
users4
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase