FAQ
Hi,

I'm looking at fixing issue 2762:

regexp: add Split
strings.Split() is limited to a simple string but you could want to
split when a regular expression is matched up. (To add in package regexp)


So far I have a working version of the following:

// Split slices s into all substrings separated by the expression and
returns
// a slice of the substrings between the expression matches. The integer
argument n
// indicates the maximum number of splits to perform when n >= 0.
func (re *Regexp) Split(s string, n int) []string

Does anyone have any objections or better ideas for this API addition? I am
modeling the code after the strings.Split() method, so the following
results are expected for input string "foo:and:bar":

Regexp Result
":" ["foo", "and", "bar"]
"a" ["foo:", "nd:b", "r"]
"foo" ["", ":and:bar"]
"bar" ["foo:and", ""]
"baz" ["foo:and:bar"]


Thanks,
Rick Arnold


--

Search Discussions

  • Rob Pike at Nov 13, 2012 at 4:33 am
    That's not enough of a spec. With regexps you need to think carefully
    about empty matches. Consider "baabaab" against `a*` for instance.

    I know what it should be, but people are used to PCRE and it probably
    does exactly the opposite 50% of the time, so I'm just going to sit
    here, watch, and wait for the surprise.

    -rob

    --
  • Rick Arnold at Nov 13, 2012 at 1:35 pm
    Here is the option I was leaning towards when implementing this:
    consecutive matches of the regular expression return empty strings.

    So splitting "baabaab" against "a" would result in ["b", "", "b", "", "b"].

    Pros:
    1. Consistency with strings.Split() - which returns empty strings for
    consecutive matches.
    2. Simpler implementation - easier to explain/reason about.
    3. Easy for the caller to determine the total number of matches encountered
    using the length of the returned slice.
    4. Looping through the returned slice and appending to a buffer won't be
    affected by the empty strings.

    Cons:
    1. Slight memory/performance cost for the "extra" empty strings in the
    returned slices.
    2. Not consistent with PCRE.
    3. Calling code will have to handle/ignore empty strings if desired.

    To me, this option feels more Go-like than collapsing consecutive matches
    since that would end up hiding information and be more complex to
    implement/explain/understand.

    Thanks,
    Rick Arnold

    On Monday, November 12, 2012 11:33:32 PM UTC-5, Rob Pike wrote:

    That's not enough of a spec. With regexps you need to think carefully
    about empty matches. Consider "baabaab" against `a*` for instance.

    I know what it should be, but people are used to PCRE and it probably
    does exactly the opposite 50% of the time, so I'm just going to sit
    here, watch, and wait for the surprise.

    -rob
    --
  • Rob Pike at Nov 13, 2012 at 3:21 pm
    You missed a key character in my mail: an asterisk. What happens when
    the regular expression matches the empty string, sometimes? "baabaab"
    against "a*". Note the *.

    That said, your example without the *, "baabaab" against "a", surely
    should return ["b", "", "b", "", "b"]; s/a/,/ to see why.

    The world walked away from what I consider the ur-definition, which is
    defined what ed does for s/pattern/X/g: leftmost-longest for matches,
    and empty matches count unless they abut a non-empty match to the
    left. That's what I want, except that our regexp library doesn't do
    leftmost-longest (to my profound regret). Maybe that doesn't matter.

    -rob

    --
  • Rick Arnold at Nov 13, 2012 at 4:07 pm
    Wow, sorry. I don't know how I missed the '*' in your original email.

    I was thinking we could define the results in terms of
    regexp.FindAllStringIndex(). The slice resulting from Split() would be
    built from all the substrings not contained in the slice returned by
    FindAllStringIndex().

    So for your example,

    `a*`.FindAllStringIndex("baabaab") results in [ [0, 0], [1, 3], [4, 6],
    [7,7] ]

    so Split() would be built from [ [0, 1], [3, 4], [6, 7] ] resulting in
    ["b", "b", "b"]
    (we could consider [0, 0] and [7, 7] here also if determining the
    number of matches from the returned slice is desired)

    and `a`.FindAllStringIndex("baabaab") results in [ [1, 2], [2, 3], [4, 5],
    [5, 6] ]

    so Split() would be built from [ [0, 1], [2, 2], [3, 4], [5, 5], [6, 7]
    ] resulting in ["b", "", "b", "", "b"]


    This would keep regexp internally consistent if the logic for
    FindAllStringIndex() itself is changed.

    Rick

    On Tuesday, November 13, 2012 10:21:51 AM UTC-5, Rob Pike wrote:

    You missed a key character in my mail: an asterisk. What happens when
    the regular expression matches the empty string, sometimes? "baabaab"
    against "a*". Note the *.

    That said, your example without the *, "baabaab" against "a", surely
    should return ["b", "", "b", "", "b"]; s/a/,/ to see why.

    The world walked away from what I consider the ur-definition, which is
    defined what ed does for s/pattern/X/g: leftmost-longest for matches,
    and empty matches count unless they abut a non-empty match to the
    left. That's what I want, except that our regexp library doesn't do
    leftmost-longest (to my profound regret). Maybe that doesn't matter.

    -rob
    --
  • Rob Pike at Nov 13, 2012 at 6:56 pm
    That sounds like a fair plan. Consistency with FindAll seems sound.

    -rob

    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedNov 13, '12 at 4:27a
activeNov 13, '12 at 6:56p
posts6
users2
websitegolang.org

2 users in discussion

Rob Pike: 3 posts Rick Arnold: 3 posts

People

Translate

site design / logo © 2022 Grokbase