FAQ
I'm trying to parse a file through a regexp, iterating over it using
regexp.FindReaderIndex, but I get unexpected behaviour.

file.txt:

"abc
abc
abc
"

Go code:

source := bufio.NewReader(sourceFile)
re := regexp.MustCompile(`abc\n`)
fmt.Println(re.FindReaderIndex(source))
fmt.Println(re.FindReaderIndex(source))
fmt.Println(re.FindReaderIndex(source))

Now, I would expect that to output [0 4][0 4][0 4] but instead I get [0
4][1 5] and that's it. It seems that FindReaderIndex doesn't consume that
last \n character from the source stream. Why is that?

--

Search Discussions

  • Jesse McNelis at Sep 12, 2012 at 11:58 pm

    On Thu, Sep 13, 2012 at 9:30 AM, Toni Cárdenas wrote:
    source := bufio.NewReader(sourceFile)
    re := regexp.MustCompile(`abc\n`)
    fmt.Println(re.FindReaderIndex(source))
    fmt.Println(re.FindReaderIndex(source))
    fmt.Println(re.FindReaderIndex(source))

    Now, I would expect that to output [0 4][0 4][0 4] but instead I get [0 4][1
    5] and that's it. It seems that FindReaderIndex doesn't consume that last \n
    character from the source stream. Why is that?
    that's two characters. a '\' and an 'n'.
    You probably want,
    re := regexp.MustCompile("abc\n")



    --
    =====================
    http://jessta.id.au

    --
  • Peter S at Sep 13, 2012 at 12:54 am
    I can't solve your problem, but it seems to me that the issue is that it
    consumes more runes, rather than less (otherwise it should give three
    matches). (Backquote for regexp doesn't seem to be the problem, `\n` is
    legal RE2 syntax, and changing it to double quotes doesn't help either.)

    I put it on the Playground for easier testing:
    http://play.golang.org/p/TFcpAVfy-1

    Peter
    On Thu, Sep 13, 2012 at 8:30 AM, Toni Cárdenas wrote:

    I'm trying to parse a file through a regexp, iterating over it using
    regexp.FindReaderIndex, but I get unexpected behaviour.

    file.txt:

    "abc
    abc
    abc
    "

    Go code:

    source := bufio.NewReader(sourceFile)
    re := regexp.MustCompile(`abc\n`)
    fmt.Println(re.FindReaderIndex(source))
    fmt.Println(re.FindReaderIndex(source))
    fmt.Println(re.FindReaderIndex(source))

    Now, I would expect that to output [0 4][0 4][0 4] but instead I get [0
    4][1 5] and that's it. It seems that FindReaderIndex doesn't consume that
    last \n character from the source stream. Why is that?

    --

    --
  • Toni Cárdenas at Sep 13, 2012 at 12:05 pm
    Here is a more illustrative test: http://play.golang.org/p/toGyzf5toG

    Examining the output, FindReaderIndex seems to take a certain amount of
    runes from the buffer and to consume from it until the pattern is found,
    but then it doesn't put back the remaining taken runes onto the buffer. I
    don't know if this behaviour is to be expected, but if it is, which would
    be a more proper way of parsing a file?

    I can remake the buffer on each iteration like
    this: http://play.golang.org/p/fYkIYBlPdx , but certainly it doesn't look
    nice.
    On Thursday, September 13, 2012 2:54:45 AM UTC+2, speter wrote:

    I can't solve your problem, but it seems to me that the issue is that it
    consumes more runes, rather than less (otherwise it should give three
    matches). (Backquote for regexp doesn't seem to be the problem, `\n` is
    legal RE2 syntax, and changing it to double quotes doesn't help either.)

    I put it on the Playground for easier testing:
    http://play.golang.org/p/TFcpAVfy-1

    Peter

    On Thu, Sep 13, 2012 at 8:30 AM, Toni Cárdenas <to...@tcardenas.me<javascript:>
    wrote:
    I'm trying to parse a file through a regexp, iterating over it using
    regexp.FindReaderIndex, but I get unexpected behaviour.

    file.txt:

    "abc
    abc
    abc
    "

    Go code:

    source := bufio.NewReader(sourceFile)
    re := regexp.MustCompile(`abc\n`)
    fmt.Println(re.FindReaderIndex(source))
    fmt.Println(re.FindReaderIndex(source))
    fmt.Println(re.FindReaderIndex(source))

    Now, I would expect that to output [0 4][0 4][0 4] but instead I get [0
    4][1 5] and that's it. It seems that FindReaderIndex doesn't consume that
    last \n character from the source stream. Why is that?

    --

    --
  • Roger peppe at Sep 13, 2012 at 1:27 pm

    On 13 September 2012 13:04, Toni Cárdenas wrote:
    Here is a more illustrative test: http://play.golang.org/p/toGyzf5toG

    Examining the output, FindReaderIndex seems to take a certain amount of
    runes from the buffer and to consume from it until the pattern is found, but
    then it doesn't put back the remaining taken runes onto the buffer. I don't
    know if this behaviour is to be expected, but if it is, which would be a
    more proper way of parsing a file?

    I can remake the buffer on each iteration like this:
    http://play.golang.org/p/fYkIYBlPdx , but certainly it doesn't look nice.
    This is an interesting problem that I made a little toy solution for
    some time ago:

    http://play.golang.org/p/1pFFrhYKXW

    I think it's about as good as you can get - there is the inherent
    problem that you have to buffer all input until a match has
    occurred, because the match might actually happen
    right at the beginning.

    So searching for a string that's never found in a large stream
    is inefficient. If you were doing it for real, you might use
    a temporary file.

    --
  • Toni Cárdenas at Sep 13, 2012 at 10:58 pm

    On Thursday, September 13, 2012 3:27:34 PM UTC+2, rog wrote:
    On 13 September 2012 13:04, Toni Cárdenas wrote:
    Here is a more illustrative test: http://play.golang.org/p/toGyzf5toG

    Examining the output, FindReaderIndex seems to take a certain amount of
    runes from the buffer and to consume from it until the pattern is found, but
    then it doesn't put back the remaining taken runes onto the buffer. I don't
    know if this behaviour is to be expected, but if it is, which would be a
    more proper way of parsing a file?

    I can remake the buffer on each iteration like this:
    http://play.golang.org/p/fYkIYBlPdx , but certainly it doesn't look
    nice.

    This is an interesting problem that I made a little toy solution for
    some time ago:

    http://play.golang.org/p/1pFFrhYKXW

    I think it's about as good as you can get - there is the inherent
    problem that you have to buffer all input until a match has
    occurred, because the match might actually happen
    right at the beginning.

    So searching for a string that's never found in a large stream
    is inefficient. If you were doing it for real, you might use
    a temporary file.
    That's nice, but a little overkill for me because my pattern starts with ^.
    I've just ended up seeking to the previous offset of the file after running
    FindReaderIndex, consuming from there just the matched string and making a
    new buffer each time, in a similar way to my previous post.

    $ godoc regexp
    ...
    There is also a subset of the methods that can be applied to text read
    from a RuneReader:
    MatchReader, FindReaderIndex, FindReaderSubmatchIndex
    This set may grow. Note that regular expression matches may need to
    examine text beyond the text returned by a match, so the methods that
    match text from a RuneReader may read arbitrarily far into the input
    before returning.

    Yeah, that would've helped if I had read it on time. Thanks!

    --
  • Roger peppe at Sep 14, 2012 at 8:49 am

    On 13 September 2012 23:58, Toni Cárdenas wrote:
    On Thursday, September 13, 2012 3:27:34 PM UTC+2, rog wrote:
    On 13 September 2012 13:04, Toni Cárdenas wrote:
    Here is a more illustrative test: http://play.golang.org/p/toGyzf5toG

    Examining the output, FindReaderIndex seems to take a certain amount of
    runes from the buffer and to consume from it until the pattern is found,
    but
    then it doesn't put back the remaining taken runes onto the buffer. I
    don't
    know if this behaviour is to be expected, but if it is, which would be a
    more proper way of parsing a file?

    I can remake the buffer on each iteration like this:
    http://play.golang.org/p/fYkIYBlPdx , but certainly it doesn't look
    nice.
    This is an interesting problem that I made a little toy solution for
    some time ago:

    http://play.golang.org/p/1pFFrhYKXW

    I think it's about as good as you can get - there is the inherent
    problem that you have to buffer all input until a match has
    occurred, because the match might actually happen
    right at the beginning.

    So searching for a string that's never found in a large stream
    is inefficient. If you were doing it for real, you might use
    a temporary file.

    That's nice, but a little overkill for me because my pattern starts with ^.
    I've just ended up seeking to the previous offset of the file after running
    FindReaderIndex, consuming from there just the matched string and making a
    new buffer each time, in a similar way to my previous post.
    If you've got a seekable file, that's definitely the way forward.

    --
  • Roger peppe at Sep 14, 2012 at 8:51 am
    BTW, if your regexp can't span line boundaries, you could just
    read line by line...
    On 14 September 2012 09:49, roger peppe wrote:
    On 13 September 2012 23:58, Toni Cárdenas wrote:
    On Thursday, September 13, 2012 3:27:34 PM UTC+2, rog wrote:
    On 13 September 2012 13:04, Toni Cárdenas wrote:
    Here is a more illustrative test: http://play.golang.org/p/toGyzf5toG

    Examining the output, FindReaderIndex seems to take a certain amount of
    runes from the buffer and to consume from it until the pattern is found,
    but
    then it doesn't put back the remaining taken runes onto the buffer. I
    don't
    know if this behaviour is to be expected, but if it is, which would be a
    more proper way of parsing a file?

    I can remake the buffer on each iteration like this:
    http://play.golang.org/p/fYkIYBlPdx , but certainly it doesn't look
    nice.
    This is an interesting problem that I made a little toy solution for
    some time ago:

    http://play.golang.org/p/1pFFrhYKXW

    I think it's about as good as you can get - there is the inherent
    problem that you have to buffer all input until a match has
    occurred, because the match might actually happen
    right at the beginning.

    So searching for a string that's never found in a large stream
    is inefficient. If you were doing it for real, you might use
    a temporary file.

    That's nice, but a little overkill for me because my pattern starts with ^.
    I've just ended up seeking to the previous offset of the file after running
    FindReaderIndex, consuming from there just the matched string and making a
    new buffer each time, in a similar way to my previous post.
    If you've got a seekable file, that's definitely the way forward.
    --
  • Kevin Gillette at Sep 16, 2012 at 4:20 am
    That method takes an io.RuneReader and therefore has no way to "push back"
    unread runes. Since, by construction, we know that the index in the reader
    is already "gone" by the time it's found, unless you have something
    seekable or fully buffered, the index can't be usefully reused with respect
    to that same "reader".

    Also keep in mind that Go doesn't treat regexp's as the "magic bullet" that
    many other languages do. For example, in interpreted languages like Perl,
    Python, or Ruby, it's generally going to be much much faster to use regexps
    for anything more complex than exact substring searching, whereas in Go,
    it's always faster (and sometimes even simpler) to use competently-written
    custom algorithms, even for tasks that regexps are "good at." Therefore,
    aside from cases where a user supplies search algorithms at runtime,
    regexps in Go are just a "convenience," not a "necessity."
    On Thursday, September 13, 2012 6:04:58 AM UTC-6, Toni Cárdenas wrote:

    Here is a more illustrative test: http://play.golang.org/p/toGyzf5toG

    Examining the output, FindReaderIndex seems to take a certain amount of
    runes from the buffer and to consume from it until the pattern is found,
    but then it doesn't put back the remaining taken runes onto the buffer. I
    don't know if this behaviour is to be expected, but if it is, which would
    be a more proper way of parsing a file?

    I can remake the buffer on each iteration like this:
    http://play.golang.org/p/fYkIYBlPdx , but certainly it doesn't look nice.
    On Thursday, September 13, 2012 2:54:45 AM UTC+2, speter wrote:

    I can't solve your problem, but it seems to me that the issue is that it
    consumes more runes, rather than less (otherwise it should give three
    matches). (Backquote for regexp doesn't seem to be the problem, `\n` is
    legal RE2 syntax, and changing it to double quotes doesn't help either.)

    I put it on the Playground for easier testing:
    http://play.golang.org/p/TFcpAVfy-1

    Peter
    On Thu, Sep 13, 2012 at 8:30 AM, Toni Cárdenas wrote:

    I'm trying to parse a file through a regexp, iterating over it using
    regexp.FindReaderIndex, but I get unexpected behaviour.

    file.txt:

    "abc
    abc
    abc
    "

    Go code:

    source := bufio.NewReader(sourceFile)
    re := regexp.MustCompile(`abc\n`)
    fmt.Println(re.FindReaderIndex(source))
    fmt.Println(re.FindReaderIndex(source))
    fmt.Println(re.FindReaderIndex(source))

    Now, I would expect that to output [0 4][0 4][0 4] but instead I get [0
    4][1 5] and that's it. It seems that FindReaderIndex doesn't consume that
    last \n character from the source stream. Why is that?

    --

    --
  • Russ Cox at Sep 13, 2012 at 7:42 pm

    On Wed, Sep 12, 2012 at 8:54 PM, Peter S wrote:
    I can't solve your problem, but it seems to me that the issue is that it
    consumes more runes, rather than less (otherwise it should give three
    matches).
    $ godoc regexp
    ...
    There is also a subset of the methods that can be applied to text read
    from a RuneReader:

    MatchReader, FindReaderIndex, FindReaderSubmatchIndex

    This set may grow. Note that regular expression matches may need to
    examine text beyond the text returned by a match, so the methods that
    match text from a RuneReader may read arbitrarily far into the input
    before returning.

    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedSep 12, '12 at 11:38p
activeSep 16, '12 at 4:20a
posts10
users6
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase