FAQ
The existing bufio support for line-at-a-time I/O is cumbersome. Here
is a proposal for an all-new design that should be easier to use,
designed with help from Brad Fitzpatrick. This API would be an add-on,
not a replacement for the existing ReadSlice etc.

It's part of the bufio package, not being worth another package. In
any case it can use some existing bufio internals to provide an
efficient implementation.

We add a new type, called Scanner, to capture the new functionality.
Its constructor takes an io.ReadCloser, for reasons that will become
clear. (The caller can promote a Reader to a ReadCloser using
ioutil.NopCloser.) If the argument is not already a bufio.Reader, one
is created to wrap the argument.

This gives us:

package bufio

type Scanner struct { /* hidden */ }

func NewScanner(r io.ReadCloser) *Scanner

The model for the scanner is to "tokenize" the input into text to be
processed, separated by delimiters that are discarded. In the default
case, this means lines of text separated by `\r?\n`. It is not
possible in this design to discover whether, for instance, the last
line of the input ends with a newline. This is OK; the point of this
API is to make I/O easier and discovering such details about the input
complicates existing designs.

To scan the input, use the Next method as the loop condition, the
Bytes or Line methods as the "getters", and Close at the end. Here are
the method signatures:

func (s *Scanner) Next() bool

func (s *Scanner) Close() error

func (s *Scanner) Bytes() []byte // Does not copy; data is volatile.

func (s *Scanner) Text() string

The last name is Text not String so we don't accidentally create a
fmt.Stringer out of a Scanner.

I/O works by calling Next to load the next "token". It returns false
at EOF or error. The Close method returns:
nil if there was no error; or
nil if the only error was EOF; or
whatever non-EOF error triggered the scan to stop, including line-too-long; or
the return value of the reader's Close method.

Note that a line-too-long terminates the scan. If you need to deal
with crazy-long lines, you'll need to use ReadSlice or just Read etc.

Here is code to print a file line-by-line, with line numbers:

s := bufio.NewScanner(io.Stdin)
for i := 1; s.Next() i++ {
line := s.Bytes()
fmt.Printf("%3d\t%s\n", i, line)
}
if err := s.Close(); err != nil {
log.Fatal(err)
}

This is the basic outline; it seems clean and easy to use. The
deferral of error until Close (thanks, Brad) is the key insight to
having the code be simple.

We could generalize a little on top of this. One easy step is to allow
options, such as to control the maximum token length. These are done
with a chaining API so they don't need to be in the constructor. There
should be very few of them. If it can be done efficiently, we could
provide:

func (s *Scanner) MaxLength(length int) *Scanner // default 4k

To allow the user to specify the token-splitting algorithm, we add a
function option. It's not easy to use, but it won't be used much.
Still, a word-breaking splitter, for instance, would be nice, as would
a byte-at-a-time and rune-at-a-time scanner, and we could provide
those in bufio itself. The function is called for each byte and has
this signature:

type SplitFunc func(char byte, atEof bool) []byte

The atEof argument is true only at EOF, giving the function a chance
to terminate the last token. A nil return means 'nothing' and can be
discriminated from an empty return. (Another design would be to
provide a separate boolean to indicate whether there is a value
returned; either should work well.) We set up a custom splitter with
an option method:

func (s *Scanner) Split(SplitFunc) *Scanner // default: split on line
breaks; cr ignored

For example, if we provided a rune splitter in the package, you'd scan
runes like this:

s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
for s.Next() {
fmt.Printf("rune: %s\n", s.Bytes())
}
if err := s.Close(); err != nil {
log.Fatal(err)
}

Comments welcome. I have no implementation.

-rob

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Jimmy frasche at Feb 14, 2013 at 3:01 am
    This would simplify most of the line handling code I've written so
    far. The API looks solid and easy to use.

    My only concern is the limitation of the splitter to a func. I think
    having something like net/http's Handler/HandlerFunc would be better.
    Sometimes you have to store state like "last char was an escape
    character". Perhaps that's exotic enough to leave to the more
    cumbersome techniques, however.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • David Symonds at Feb 14, 2013 at 4:11 am

    On Thu, Feb 14, 2013 at 2:00 PM, jimmy frasche wrote:

    My only concern is the limitation of the splitter to a func. I think
    having something like net/http's Handler/HandlerFunc would be better.
    Sometimes you have to store state like "last char was an escape
    character". Perhaps that's exotic enough to leave to the more
    cumbersome techniques, however.
    You could always store state by using a closure. The dominant case is
    a stateless function, though, so passing a func seems like a better
    fit.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Gustavo Niemeyer at Feb 14, 2013 at 10:38 am

    On Thu, Feb 14, 2013 at 2:04 AM, David Symonds wrote:
    You could always store state by using a closure. The dominant case is
    a stateless function, though, so passing a func seems like a better
    fit.
    Is it? I don't think the current behavior of ReadLine can be made stateless.


    gustavo @ http://niemeyer.net

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Andrew Gerrand at Feb 14, 2013 at 6:36 am

    On 14 February 2013 13:36, Rob Pike wrote:

    The last name is Text not String so we don't accidentally create a
    fmt.Stringer out of a Scanner.
    Why not let it be a Stringer?

    s := bufio.NewScanner(r)
    for s.Next() {
    fmt.Println(s)
    }
    // etc

    I would be more concerned if the String/Text method had some kind of side
    effect, apart from an allocation. Since you must always advance with Next,
    I don't see the problem with Scanner being a Stringer.

    What happens if you call the Bytes or String/Text methods before calling
    Next (or after calling Close)? Panic? If that's the case, I could see the
    fmt.Stringer thing being an issue.

    Andrew

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at Feb 14, 2013 at 7:57 am
    I was thinking the same thing. It gives access to one of the nicer perl/python idioms.

    Also, as someone who regularly deals with crazy long lines (there is nothing in the fasta specification that insists on any line breaks except after seqs and ids) I'd like a possibility of continuing on from an overlong ling in some way.

    Very nice though.

    Dan

    On 14/02/2013, at 5:06 PM, "Andrew Gerrand" wrote:


    On 14 February 2013 13:36, Rob Pike wrote:
    The last name is Text not String so we don't accidentally create a
    fmt.Stringer out of a Scanner.

    Why not let it be a Stringer?

    s := bufio.NewScanner(r)
    for s.Next() {
    fmt.Println(s)
    }
    // etc

    I would be more concerned if the String/Text method had some kind of side effect, apart from an allocation. Since you must always advance with Next, I don't see the problem with Scanner being a Stringer..

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Donovan Hide at Feb 14, 2013 at 7:59 am
    Would there be any future possibility of hooking this into range? If there
    is some thought going into a future iterator implementation of iterators
    which work with range, then this is a prime candidate!

    s := bufio.NewScanner(r)
    for i,line := range s {
    fmt.Println(i,line)
    }

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Yy at Feb 14, 2013 at 8:45 am
    (sorry Dan, I sent this reply just to you, fwding to the list)
    On 14 February 2013 08:50, Dan Kortschak wrote:

    I'd like a possibility of continuing on from an overlong ling in some way.
    If there was a (s *Scanner)LineTooLong() bool method (or, more generally, a
    s.Error() method), wouldn't something like this work?

    func longerNext(s *bufio.Scanner) bool {
    if s.LineTooLong() {
    return s.MaxLength(2*len(s.Bytes())).Next() || longerNext(s)
    }
    return false
    }

    and, in the "scanning loop":

    for s.Next() || longerNext(s) {

    This may be dangerous, I don't think it should be part of the library, but
    you could easily implement it as long as there's some way to know the error
    without closing the reader. In any case, I think it would be interesting
    having such functionality, because you may also want to discard a long line
    but keep reading.


    --
    - yiyus || JGL .

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at Feb 14, 2013 at 8:53 am
    I was thinking more of a way to continue rather that grow. I guess sort of how you would if a line were too long when reading with bufio.ReadLine. You would need to have a way of checking errors without the call to Close(), which may not be wanted. If the Underlying Reader does read past where the buffer would overflow, I guess you could create a new Scanner on the that Reader, but it seems wasteful.

    On 14/02/2013, at 7:15 PM, "yy" wrote:

    On 14 February 2013 08:50, Dan Kortschak wrote:
    I'd like a possibility of continuing on from an overlong ling in some way.

    If there was a (s *Scanner)LineTooLong() bool method (or, more generally, a s.Error() method), wouldn't something like this work?

    func longerNext(s *bufio.Scanner) bool {
    if s.LineTooLong() {
    return s.MaxLength(2*len(s.Bytes())).Next() || longerNext(s)
    }
    return false
    }

    and, in the "scanning loop":

    for s.Next() || longerNext(s) {

    This may be dangerous, I don't think it should be part of the library, but you could easily implement it as long as there's some way to know the error without closing the reader. In any case, I think it would be interesting having such functionality, because you may also want to discard a long line but keep reading.


    --
    - yiyus || JGL .


    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com .
    For more options, visit https://groups.google.com/groups/opt_out.


    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Anthony Martin at Feb 14, 2013 at 6:48 am
    What's the rationale for the names "Next" and "Close"?

    The only precedent I can see is database/sql.(*Rows).Next.
    All other Next methods in the standard library actually
    return the "next" of something.

    Anthony

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Aston Motes at Feb 14, 2013 at 10:24 am
    It seems like "Scan" works just as well for the name of the "function to do
    the next thing" loop condition and avoids the confusion of "Next" not
    returning the next line.

    On Wed, Feb 13, 2013 at 10:48 PM, Anthony Martin wrote:

    What's the rationale for the names "Next" and "Close"?

    The only precedent I can see is database/sql.(*Rows).Next.
    All other Next methods in the standard library actually
    return the "next" of something.

    Anthony

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kamil Kisiel at Feb 14, 2013 at 6:27 pm
    I disagree there. Comparing to the API of existing packages like
    database/sql that's backwards.
    On Thursday, February 14, 2013 2:23:50 AM UTC-8, Aston Motes wrote:

    It seems like "Scan" works just as well for the name of the "function to
    do the next thing" loop condition and avoids the confusion of "Next" not
    returning the next line.


    On Wed, Feb 13, 2013 at 10:48 PM, Anthony Martin <al...@pbrane.org<javascript:>
    wrote:
    What's the rationale for the names "Next" and "Close"?

    The only precedent I can see is database/sql.(*Rows).Next.
    All other Next methods in the standard library actually
    return the "next" of something.

    Anthony

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kamil Kisiel at Feb 14, 2013 at 7:47 am

    On Wednesday, February 13, 2013 6:36:37 PM UTC-8, Rob Pike wrote:
    To scan the input, use the Next method as the loop condition, the
    Bytes or Line methods as the "getters", and Close at the end.
    By "Line" did you mean "Text"? or is there some other method?

    The API looks good to me, probably about as close as possible to Python's
    "for line in f:" as is possible with Go constructs.

    Is the choice of line ending fixed? I remember there were some people on
    the mailing list asking for convenient handling of \r some time back, for
    dealing with old mac files.

    I the Python world there's something called "Universal Newlines" mode that
    treats either \n, \r\n or \r as a newline if you open a file with a special
    flag. The splitlines() function does the same thing.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kevin Gillette at Feb 14, 2013 at 8:36 am
    The semantics of SplitFunc are unclear to me... Based on that func type
    returning []byte while accepting a byte at a time suggests (to me) that it
    must have internal state, such as with a closure, accumulating bytes until
    it can return a token (returning nil until the token is constructed). If
    that interpretation is correct, then contrary to what others have said,
    almost every use case would need state, and thus soapboxcicero's suggestion
    of an employing an interface would be appropriate. On the other hand, if
    the return value is only ever expected to be nil, or length <= 1 (which is
    usually all a stateless func can do with a single byte as input) in order
    to achieve an in-band signalling effect, I believe a multi-valued return
    signature would be clearer.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Michal at Feb 14, 2013 at 8:41 am
    will there be possible to read all lines in this way:

    for hasNext:=s.Next(); hasNext; hasNext=s.Next() {
    ...
    }

    ?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Gustavo Niemeyer at Feb 14, 2013 at 9:49 am

    On Thu, Feb 14, 2013 at 12:36 AM, Rob Pike wrote:
    This is the basic outline; it seems clean and easy to use. The
    deferral of error until Close (thanks, Brad) is the key insight to
    having the code be simple.
    That's the design of mgo iterators, except there's an explicit method
    iter.Err() instead. It works quite well. I was just tempted to suggest
    not bundling this with Close, but I guess the common case is files
    opened just for parsing, so the NopCloser should be fine for the rest.
    type SplitFunc func(char byte, atEof bool) []byte
    I suggest this instead:

    type Splitter interface {
    Split(buf []byte) [][]byte
    }

    The last call to Split is done with nil. This avoids both atEof and
    the bool return, and is also more friendly to splitters that need to
    take state.
    to terminate the last token. A nil return means 'nothing' and can be
    discriminated from an empty return. (Another design would be to
    Given we encourage people to handle nil slices as if they were empty,
    might be best to avoid purposefully differentiating a nil return from
    an empty one.


    gustavo @ http://niemeyer.net

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Roger peppe at Feb 14, 2013 at 12:42 pm

    On 14 February 2013 09:49, Gustavo Niemeyer wrote:
    type SplitFunc func(char byte, atEof bool) []byte
    I suggest this instead:

    type Splitter interface {
    Split(buf []byte) [][]byte
    }

    The last call to Split is done with nil. This avoids both atEof and
    the bool return, and is also more friendly to splitters that need to
    take state.
    This interface would make it difficult to enforce buffer size limits,
    I think, and even if Split was changed to return an error,
    I think it's nicer to have buffer size limits enforced at the Scanner
    level rather than requiring a custom splitter.
    to terminate the last token. A nil return means 'nothing' and can be
    discriminated from an empty return. (Another design would be to
    Given we encourage people to handle nil slices as if they were empty,
    might be best to avoid purposefully differentiating a nil return from
    an empty one.
    I don't mind, personally. The distinction's got to be good for something.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Nigel Tao at Feb 14, 2013 at 11:16 pm

    On Thu, Feb 14, 2013 at 8:49 PM, Gustavo Niemeyer wrote:
    On Thu, Feb 14, 2013 at 12:36 AM, Rob Pike wrote:
    This is the basic outline; it seems clean and easy to use. The
    deferral of error until Close (thanks, Brad) is the key insight to
    having the code be simple.
    That's the design of mgo iterators, except there's an explicit method
    iter.Err() instead.
    FWIW, leveldb-go's db.Iterator works like Rob's proposal (Close
    returns the accumulated error).
    https://code.google.com/p/leveldb-go/source/browse/leveldb/db/db.go#53

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Roger peppe at Feb 14, 2013 at 11:57 am

    On 14 February 2013 02:36, Rob Pike wrote:
    The existing bufio support for line-at-a-time I/O is cumbersome. Here
    is a proposal for an all-new design that should be easier to use,
    designed with help from Brad Fitzpatrick. This API would be an add-on,
    not a replacement for the existing ReadSlice etc.
    I like this proposal in general. I think it's neat and should be easy to use,
    and making it general is great too.
    I have one or two comments though.

    I'm not sure about the Close thing. I very often read lines from
    a Reader, and using io.NopCloser adds to the weight of this
    common case. Also, it means that the common idiom of
    putting a defer f.Close() after opening f is not so applicable
    (you usually want to see line-too-long errors, but reader-close
    errors are usually boring).

    Having the scanner close the underlying file on error
    means we can't carry on after an error, even if we
    want to.

    How about using Err rather than close?

    s := bufio.NewScanner(io.Stdin)
    for i := 1; s.Next() i++ {
    line := s.Bytes()
    fmt.Printf("%3d\t%s\n", i, line)
    }
    if err := s.Err(); err != nil {
    log.Fatal(err)
    }
    We could generalize a little on top of this. One easy step is to allow
    options, such as to control the maximum token length. These are done
    with a chaining API so they don't need to be in the constructor. There
    should be very few of them. If it can be done efficiently, we could
    provide:

    func (s *Scanner) MaxLength(length int) *Scanner // default 4k
    I'm not keen on the chaining API. It means that no-one
    else can implement types that satisfy the same interface
    as Scanner.

    Even though it might add a couple of lines to the source code,
    I'd prefer to see:

    func (s *Scanner) SetMaxLength(length int)

    (or MaxLength as above, but without the *Scanner return)
    To allow the user to specify the token-splitting algorithm, we add a
    function option. It's not easy to use, but it won't be used much.
    Still, a word-breaking splitter, for instance, would be nice, as would
    a byte-at-a-time and rune-at-a-time scanner, and we could provide
    those in bufio itself. The function is called for each byte and has
    this signature:

    type SplitFunc func(char byte, atEof bool) []byte
    I'm not sure about the use of a function type here.
    Usually when we provide a function type, we expect
    it to be side-effect free (I'm thinking in particular of the *Func
    functions in strings and bytes here). Here, on the other hand,
    we *require* the function to have state, otherwise it
    cannot function correctly.

    The other thing that concerns me here is efficiency.
    Currently ReadSlice can use IndexByte to scan very quickly
    through a buffer - we're making that into many function
    calls here.

    Given that this interface is about *splitting*, not tokenization,
    how about an interface something like this?

    type Splitter interface {
    // Split scans the given byte slice for a token, where the first
    // seen bytes have been inspected by a previous call to Split.
    // It returns the number of bytes it has inspected and, if a
    // token was found, a non-nil slice containing the token.
    // The length of b may only be zero when atEOF is true.
    Split(b []byte, atEOF bool, seen int) (n int, token []byte)
    }

    It think this interface can make it relatively straightforward and efficient
    to implement common splitting idioms without needing to
    maintain state, while being flexible enough to implement more
    interesting splitters.

    One significant limitation is that it wouldn't allow for delimiters
    that are longer than the maximum token length, but I'm
    not sure that's a big issue (if it is, Split could be changed to
    return whether the input should be discarded).

    For instance, here's the usual \r\n line splitter (utterly untested, of course):

    func (lineSplitter) Split(b []byte, atEOF bool, seen int) (int, []byte) {
    n := len(b)
    if atEOF {
    if seen > 0 {
    return n, b
    }
    return n, nil
    }
    i := bytes.IndexByte(b[seen:], '\n')
    if i < 0 {
    if b[n-1] == '\r' {
    // Save unresolved \r for the next call.
    n--
    }
    return n, nil
    }
    t := b[0:i]
    if t[len(t)-1] == '\r' {
    t = t[0 : len(t)-1]
    }
    return i + 1, t
    }

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kyle Lemons at Feb 14, 2013 at 5:03 pm
    +1. Not sure about precisely what the interface should be, but I think
    this looks close. This would make it useful for more than just line and
    word splitting, for instance: it could be used as a general tokenizer. Of
    course, that may be a non-goal.

    I also agree with allowing it to be a Stringer, and with not requiring a
    Closer (Err is fine with me).

    On Thu, Feb 14, 2013 at 3:57 AM, roger peppe wrote:
    On 14 February 2013 02:36, Rob Pike wrote:
    The existing bufio support for line-at-a-time I/O is cumbersome. Here
    is a proposal for an all-new design that should be easier to use,
    designed with help from Brad Fitzpatrick. This API would be an add-on,
    not a replacement for the existing ReadSlice etc.
    I like this proposal in general. I think it's neat and should be easy to
    use,
    and making it general is great too.
    I have one or two comments though.

    I'm not sure about the Close thing. I very often read lines from
    a Reader, and using io.NopCloser adds to the weight of this
    common case. Also, it means that the common idiom of
    putting a defer f.Close() after opening f is not so applicable
    (you usually want to see line-too-long errors, but reader-close
    errors are usually boring).

    Having the scanner close the underlying file on error
    means we can't carry on after an error, even if we
    want to.

    How about using Err rather than close?

    s := bufio.NewScanner(io.Stdin)
    for i := 1; s.Next() i++ {
    line := s.Bytes()
    fmt.Printf("%3d\t%s\n", i, line)
    }
    if err := s.Err(); err != nil {
    log.Fatal(err)
    }
    We could generalize a little on top of this. One easy step is to allow
    options, such as to control the maximum token length. These are done
    with a chaining API so they don't need to be in the constructor. There
    should be very few of them. If it can be done efficiently, we could
    provide:

    func (s *Scanner) MaxLength(length int) *Scanner // default 4k
    I'm not keen on the chaining API. It means that no-one
    else can implement types that satisfy the same interface
    as Scanner.

    Even though it might add a couple of lines to the source code,
    I'd prefer to see:

    func (s *Scanner) SetMaxLength(length int)

    (or MaxLength as above, but without the *Scanner return)
    To allow the user to specify the token-splitting algorithm, we add a
    function option. It's not easy to use, but it won't be used much.
    Still, a word-breaking splitter, for instance, would be nice, as would
    a byte-at-a-time and rune-at-a-time scanner, and we could provide
    those in bufio itself. The function is called for each byte and has
    this signature:

    type SplitFunc func(char byte, atEof bool) []byte
    I'm not sure about the use of a function type here.
    Usually when we provide a function type, we expect
    it to be side-effect free (I'm thinking in particular of the *Func
    functions in strings and bytes here). Here, on the other hand,
    we *require* the function to have state, otherwise it
    cannot function correctly.

    The other thing that concerns me here is efficiency.
    Currently ReadSlice can use IndexByte to scan very quickly
    through a buffer - we're making that into many function
    calls here.

    Given that this interface is about *splitting*, not tokenization,
    how about an interface something like this?

    type Splitter interface {
    // Split scans the given byte slice for a token, where the first
    // seen bytes have been inspected by a previous call to Split.
    // It returns the number of bytes it has inspected and, if a
    // token was found, a non-nil slice containing the token.
    // The length of b may only be zero when atEOF is true.
    Split(b []byte, atEOF bool, seen int) (n int, token []byte)
    }

    It think this interface can make it relatively straightforward and
    efficient
    to implement common splitting idioms without needing to
    maintain state, while being flexible enough to implement more
    interesting splitters.

    One significant limitation is that it wouldn't allow for delimiters
    that are longer than the maximum token length, but I'm
    not sure that's a big issue (if it is, Split could be changed to
    return whether the input should be discarded).

    For instance, here's the usual \r\n line splitter (utterly untested, of
    course):

    func (lineSplitter) Split(b []byte, atEOF bool, seen int) (int,
    []byte) {
    n := len(b)
    if atEOF {
    if seen > 0 {
    return n, b
    }
    return n, nil
    }
    i := bytes.IndexByte(b[seen:], '\n')
    if i < 0 {
    if b[n-1] == '\r' {
    // Save unresolved \r for the next call.
    n--
    }
    return n, nil
    }
    t := b[0:i]
    if t[len(t)-1] == '\r' {
    t = t[0 : len(t)-1]
    }
    return i + 1, t
    }

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Roger peppe at Feb 14, 2013 at 6:47 pm

    On 14 February 2013 17:02, Kyle Lemons wrote:
    +1. Not sure about precisely what the interface should be, but I think this
    looks close. This would make it useful for more than just line and word
    splitting, for instance: it could be used as a general tokenizer. Of
    course, that may be a non-goal.
    yeah. my first thought was to make it return indexes into the passed
    slice, but returning a slice was neater (and, as you point out, allows
    more generality).
    type Splitter interface {
    // Split scans the given byte slice for a token, where the first
    // seen bytes have been inspected by a previous call to Split.
    // It returns the number of bytes it has inspected and, if a
    // token was found, a non-nil slice containing the token.
    // The length of b may only be zero when atEOF is true.
    Split(b []byte, atEOF bool, seen int) (n int, token []byte)
    }
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Ingo Oeser at Feb 14, 2013 at 1:25 pm
    First: Great work! That really makes it simple!
    On Thursday, February 14, 2013 3:36:37 AM UTC+1, Rob Pike wrote:

    We add a new type, called Scanner, to capture the new functionality.
    Its constructor takes an io.ReadCloser, for reasons that will become
    clear. (The caller can promote a Reader to a ReadCloser using
    ioutil.NopCloser.) If the argument is not already a bufio.Reader, one
    is created to wrap the argument.
    Could this be done by the constructor itself, like we do for automatic
    buffering?
    e.g.

    rc, ok := r.(io.ReadCloser)
    if !ok {
    rc = ioutil.NopCloser(r)
    }

    Or is this considered dangerous, as the caller might not expect that the
    Scanner closes its stream?

    func (s *Scanner) Text() string
    The last name is Text not String so we don't accidentally create a
    fmt.Stringer out of a Scanner.
    Conforming to Stringer is actually a very useful feature. I wish every text
    stream analysis API would return their hold space this way. Makes usage
    very simple and idiomatic. Andrew presented already a good example.


    Best Regards

    Ingo

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Gerard at Feb 14, 2013 at 2:20 pm
    Nice initiative! FWIW here are my € 0.02

    The thing I missed in the pseudo code was the delimiter string. I suppose
    it will be defined in the constructor like this:

    func bufio.NewScanner(r io.ReadCloser, delim string) *Scanner

    or in the Next function, like :

    Next(delim string) bool // This makes the delimiter string
    modifyable while reading, generating lots of possiblities (at what cost?)


    Gerard

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Steve McCoy at Feb 14, 2013 at 2:52 pm
    I think this type is pretty much perfect as-is. I didn't like the Next
    method at first, but I realized that pushing the error-handling to Close is
    really nice.

    Here's another alternative for SplitFunc, which would allow for them to be
    very simple and stateless (by virtue of putting the Scanner in charge of
    the state):

    type SplitFunc func(chunk []byte, atEof bool) (ok, prefix bool)

    My thought is that, when scanning for splitting tokens, the Scanner will
    feed a chunk to the SplitFunc, and the SplitFunc will return whether the
    chunk matches a splitting token ("ok") and/or if it is a prefix of a
    splitting token. For example, say the Scanner has an internal buffer and
    calls SplitFunc(buffer[n:n+1], false). As long as the chunk is a prefix,
    the Scanner calls SplitFunc with buffer[n:n+m] ("growing" the chunk).
    Eventually, the SplitFunc will indicate that the chunk is a token or a
    non-token-non-prefix, so the Scanner moves on to the next chunk with
    SplitFunc(buffer[n+m:n+m+1], false) (if it isn't at EOF).

    There are a few downsides: The Scanner is more complex internally, I left
    out several edge cases in the above example, and this is potentially *very*
    inefficient for long or complex splitters. I think the first two are made
    up for by simplifying client code. As for efficiency, I'm inclined to think
    that maybe this is too extreme of a simplification because it'd be very
    easy to make something pathological, but on the other hand, anything this
    would be insufficient for would probably be just as much effort to
    implement via a stateful SplitFunc as with the existing tools in bufio.


    On Wednesday, February 13, 2013 9:36:37 PM UTC-5, Rob Pike wrote:


    func (s *Scanner) Next() bool

    func (s *Scanner) Close() error

    func (s *Scanner) Bytes() []byte // Does not copy; data is
    volatile.

    func (s *Scanner) Text() string
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kevin Gillette at Feb 14, 2013 at 6:18 pm
    @rog: who says you can't call Close twice? I believe os.File takes steps to ensure that its idempotent (and so it didn't close a reopened fd), while being safe with other ReadCloser use cases I've seen.

    Some contradictory opinions: Steve McCoy's proposed func seems close, but would be much faster if it returned a slice (otherwise the func would continually have to rescan the same input until it was given a whole token). The stdlib side could use address checking to determine where the token resides (and panic if the input and output slices have no overlap). I suggest:

    func (data []byte, prevState int) (token []byte, state int)

    Here, the func should scan data for at most one token. state is user defined, except for zero, which means the token return value, if not nil, represents a complete token. This allows the function to continue processing a token without nesting the previous data. If the returned state is negative, it indicates an invalid token (the contents of which should be represented by the returned token value); the abs of the negative state will be stuffed with the full token (over however many calls had consecutively returned a state > 0) into an error; the caller of Close can type assert to use the error state to lookup a useful error message.

    Though for simple cases, what's wrong with the func type that bytes.FieldsFunc takes, or something similar, like `func (byte) bool`. This is less flexible and less efficient, yet sufficient for many cases, and an adaptor that wraps this into the defacto func type would be useful.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Roger peppe at Feb 14, 2013 at 6:44 pm

    On 14 February 2013 18:18, Kevin Gillette wrote:
    @rog: who says you can't call Close twice? I believe os.File takes steps to ensure that its idempotent (and so it didn't close a reopened fd), while being safe with other ReadCloser use cases I've seen.
    yeah, you can, but i'm not sure it's defined to be safe always.

    I just don't think that conflating Close and Err is that helpful
    (and it might be actively unhelpful if you want to pass a
    bufio.Reader to the scanner - currently bufio.Reader checks
    if its argument is already a bufio.Reader and doesn't add
    another layer if so, but forcing it to be an io.NopCloser
    would prevent this).
    Some contradictory opinions: Steve McCoy's proposed func seems close, but would be much faster if it returned a slice (otherwise the func would continually have to rescan the same input until it was given a whole token). The stdlib side could use address checking to determine where the token resides (and panic if the input and output slices have no overlap). I suggest:

    func (data []byte, prevState int) (token []byte, state int)

    Here, the func should scan data for at most one token. state is user defined, except for zero, which means the token return value, if not nil, represents a complete token. This allows the function to continue processing a token without nesting the previous data. If the returned state is negative, it indicates an invalid token (the contents of which should be represented by the returned token value); the abs of the negative state will be stuffed with the full token (over however many calls had consecutively returned a state > 0) into an error; the caller of Close can type assert to use the error state to lookup a useful error message.
    Assume the data argument contains several lines. How does the above function
    indicate where the token finished?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kevin Gillette at Feb 14, 2013 at 7:06 pm
    @rog: the token return val would only contain the first token. The next call to that func would pass the remaining buffer (starting on the first byte after the token's last byte), similar to the logic used with Read or Write in a loop, except instead of n, addresses would be used by bufio to determine the offset.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kyle Lemons at Feb 14, 2013 at 7:13 pm

    On Thu, Feb 14, 2013 at 11:06 AM, Kevin Gillette wrote:

    @rog: the token return val would only contain the first token. The next
    call to that func would pass the remaining buffer (starting on the first
    byte after the token's last byte), similar to the logic used with Read or
    Write in a loop, except instead of n, addresses would be used by bufio to
    determine the offset.

    That is very strange. Keep in mind, there is no pointer arithmetic in Go.
    That would be difficult (if not impossible) to implement without unsafe.

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kevin Gillette at Feb 14, 2013 at 9:16 pm
    It shouldn't be problematic in that respect -- there wouldn't be any
    pointer arithmetic, since it's similar to the (safe) approach used to
    determine if two slices overlap:

    token, newState := thefunc(data, prevState)
    offset := uintptr(&token[0]) - uintptr(&data[0])
    // now do slicing or other (safe) determinations with offset -- it _won't_
    be converted into an unsafe.Pointer

    A signature like this would support just about any kind of typical lexer,
    such as a (comparatively inefficient) tokenizer for the Go language. The
    func uses in bytes.FieldFunc isn't as powerful, since that couldn't
    distinguish between `:=` and `=:` without unclean use of closure-wrapped
    state (which also makes assumptions about the algorithm that calls the
    func).
    On Thursday, February 14, 2013 12:12:41 PM UTC-7, Kyle Lemons wrote:

    On Thu, Feb 14, 2013 at 11:06 AM, Kevin Gillette <extempor...@gmail.com<javascript:>
    wrote:
    @rog: the token return val would only contain the first token. The next
    call to that func would pass the remaining buffer (starting on the first
    byte after the token's last byte), similar to the logic used with Read or
    Write in a loop, except instead of n, addresses would be used by bufio to
    determine the offset.

    That is very strange. Keep in mind, there is no pointer arithmetic in Go.
    That would be difficult (if not impossible) to implement without unsafe.

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Rob Pike at Feb 14, 2013 at 9:29 pm
    Updated proposal. Changes, comments, and conterarguments to proposed
    changes precede it here. Thanks for all your input; it was very
    helpful.

    The argument to the constructor is now an io.Reader, and Close closes
    the scanner but not the underlying Reader.

    NewScanner is otherwise unchanged. The design should be simple; no
    extra folderol in the constructor please.

    The method to recover the string-valued token is still Text, not
    String. Making it String, and hence a fmt.Stringer, feels like a
    misuse of the purpose of both the Scanner and the fmt interface. If
    Scanner is to have a String method, it should be about the Scanner,
    not a magic trick to access temporary state just to avoid one method
    call. I also like that making the string accessor explicit always
    means the user must think about whether to pay the price for the
    allocation the operation requires.

    There is still no way to grab the space between the tokens. The
    purpose of this design is to be simple, to do the common case by
    default and easily. If you want more control, use the old methods. A
    similar argument applies to things like defining line endings.

    The maximum token length has been enlarged to 64kB. I might even make
    it 1MB. If it's long enough, the need to continue the discussion about
    continuing after long lines becomes vanishingly small. See the
    previous paragraph.

    I left the chaining design in place. It's cheap and helpful for
    initialization but is not necessary: one may always use a separate
    line for the option setup if desired. This is a trivial decision to
    reverse, of course, and it's not set in stone yet.

    The Split function interface is all new. It was the least
    thought-through part of the previous iteration, and may still require
    refinement as implementation proceeds, but the design in this round
    seems pretty good to me.

    ---

    The existing bufio support for line-at-a-time I/O is cumbersome. Here
    is a proposal for an all-new design that should be easier to use,
    designed with help from Brad Fitzpatrick. This API would be an add-on,
    not a replacement for the existing ReadSlice etc.

    It's part of the bufio package, not being worth another package. In
    any case it can use some existing bufio internals to provide an
    efficient implementation.

    We add a new type, called Scanner, that is used to capture the new
    functionality. Its constructor takes an io.Reader. If the argument is
    not already a bufio.Reader, one is created to wrap the argument.

    This gives us:

    package bufio

    type Scanner struct { /* hidden */ }

    func NewScanner(r io.Reader) *Scanner

    The model for the scanner is to "tokenize" the input into text to be
    processed, separated by delimiters that are discarded. In the default
    case, this means lines of text separated by `\r?\n`. It is not
    possible in this design to discover whether, for instance, the last
    line of the input ends with a newline. This is OK; the point of this
    API is to make I/O easier and discovering such details about the input
    complicates existing designs.

    To scan the input, use the Next method as the loop condition, the
    Bytes or Text methods as the "getters", and Close at the end. Here are
    the method signatures:

    func (s *Scanner) Next() bool

    func (s *Scanner) Close() error

    func (s *Scanner) Bytes() []byte // Does not copy; data is volatile.

    func (s *Scanner) Text() string

    The last name is not String so we don't accidentally create a
    fmt.Stringer out of a Scanner. Some have suggested doing that anyway,
    but making String be an accessor rather than a formatter is an abuse
    to the model.

    Close does not close the Reader (it can't; the argument is not a
    ReadCloser); it just shuts down the scanning operation, terminates the
    scan, and reports any accumulated error. Because of the internal use
    of bufio, in general there can no guarantee that, for early calls to
    Close, all data after the last returned token is available to be read
    afterwards.

    I/O works by calling Next to load the next "token". It returns false
    at EOF or error. Close() returns:

    nil if there was no error; or
    nil if the only I/O error was EOF; or
    whatever error from the Reader caused the scan to stop; or
    whatever scan error caused the sane to stop, such as line-too-long

    Here is code to print a file line-by-line, with line numbers:

    s := bufio.NewScanner(io.Stdin)
    for i := 1; s.Next(); i++ {
    line := s.Text()
    fmt.Printf("%3d\t%s\n", i, line)
    }
    if err := s.Close(); err != nil {
    log.Fatal(err)
    }

    This is the basic outline; it seems clean and easy to use. The use of
    Close to report error (thanks, Brad) is the key insight to having the
    code be simple.

    We could generalize a little on top of this. One easy step is to allow
    options, such as to control the maximum token length. These are done
    with a chaining API so they don't need to be in the constructor and
    work well in initialization expressions. There should be very few of
    them.

    func (s *Scanner) MaxLength(length int) *Scanner // default 64k or
    maybe larger. anyway pretty big

    To allow the user to specify the token-splitting algorithm, we add a
    function option. It's not easy to use, but it won't be used much.
    Still, a word-breaking splitter, for instance, would be nice, as would
    a byte-at-atime and rune-at-a-time scanner, and we could provide those
    in bufio itself. The function is called by Next and has this
    signature:

    type SplitFunc func(data []byte, EOF bool) (advance int, token []byte)

    The EOF argument is true only at EOF, giving the function a chance to
    terminate the last token.

    The incoming data is a slice of unconsumed data. Each call to
    SplitFunc occurs at the previous location, plus the returned 'advance'
    value from the previous call. Thus by returning advance==0, SplitFunc
    can ask the Scanner to accumulate data until there is a full token to
    return. If the required storage becomes too large while accumulating,
    the Scanner will terminate with a line-too-long error. Once a token is
    delivered, SplitFunc would typically return advance=len(token) plus
    perhaps len(separator).

    The token returned by SplitFunc is the next token to deliver to the
    client; there is no requirement that it correspond to any actual input
    data. For instance, it might be upper-cased or lower-cased or
    something completely arbitrary. A nil token signals to return nothing
    to the client yet.

    We set up a custom splitter with an option method:

    func (s *Scanner) Split(SplitFunc) *Scanner // default: split on line
    breaks of the form `\r?\n`

    For example, if we provided a rune splitter in the package, you'd scan
    runes like this:

    s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
    for s.Next() {
    fmt.Printf("rune: %s\n", s.Bytes())
    }
    if err := s.Close(); err != nil {
    log.Fatal(err)
    }

    Comments welcome.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jimmy frasche at Feb 14, 2013 at 9:43 pm
    Doesn't the argument that Scanner not be a fmt.Stringer also apply to
    Scanner not being an io.Closer, if the Scanner doesn't actually Close
    the reader? That seems like it would be a source of common error.
    Maybe rename Close Stop?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Rob Pike at Feb 14, 2013 at 11:50 pm

    On Thu, Feb 14, 2013 at 1:43 PM, jimmy frasche wrote:
    Doesn't the argument that Scanner not be a fmt.Stringer also apply to
    Scanner not being an io.Closer, if the Scanner doesn't actually Close
    the reader? That seems like it would be a source of common error.
    Maybe rename Close Stop?
    The io.Closer interface comes up very rarely and such an error is
    unlikely. I'm not worried about it and, as Nigel says, there is
    precedent for Close.

    The fmt.Stringer argument was being made because people wanted to
    encourage the use of Scanner as a Stringer, whereas I want to prevent
    it.

    -rob

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Cross at Feb 15, 2013 at 12:23 am
    On Thu, Feb 14, 2013 at 4:43 PM, jimmy frasche [...]
    Maybe rename Close Stop?
    I agree, despite the counterarguments and precedence (and in a database
    context, 'Close' has double precedence with from cursors).

    This whole design seems clean, simple and elegant, but I don't like the
    names: I prefer the suggestions of 'Scan' instead of 'Next' and 'Stop'
    instead of 'Close'.

    My justification is that these better match the semantics of what's
    actually happening. Next isn't returning a value, but rather an indicator
    of whether it did anything. Instead, just refer to the thing that was
    actually done (the scanning). Similarly, 'Close' doesn't close anything
    (what does it mean to close a scanner?), whereas 'Stop' is stating
    explicitly that one is stopping the Scanner.

    - Dan C.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Patrick W at Feb 15, 2013 at 2:48 am
    I'm in the target audience for this proposal (golang novice), and I'd find
    it more intuitive for a Scanner to provide Scan/Stop operations rather than
    Next/Close.

    On Friday, February 15, 2013 11:23:23 AM UTC+11, Dan Cross wrote:

    On Thu, Feb 14, 2013 at 4:43 PM, jimmy frasche <soapbo...@gmail.com<javascript:>
    wrote:
    [...]
    Maybe rename Close Stop?
    I agree, despite the counterarguments and precedence (and in a database
    context, 'Close' has double precedence with from cursors).

    This whole design seems clean, simple and elegant, but I don't like the
    names: I prefer the suggestions of 'Scan' instead of 'Next' and 'Stop'
    instead of 'Close'.

    My justification is that these better match the semantics of what's
    actually happening. Next isn't returning a value, but rather an indicator
    of whether it did anything. Instead, just refer to the thing that was
    actually done (the scanning). Similarly, 'Close' doesn't close anything
    (what does it mean to close a scanner?), whereas 'Stop' is stating
    explicitly that one is stopping the Scanner.

    - Dan C.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kyle Lemons at Feb 15, 2013 at 12:17 am

    On Thu, Feb 14, 2013 at 1:28 PM, Rob Pike wrote:

    Updated proposal. Changes, comments, and conterarguments to proposed
    changes precede it here. Thanks for all your input; it was very
    helpful.

    The argument to the constructor is now an io.Reader, and Close closes
    the scanner but not the underlying Reader.

    NewScanner is otherwise unchanged. The design should be simple; no
    extra folderol in the constructor please.

    The method to recover the string-valued token is still Text, not
    String. Making it String, and hence a fmt.Stringer, feels like a
    misuse of the purpose of both the Scanner and the fmt interface. If
    Scanner is to have a String method, it should be about the Scanner,
    not a magic trick to access temporary state just to avoid one method
    call. I also like that making the string accessor explicit always
    means the user must think about whether to pay the price for the
    allocation the operation requires.


    There is still no way to grab the space between the tokens. The
    purpose of this design is to be simple, to do the common case by
    default and easily. If you want more control, use the old methods. A
    similar argument applies to things like defining line endings.

    The maximum token length has been enlarged to 64kB. I might even make
    it 1MB. If it's long enough, the need to continue the discussion about
    continuing after long lines becomes vanishingly small. See the
    previous paragraph.

    I left the chaining design in place. It's cheap and helpful for
    initialization but is not necessary: one may always use a separate
    line for the option setup if desired. This is a trivial decision to
    reverse, of course, and it's not set in stone yet.

    The Split function interface is all new. It was the least
    thought-through part of the previous iteration, and may still require
    refinement as implementation proceeds, but the design in this round
    seems pretty good to me.

    ---

    The existing bufio support for line-at-a-time I/O is cumbersome. Here
    is a proposal for an all-new design that should be easier to use,
    designed with help from Brad Fitzpatrick. This API would be an add-on,
    not a replacement for the existing ReadSlice etc.

    It's part of the bufio package, not being worth another package. In
    any case it can use some existing bufio internals to provide an
    efficient implementation.

    We add a new type, called Scanner, that is used to capture the new
    functionality. Its constructor takes an io.Reader. If the argument is
    not already a bufio.Reader, one is created to wrap the argument.

    This gives us:

    package bufio

    type Scanner struct { /* hidden */ }

    func NewScanner(r io.Reader) *Scanner

    The model for the scanner is to "tokenize" the input into text to be
    processed, separated by delimiters that are discarded. In the default
    case, this means lines of text separated by `\r?\n`. It is not
    possible in this design to discover whether, for instance, the last
    line of the input ends with a newline. This is OK; the point of this
    API is to make I/O easier and discovering such details about the input
    complicates existing designs.

    To scan the input, use the Next method as the loop condition, the
    Bytes or Text methods as the "getters", and Close at the end. Here are
    the method signatures:

    func (s *Scanner) Next() bool

    func (s *Scanner) Close() error

    func (s *Scanner) Bytes() []byte // Does not copy; data is
    volatile.

    func (s *Scanner) Text() string

    The last name is not String so we don't accidentally create a
    fmt.Stringer out of a Scanner. Some have suggested doing that anyway,
    but making String be an accessor rather than a formatter is an abuse
    to the model.

    Close does not close the Reader (it can't; the argument is not a
    ReadCloser); it just shuts down the scanning operation, terminates the
    scan, and reports any accumulated error. Because of the internal use
    of bufio, in general there can no guarantee that, for early calls to
    Close, all data after the last returned token is available to be read
    afterwards.

    I/O works by calling Next to load the next "token". It returns false
    at EOF or error. Close() returns:

    nil if there was no error; or
    nil if the only I/O error was EOF; or
    whatever error from the Reader caused the scan to stop; or
    whatever scan error caused the sane to stop, such as line-too-long

    Here is code to print a file line-by-line, with line numbers:

    s := bufio.NewScanner(io.Stdin)
    for i := 1; s.Next(); i++ {
    line := s.Text()
    fmt.Printf("%3d\t%s\n", i, line)
    }
    if err := s.Close(); err != nil {
    log.Fatal(err)
    }

    This is the basic outline; it seems clean and easy to use. The use of
    Close to report error (thanks, Brad) is the key insight to having the
    code be simple.

    We could generalize a little on top of this. One easy step is to allow
    options, such as to control the maximum token length. These are done
    with a chaining API so they don't need to be in the constructor and
    work well in initialization expressions. There should be very few of
    them.

    func (s *Scanner) MaxLength(length int) *Scanner // default 64k or
    maybe larger. anyway pretty big

    To allow the user to specify the token-splitting algorithm, we add a
    function option. It's not easy to use, but it won't be used much.
    Still, a word-breaking splitter, for instance, would be nice, as would
    a byte-at-atime and rune-at-a-time scanner, and we could provide those
    in bufio itself. The function is called by Next and has this
    signature:

    type SplitFunc func(data []byte, EOF bool) (advance int, token
    []byte)
    Cool

    The EOF argument is true only at EOF, giving the function a chance to
    terminate the last token.

    The incoming data is a slice of unconsumed data. Each call to
    SplitFunc occurs at the previous location, plus the returned 'advance'
    value from the previous call. Thus by returning advance==0, SplitFunc
    can ask the Scanner to accumulate data until there is a full token to
    return. If the required storage becomes too large while accumulating,
    the Scanner will terminate with a line-too-long error. Once a token is
    delivered, SplitFunc would typically return advance=len(token) plus
    perhaps len(separator).

    The token returned by SplitFunc is the next token to deliver to the
    client; there is no requirement that it correspond to any actual input
    data.

    Ooo, shiny.

    For instance, it might be upper-cased or lower-cased or
    something completely arbitrary. A nil token signals to return nothing
    to the client yet.

    We set up a custom splitter with an option method:

    func (s *Scanner) Split(SplitFunc) *Scanner // default: split on
    line
    breaks of the form `\r?\n`

    For example, if we provided a rune splitter in the package, you'd scan
    runes like this:

    s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
    for s.Next() {
    fmt.Printf("rune: %s\n", s.Bytes())
    }
    if err := s.Close(); err != nil {
    log.Fatal(err)
    }

    Comments welcome.
    LGTM. While I would've chosen the opposite for String/Text and Close/Err,
    I like this API and I think it is a dramatic improvement.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Nate Finch at Feb 15, 2013 at 1:59 am
    I hate the chaining API. It's a cheap trick just to avoid writing one line
    of code, which to me is a very un-Go-like goal. MaxLength does not return a
    new Scanner, so it shouldn't return *Scanner; it's an unnecessary
    complication of the interface. I haven't found anything else in the
    standard library that does that, and I don't think starting now is a good
    idea.

    Also, this may have been assumed, but it wasn't explicitly stated - you
    should export the default SplitFunc for splitting on '\r?\n'.
    On Thursday, February 14, 2013 4:28:56 PM UTC-5, Rob Pike wrote:

    I left the chaining design in place. It's cheap and helpful for
    initialization but is not necessary: one may always use a separate
    line for the option setup if desired. This is a trivial decision to
    reverse, of course, and it's not set in stone yet.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kyle Lemons at Feb 15, 2013 at 2:16 am

    On Thu, Feb 14, 2013 at 5:59 PM, Nate Finch wrote:

    I hate the chaining API. It's a cheap trick just to avoid writing one line
    of code, which to me is a very un-Go-like goal. MaxLength does not return a
    new Scanner, so it shouldn't return *Scanner; it's an unnecessary
    complication of the interface. I haven't found anything else in the
    standard library that does that, and I don't think starting now is a good
    idea.

    I don't really like chaining either, but it is used in e.g. the template
    libraries. template.New("blah").Funcs(funcmap).Parse("text") etc, so it
    does have precedent.

    Also, this may have been assumed, but it wasn't explicitly stated - you
    should export the default SplitFunc for splitting on '\r?\n'.
    On Thursday, February 14, 2013 4:28:56 PM UTC-5, Rob Pike wrote:

    I left the chaining design in place. It's cheap and helpful for
    initialization but is not necessary: one may always use a separate
    line for the option setup if desired. This is a trivial decision to
    reverse, of course, and it's not set in stone yet.

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Gary b at Feb 15, 2013 at 2:18 am

    On Thursday, February 14, 2013 5:59:45 PM UTC-8, Nate Finch wrote:

    I haven't found anything else in the standard library that does that,

    The text/template and html/template packages use the chaining design (see
    Funcs and Delims methods on Template type).

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Nate Finch at Feb 15, 2013 at 2:24 am
    I figured someone else would prove me wrong. I scanned through likely
    packages to see if I'd forgotten anything, and missed those ones. I've
    even used them, though not recently, pleh. I'll blame it on lack of sleep
    :) I still don't like it, even there :)
    On Thursday, February 14, 2013 9:18:26 PM UTC-5, gary b wrote:
    On Thursday, February 14, 2013 5:59:45 PM UTC-8, Nate Finch wrote:

    I haven't found anything else in the standard library that does that,

    The text/template and html/template packages use the chaining design (see
    Funcs and Delims methods on Template type).
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Roger peppe at Feb 15, 2013 at 8:38 am

    On 15 February 2013 02:18, gary b wrote:
    On Thursday, February 14, 2013 5:59:45 PM UTC-8, Nate Finch wrote:

    I haven't found anything else in the standard library that does that,

    The text/template and html/template packages use the chaining design (see
    Funcs and Delims methods on Template type).
    That's true, and it's a right pain (and the reason I suggested it
    might not be a good idea for the splitter interface) - it means you can't have
    a single interface that satisfies both the Template in text/template
    and in html/template.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Steve McCoy at Feb 15, 2013 at 2:34 am

    On Thursday, February 14, 2013 4:28:56 PM UTC-5, Rob Pike wrote:

    Updated proposal.
    Looks great to me and — despite your warning — this version of Split seems
    very easy to use.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Roger peppe at Feb 15, 2013 at 9:52 am

    On 14 February 2013 21:28, Rob Pike wrote:
    Close does not close the Reader (it can't; the argument is not a
    ReadCloser); it just shuts down the scanning operation, terminates the
    scan, and reports any accumulated error. Because of the internal use
    of bufio, in general there can no guarantee that, for early calls to
    Close, all data after the last returned token is available to be read
    afterwards.
    Presumably if the reader was already a bufio.Reader, we *could*
    make that guarantee, and it might be useful to do so.
    To allow the user to specify the token-splitting algorithm, we add a
    function option. It's not easy to use, but it won't be used much.
    Still, a word-breaking splitter, for instance, would be nice, as would
    a byte-at-atime and rune-at-a-time scanner, and we could provide those
    in bufio itself. The function is called by Next and has this
    signature:

    type SplitFunc func(data []byte, EOF bool) (advance int, token []byte)

    The EOF argument is true only at EOF, giving the function a chance to
    terminate the last token.

    The incoming data is a slice of unconsumed data. Each call to
    SplitFunc occurs at the previous location, plus the returned 'advance'
    value from the previous call. Thus by returning advance==0, SplitFunc
    can ask the Scanner to accumulate data until there is a full token to
    return. If the required storage becomes too large while accumulating,
    the Scanner will terminate with a line-too-long error. Once a token is
    delivered, SplitFunc would typically return advance=len(token) plus
    perhaps len(separator).
    This seems fine. One small thing for consideration though: by
    making the existence of a token predicated on advance>0,
    we preclude the possibility of zero-length tokens with no delimiter.
    For example, we couldn't have a splitter similar to strings.Split,
    because the last (or the first) token has no delimiter and may be empty.
    On the other hand, it has the nice property that the splitter is bound
    to make progress regardless of what the split func returns.
    The token returned by SplitFunc is the next token to deliver to the
    client; there is no requirement that it correspond to any actual input
    data. For instance, it might be upper-cased or lower-cased or
    something completely arbitrary. A nil token signals to return nothing
    to the client yet.

    We set up a custom splitter with an option method:

    func (s *Scanner) Split(SplitFunc) *Scanner // default: split on line
    breaks of the form `\r?\n`

    For example, if we provided a rune splitter in the package, you'd scan
    runes like this:

    s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
    I'm still a little uncomfortable with using a function type for the
    splitter. It's fine for splitting runes (there's no efficiency to be
    gained from scanning where the last call left off, so we
    can use static function), but n general one would need to create
    a new function for each new scanner, because it's necessary to
    have closure state.

    For example, if we provide a splitter to split on \n only,
    it would probably be something like:

    func NewlineSplitter() SplitFunc {
    seen := 0
    return func(b []byte, atEOF bool) (int, []byte) {
    i := bytes.IndexByte(b[seen:], '\n')
    if i < 0 {
    seen = len(b)
    return 0, nil
    }
    etc
    }
    }

    That looks quite like a New function, and in general
    I think it's nicer to be carrying state around in values
    rather than closures.

    That said, it'll work fine either way.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Luis Alfonso Vega Garcia at Feb 15, 2013 at 1:33 pm
    Cool, a (simple) Scanner has been a missing important feature for Go.
    I'm looking forward to use it.
    We set up a custom splitter with an option method:
    func (s *Scanner) Split(SplitFunc) *Scanner // default: split
    on line breaks of the form `\r?\n`

    I wonder what would be the implementation of the SplitFunc for the mac
    case ('\r').


    -- Alfonso

    Alfonso Vega-Garcia | Software Engineer | vegacom at gmail.com


    On Fri, Feb 15, 2013 at 6:52 PM, roger peppe wrote:
    On 14 February 2013 21:28, Rob Pike wrote:
    Close does not close the Reader (it can't; the argument is not a
    ReadCloser); it just shuts down the scanning operation, terminates the
    scan, and reports any accumulated error. Because of the internal use
    of bufio, in general there can no guarantee that, for early calls to
    Close, all data after the last returned token is available to be read
    afterwards.
    Presumably if the reader was already a bufio.Reader, we *could*
    make that guarantee, and it might be useful to do so.
    To allow the user to specify the token-splitting algorithm, we add a
    function option. It's not easy to use, but it won't be used much.
    Still, a word-breaking splitter, for instance, would be nice, as would
    a byte-at-atime and rune-at-a-time scanner, and we could provide those
    in bufio itself. The function is called by Next and has this
    signature:

    type SplitFunc func(data []byte, EOF bool) (advance int, token []byte)

    The EOF argument is true only at EOF, giving the function a chance to
    terminate the last token.

    The incoming data is a slice of unconsumed data. Each call to
    SplitFunc occurs at the previous location, plus the returned 'advance'
    value from the previous call. Thus by returning advance==0, SplitFunc
    can ask the Scanner to accumulate data until there is a full token to
    return. If the required storage becomes too large while accumulating,
    the Scanner will terminate with a line-too-long error. Once a token is
    delivered, SplitFunc would typically return advance=len(token) plus
    perhaps len(separator).
    This seems fine. One small thing for consideration though: by
    making the existence of a token predicated on advance>0,
    we preclude the possibility of zero-length tokens with no delimiter.
    For example, we couldn't have a splitter similar to strings.Split,
    because the last (or the first) token has no delimiter and may be empty.
    On the other hand, it has the nice property that the splitter is bound
    to make progress regardless of what the split func returns.
    The token returned by SplitFunc is the next token to deliver to the
    client; there is no requirement that it correspond to any actual input
    data. For instance, it might be upper-cased or lower-cased or
    something completely arbitrary. A nil token signals to return nothing
    to the client yet.

    We set up a custom splitter with an option method:

    func (s *Scanner) Split(SplitFunc) *Scanner // default: split on line
    breaks of the form `\r?\n`

    For example, if we provided a rune splitter in the package, you'd scan
    runes like this:

    s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
    I'm still a little uncomfortable with using a function type for the
    splitter. It's fine for splitting runes (there's no efficiency to be
    gained from scanning where the last call left off, so we
    can use static function), but n general one would need to create
    a new function for each new scanner, because it's necessary to
    have closure state.

    For example, if we provide a splitter to split on \n only,
    it would probably be something like:

    func NewlineSplitter() SplitFunc {
    seen := 0
    return func(b []byte, atEOF bool) (int, []byte) {
    i := bytes.IndexByte(b[seen:], '\n')
    if i < 0 {
    seen = len(b)
    return 0, nil
    }
    etc
    }
    }

    That looks quite like a New function, and in general
    I think it's nicer to be carrying state around in values
    rather than closures.

    That said, it'll work fine either way.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Michael Jones at Feb 15, 2013 at 4:09 pm
    Luis (and Rob),

    Why not change the default from Rob's:

    In the default case, this means lines of text separated by `\r?\n`.

    to be "[end of line stuff]" as in:

    (\r\n?) | (\n)
    On Fri, Feb 15, 2013 at 5:33 AM, Luis Alfonso Vega Garcia wrote:

    Cool, a (simple) Scanner has been a missing important feature for Go.
    I'm looking forward to use it.
    We set up a custom splitter with an option method:
    func (s *Scanner) Split(SplitFunc) *Scanner // default: split
    on line breaks of the form `\r?\n`

    I wonder what would be the implementation of the SplitFunc for the mac
    case ('\r').


    -- Alfonso

    Alfonso Vega-Garcia | Software Engineer | vegacom at gmail.com


    On Fri, Feb 15, 2013 at 6:52 PM, roger peppe wrote:
    On 14 February 2013 21:28, Rob Pike wrote:
    Close does not close the Reader (it can't; the argument is not a
    ReadCloser); it just shuts down the scanning operation, terminates the
    scan, and reports any accumulated error. Because of the internal use
    of bufio, in general there can no guarantee that, for early calls to
    Close, all data after the last returned token is available to be read
    afterwards.
    Presumably if the reader was already a bufio.Reader, we *could*
    make that guarantee, and it might be useful to do so.
    To allow the user to specify the token-splitting algorithm, we add a
    function option. It's not easy to use, but it won't be used much.
    Still, a word-breaking splitter, for instance, would be nice, as would
    a byte-at-atime and rune-at-a-time scanner, and we could provide those
    in bufio itself. The function is called by Next and has this
    signature:

    type SplitFunc func(data []byte, EOF bool) (advance int, token
    []byte)
    The EOF argument is true only at EOF, giving the function a chance to
    terminate the last token.

    The incoming data is a slice of unconsumed data. Each call to
    SplitFunc occurs at the previous location, plus the returned 'advance'
    value from the previous call. Thus by returning advance==0, SplitFunc
    can ask the Scanner to accumulate data until there is a full token to
    return. If the required storage becomes too large while accumulating,
    the Scanner will terminate with a line-too-long error. Once a token is
    delivered, SplitFunc would typically return advance=len(token) plus
    perhaps len(separator).
    This seems fine. One small thing for consideration though: by
    making the existence of a token predicated on advance>0,
    we preclude the possibility of zero-length tokens with no delimiter.
    For example, we couldn't have a splitter similar to strings.Split,
    because the last (or the first) token has no delimiter and may be empty.
    On the other hand, it has the nice property that the splitter is bound
    to make progress regardless of what the split func returns.
    The token returned by SplitFunc is the next token to deliver to the
    client; there is no requirement that it correspond to any actual input
    data. For instance, it might be upper-cased or lower-cased or
    something completely arbitrary. A nil token signals to return nothing
    to the client yet.

    We set up a custom splitter with an option method:

    func (s *Scanner) Split(SplitFunc) *Scanner // default: split
    on line
    breaks of the form `\r?\n`

    For example, if we provided a rune splitter in the package, you'd scan
    runes like this:

    s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
    I'm still a little uncomfortable with using a function type for the
    splitter. It's fine for splitting runes (there's no efficiency to be
    gained from scanning where the last call left off, so we
    can use static function), but n general one would need to create
    a new function for each new scanner, because it's necessary to
    have closure state.

    For example, if we provide a splitter to split on \n only,
    it would probably be something like:

    func NewlineSplitter() SplitFunc {
    seen := 0
    return func(b []byte, atEOF bool) (int, []byte) {
    i := bytes.IndexByte(b[seen:], '\n')
    if i < 0 {
    seen = len(b)
    return 0, nil
    }
    etc
    }
    }

    That looks quite like a New function, and in general
    I think it's nicer to be carrying state around in values
    rather than closures.

    That said, it'll work fine either way.

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.


    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Tarmigan at Feb 17, 2013 at 5:44 am

    On Fri, Feb 15, 2013 at 1:52 AM, roger peppe wrote:
    I'm still a little uncomfortable with using a function type for the
    splitter. It's fine for splitting runes (there's no efficiency to be
    gained from scanning where the last call left off, so we
    can use static function), but n general one would need to create
    a new function for each new scanner, because it's necessary to
    have closure state.

    For example, if we provide a splitter to split on \n only,
    it would probably be something like:

    func NewlineSplitter() SplitFunc {
    seen := 0
    return func(b []byte, atEOF bool) (int, []byte) {
    i := bytes.IndexByte(b[seen:], '\n')
    if i < 0 {
    seen = len(b)
    return 0, nil
    }
    etc
    }
    }

    That looks quite like a New function, and in general
    I think it's nicer to be carrying state around in values
    rather than closures.

    That said, it'll work fine either way.
    My initial reading was that b would advance one byte at per call, so I
    think you could just check the last byte?

    func NewlineSplitter(b []byte, atEOF bool) (int, []byte) {
    if atEOF {
    return len(b), b
    }
    if b[len(b)-1] == '\n' {
    if len(b) == 1 {
    return 1, nil
    }
    return len(b), b[:len(b)-2]
    }
    return 0, nil
    }

    But rereading the proposal again, it looks unspecified whether the
    slice could advance by more than 1 byte.

    -Tarmigan

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Brad Fitzpatrick at Feb 15, 2013 at 4:53 pm
    (*Scanner).Close in the updated proposal is a weird name, considering there
    are no resources to free. If there isn't a ReadCloser to close, naming it
    Close makes it sound like FreeSomeBuffersAndBreakUpSomeGarbageMaybe().

    There's precedent for naming it Err() error instead:

    http://golang.org/pkg/database/sql/#Rows.Err

    Which does the same thing:

    927 // Err returns the error, if any, that was encountered during
    iteration.
    928 func (rs *Rows) Err() error {
    929 if rs.lasterr == io.EOF {
    930 return nil
    931 }
    932 return rs.lasterr
    933 }


    On Thu, Feb 14, 2013 at 1:28 PM, Rob Pike wrote:

    Updated proposal. Changes, comments, and conterarguments to proposed
    changes precede it here. Thanks for all your input; it was very
    helpful.

    The argument to the constructor is now an io.Reader, and Close closes
    the scanner but not the underlying Reader.

    NewScanner is otherwise unchanged. The design should be simple; no
    extra folderol in the constructor please.

    The method to recover the string-valued token is still Text, not
    String. Making it String, and hence a fmt.Stringer, feels like a
    misuse of the purpose of both the Scanner and the fmt interface. If
    Scanner is to have a String method, it should be about the Scanner,
    not a magic trick to access temporary state just to avoid one method
    call. I also like that making the string accessor explicit always
    means the user must think about whether to pay the price for the
    allocation the operation requires.

    There is still no way to grab the space between the tokens. The
    purpose of this design is to be simple, to do the common case by
    default and easily. If you want more control, use the old methods. A
    similar argument applies to things like defining line endings.

    The maximum token length has been enlarged to 64kB. I might even make
    it 1MB. If it's long enough, the need to continue the discussion about
    continuing after long lines becomes vanishingly small. See the
    previous paragraph.

    I left the chaining design in place. It's cheap and helpful for
    initialization but is not necessary: one may always use a separate
    line for the option setup if desired. This is a trivial decision to
    reverse, of course, and it's not set in stone yet.

    The Split function interface is all new. It was the least
    thought-through part of the previous iteration, and may still require
    refinement as implementation proceeds, but the design in this round
    seems pretty good to me.

    ---

    The existing bufio support for line-at-a-time I/O is cumbersome. Here
    is a proposal for an all-new design that should be easier to use,
    designed with help from Brad Fitzpatrick. This API would be an add-on,
    not a replacement for the existing ReadSlice etc.

    It's part of the bufio package, not being worth another package. In
    any case it can use some existing bufio internals to provide an
    efficient implementation.

    We add a new type, called Scanner, that is used to capture the new
    functionality. Its constructor takes an io.Reader. If the argument is
    not already a bufio.Reader, one is created to wrap the argument.

    This gives us:

    package bufio

    type Scanner struct { /* hidden */ }

    func NewScanner(r io.Reader) *Scanner

    The model for the scanner is to "tokenize" the input into text to be
    processed, separated by delimiters that are discarded. In the default
    case, this means lines of text separated by `\r?\n`. It is not
    possible in this design to discover whether, for instance, the last
    line of the input ends with a newline. This is OK; the point of this
    API is to make I/O easier and discovering such details about the input
    complicates existing designs.

    To scan the input, use the Next method as the loop condition, the
    Bytes or Text methods as the "getters", and Close at the end. Here are
    the method signatures:

    func (s *Scanner) Next() bool

    func (s *Scanner) Close() error

    func (s *Scanner) Bytes() []byte // Does not copy; data is
    volatile.

    func (s *Scanner) Text() string

    The last name is not String so we don't accidentally create a
    fmt.Stringer out of a Scanner. Some have suggested doing that anyway,
    but making String be an accessor rather than a formatter is an abuse
    to the model.

    Close does not close the Reader (it can't; the argument is not a
    ReadCloser); it just shuts down the scanning operation, terminates the
    scan, and reports any accumulated error. Because of the internal use
    of bufio, in general there can no guarantee that, for early calls to
    Close, all data after the last returned token is available to be read
    afterwards.

    I/O works by calling Next to load the next "token". It returns false
    at EOF or error. Close() returns:

    nil if there was no error; or
    nil if the only I/O error was EOF; or
    whatever error from the Reader caused the scan to stop; or
    whatever scan error caused the sane to stop, such as line-too-long

    Here is code to print a file line-by-line, with line numbers:

    s := bufio.NewScanner(io.Stdin)
    for i := 1; s.Next(); i++ {
    line := s.Text()
    fmt.Printf("%3d\t%s\n", i, line)
    }
    if err := s.Close(); err != nil {
    log.Fatal(err)
    }

    This is the basic outline; it seems clean and easy to use. The use of
    Close to report error (thanks, Brad) is the key insight to having the
    code be simple.

    We could generalize a little on top of this. One easy step is to allow
    options, such as to control the maximum token length. These are done
    with a chaining API so they don't need to be in the constructor and
    work well in initialization expressions. There should be very few of
    them.

    func (s *Scanner) MaxLength(length int) *Scanner // default 64k or
    maybe larger. anyway pretty big

    To allow the user to specify the token-splitting algorithm, we add a
    function option. It's not easy to use, but it won't be used much.
    Still, a word-breaking splitter, for instance, would be nice, as would
    a byte-at-atime and rune-at-a-time scanner, and we could provide those
    in bufio itself. The function is called by Next and has this
    signature:

    type SplitFunc func(data []byte, EOF bool) (advance int, token
    []byte)

    The EOF argument is true only at EOF, giving the function a chance to
    terminate the last token.

    The incoming data is a slice of unconsumed data. Each call to
    SplitFunc occurs at the previous location, plus the returned 'advance'
    value from the previous call. Thus by returning advance==0, SplitFunc
    can ask the Scanner to accumulate data until there is a full token to
    return. If the required storage becomes too large while accumulating,
    the Scanner will terminate with a line-too-long error. Once a token is
    delivered, SplitFunc would typically return advance=len(token) plus
    perhaps len(separator).

    The token returned by SplitFunc is the next token to deliver to the
    client; there is no requirement that it correspond to any actual input
    data. For instance, it might be upper-cased or lower-cased or
    something completely arbitrary. A nil token signals to return nothing
    to the client yet.

    We set up a custom splitter with an option method:

    func (s *Scanner) Split(SplitFunc) *Scanner // default: split on
    line
    breaks of the form `\r?\n`

    For example, if we provided a rune splitter in the package, you'd scan
    runes like this:

    s := bufio.NewScanner(io.Stdin).Split(bufio.SplitRune)
    for s.Next() {
    fmt.Printf("rune: %s\n", s.Bytes())
    }
    if err := s.Close(); err != nil {
    log.Fatal(err)
    }

    Comments welcome.

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Rob Pike at Feb 15, 2013 at 5:29 pm
    I'm leaning towards Scan and Stop as the methods at this point. I'm
    close to blowing the whistle on the bikeshedding session.

    I don't like Err or Error because they suggest something about the
    error interface. The point is, in fact, to shut down the scanner and
    terminate the scan; the error is a side effect, not the driver, and
    calling it Err for instance will encourage lazy users to skip that
    stage. That's partly why I liked Close, but now think that Close
    indicates closure of the underlying resource. Stop indeed makes sense.
    When you're done, you Stop. "When you're done, you Err" doesn't sound
    right.

    I'm going to write some code.

    -rob

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Brad Fitzpatrick at Feb 15, 2013 at 5:45 pm

    On Fri, Feb 15, 2013 at 9:29 AM, Rob Pike wrote:

    I'm leaning towards Scan and Stop as the methods at this point. I'm
    close to blowing the whistle on the bikeshedding session.

    I don't like Err or Error because they suggest something about the
    error interface.

    What do they suggest? Error(), sure, makes it look like an error interface.

    But Err is used in both:

    $ grep Err\(\) api/go1.txt
    pkg database/sql, method (*Rows) Err() error
    pkg go/scanner, method (ErrorList) Err() error

    And in compress/gzip2's private bitReader.

    And 7 places inside Google's codebase.

    If that's not the convention for a sticky error accessor, what should it
    be? "GetLastError"?

    The point is, in fact, to shut down the scanner and
    terminate the scan; the error is a side effect, not the driver, and
    calling it Err for instance will encourage lazy users to skip that
    stage. That's partly why I liked Close, but now think that Close
    indicates closure of the underlying resource. Stop indeed makes sense.
    When you're done, you Stop. "When you're done, you Err" doesn't sound
    right.
    What is there to Stop? What needs to be freed and shut down? The state
    and buffers are garbage-collected.

    I don't Stop reading from an io.Reader if I'm done early before io.EOF.

    What else I Stop?

    The only thing I see to Stop is a *time.Ticker. But that makes sense: if
    I don't stop it, it keeps going. A bufio.Scanner has no inertia.

    I'm going to write some code.
    >

    Okay. I'll can bring this up again later during code review.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kevin Gillette at Feb 15, 2013 at 10:26 pm
    I agree on blowing that whistle -- I'd have been happy if your initial
    proposal had used any of the suggested names, and wouldn't have put much
    thought towards that. The nomenclature may be a learning/readability
    concern, but it certainly has no effect on practical ease-of-use or
    flexibility; those are the issues I'm concerned with.

    I'm still interested in more details on SplitFunc: I believe it'd still be
    useful to have some kind of error signalling. Just as filepath.Walk can
    break out early with an error, there'd be use cases here where the
    interface is close enough to working for general tokenizing tasks (which
    would need to be able to signal syntax errors and such) that it is an
    important consideration. I'm not making a suggestion, and indeed if the
    expected use-cases are such that closures or panics could satisfy this
    need, then so be it.

    Regarding chaining vs text/template: I see the usefulness of chaining in
    the template case, since it's often desirable to fully initialize global
    variables with Template values -- I don't see the usefulness of fully
    initializing a global with a bufio.Scanner. In this case, I think a cleaner
    option would be the exp/cookiejar approach, along the lines of:

    NewScanner(io.Reader, *ScannerCfg) *Scanner

    Where the cfg arg, a struct pointer, can be nil to indicate defaults.
    Alternatively, that signature could be a separate function perhaps called
    NewScannerConfig, alongside Rob's proposed NewScanner. Importantly, if
    options continually get added, there won't be a large chaining method set
    to maintain -- it'd instead consist of just adding extra fields to the
    config struct type.

    I think 32kb is a reasonable default buffer size since, one way or another,
    that size will be configurable on a per-scanner basis. I, for one, would
    want to use this for parsing files which wouldn't even be expected to total
    32kb in size (much less having lines that long). For cases where the lines
    can be arbitrarily, extremely long (such that the "human readable" concept
    of "line" long since ceased to apply), especially with files whose typical
    production contents are not intended to be managed or viewed by humans
    except in dire circumstances.

    With due deference and respect to kortschak, I doubt that 1mb, or any fixed
    buffer size would be sufficient to handle some of those line-as-record
    cases which can contain arbitrary internal data, but in those cases as
    well, if I chose bufio.Scanner as the solution, I'd write a custom
    SplitFunc to handle low level tokenization and also include the newline
    itself as a separate token, rather than scan line-at-a-time, and then
    subprocess each line (bufio.Reader's ReadLine as is would excel
    particularly well in doing that anyway).

    My suggestion is that the default be whatever buffer size is reasonable and
    efficient for throwaway tasks with an emphasis on expected home-brew
    formats, since complex formats (or formats designed for machine efficiency)
    are more likely to use a custom scanner, or may even be forced to use a
    custom scanner (if processing is truly line oriented yet lines may be of
    any length). By not setting it too high, it can also serve as an early
    canary for some tasks that bufio.Scanner may not be appropriate -- I myself
    have worked with formats where the common case has lines that fit on an 80
    character display, but with corner cases that can be hundreds of kilobytes
    to tens of megs long -- exceeding a 32 or 64kb buffer may occur after a
    couple days of processing production data, while it may take 6 months to
    come across a case that will exceed a 1mb buffer.
    On Friday, February 15, 2013 10:29:53 AM UTC-7, Rob Pike wrote:

    I'm leaning towards Scan and Stop as the methods at this point. I'm
    close to blowing the whistle on the bikeshedding session.

    I don't like Err or Error because they suggest something about the
    error interface. The point is, in fact, to shut down the scanner and
    terminate the scan; the error is a side effect, not the driver, and
    calling it Err for instance will encourage lazy users to skip that
    stage. That's partly why I liked Close, but now think that Close
    indicates closure of the underlying resource. Stop indeed makes sense.
    When you're done, you Stop. "When you're done, you Err" doesn't sound
    right.

    I'm going to write some code.

    -rob
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at Feb 15, 2013 at 11:07 pm
    Yes, that's what I currently do for FASTA files - that is use ReadLine. I suspect that the advent of the proposed Scanner will not change my existing codebase.

    It occurred to me that I has already written a Scanner type, though for the specialised purpose of conditioning ragel state variables, this again would not change as it needs control of its buffers.
    On 16/02/2013, at 8:56 AM, "Kevin Gillette" wrote:

    With due deference and respect to kortschak, I doubt that 1mb, or any fixed buffer size would be sufficient to handle some of those line-as-record cases which can contain arbitrary internal data, but in those cases as well, if I chose bufio.Scanner as the solution, I'd write a custom SplitFunc to handle low level tokenization and also include the newline itself as a separate token, rather than scan line-at-a-time, and then subprocess each line (bufio.Reader's ReadLine as is would excel particularly well in doing that anyway).
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Michal at Feb 16, 2013 at 2:50 pm
    Another proposition:

    type Scanner struct {
    // error returned by last Next()
    lastErr error
    // error returned by Next() before the last Next()
    prevErr error
    ...
    }

    // HasNext returns false <-> prevErr != nil and lastErr != nil
    func (s *Scanner) HasNext() bool { ... }

    // Next returns error (lastErr)
    func (s *Scanner) Next() error { ... }

    // If prevErr != nil and lastErr != nil -> length of slice returned by
    Bytes() equals 0.
    func (s *Scanner) Bytes() []byte { ... }

    Usage:

    for err:=s.Next(); s.HasNext(); err=s.Next() {
    if (err != nil && err != io.EOF) {
    return
    }
    ...
    }

    michal

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

People

Translate

site design / logo © 2022 Grokbase