FAQ
Greetings,

At the Library of Congress we've recently been exploring rewriting a [Java web archiving tool][1] in Go. So far this has involved working with an existing body (~500TB) of data encoded using [ISO/DIS 28500][2] aka the WARC file format. One of the features of WARC is its use of [Gzip][3] as a packaging format, which allows individual WARC records to be represented as separate members in the larger Gzip file. Or as the spec says:
Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed. Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files. External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.
We ran into difficulty using gzip.Reader since it does not provide any insight into when a member has been read. It simply reads through all the members in the file. While fishing around for people with a similar problem we ran across a [go-nuts thread][4] initiated by Dan Kortschak who needed to access members in a gzip file in his [Biogo][5] for processing genomic and metagenomic data sets.

We would like to propose a small addition for gzip that would introduce a MemberReader which would expose when a the end of a member has been reached, as well as the byte offset position in the underlying compressed data.

For example, if you want to print out the header and the end position of each member:

```go
f, _ := os.Open("test.gz")
defer f.Close()
if gz, err := gzip.NewMemberReader(f); err == nil {
         for {
                 if _, err := io.Copy(ioutil.Discard, gz); err == nil {
                         return nil
                 } else if err == gzip.EndOfMember {
                         fmt.Printf("Header: %#v\n", gz.Header)
                         fmt.Print("End Position:", gz.EndPosition(), "\n")
                 } else {
                         return err
                 }
         }
} else {
         return err
}
```

Then to read one member at a known position:

```go
f, _ := os.Open("test.gz")
f.Seek(position, 417)
gz, _ := gzip.NewMemberReader(f)
```

Thoughts? We are ready to work on an implementation once the design looks good.

[1]: https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
[2]: http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf
[3]: https://tools.ietf.org/html/rfc1952
[4]: https://groups.google.com/forum/#!searchin/golang-nuts/gzip/golang-nuts/VFfzYiI2rDc/EZkt6gguirwJ
[5]: https://code.google.com/p/biogo/

--

---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Dan Kortschak at Sep 25, 2013 at 9:04 pm
    I have a gzip reader that does this at code.google.com/p/biogo.bam/bgzf/egzip<http://code.google.com/p/biogo.bam/bgzf/egzip>.

    When I suggested that this kind of thing be included in the std lib the proposal was not accepted. The diff between the std and my fork is not onerous to maintain.

    On 26/09/2013, at 6:28 AM, "Daniel Krech" wrote:


    Greetings,

    At the Library of Congress we've recently been exploring rewriting a [Java web archiving tool][1] in Go. So far this has involved working with an existing body (~500TB) of data encoded using [ISO/DIS 28500][2] aka the WARC file format. One of the features of WARC is its use of [Gzip][3] as a packaging format, which allows individual WARC records to be represented as separate members in the larger Gzip file. Or as the spec says:
    Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed. Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files. External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.
    We ran into difficulty using gzip.Reader since it does not provide any insight into when a member has been read. It simply reads through all the members in the file. While fishing around for people with a similar problem we ran across a [go-nuts thread][4] initiated by Dan Kortschak who needed to access members in a gzip file in his [Biogo][5] for processing genomic and metagenomic data sets.

    We would like to propose a small addition for gzip that would introduce a MemberReader which would expose when a the end of a member has been reached, as well as the byte offset position in the underlying compressed data.

    For example, if you want to print out the header and the end position of each member:

    ```go
    f, _ := os.Open("test.gz")
    defer f.Close()
    if gz, err := gzip.NewMemberReader(f); err == nil {
             for {
                     if _, err := io.Copy(ioutil.Discard, gz); err == nil {
                             return nil
                     } else if err == gzip.EndOfMember {
                             fmt.Printf("Header: %#v\n", gz.Header)
                             fmt.Print("End Position:", gz.EndPosition(), "\n")
                     } else {
                             return err
                     }
             }
    } else {
             return err
    }
    ```

    Then to read one member at a known position:

    ```go
    f, _ := os.Open("test.gz")
    f.Seek(position, 417)
    gz, _ := gzip.NewMemberReader(f)
    ```

    Thoughts? We are ready to work on an implementation once the design looks good.

    [1]: https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
    [2]: http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf
    [3]: https://tools.ietf.org/html/rfc1952
    [4]: https://groups.google.com/forum/#!searchin/golang-nuts/gzip/golang-nuts/VFfzYiI2rDc/EZkt6gguirwJ
    [5]: https://code.google.com/p/biogo/


    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com .
    For more options, visit https://groups.google.com/groups/opt_out.

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Ed Summers at Sep 26, 2013 at 2:32 pm
    Hi Dan,
    On Wednesday, September 25, 2013 5:04:43 PM UTC-4, Dan Kortschak wrote:

    I have a gzip reader that does this at
    code.google.com/p/biogo.bam/bgzf/egzip.

    When I suggested that this kind of thing be included in the std lib the
    proposal was not accepted. The diff between the std and my fork is not
    onerous to maintain.
    Yes, if you read our email a bit closer you could see we cited your
    previous work on this :-)

    One thing your implementation did not do for us was provide the byte
    offsets for where members began in the compressed file. Were those known to
    you out of band? We needed to make a small addition to compress/dflate to
    get access to this.

    Your implementation basically cloned all of bufio and gzip, and although
    the diffs were relatively modest, it seems like other golang users might
    potentially find this functionality useful. If two golang users from vastly
    different domains need it, and it is a feature of the gzip specification,
    it seems worthy of consideration.

    Can you point to the previous design discussion? One thing Dan Krech didn't
    mention in his previous email is we have a working implementation if others
    are interested in seeing it.

    //Ed

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at Sep 26, 2013 at 8:17 pm

    On 27/09/2013, at 12:02 AM, "Ed Summers" wrote:

    Yes, if you read our email a bit closer you could see we cited your previous work on this :-)
    Yes, sorry missed that until after I sent.
    One thing your implementation did not do for us was provide the byte offsets for where members began in the compressed file. Were those known to you out of band? We needed to make a small addition to compress/dflate to get access to this.
    Yes, it's interesting because I'm at the moment working working on something that does exactly that although it's not ready (this is necessary for concurrent bgzf access).
    Your implementation basically cloned all of bufio and gzip, and although the diffs were relatively modest, it seems like other golang users might potentially find this functionality useful. If two golang users from vastly different domains need it, and it is a feature of the gzip specification, it seems worthy of consideration.
    Yes, I agree, and seeing that Russ is positive about this is a good this.
    Can you point to the previous design discussion? One thing Dan Krech didn't mention in his previous email is we have a working implementation if others are interested in seeing it.
    There was really very little discussion and it was at the thread that Daniel linked. I'd be interested to see your implementation.

    Dan

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at Sep 26, 2013 at 8:49 pm
    Just to extend on this. Given the way gzip Reader uses a bufio.Buffer if given a non flate.Reader this is not really something that the gzip Reader can know, unless you ensure that you wrap the initial reader in something that keeps its position. Which can already be done.

    Dan
    On 27/09/2013, at 12:02 AM, "Ed Summers" wrote:

    One thing your implementation did not do for us was provide the byte offsets for where members began in the compressed file. Were those known to you out of band? We needed to make a small addition to compress/dflate to get access to this.
    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Russ Cox at Sep 26, 2013 at 4:01 pm

    On Wed, Sep 25, 2013 at 4:57 PM, Daniel Krech wrote:

    We would like to propose a small addition for gzip that would introduce a MemberReader which would expose when a the end of a member has been reached, as well as the byte offset position in the underlying compressed data.
    Hi. We are in a feature freeze for the upcoming Go 1.2 release, but now
    that this has come up in two different contexts I think it is worth
    considering for Go 1.3.

    As one possible way to do this, I have included a patch to compress/gzip
    below that adds a StopAtBoundary method to instruct a reader to stop with
    EOF at every boundary. Given that patch, you can write a program that
    retrieves the information you want by tracking the offset in a custom
    reader. And Dan, the patch also saves the extra information on each header
    read in that mode.

    I'm not saying it will be in this exact form, but it's something to use for
    now.

    I've created golang.org/issue/6486. If you star it you will get email
    updates about the issue.

    Russ

    g% hg diff .
    diff -r 351b6fe0ae36 src/pkg/compress/gzip/gunzip.go
    --- a/src/pkg/compress/gzip/gunzip.go Tue Sep 24 15:54:48 2013 -0400
    +++ b/src/pkg/compress/gzip/gunzip.go Thu Sep 26 11:57:58 2013 -0400
    @@ -74,6 +74,8 @@
       flg byte
       buf [512]byte
       err error
    + stop bool
    + hdr bool
      }

      // NewReader creates a new Reader reading the given reader.
    @@ -89,6 +91,10 @@
       return z, nil
      }

    +func (z *Reader) StopAtBoundary(stop bool) {
    + z.stop = stop
    +}
    +
      // GZIP (RFC 1952) is little-endian, unlike ZLIB (RFC 1950).
      func get4(p []byte) uint32 {
       return uint32(p[0]) | uint32(p[1])<<8 | uint32(p[2])<<16 |
    uint32(p[3])<<24
    @@ -200,10 +206,15 @@
       if z.err != nil {
       return 0, z.err
       }
    + if z.hdr {
    + if err := z.resetHeader(); err != nil {
    + return 0, err
    + }
    + }
       if len(p) == 0 {
       return 0, nil
       }
    -
    +
       n, err = z.decompressor.Read(p)
       z.digest.Write(p[0:n])
       z.size += uint32(n)
    @@ -224,16 +235,25 @@
       return 0, z.err
       }

    - // File is ok; is there another?
    - if err = z.readHeader(false); err != nil {
    + z.hdr = true
    + if z.stop {
    + return 0, io.EOF
    + }
    + return z.Read(p)
    +}
    +
    +func (z *Reader) resetHeader() error {
    + // Is there another header?
    + if err := z.readHeader(z.stop); err != nil {
       z.err = err
    - return
    + return err
       }

       // Yes. Reset and read from it.
       z.digest.Reset()
       z.size = 0
    - return z.Read(p)
    + z.hdr = false
    + return nil
      }

      // Close closes the Reader. It does not close the underlying io.Reader.
    g% cat x.go
    package main

    import (
    "bufio"
    "compress/gzip"
    "io"
    "io/ioutil"
    "fmt"
    "log"
    "os"
    )

    type byteCounter struct {
    r *bufio.Reader
    offset int64
    }

    func (b *byteCounter) Read(p []byte) (int, error) {
    n, err := b.r.Read(p)
    b.offset += int64(n)
    return n, err
    }

    func (b *byteCounter) ReadByte() (byte, error) {
    c, err := b.r.ReadByte()
    if err == nil {
    b.offset++
    }
    return c, err
    }

    func main() {
    f, err := os.Open("test.gz")
    if err != nil {
    log.Fatal(err)
    }
    bc := &byteCounter{r: bufio.NewReader(f)}
    gz, err := gzip.NewReader(bc)
    if err != nil {
    log.Fatal(err)
    }
    gz.StopAtBoundary(true)
    var off int64
    for {
    n, err := io.Copy(ioutil.Discard, gz)
    if err != nil {
    log.Fatalf("@%d: %d bytes + error: %v", off, n, err)
    }
    if off == bc.offset {
    fmt.Printf("@%d: EOF\n", off)
    break
    }
    fmt.Printf("@%d: %d bytes uncompressed\n", off, n)
    off = bc.offset
    }
    }
    g% go run x.go
    @0: 4892 bytes uncompressed
    @1556: 1989 bytes uncompressed
    @2548: 1238 bytes uncompressed
    @3174: EOF
    g%

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Daniel Krech at Sep 26, 2013 at 7:54 pm

    On Sep 26, 2013, at 12:01 PM, Russ Cox wrote:

    As one possible way to do this, I have included a patch to compress/gzip below that adds a StopAtBoundary method to instruct a reader to stop with EOF at every boundary. Given that patch, you can write a program that retrieves the information you want by tracking the offset in a custom reader. And Dan, the patch also saves the extra information on each header read in that mode.

    I'm not saying it will be in this exact form, but it's something to use for now.

    I've created golang.org/issue/6486. If you star it you will get email updates about the issue.
    We had been going down the path of exposing the compressed input offset in compress/flate that you mention in the ticket. We thought we had to in order for buffering not to obscure the offset of the boundaries. Switched to the custom reader approach and see that one can navigate around the buffering. But I do not think this approach is readily apparent without digging into the implementation of gzip and the approach, I think, depends on the implementation of gzip's makeReader and deflate. It would be great if the offset functionality could also be pushed down into gzip (and flat as you already mentioned). So one could call gz.Offset() when stopping at boundaries.

    Thank you for opening the issue and we look forward to helping see the issue through.


    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at Oct 8, 2013 at 11:22 am
    How can a caller know that an EOF is an EOF rather than an end of member with this approach? You expect the return from an end of file EOF to return a byte count of 0, but you will also see this with an empty member, so a (0, io.EOF) return is not definitive.

    Dan

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Russ Cox at Oct 14, 2013 at 1:42 pm

    On Tue, Oct 8, 2013 at 7:22 AM, Dan Kortschak wrote:

    How can a caller know that an EOF is an EOF rather than an end of member
    with this approach? You expect the return from an end of file EOF to return
    a byte count of 0, but you will also see this with an empty member, so a
    (0, io.EOF) return is not definitive.
    In the code I posted, the caller knows it reached EOF because the read
    offset did not advance.

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/CAA8EjDQ9NxnZPsGpBPPSJWPtc56v25hxLRZHOxwQEGakUm%3DqhA%40mail.gmail.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at Oct 14, 2013 at 10:47 pm

    On Mon, 2013-10-14 at 09:42 -0400, Russ Cox wrote:
    In the code I posted, the caller knows it reached EOF because the read
    offset did not advance.
    Missed that, sorry. Could also just keep the last n from the internal
    Read(). Thanks.

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/1381790804.11795.60.camel%40zoidberg.mbs.adelaide.edu.au.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kamil Dziedzic at Aug 25, 2014 at 6:26 pm
    Hi all,

    How this ended up? Still there is now way to read gzip/bzip2 chunks in
    parallel?

    I have a gzip reader that does this at
    code.google.com/p/biogo.bam/bgzf/egzip.
    I can't see this package? It is not supported anymore?


    Kind Regards, Kamil Dziedzic

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dan Kortschak at Aug 25, 2014 at 8:38 pm
    Not yet. I'm waiting on changes to the compress/gzip API that are being thought about by rsc.

    How this ended up? Still there is now way to read gzip/bzip2 chunks in parallel?

    I have a gzip reader that does this at code.google.com/p/biogo.bam/bgzf/egzip<http://code.google.com/p/biogo.bam/bgzf/egzip>.

    I can't see this package? It is not supported anymore?

    No, the changes that were in egzip to handle seeking are no longer necessary. Have a look in .../biogo.bam/bgzf to see how the standard gzip package can be used to seek (part of the requirement for parallel reading).

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedSep 25, '13 at 8:57p
activeAug 25, '14 at 8:38p
posts12
users5
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase