FAQ
hi there,

I am trying to come up with a binary file format to store GiB (well,
TiB) worth of data.
One particularly important issue is to be able to inspect a file's
content and know:
  - for a given "folder" (or rather, "block") what is the type of data
being stored
  - for a given "folder" what is the type content being stored

so if the user gives the file-reader layer a pointer to a value to be
read into, the unmarshaling can check the types match.
and, the file-reader layer is given a different type but with matching
layouts, the deserialization can still take place.
and, if the file-reader layer is given no type, it can still display
values thanks to the type metadata stored about the type's layout
associated with the "folder".

How would you go about it?

- I know the "gob" (de)serialization package is self-describing but
each gob-stream needs to be primed with the types descriptors, so I
would need to store this information for each file "block" even when
"block-1" and "block-2" contain data of the same type (and I need to
be able to seek to block-1,..-n or just read block-m so I can't lump
everything together)
also, I need to be able to tell the type's layout just from looking at
some block header (and I don't think "gob" exposes that type of
information)

- "protobuf" is also a self-decribing format (if you serialize the
TypeDescriptors) but it's a pain to use in a pure-go setup.

- "hdf5" is too slow (especially from Go, via the cgo gateway.)

did I forget a very useful (de)serialization package out there?

-s

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Konstantin Khomoutov at Mar 3, 2015 at 10:40 am

    On Tue, 3 Mar 2015 11:07:00 +0100 Sebastien Binet wrote:

    I am trying to come up with a binary file format to store GiB (well,
    TiB) worth of data.
    One particularly important issue is to be able to inspect a file's
    content and know:
    - for a given "folder" (or rather, "block") what is the type of data
    being stored
    - for a given "folder" what is the type content being stored [...]
    did I forget a very useful (de)serialization package out there?
    Yes, a hand-crafted simple TLV format [1].

    1. http://en.wikipedia.org/wiki/Type-length-value

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Sebastien Binet at Mar 3, 2015 at 12:50 pm

    On Tue, Mar 3, 2015 at 11:40 AM, Konstantin Khomoutov wrote:
    On Tue, 3 Mar 2015 11:07:00 +0100
    Sebastien Binet wrote:
    I am trying to come up with a binary file format to store GiB (well,
    TiB) worth of data.
    One particularly important issue is to be able to inspect a file's
    content and know:
    - for a given "folder" (or rather, "block") what is the type of data
    being stored
    - for a given "folder" what is the type content being stored [...]
    did I forget a very useful (de)serialization package out there?
    Yes, a hand-crafted simple TLV format [1].

    1. http://en.wikipedia.org/wiki/Type-length-value
    ok, so I am on the right track (I was considering CBOR.)
    there are a few interesting challenges though (such as handling
    structs with interface fields) hence my question about a battlefield
    tested package already handling all these tricky issues.

    -s

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Manlio Perillo at Mar 3, 2015 at 10:52 am
    Il giorno martedì 3 marzo 2015 11:08:05 UTC+1, Sebastien Binet ha scritto:
    hi there,

    I am trying to come up with a binary file format to store GiB (well,
    TiB) worth of data.
    One particularly important issue is to be able to inspect a file's
    content and know:
    - for a given "folder" (or rather, "block") what is the type of data
    being stored
    - for a given "folder" what is the type content being stored
    [...]

    How would you go about it?
    - I know the "gob" (de)serialization package is self-describing but
    each gob-stream needs to be primed with the types descriptors, so I
    would need to store this information for each file "block" even when
    "block-1" and "block-2" contain data of the same type (and I need to
    be able to seek to block-1,..-n or just read block-m so I can't lump
    everything together)
    also, I need to be able to tell the type's layout just from looking at
    some block header (and I don't think "gob" exposes that type of
    information)
    The FIT file format:
    http://www.thisisant.com/assets/resources/FIT/FitSDKRelease_14.00.zip

    does something that you may reuse.
    It is a record based format, for data read by fitness sensors (like Heart
    Rate monitors).

    Each record have a Definition Message, that defines the number of fields
    and the types of each record.
    The Definition Message is specified only once, when the type of record
    being serialized changes.

    The Data Message just has serialized data, that can be parsed using
    informations from a previous Definition Message.


    What type of data do you need to handle?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Sebastien Binet at Mar 3, 2015 at 12:43 pm

    On Tue, Mar 3, 2015 at 11:52 AM, Manlio Perillo wrote:
    Il giorno martedì 3 marzo 2015 11:08:05 UTC+1, Sebastien Binet ha scritto:
    hi there,

    I am trying to come up with a binary file format to store GiB (well,
    TiB) worth of data.
    One particularly important issue is to be able to inspect a file's
    content and know:
    - for a given "folder" (or rather, "block") what is the type of data
    being stored
    - for a given "folder" what is the type content being stored
    [...]



    How would you go about it?

    - I know the "gob" (de)serialization package is self-describing but
    each gob-stream needs to be primed with the types descriptors, so I
    would need to store this information for each file "block" even when
    "block-1" and "block-2" contain data of the same type (and I need to
    be able to seek to block-1,..-n or just read block-m so I can't lump
    everything together)
    also, I need to be able to tell the type's layout just from looking at
    some block header (and I don't think "gob" exposes that type of
    information)
    The FIT file format:
    http://www.thisisant.com/assets/resources/FIT/FitSDKRelease_14.00.zip

    does something that you may reuse.
    It is a record based format, for data read by fitness sensors (like Heart
    Rate monitors).

    Each record have a Definition Message, that defines the number of fields and
    the types of each record.
    The Definition Message is specified only once, when the type of record being
    serialized changes.

    The Data Message just has serialized data, that can be parsed using
    informations from a previous Definition Message.
    FITS has a bunch of limitations:
      http://www.adass2014.org/presentations/B1.pdf
      http://tinyurl.com/acfits-draft-pdf

    but, yeah... FITS.
    I actually wrote a cgo-based FITS package:
      http://fits.gsfc.nasa.gov/fits_libraries.html#gocfitsio
      github.com/astrogo/cfitsio

    as well as a pure-go one:
      http://fits.gsfc.nasa.gov/fits_libraries.html#gofitsio
      github.com/astrogo/fitsio
    What type of data do you need to handle?
    high energy physics data (the kind of stuff spit out of the Large
    Hadron Collider, CERN)
    well, if I manage to convice (a sizeable cluster of) people that Go is
    (mostly) a suitable replacement for C++/Python :)

    -s

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Sebastien Binet at Mar 3, 2015 at 12:51 pm

    On Tue, Mar 3, 2015 at 1:43 PM, Sebastien Binet wrote:
    On Tue, Mar 3, 2015 at 11:52 AM, Manlio Perillo
    wrote:
    Il giorno martedì 3 marzo 2015 11:08:05 UTC+1, Sebastien Binet ha scritto:
    hi there,

    I am trying to come up with a binary file format to store GiB (well,
    TiB) worth of data.
    One particularly important issue is to be able to inspect a file's
    content and know:
    - for a given "folder" (or rather, "block") what is the type of data
    being stored
    - for a given "folder" what is the type content being stored
    [...]



    How would you go about it?

    - I know the "gob" (de)serialization package is self-describing but
    each gob-stream needs to be primed with the types descriptors, so I
    would need to store this information for each file "block" even when
    "block-1" and "block-2" contain data of the same type (and I need to
    be able to seek to block-1,..-n or just read block-m so I can't lump
    everything together)
    also, I need to be able to tell the type's layout just from looking at
    some block header (and I don't think "gob" exposes that type of
    information)
    The FIT file format:
    http://www.thisisant.com/assets/resources/FIT/FitSDKRelease_14.00.zip

    does something that you may reuse.
    It is a record based format, for data read by fitness sensors (like Heart
    Rate monitors).

    Each record have a Definition Message, that defines the number of fields and
    the types of each record.
    The Definition Message is specified only once, when the type of record being
    serialized changes.

    The Data Message just has serialized data, that can be parsed using
    informations from a previous Definition Message.
    FITS has a bunch of limitations:
    http://www.adass2014.org/presentations/B1.pdf
    http://tinyurl.com/acfits-draft-pdf

    but, yeah... FITS.
    I actually wrote a cgo-based FITS package:
    http://fits.gsfc.nasa.gov/fits_libraries.html#gocfitsio
    github.com/astrogo/cfitsio

    as well as a pure-go one:
    http://fits.gsfc.nasa.gov/fits_libraries.html#gofitsio
    github.com/astrogo/fitsio
    ah. you (really) meant FIT. not FITS.

    -s

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Manlio Perillo at Mar 3, 2015 at 5:01 pm

    Il giorno martedì 3 marzo 2015 13:43:57 UTC+1, Sebastien Binet ha scritto:

    [...]
    The FIT file format:
    http://www.thisisant.com/assets/resources/FIT/FitSDKRelease_14.00.zip

    does something that you may reuse.
    It is a record based format, for data read by fitness sensors (like Heart
    Rate monitors).
    FITS has a bunch of limitations:
    Note that I was referring to FIT (defined by the ANT+ group), no FITS.
    [...]
    What type of data do you need to handle?
    high energy physics data (the kind of stuff spit out of the Large
    Hadron Collider, CERN)
    well, if I manage to convice (a sizeable cluster of) people that Go is
    (mostly) a suitable replacement for C++/Python :)
    I don't know how this kind of data is organized.
    As you wrote, it seems you want the data to be serialized in blocks, and
    AFAIK, you want each block to be easily identified in the file.

    In this case one possible solution is to define the binary format as a
    sequence of blocks.
    Each block has an header that defines the length, type and kind of data
    stored (as in the FIT file format)
    Using padding and a magic string at the begin of each block, you may be
    able to align blocks on a fixed size, so that
    seeking is possible without reading the entire file. Seeking should not
    need to be exact, but it must be possible to find block boundaries.

    Following the block header, there is a sequence of records (if your data
    can be stored in records).



    Regards Manlio

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Sebastien Binet at Mar 4, 2015 at 1:38 pm

    On Tue, Mar 3, 2015 at 6:01 PM, Manlio Perillo wrote:
    Il giorno martedì 3 marzo 2015 13:43:57 UTC+1, Sebastien Binet ha scritto:
    [...]
    The FIT file format:
    http://www.thisisant.com/assets/resources/FIT/FitSDKRelease_14.00.zip

    does something that you may reuse.
    It is a record based format, for data read by fitness sensors (like
    Heart
    Rate monitors).

    FITS has a bunch of limitations:

    Note that I was referring to FIT (defined by the ANT+ group), no FITS.
    yep.
    I read your FIT format specification and it is somewhat reassuring
    "my" format is based on the same concepts:
    https://github.com/go-hep/rio
    http://www-sldnt.slac.stanford.edu/nld/new/Docs/FileFormats/sio.pdf

    thanks for the input(s.)
    I guess the bottom line is: there's no pure-Go pre-packaged facility
    (yet?) to store type descriptors (and only type-descriptors.)
    that is, short of forking the interesting bits off "gob."

    -s

    PS: go-hep/rio is labelled as "FAILING" by drone.io because it's using
    pieces from go-1.4 and drone.io buildlet haven't yet (?) been migrated
    to that version (still lagging at go-1.2. see:
    https://github.com/drone/drone/issues/878)

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Manlio Perillo at Mar 4, 2015 at 3:40 pm

    On Wed, Mar 4, 2015 at 2:38 PM, Sebastien Binet wrote:

    On Tue, Mar 3, 2015 at 6:01 PM, Manlio Perillo wrote:
    [...]
    yep.
    I read your FIT format specification and it is somewhat reassuring
    "my" format is based on the same concepts:
    https://github.com/go-hep/rio
    http://www-sldnt.slac.stanford.edu/nld/new/Docs/FileFormats/sio.pdf

    thanks for the input(s.)
    I guess the bottom line is: there's no pure-Go pre-packaged facility
    (yet?) to store type descriptors (and only type-descriptors.)
    that is, short of forking the interesting bits off "gob."
    A recent protocol is
    https://capnproto.org/
    and, more in detail:
    https://capnproto.org/encoding.html#lists

    The documentation says that you can do both incremental reads and random
    access.

    However if you need to store numeric arrays it may not be the best solution.
    The main feature of capnproto (that you can try to reuse) is to avoid extra
    memory allocations, if possible.

    Regards Manlio

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedMar 3, '15 at 10:07a
activeMar 4, '15 at 3:40p
posts9
users3
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase