FAQ
Hello,

I have completed a draft revision of a set of types intended to be used
to represent protein and nucleic acid sequence types to replace the
initial implementation currently used in biogo.

I would be grateful if some people could have a look at what I have and
give me some feedback. I'm not completely happy with the design at this
stage, though it is significantly improved from the initial write of it.

The motivation of the set of packages it to provide support for a
variety of sequence types that can be used in a type safe manner
(preferably at compile time but aggressively at runtime if that is not
possible - i.e. panicking is types mismatch) or can be used generically
via interfaces.

Types are: single sequences and column-based multiple sequences with or
without associated letter quality data (4 types for each of nucleic acid
(NA) and protein), row-based aligned and unaligned multiple sequences (2
types for each). Column-based multiple sequence types are for analysis
of alignments, while row-based are for alignment construction.

There is a fair amount of code [1] to look at, so I'll ask specific
questions/raise specific issues:

1. alphabet.Letter is just a byte. Making it a type adds typing and
conversion costs but was intended to make letters and bytes
distinct and allow letters to have methods (it seemed like a
good idea at the time).
2. NA sequences can be reverse complemented so there needs to be a
distinction between these and protein sequences. At the moment
this is done by types at runtime. This results in massive code
duplication, so I'd like to see if there is another way to
achieve this that does not have this cost, but preferably still
allows compile time type safety (the alternative seems to be to
panic if RevComp is called on a protein). NB RevComp is a method
unlike other manipulations for performance reasons - avoids
calling At/Set on each position.
3. Because sequences can be single or multiple, indexing into
sequences is via the seq.Position type. This is wasteful for the
majority of use cases which are expected to be single sequences
but required for a uniform interface across all sequences. Is
this uniformity achievable without this?
4. At/Set methods take, return a quality qualified letter
(alphabet.QLetter) for all sequence types, even if per-position
qualities are not available (a default quality is returned in
these cases). Again this is for interface uniformity.
5. I'm aware that some of my interface definitions are fairly big.
What is the view on this?
6. The types I'm least happy with are are the two row based
multiple aligned sequence types (.../*multi.Multi). Ideally
these should be able to be used in the place of other sequence
types, but if that is the case multiple sequences can be stored
in multiple sequences, which while not necessarily incorrect
leads to unwanted complexity.

Generally: Documentation needs improvement and tests are barely/almost
adequate. Any other input welcomed.

Sorry for the length this post.

Dan

[1]http://code.google.com/p/biogo/source/browse/exp?name=development

--

Search Discussions

  • Egon at Nov 15, 2012 at 11:13 am
    I took a quick look over it. (So given more time I may come to different
    ideas/conclusions/suggestions)

    1. casting to alphabet.Letter is an insignificant overhead. But don't use
    unsafe for casting - it prevents it from being used on AppEngine.
    2. Maybe just make RevComp return an error.
    3. Alternatively you can use variadic index or just two different methods.
    4. It may be reasonable to unify the whole lettering by only using QLetter
    - it should amount to a lot of code reduction (at the expense of memory
    use).
    5. Large interfaces may indicate too many responsibilities. For example in
    Seq/QSeq you can remove the encoding responsibility - you can just use the
    encoder to encode the sequence instead.
    6. Allow any sequence in Multi, so the approach is fine. It probably won't
    cause any unwanted complexity - and the alternative would be less flexible.


    Other thoughts while going over the code:

    * Letters, QLetters into different files
    * it's probably better to use go-s testing package.
    ** testing/quick has some nice ways for randomized testing
    * Encoding can be also written as:

    type Encoding struct {
    DecodeToQPhred func(q byte) Qphred
    DecodeToQsolexa func(q byte) Qsolexa
    }

    Sanger := &Encoding{
    DecodeToQPhred : func(q byte) Qphred { return (Qphred(q) - 33) }
    DecodeToQsolexa: func(q byte) Qsolexa { return (Qphred(q) - 33).Qsolexa() }
    }

    + egon

    On Thursday, November 15, 2012 1:30:00 AM UTC+2, kortschak wrote:

    Hello,

    I have completed a draft revision of a set of types intended to be used
    to represent protein and nucleic acid sequence types to replace the
    initial implementation currently used in biogo.

    I would be grateful if some people could have a look at what I have and
    give me some feedback. I'm not completely happy with the design at this
    stage, though it is significantly improved from the initial write of it.

    The motivation of the set of packages it to provide support for a
    variety of sequence types that can be used in a type safe manner
    (preferably at compile time but aggressively at runtime if that is not
    possible - i.e. panicking is types mismatch) or can be used generically
    via interfaces.

    Types are: single sequences and column-based multiple sequences with or
    without associated letter quality data (4 types for each of nucleic acid
    (NA) and protein), row-based aligned and unaligned multiple sequences (2
    types for each). Column-based multiple sequence types are for analysis
    of alignments, while row-based are for alignment construction.

    There is a fair amount of code [1] to look at, so I'll ask specific
    questions/raise specific issues:

    1. alphabet.Letter is just a byte. Making it a type adds typing and
    conversion costs but was intended to make letters and bytes
    distinct and allow letters to have methods (it seemed like a
    good idea at the time).
    2. NA sequences can be reverse complemented so there needs to be a
    distinction between these and protein sequences. At the moment
    this is done by types at runtime. This results in massive code
    duplication, so I'd like to see if there is another way to
    achieve this that does not have this cost, but preferably still
    allows compile time type safety (the alternative seems to be to
    panic if RevComp is called on a protein). NB RevComp is a method
    unlike other manipulations for performance reasons - avoids
    calling At/Set on each position.
    3. Because sequences can be single or multiple, indexing into
    sequences is via the seq.Position type. This is wasteful for the
    majority of use cases which are expected to be single sequences
    but required for a uniform interface across all sequences. Is
    this uniformity achievable without this?
    4. At/Set methods take, return a quality qualified letter
    (alphabet.QLetter) for all sequence types, even if per-position
    qualities are not available (a default quality is returned in
    these cases). Again this is for interface uniformity.
    5. I'm aware that some of my interface definitions are fairly big.
    What is the view on this?
    6. The types I'm least happy with are are the two row based
    multiple aligned sequence types (.../*multi.Multi). Ideally
    these should be able to be used in the place of other sequence
    types, but if that is the case multiple sequences can be stored
    in multiple sequences, which while not necessarily incorrect
    leads to unwanted complexity.

    Generally: Documentation needs improvement and tests are barely/almost
    adequate. Any other input welcomed.

    Sorry for the length this post.

    Dan

    [1]http://code.google.com/p/biogo/source/browse/exp?name=development
    --
  • Dan Kortschak at Nov 16, 2012 at 1:37 am
    Thanks Egon.
    On Thu, 2012-11-15 at 03:12 -0800, egon wrote:
    I took a quick look over it. (So given more time I may come to different
    ideas/conclusions/suggestions)

    1. casting to alphabet.Letter is an insignificant overhead. But don't use
    unsafe for casting - it prevents it from being used on AppEngine.
    Sorry, should have been clearer - keyboard typing, not variable typing.
    Note that conversion between []byte and []alphabet.Letter requires
    unsafe unless a loop copy operation is done - this is too expensive.
    2. Maybe just make RevComp return an error.
    Yeah, though I think that calling RevComp on a protein is a programmer
    error and people avoid checking returned errors too often and in this
    case that would silently be bad. I think a panic would be the way to go.
    This change massively simplifies the architecture of the packages.
    3. Alternatively you can use variadic index or just two different methods.
    I think two methods is the way to go here otherwise I have overheads
    from getting the indices out of the slice. It also makes a nice
    distinction in the interfaces between single and multiple sequences:
    single have an extra method pair that is the direct access/set of
    letters while multiple sequences have only the 2D access/set methods.
    4. It may be reasonable to unify the whole lettering by only using QLetter
    - it should amount to a lot of code reduction (at the expense of memory
    use).
    The footprint and performance hits are too big to justify unfortunately.
    5. Large interfaces may indicate too many responsibilities. For example in
    Seq/QSeq you can remove the encoding responsibility - you can just use the
    encoder to encode the sequence instead.
    I considered something like that, where a QSeq returns an
    alphabet.Encoding that can then be used, but this required giving the
    Encode method to alphabet.Encoding. This is attractive, but collides
    with the fact that Qsolexa and Qphred have their own Encode methods, so
    merging them onto Encoding would require a type switch, which will have
    a significant performance impact compared to the current situation (I
    wish solexa scores would just go away).
    6. Allow any sequence in Multi, so the approach is fine. It probably won't
    cause any unwanted complexity - and the alternative would be less flexible.
    This is how I had it in the previous version, but the use of
    alphabet.Slice types in the current version make that much more
    complicated. The symmetry is attractive though, so I'll have to think
    about it more.
    Other thoughts while going over the code:

    * Letters, QLetters into different files
    * it's probably better to use go-s testing package.
    Indeed. I use gocheck normally, but was lazy here.
    ** testing/quick has some nice ways for randomized testing
    I'll look into that.
    * Encoding can be also written as:

    type Encoding struct {
    DecodeToQPhred func(q byte) Qphred
    DecodeToQsolexa func(q byte) Qsolexa
    }

    Sanger := &Encoding{
    DecodeToQPhred : func(q byte) Qphred { return (Qphred(q) - 33) }
    DecodeToQsolexa: func(q byte) Qsolexa { return (Qphred(q) - 33).Qsolexa() }
    }
    I'll think about that also.
    + egon
    Thanks.
    Dan

    --
  • Egon at Nov 16, 2012 at 11:22 am

    On Friday, November 16, 2012 3:37:54 AM UTC+2, kortschak wrote:
    Thanks Egon.
    On Thu, 2012-11-15 at 03:12 -0800, egon wrote:
    I took a quick look over it. (So given more time I may come to different
    ideas/conclusions/suggestions)

    1. casting to alphabet.Letter is an insignificant overhead. But don't use
    unsafe for casting - it prevents it from being used on AppEngine.
    Sorry, should have been clearer - keyboard typing, not variable typing.
    Note that conversion between []byte and []alphabet.Letter requires
    unsafe unless a loop copy operation is done - this is too expensive.
    2. Maybe just make RevComp return an error.

    Yeah, though I think that calling RevComp on a protein is a programmer
    error and people avoid checking returned errors too often and in this
    case that would silently be bad. I think a panic would be the way to go.
    This change massively simplifies the architecture of the packages.
    Indeed. Also maybe remove RevComp from the Alphabet interface that way you
    don't have to have interfaces alphabet.Nucleic/alphabet.Protein. Or keep it
    in both.

    3. Alternatively you can use variadic index or just two different
    methods.

    I think two methods is the way to go here otherwise I have overheads
    from getting the indices out of the slice. It also makes a nice
    distinction in the interfaces between single and multiple sequences:
    single have an extra method pair that is the direct access/set of
    letters while multiple sequences have only the 2D access/set methods.
    4. It may be reasonable to unify the whole lettering by only using QLetter
    - it should amount to a lot of code reduction (at the expense of memory
    use).
    The footprint and performance hits are too big to justify unfortunately.
    When you keep two things you have a hit on ease of use and development
    time. The memory hit I assume is about 2x. In bioinformatics you probably
    should be able to deal with this, since either you are not using large
    datasets or you are using a large dataset that exceeds the memory anyways
    (and you need a better program structure). But every little helps, I guess.
    Not sure how large is the performance hit.

    Or maybe there's a possibility to hide whether you are using the Letter or
    QLetter and present a single interface and switch using a flag? Are there
    any programs that should need both simultaneously?

    5. Large interfaces may indicate too many responsibilities. For example in
    Seq/QSeq you can remove the encoding responsibility - you can just use the
    encoder to encode the sequence instead.
    I considered something like that, where a QSeq returns an
    alphabet.Encoding that can then be used, but this required giving the
    Encode method to alphabet.Encoding. This is attractive, but collides
    with the fact that Qsolexa and Qphred have their own Encode methods, so
    merging them onto Encoding would require a type switch, which will have
    a significant performance impact compared to the current situation (I
    wish solexa scores would just go away).
    I mean maybe the encoding can be a totally separate entity.
    "Qsolexa.Encode(seq)"

    6. Allow any sequence in Multi, so the approach is fine. It probably won't
    cause any unwanted complexity - and the alternative would be less
    flexible.

    This is how I had it in the previous version, but the use of
    alphabet.Slice types in the current version make that much more
    complicated. The symmetry is attractive though, so I'll have to think
    about it more.
    Other thoughts while going over the code:

    * Letters, QLetters into different files
    * it's probably better to use go-s testing package.
    Indeed. I use gocheck normally, but was lazy here.
    ** testing/quick has some nice ways for randomized testing
    I'll look into that.
    * Encoding can be also written as:

    type Encoding struct {
    DecodeToQPhred func(q byte) Qphred
    DecodeToQsolexa func(q byte) Qsolexa
    }

    Sanger := &Encoding{
    DecodeToQPhred : func(q byte) Qphred { return (Qphred(q) - 33) }
    DecodeToQsolexa: func(q byte) Qsolexa { return (Qphred(q) -
    33).Qsolexa() }
    }
    I'll think about that also.

    * Also what are the main differences between seq.nucleic and seq.protein -
    they look similar. It seems that if you put Annotiation.Strand into some
    Metadata field and RevComp panics for protein they can be merged.

    + egon
    Thanks.
    Dan
    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedNov 14, '12 at 11:29p
activeNov 16, '12 at 11:22a
posts4
users2
websitegolang.org

2 users in discussion

Egon: 2 posts Dan Kortschak: 2 posts

People

Translate

site design / logo © 2022 Grokbase