Grokbase Groups R r-help March 2002
FAQ
I frequently want to test for differences between animal size frequency
distributions. The obvious test (I think) to use is the Kolmogorov-Smirnov
two sample test (provided in R as the function ks.test in package ctest).
The KS test is for continuous variables and this obviously includes length,
weight etc. However, limitations in measuring (e.g length to the nearest
cm/mm, weight to the nearest g/mg etc) has the obvious effect of
"discretising" real data.

The ks.test function checks for the presence of ties noting in the help page
that "continuous distributions do not generate them". Given the problem of
"measuring to the nearest..." noted above I frequently find that my data has
ties and ks.test generates a warning.
I was interested to note that the example of a two-sample KS test given in
Sokal & Rohlf's "Biometry" (I have the 2nd edition where the example is on
p.441) has exactly the same problem:
A <- c(104,109,112,114,116,118,118,117,121,123,125,126,126,128,128,128)
B <- c(100,105,107,107,108,111,116,120,121,123)
ks.test(A,B)
Two-sample Kolmogorov-Smirnov test

data: A and B
D = 0.475, p-value = 0.1244
alternative hypothesis: two.sided

Warning message:
cannot compute correct p-values with ties in: ks.test(A, B)
In their chapter 2, "Data in Biology", Sokal & Rohlf note "any given reading
of a continuous variable ... is therefore an approximation to the exact
reading, which is in practice unknowable. However, for the purposes of
computation these approximations are usually sufficient..."
I am interested to know whether this can be made more exact. Are there
methods to test that data are measured at an appropriate scale so as to be
regarded as sufficiently continuous for a KS test, or is common sense choice
of measurement precision widely regarded as sufficient?
Any comments/references would be appreciated!
David Middleton

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Search Discussions

  • Torsten Hothorn at Mar 27, 2002 at 3:15 pm

    I frequently want to test for differences between animal size frequency
    distributions. The obvious test (I think) to use is the Kolmogorov-Smirnov
    two sample test (provided in R as the function ks.test in package ctest).
    "obvious" depends on the problem you want to test: KS tests the hypothesis

    H_0: F(z) = G(z) for all z vs. H_1: F(z) != G(z) for at least one z

    ks.test assumes that both F and G are continuous variables. However, if
    you want to test

    H_0: F(z) = G(z) vs. H_1: F(z) = G(z - delta); delta != 0

    as "test for differences" indicates, the Wilcoxon rank sum test is
    "obvious". Or, more general, if your hypothesis is "exchangeability", a
    permutation test can be used.
    The KS test is for continuous variables and this obviously includes length,
    weight etc. However, limitations in measuring (e.g length to the nearest
    cm/mm, weight to the nearest g/mg etc) has the obvious effect of
    "discretising" real data.
    or maybe the underlying distribution is discrete?

    Anyway: ks.test and wilcox.test in ctest assume data from continuous
    distributions and the normal approximation is used if ties occur.

    For the Wilcoxon and permutation test, the conditional distribution (that
    is: conditional on the ties) can be computed using the exactRankTests
    package.
    The ks.test function checks for the presence of ties noting in the help page
    that "continuous distributions do not generate them". Given the problem of
    "measuring to the nearest..." noted above I frequently find that my data has
    ties and ks.test generates a warning.
    I was interested to note that the example of a two-sample KS test given in
    Sokal & Rohlf's "Biometry" (I have the 2nd edition where the example is on
    p.441) has exactly the same problem:
    A <- c(104,109,112,114,116,118,118,117,121,123,125,126,126,128,128,128)
    B <- c(100,105,107,107,108,111,116,120,121,123)
    For your example:

    R> library(exactRankTests)
    R> wilcox.exact(B, A)

    Exact Wilcoxon rank sum test

    data: B and A
    W = 36.5, p-value = 0.02039
    alternative hypothesis: true mu is not equal to 0


    R> perm.test(B, A)

    2-sample Permutation Test

    data: B and A
    T = 1118, p-value = 0.01864
    alternative hypothesis: true mu is not equal to 0

    Torsten
    ks.test(A,B)
    Two-sample Kolmogorov-Smirnov test

    data: A and B
    D = 0.475, p-value = 0.1244
    alternative hypothesis: two.sided

    Warning message:
    cannot compute correct p-values with ties in: ks.test(A, B)
    In their chapter 2, "Data in Biology", Sokal & Rohlf note "any given reading
    of a continuous variable ... is therefore an approximation to the exact
    reading, which is in practice unknowable. However, for the purposes of
    computation these approximations are usually sufficient..."
    I am interested to know whether this can be made more exact. Are there
    methods to test that data are measured at an appropriate scale so as to be
    regarded as sufficiently continuous for a KS test, or is common sense choice
    of measurement precision widely regarded as sufficient?
    Any comments/references would be appreciated!
    David Middleton

    -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
    r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
    Send "info", "help", or "[un]subscribe"
    (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
    _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
    -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
    r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
    Send "info", "help", or "[un]subscribe"
    (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
    _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
  • David Middleton at Mar 28, 2002 at 10:48 am
    Thanks for the input, and sorry for the delay in returning to the thread.
    I frequently want to test for differences between animal size frequency
    distributions. The obvious test (I think) to use is the
    Kolmogorov-Smirnov
    two sample test (provided in R as the function ks.test in package
    ctest).
    "obvious" depends on the problem you want to test: KS tests the hypothesis

    H_0: F(z) = G(z) for all z vs. H_1: F(z) != G(z) for at least one z

    ks.test assumes that both F and G are continuous variables. However, if
    you want to test

    H_0: F(z) = G(z) vs. H_1: F(z) = G(z - delta); delta != 0

    as "test for differences" indicates, the Wilcoxon rank sum test is
    "obvious". Or, more general, if your hypothesis is "exchangeability", a
    permutation test can be used.
    Apologies for my vague description. The Wilcoxon rank sum test is a test of
    difference in location, as is the permutation test I believe. I am
    interested in more than just location - the animal size distributions I have
    in mind are often multimodal, encompassing different cohorts for example -
    so I am interested in a more general test of differences in the
    distributions, both for exploratory purposes and too see if it is
    appropriate to lump samples. Thus the KS test seems the "obvious" choice.
    The KS test is for continuous variables and this obviously includes
    length,
    weight etc. However, limitations in measuring (e.g length to the
    nearest
    cm/mm, weight to the nearest g/mg etc) has the obvious effect of
    "discretising" real data.
    or maybe the underlying distribution is discrete?
    In the case I described (animal size) it is pretty clear that the variable
    is continuous, and likewise the underlying distribution. The ties really
    are the result of rounding error.

    Off list both Don MacQueen and Ross Darnell came up with the idea of
    "jittering" the values (adding a random number form a uniform distribution
    half the width of the measurement unit) to remove the ties, and re-testing
    to see if the rounding was influencing the results. This seems to be what I
    need.

    David Middleton


    -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
    r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
    Send "info", "help", or "[un]subscribe"
    (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
    _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
  • Jason W. Martinez at Mar 28, 2002 at 3:43 pm
    Hello,

    You may want to check out Handcock and Morris's book and R/splus code on
    ``relative distribution methods.''

    See their website for more info. Last time I checked, the documentation for
    their code was somewhat lacking, though.

    http://www.stat.washington.edu/~handcock/RelDist/

    jason

    On Thursday 28 March 2002 02:48 am, David Middleton wrote:
    Thanks for the input, and sorry for the delay in returning to the thread.
    I frequently want to test for differences between animal size frequency
    distributions. The obvious test (I think) to use is the
    Kolmogorov-Smirnov
    two sample test (provided in R as the function ks.test in package
    ctest).
    "obvious" depends on the problem you want to test: KS tests the
    hypothesis

    H_0: F(z) = G(z) for all z vs. H_1: F(z) != G(z) for at least one z

    ks.test assumes that both F and G are continuous variables. However, if
    you want to test

    H_0: F(z) = G(z) vs. H_1: F(z) = G(z - delta); delta != 0

    as "test for differences" indicates, the Wilcoxon rank sum test is
    "obvious". Or, more general, if your hypothesis is "exchangeability", a
    permutation test can be used.
    Apologies for my vague description. The Wilcoxon rank sum test is a test
    of difference in location, as is the permutation test I believe. I am
    interested in more than just location - the animal size distributions I
    have in mind are often multimodal, encompassing different cohorts for
    example - so I am interested in a more general test of differences in the
    distributions, both for exploratory purposes and too see if it is
    appropriate to lump samples. Thus the KS test seems the "obvious" choice.
    The KS test is for continuous variables and this obviously includes
    length,
    weight etc. However, limitations in measuring (e.g length to the
    nearest
    cm/mm, weight to the nearest g/mg etc) has the obvious effect of
    "discretising" real data.
    or maybe the underlying distribution is discrete?
    In the case I described (animal size) it is pretty clear that the variable
    is continuous, and likewise the underlying distribution. The ties really
    are the result of rounding error.

    Off list both Don MacQueen and Ross Darnell came up with the idea of
    "jittering" the values (adding a random number form a uniform distribution
    half the width of the measurement unit) to remove the ties, and re-testing
    to see if the rounding was influencing the results. This seems to be what
    I need.

    David Middleton


    -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
    .-.- r-help mailing list -- Read
    http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or
    "[un]subscribe"
    (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
    _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
    ._._
    -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
    r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
    Send "info", "help", or "[un]subscribe"
    (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
    _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
  • Peter Flom at Mar 28, 2002 at 4:33 pm
    David Middleton <dmiddleton@fisheries.gov.fk> 03/28/02 05:48AM >>>
    wrote
    I frequently want to test for differences between animal size
    frequency distributions. The obvious test (I think) to use is the
    Kolmogorov-Smirnov two sample test (provided in R as the function >>ks.test in package ctest).
    and later added:
    Apologies for my vague description. The Wilcoxon rank sum test is a test >of difference in location, as is the permutation test I believe. I am
    interested in more than just location - the animal size distributions I have
    in mind are often multimodal, encompassing different cohorts for example >- so I am interested in a more general test of differences in the
    distributions, both for exploratory purposes and too see if it is
    appropriate to lump samples. Thus the KS test seems the "obvious" >choice.

    In which case, I recommend the methods developed and advocated Handcock & Morris

    see

    www.stat.washington.edu/handcock/RelDist


    For which code in R is available.


    These provide more complete methods for comparing two distributions; I think they're really good. The only caveat is that the sample size should be large (at least hundreds, preferably thousands).

    Peter

    Peter L. Flom, PhD
    Assistant Director, Statistics and Data Analysis Core
    Center for Drug Use and HIV Research
    National Development and Research Institutes
    71 W. 23rd St
    New York, NY 10010
    (212) 845-4485 (voice)
    (917) 438-0894 (fax)


    -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
    r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
    Send "info", "help", or "[un]subscribe"
    (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
    _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupr-help @
categoriesr
postedMar 26, '02 at 5:23p
activeMar 28, '02 at 4:33p
posts5
users4
websiter-project.org
irc#r

People

Translate

site design / logo © 2017 Grokbase