Grokbase Groups R r-help April 2009
FAQ
lm's "predict" function offers an "interval" parameter to choose between 'confidence' and 'prediction' bands. In the package "robust" and for "lmRob", there is also a "predict" but it lacks such a parameter, and the documented "type" parameter has only "response" offerred. Is there some way of obtaining prediction bands from lmRob? Is there an alternative robust (linear) regression package that offers such a capability?

Thanks for any and all help.

- Jan Galkowski, Akamai Technologies, Cambridge, MA.

Search Discussions

  • Greg Snow at Apr 8, 2009 at 5:20 pm
    Your problem is related to the theory underlying linear models (and is an example as to why it is important to understand the theory, not just know how to plug numbers into a computer).

    The lm function is based on theory that assumes the y variable in normally distributed with the mean of that normal based on the model and the x values. This allows the predict function for lm to create prediction intervals based on the normal distribution, the predicted mean of that distribution, the estimated standard deviation, and the uncertainty in the predicted mean. Note that if your y variable is not normally distributed, but the sample size is large enough for the Central Limit Theorem to hold, then the confidence intervals will be approximately correct, but the prediction intervals will probably not be.

    When you switch to a robust regression approach, the assumption is that the y variable is not normal, so a prediction interval based on the normal distribution does not make sense. To get an appropriate prediction interval you need some information on what the distribution of the y values is (conditional on the model), but most robust techniques are not based on a specific distribution, just some properties of the distribution. Without some information (or at least an assumption) on that distribution, the predict method cannot create prediction intervals.

    I know that this does not answer your question, but hopefully helps you to understand what is happening. Think about what your actual scientific question is, it may be that you can answer the question without prediction intervals.

    If you feel that you really need the prediction intervals, then you will need to do some additional background research into what distribution you think the data comes from, then you can proceed from there. Some options include fitting a model based on that distribution, simulating data from the distribution given the model estimates and the uncertainty in those estimates, quantile regression, mixture of regressions, and others.

    Hope this helps,

    --
    Gregory (Greg) L. Snow Ph.D.
    Statistical Data Center
    Intermountain Healthcare
    greg.snow at imail.org
    801.408.8111

    -----Original Message-----
    From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
    project.org] On Behalf Of Galkowski, Jan
    Sent: Wednesday, April 08, 2009 9:32 AM
    To: r-help at r-project.org
    Subject: [R] predict "interval" for lmRob?

    lm's "predict" function offers an "interval" parameter to choose
    between 'confidence' and 'prediction' bands. In the package "robust"
    and for "lmRob", there is also a "predict" but it lacks such a
    parameter, and the documented "type" parameter has only "response"
    offerred. Is there some way of obtaining prediction bands from lmRob?
    Is there an alternative robust (linear) regression package that offers
    such a capability?

    Thanks for any and all help.

    - Jan Galkowski, Akamai Technologies, Cambridge, MA.

    ______________________________________________
    R-help at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-
    guide.html
    and provide commented, minimal, self-contained, reproducible code.
  • Galkowski, Jan at Apr 8, 2009 at 6:28 pm
    Hi Greg,

    Thanks for your guidance.

    In this case, the evidence is that the primary subpopulation of the data, accounting for observes the standard statistical model in the sense that Rice uses the term. It may by all accounts be normally distributed, and a Q-Q shows a large portion of the primary subpopulation behaves that way, out to 2 theoretical quantiles. But, for the measurement ranges of interest, the complement of the "normal subpopulation", accounting for some 20% of the total two million data points, behaves in other ways, which are, as a matter of fact, poorly understood. That's not likely to change soon.

    The choice of a robust regression framework and of "robust" (and possibly "quantreg" as Prof Koenker suggested) was simply to automatically fit a line to the primary subpopulation, without having to make arbitrary choices as what to keep or what to discard. Also, use of any preexisting package was simply pursued as a timesaver, worksaver, and to have some conceptual framework within to proceed other than just throwing least squares at arbitrarily chosen subsets.

    It sounds to me like I might use the robust regression to decide what to discard and then apply standard linear "lm" to the remainder, minding the diagnostics. Should they prove favorable, I'll proceed with the result of "lm".

    Thanks for pointing out the limitations of "robust" and its kin for me.

    BTW, if "robust" does not adopt a normal model for the y variable, what's the proper interpretation of the standard errors for slope and intercept it yields? A reference?

    - Jan

    -----Original Message-----
    From: Greg Snow [mailto:Greg.Snow at imail.org]
    Sent: Wednesday, April 08, 2009 1:20 PM
    To: Galkowski, Jan; r-help at r-project.org
    Subject: RE: predict "interval" for lmRob?

    Your problem is related to the theory underlying linear models (and is an example as to why it is important to understand the theory, not just know how to plug numbers into a computer).

    The lm function is based on theory that assumes the y variable in normally distributed with the mean of that normal based on the model and the x values. This allows the predict function for lm to create prediction intervals based on the normal distribution, the predicted mean of that distribution, the estimated standard deviation, and the uncertainty in the predicted mean. Note that if your y variable is not normally distributed, but the sample size is large enough for the Central Limit Theorem to hold, then the confidence intervals will be approximately correct, but the prediction intervals will probably not be.

    When you switch to a robust regression approach, the assumption is that the y variable is not normal, so a prediction interval based on the normal distribution does not make sense. To get an appropriate prediction interval you need some information on what the distribution of the y values is (conditional on the model), but most robust techniques are not based on a specific distribution, just some properties of the distribution. Without some information (or at least an assumption) on that distribution, the predict method cannot create prediction intervals.

    I know that this does not answer your question, but hopefully helps you to understand what is happening. Think about what your actual scientific question is, it may be that you can answer the question without prediction intervals.

    If you feel that you really need the prediction intervals, then you will need to do some additional background research into what distribution you think the data comes from, then you can proceed from there. Some options include fitting a model based on that distribution, simulating data from the distribution given the model estimates and the uncertainty in those estimates, quantile regression, mixture of regressions, and others.

    Hope this helps,

    --
    Gregory (Greg) L. Snow Ph.D.
    Statistical Data Center
    Intermountain Healthcare
    greg.snow at imail.org
    801.408.8111

    -----Original Message-----
    From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
    project.org] On Behalf Of Galkowski, Jan
    Sent: Wednesday, April 08, 2009 9:32 AM
    To: r-help at r-project.org
    Subject: [R] predict "interval" for lmRob?

    lm's "predict" function offers an "interval" parameter to choose
    between 'confidence' and 'prediction' bands. In the package "robust"
    and for "lmRob", there is also a "predict" but it lacks such a
    parameter, and the documented "type" parameter has only "response"
    offerred. Is there some way of obtaining prediction bands from lmRob?
    Is there an alternative robust (linear) regression package that offers
    such a capability?

    Thanks for any and all help.

    - Jan Galkowski, Akamai Technologies, Cambridge, MA.

    ______________________________________________
    R-help at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-
    guide.html
    and provide commented, minimal, self-contained, reproducible code.
  • Greg Snow at Apr 8, 2009 at 7:10 pm

    -----Original Message-----
    From: Galkowski, Jan [mailto:jgalkows at akamai.com]
    Sent: Wednesday, April 08, 2009 12:28 PM
    To: Greg Snow; r-help at r-project.org
    Subject: RE: predict "interval" for lmRob? [snip]
    It sounds to me like I might use the robust regression to decide what
    to discard and then apply standard linear "lm" to the remainder,
    minding the diagnostics. Should they prove favorable, I'll proceed with
    the result of "lm".
    Discarding actual data points always makes me nervous. Sometimes the points we want to discard are actually the most interesting.
    Thanks for pointing out the limitations of "robust" and its kin for me.
    I consider anything that encourages me to ask questions, contemplate the answers, and really think about my data and scientific question to be a benefit rather than a limitation (one of the reasons I like R so much).
    BTW, if "robust" does not adopt a normal model for the y variable,
    what's the proper interpretation of the standard errors for slope and
    intercept it yields? A reference?
    Well there are several references on the help page for lmRob, there is also a section in MASS (the book). But I think that while some of the techniques may have been developed for one particular distribution, it has been found that they work for a larger set of distributions and the theory does not depend on a particular distribution (you have to decide which makes the most sense for your data/application area). For simulations to show that they work I have seen: mixture of 2 normals, same mean but one with a much larger variance (giving the outliers), mixture of a normal and a t/cauchy, mixture of a normal and a gamma (some skewness/outliers), mixture of 2 normals with different means (outliers come from a different population mingled in with the population of interest and not easily distinguished), etc.



    --
    Gregory (Greg) L. Snow Ph.D.
    Statistical Data Center
    Intermountain Healthcare
    greg.snow at imail.org
    801.408.8111
  • Galkowski, Jan at Apr 8, 2009 at 7:23 pm
    [snip]

    Discarding actual data points always makes me nervous. Sometimes the points we want to discard are actually the most interesting.
    No doubt this is true, and there's a lot of information in those outliers, a lot of structure. For instance, in this case, one part of the outlier population is actually identifiable as a valid part of the primary dataset having the abscissa shifted by a known constant. The mechanism for that is known, so it could be defended that this portion of the outliers could be added back into the main population by removing the shift. Still, not much else is known about its surround, so I/we wonder what else we'd be picking up if that were done.

    But for the primary application, which is a calibration, I think going after the main population is what's wanted right now.

    Thanks again.

    - Jan

    [snip]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupr-help @
categoriesr
postedApr 8, '09 at 3:32p
activeApr 8, '09 at 7:23p
posts5
users2
websiter-project.org
irc#r

2 users in discussion

Galkowski, Jan: 3 posts Greg Snow: 2 posts

People

Translate

site design / logo © 2022 Grokbase