Grokbase Groups R r-help January 2011
FAQ
Hi,

I have a fairly simple linear regression using the lm function. There
are about 100 variables and 30,000 rows of data. It runs fine and
produces a decent looking R2 value. I'm interested in performing a
stepwise variable selection to see if things can be cleaned up a bit.

Calling the step function returns ONE iteration (all the variables) and
then stops. No errors are reported.

Can someone suggest why this might not be working as expected.
(Normally this function steps through all the variables to find the
"best" combination.)

Thanks!

-N

Search Discussions

  • Joshua Wiley at Jan 10, 2011 at 9:13 am
    Hi Noah,

    Are you able to reproduce the example on a smaller dataset? Do you have any strange variable names or I created a 30000 x 100 matrix, fit a linear model and step has been running fine (other than bringing my poor netbook to it's knees). It also might be helpful if you could post your session info per the posting guide.

    You could also try: debug(step). Then run step on your model so you can see what the function does before it exits.

    Cheers,

    Josh
    On Jan 9, 2011, at 23:57, Noah Silverman wrote:

    Hi,

    I have a fairly simple linear regression using the lm function. There
    are about 100 variables and 30,000 rows of data. It runs fine and
    produces a decent looking R2 value. I'm interested in performing a
    stepwise variable selection to see if things can be cleaned up a bit.

    Calling the step function returns ONE iteration (all the variables) and
    then stops. No errors are reported.

    Can someone suggest why this might not be working as expected.
    (Normally this function steps through all the variables to find the
    "best" combination.)

    Thanks!

    -N

    ______________________________________________
    R-help at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
  • Uwe Ligges at Jan 10, 2011 at 9:19 am

    On 10.01.2011 10:13, Joshua Wiley wrote:
    Hi Noah,

    Are you able to reproduce the example on a smaller dataset? Do you have any strange variable names or I created a 30000 x 100 matrix, fit a linear model and step has been running fine (other than bringing my poor netbook to it's knees). It also might be helpful if you could post your session info per the posting guide.

    You could also try: debug(step). Then run step on your model so you can see what the function does before it exits.

    Cheers,

    Josh

    On Jan 9, 2011, at 23:57, Noah Silvermanwrote:
    Hi,

    I have a fairly simple linear regression using the lm function. There
    are about 100 variables and 30,000 rows of data. It runs fine and
    produces a decent looking R2 value. I'm interested in performing a
    stepwise variable selection to see if things can be cleaned up a bit.

    Calling the step function returns ONE iteration (all the variables) and
    then stops. No errors are reported.

    Can you show us both your code and the output as well as the summary of
    the whole model, please?

    Uwe Ligges

    Can someone suggest why this might not be working as expected.
    (Normally this function steps through all the variables to find the
    "best" combination.)

    Thanks!

    -N

    ______________________________________________
    R-help at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
    ______________________________________________
    R-help at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
  • Noah Silverman at Jan 10, 2011 at 9:51 am
    Hi,

    Its a lot of data, but here are sum summary stats:

    l <- lm(trainy ~ x)
    str(x)
    num [1:31205, 1:48] 0.0975 -0.1987 0.3254 -0.7912 0.0975 ...
    - attr(*, "dimnames")=List of 2
    ..$ : chr [1:31205] "5" "6" "7" "8" ...
    ..$ : NULL
    - attr(*, "names")= chr [1:1497840] "a" NA NA NA ...

    summary(x)
    V1 V2 V3
    V4 V5 V6
    Min. :-1.679848 Min. :-1.606698 Min. :-1.617491 Min.
    :-1.6534404 Min. :-0.93052 Min. :-1.66594
    1st Qu.:-0.865216 1st Qu.:-0.867430 1st Qu.:-0.875567 1st
    Qu.:-0.9042894 1st Qu.:-0.67904 1st Qu.:-0.90768
    Median : 0.074739 Median :-0.004886 Median :-0.009924 Median :
    0.0946436 Median :-0.40504 Median :-0.14942
    Mean : 0.000492 Mean :-0.001140 Mean :-0.001563 Mean
    :-0.0006543 Mean :-0.01372 Mean : 0.01700
    3rd Qu.: 0.826709 3rd Qu.: 0.857625 3rd Qu.: 0.855687 3rd Qu.:
    0.8438270 3rd Qu.: 0.23305 3rd Qu.: 0.79841
    Max. : 1.578680 Max. : 1.596925 Max. : 1.597644 Max. :
    1.5930105 Max. : 2.74787 Max. : 2.88363
    V7 V8 V9
    V10 V11 V12
    Min. :-2.84607 Min. :-17.340329 Min. :-5.72374 Min.
    :-9.088574 Min. :-0.753625 Min. :-9.694224
    1st Qu.:-0.69230 1st Qu.: -0.680686 1st Qu.:-0.77093 1st
    Qu.:-0.484832 1st Qu.:-0.753625 1st Qu.:-0.535022
    Median : 0.07690 Median : -0.050236 Median : 0.08103 Median :
    0.127993 Median :-0.187126 Median : 0.094031
    Mean :-0.01912 Mean : 0.007672 Mean :-0.01086 Mean :
    0.004137 Mean : 0.001845 Mean : 0.005425
    3rd Qu.: 0.69226 3rd Qu.: 0.643260 3rd Qu.: 0.70906 3rd Qu.:
    0.646475 3rd Qu.: 0.232864 3rd Qu.: 0.640222
    Max. : 1.76915 Max. : 4.299870 Max. : 3.87579 Max. :
    4.307299 Max. : 8.125662 Max. :13.955377
    V13 V14 V15
    V16 V17 V18
    Min. :-2.325326 Min. :-1.122704 Min. :-15.78010 Min.
    :-1.41451 Min. :-2.890895 Min. :-6.48201
    1st Qu.:-0.707599 1st Qu.:-0.677653 1st Qu.: 0.10818 1st
    Qu.:-0.67008 1st Qu.:-0.562810 1st Qu.:-0.65572
    Median : 0.022490 Median :-0.249277 Median : 0.29841 Median
    :-0.24738 Median :-0.068975 Median :-0.01222
    Mean : 0.000984 Mean : 0.005968 Mean : -0.01914 Mean
    :-0.01929 Mean :-0.004446 Mean :-0.04004
    3rd Qu.: 0.735969 3rd Qu.: 0.387072 3rd Qu.: 0.38232 3rd Qu.:
    0.32839 3rd Qu.: 0.502638 3rd Qu.: 0.59069
    Max. : 2.328877 Max. :10.034416 Max. : 1.17948 Max. :
    3.66491 Max. : 3.405497 Max. : 3.95314
    V19 V20 V21
    V22 V23 V24
    Min. :-3.4866219 Min. :-53.84720 Min. :-3.872473 Min.
    :-82.470612 Min. :-0.877362 Min. :-0.9064
    1st Qu.:-0.6866883 1st Qu.: -0.57941 1st Qu.:-0.459875 1st Qu.:
    -0.546812 1st Qu.:-0.556758 1st Qu.:-0.6743
    Median : 0.0181297 Median : -0.01640 Median :-0.026090 Median :
    -0.023271 Median :-0.283361 Median :-0.2101
    Mean : 0.0005746 Mean : 0.02152 Mean : 0.001832 Mean :
    -0.002836 Mean : 0.006677 Mean : 0.0330
    3rd Qu.: 0.7036093 3rd Qu.: 0.58834 3rd Qu.: 0.400639 3rd Qu.:
    0.501094 3rd Qu.: 0.196238 3rd Qu.: 0.4863
    Max. : 3.5553623 Max. : 53.96102 Max. : 5.111946 Max. :
    7.022679 Max. :21.385854 Max. :12.3242
    V25 V26 V27
    V28 V29 V30
    Min. :-0.88375 Min. :-1.11709 Min. :-1.00780 Min.
    :-10.7395 Min. :-1.66934 Min. :-1.0292617
    1st Qu.:-0.65752 1st Qu.:-0.71563 1st Qu.:-0.70467 1st Qu.:
    -0.1804 1st Qu.:-0.46190 1st Qu.:-0.6029130
    Median :-0.20505 Median :-0.07946 Median :-0.14171 Median :
    0.2798 Median :-0.12636 Median :-0.3733405
    Mean : 0.03226 Mean : 0.02066 Mean : 0.01787 Mean :
    -0.0344 Mean : 0.01104 Mean : 0.0004641
    3rd Qu.: 0.47365 3rd Qu.: 0.48877 3rd Qu.: 0.42125 3rd Qu.:
    0.5117 3rd Qu.: 0.32533 3rd Qu.: 0.0530082
    Max. :10.88045 Max. :11.39008 Max. :11.55056 Max. :
    1.2400 Max. :76.74103 Max. : 5.4643580
    V31 V32 V33
    V34 V35 V36
    Min. :-1.72330 Min. :-2.81647 Min. :-1.22587 Min.
    :-1.33872 Min. :-0.85680 Min. :-1.84229
    1st Qu.:-0.95858 1st Qu.:-0.68389 1st Qu.:-0.79860 1st
    Qu.:-0.85541 1st Qu.:-0.66622 1st Qu.:-0.81453
    Median :-0.19386 Median : 0.07774 Median :-0.18821 Median
    :-0.18663 Median :-0.37654 Median :-0.25103
    Mean : 0.01799 Mean :-0.01678 Mean : 0.01022 Mean
    :-0.07883 Mean :-0.05283 Mean :-0.01440
    3rd Qu.: 0.76204 3rd Qu.: 0.68705 3rd Qu.: 0.54426 3rd Qu.:
    0.53015 3rd Qu.: 0.25618 3rd Qu.: 0.62855
    Max. : 2.86501 Max. : 1.75334 Max. : 4.57282 Max. :
    2.78523 Max. : 3.86957 Max. : 5.99709
    V37 V38 V39
    V40 V41 V42
    Min. :-0.457517 Min. :-2.2722 Min. :-1.6455 Min.
    :-3.477135 Min. :-1.17361 Min. :-5.151515
    1st Qu.:-0.457517 1st Qu.:-0.8465 1st Qu.:-0.8011 1st
    Qu.:-0.687784 1st Qu.:-1.17361 1st Qu.:-0.057516
    Median :-0.457517 Median :-0.2618 Median :-0.3438 Median
    :-0.229916 Median : 0.03988 Median :-0.057516
    Mean :-0.001647 Mean :-0.2080 Mean :-0.1453 Mean :
    0.007545 Mean : 0.02236 Mean : 0.001137
    3rd Qu.:-0.457517 3rd Qu.: 0.3710 3rd Qu.: 0.3013 3rd Qu.:
    0.515931 3rd Qu.: 0.49494 3rd Qu.: 0.706584
    Max. :15.512632 Max. : 2.1959 Max. : 2.7406 Max. :
    3.717934 Max. : 6.15788 Max. : 5.036483
    V43 V44 V45
    V46 V47 V48
    Min. :0.0000 Min. :-0.708214 Min. :-0.5407803 Min.
    :-0.980665 Min. :-17.332960 Min. :-0.291151
    1st Qu.:0.0000 1st Qu.:-0.708214 1st Qu.:-0.5407803 1st
    Qu.:-0.980665 1st Qu.: -0.684639 1st Qu.:-0.291151
    Median :0.0000 Median :-0.286641 Median :-0.2274321 Median
    :-0.416754 Median : -0.054618 Median :-0.291151
    Mean :0.1500 Mean :-0.001202 Mean :-0.0006913 Mean :
    0.004792 Mean : 0.007181 Mean : 0.008824
    3rd Qu.:0.0000 3rd Qu.: 0.313288 3rd Qu.: 0.1232400 3rd Qu.:
    0.711067 3rd Qu.: 0.654157 3rd Qu.:-0.291151
    Max. :1.0000 Max. :30.801619 Max. :45.7742768 Max. :
    5.786264 Max. : 4.292532 Max. :10.602908
  • Noah Silverman at Jan 10, 2011 at 9:56 am
    I think I just figured it out.

    x is a matrix.
    l <- lm(y ~ x) works for generating a model, but fails. (It considers x
    as a single item to add/remove for step.)

    Step does work if I use a data.frame

    foo <- cbind(y,x)

    l <- lm(y ~ ., data=foo)

    Now step(l) works.

    I guess R doesn't look at the "x" in the first version to iterate
    through the different variable. It does, however iterate when the "."
    is used in a formula.




    On 1/10/11 1:13 AM, Joshua Wiley wrote:
    Hi Noah,

    Are you able to reproduce the example on a smaller dataset? Do you have any strange variable names or I created a 30000 x 100 matrix, fit a linear model and step has been running fine (other than bringing my poor netbook to it's knees). It also might be helpful if you could post your session info per the posting guide.

    You could also try: debug(step). Then run step on your model so you can see what the function does before it exits.

    Cheers,

    Josh
    On Jan 9, 2011, at 23:57, Noah Silverman wrote:

    Hi,

    I have a fairly simple linear regression using the lm function. There
    are about 100 variables and 30,000 rows of data. It runs fine and
    produces a decent looking R2 value. I'm interested in performing a
    stepwise variable selection to see if things can be cleaned up a bit.

    Calling the step function returns ONE iteration (all the variables) and
    then stops. No errors are reported.

    Can someone suggest why this might not be working as expected.
    (Normally this function steps through all the variables to find the
    "best" combination.)

    Thanks!

    -N

    ______________________________________________
    R-help at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupr-help @
categoriesr
postedJan 10, '11 at 7:57a
activeJan 10, '11 at 9:56a
posts5
users3
websiter-project.org
irc#r

People

Translate

site design / logo © 2022 Grokbase