FAQ
Hi,

I need to filter my data:
I think its easy but i'm stuck so i'll appreciate some help:

I have a data frame with 14 variables and 6 million rows. About half of this
rows have a value of "0" in 12 variables (the other two variables always
have values). How can I delete the rows in which all 12 variables have the
value of "0".

example (from my data, variable 14 is missing):

1783 81 85 78 89 71 97 76 66 88
95 95 98 -57.48258
1784 81 86 79 90 71 97 77 66 88
95 95 98 -57.43768
1785 81 86 79 90 71 98 77 66 89
95 94 98 -57.39278
1786 0 0 0 0 0 0 0 0 0
0 0 0 -57.34788
1787 0 0 0 0 0 0 0 0 0
0 0 0 -57.30298
1788 80 86 80 90 72 98 78 66 88
93 93 96 -57.25808
1789 77 83 78 88 70 95 76 63 86
91 90 93 -57.21318
1790 77 84 79 89 70 96 76 64 87
91 90 93 -57.16828

I would need to delete rows 1786 & 1787.

I tried subset with variable1>"0"&variable2>"0", but it wasn't useful 'cause
it only took the rows that didn't have a 0 in any of the variables; I only
need in ALL of the variables simultaneously.

Thanks,

Ignacio

--
View this message in context: http://r.789695.n4.nabble.com/How-to-delete-rows-with-specific-values-on-all-columns-variables-tp3318115p3318115.html
Sent from the R help mailing list archive at Nabble.com.

Search Discussions

  • David Winsemius at Feb 21, 2011 at 11:05 pm

    On Feb 21, 2011, at 4:03 PM, IgnacioQM wrote:
    Hi,

    I need to filter my data:
    I think its easy but i'm stuck so i'll appreciate some help:

    I have a data frame with 14 variables and 6 million rows. About half
    of this
    rows have a value of "0" in 12 variables (the other two variables
    always
    have values). How can I delete the rows in which all 12 variables
    have the
    value of "0".

    example (from my data, variable 14 is missing):

    1783 81 85 78 89 71 97 76
    66 88
    95 95 98 -57.48258
    1784 81 86 79 90 71 97 77
    66 88
    95 95 98 -57.43768
    1785 81 86 79 90 71 98 77
    66 89
    95 94 98 -57.39278
    1786 0 0 0 0 0 0 0
    0 0
    0 0 0 -57.34788
    1787 0 0 0 0 0 0 0
    0 0
    0 0 0 -57.30298
    1788 80 86 80 90 72 98 78
    66 88
    93 93 96 -57.25808
    1789 77 83 78 88 70 95 76
    63 86
    91 90 93 -57.21318
    1790 77 84 79 89 70 96 76
    64 87
    91 90 93 -57.16828

    I would need to delete rows 1786 & 1787.
    something along the lines of:

    dfrm[ -apply(dfrm, 1, function(x) all(x==0) ), ]


    I tried subset with variable1>"0"&variable2>"0", but it wasn't
    useful 'cause
    it only took the rows that didn't have a 0 in any of the variables;
    I only
    need in ALL of the variables simultaneously.

    Thanks,

    Ignacio

    --
    View this message in context: http://r.789695.n4.nabble.com/How-to-delete-rows-with-specific-values-on-all-columns-variables-tp3318115p3318115.html
    Sent from the R help mailing list archive at Nabble.com.

    ______________________________________________
    R-help at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
    David Winsemius, MD
    West Hartford, CT
  • David Winsemius at Feb 21, 2011 at 11:11 pm
    On Feb 21, 2011, at 6:05 PM, David Winsemius wrote:
    On Feb 21, 2011, at 4:03 PM, IgnacioQM wrote:

    I need to filter my data:
    I think its easy but i'm stuck so i'll appreciate some help:

    I have a data frame with 14 variables and 6 million rows. About
    half of this
    rows have a value of "0" in 12 variables (the other two variables
    always
    have values). How can I delete the rows in which all 12 variables
    have the
    value of "0".

    example (from my data, variable 14 is missing):

    1783 81 85 78 89 71 97 76
    66 88
    95 95 98 -57.48258
    1784 81 86 79 90 71 97 77
    66 88
    95 95 98 -57.43768
    1785 81 86 79 90 71 98 77
    66 89
    95 94 98 -57.39278
    1786 0 0 0 0 0 0 0
    0 0
    0 0 0 -57.34788
    1787 0 0 0 0 0 0 0
    0 0
    0 0 0 -57.30298
    1788 80 86 80 90 72 98 78
    66 88
    93 93 96 -57.25808
    1789 77 83 78 88 70 95 76
    63 86
    91 90 93 -57.21318
    1790 77 84 79 89 70 96 76
    64 87
    91 90 93 -57.16828

    I would need to delete rows 1786 & 1787.
    something along the lines of:

    dfrm[ -apply(dfrm, 1, function(x) all(x==0) ), ]
    Looking at a second time, I see the qualification of only the first 12
    rows, so

    dfrm[ -apply(dfrm[, 1:12], 1, function(x) all(x==0) ), ]

    Email obscured that compounded by the fact that you didn't post a
    reproducible data object.
    I tried subset with variable1>"0"&variable2>"0", but it wasn't
    useful 'cause
    it only took the rows that didn't have a 0 in any of the variables;
    I only
    need in ALL of the variables simultaneously.

    Thanks,

    Ignacio

    David Winsemius, MD
    West Hartford, CT
  • William Dunlap at Feb 22, 2011 at 12:11 am

    -----Original Message-----
    From: r-help-bounces at r-project.org
    [mailto:r-help-bounces at r-project.org] On Behalf Of David Winsemius
    Sent: Monday, February 21, 2011 3:11 PM
    To: David Winsemius
    Cc: r-help at r-project.org; IgnacioQM
    Subject: Re: [R] How to delete rows with specific values on
    all columns(variables)?

    On Feb 21, 2011, at 6:05 PM, David Winsemius wrote:

    On Feb 21, 2011, at 4:03 PM, IgnacioQM wrote:

    I need to filter my data:
    I think its easy but i'm stuck so i'll appreciate some help:

    I have a data frame with 14 variables and 6 million rows. About
    half of this
    rows have a value of "0" in 12 variables (the other two variables
    always
    have values). How can I delete the rows in which all 12 variables
    have the
    value of "0".

    example (from my data, variable 14 is missing):

    1783 81 85 78 89 71 97 76
    66 88
    95 95 98 -57.48258
    1784 81 86 79 90 71 97 77
    66 88
    95 95 98 -57.43768
    1785 81 86 79 90 71 98 77
    66 89
    95 94 98 -57.39278
    1786 0 0 0 0 0 0 0
    0 0
    0 0 0 -57.34788
    1787 0 0 0 0 0 0 0
    0 0
    0 0 0 -57.30298
    1788 80 86 80 90 72 98 78
    66 88
    93 93 96 -57.25808
    1789 77 83 78 88 70 95 76
    63 86
    91 90 93 -57.21318
    1790 77 84 79 89 70 96 76
    64 87
    91 90 93 -57.16828

    I would need to delete rows 1786 & 1787.
    something along the lines of:

    dfrm[ -apply(dfrm, 1, function(x) all(x==0) ), ]
    Looking at a second time, I see the qualification of only the
    first 12
    rows, so

    dfrm[ -apply(dfrm[, 1:12], 1, function(x) all(x==0) ), ]
    I think you want !apply, not -apply, as in
    f0 <- function (dfrm) {
    dfrm[!apply(dfrm[, 1:12], 1, function(x) all(x == 0)), ]
    }
    Email obscured that compounded by the fact that you didn't post a
    reproducible data object.
    A faster and safer way would be to operate a column
    at a time (faster when there are many more rows than
    colummns) and to avoid apply (safer, as it turns
    the data.frame into a matrix whose storage.mode
    might surprise you and lead to errors in the x==0 test).
    E.g.,

    f1 <- function (dfrm) {
    isZero <- function(x) !is.na(x) & x == 0
    areAllColsZero <- isZero(dfrm[, 1])
    for (col in dfrm[, 2:12]) areAllColsZero <- areAllColsZero &
    isZero(col)
    dfrm[!areAllColsZero, , drop = FALSE]
    }
    You can use Reduce() instead of the loop, but the loop code
    is easy to understand.

    I made some fake data with the following function, which
    makes all but 12 rows be all-zero:
    makeData <- function (nrow) {
    rowNum <- seq_len(nrow)
    data.frame(lapply(structure(1:12, names = paste("X", 1:12,
    sep = "")), function(i) as.integer(rowNum == (2 * i))),
    Data1 = 1:nrow, Data2 = sqrt(1:nrow))
    }

    To test this out:
    dfrm <- makeData(1e6) # million rows, 12 to keep
    system.time(r0 <- f0(dfrm))
    user system elapsed
    21.45 0.60 22.55
    system.time(r1 <- f1(dfrm)) # faster
    user system elapsed
    0.87 0.07 0.88
    identical(r0, r1) # gives same results [1] TRUE
    dim(r0) [1] 12 14
    r0
    X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Data1 Data2
    2 1 0 0 0 0 0 0 0 0 0 0 0 2 1.414214
    4 0 1 0 0 0 0 0 0 0 0 0 0 4 2.000000
    ...

    Bill Dunlap
    Spotfire, TIBCO Software
    wdunlap tibco.com
    I tried subset with variable1>"0"&variable2>"0", but it wasn't
    useful 'cause
    it only took the rows that didn't have a 0 in any of the
    variables;
    I only
    need in ALL of the variables simultaneously.

    Thanks,

    Ignacio

    David Winsemius, MD
    West Hartford, CT

    ______________________________________________
    R-help at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupr-help @
categoriesr
postedFeb 21, '11 at 9:03p
activeFeb 22, '11 at 12:11a
posts4
users3
websiter-project.org
irc#r

People

Translate

site design / logo © 2023 Grokbase