Grokbase Groups R r-help June 2016
FAQ
Dear all;






I have two data sets, data=map and data=ref). A small part of each data set
are given below. Data map has more than 27 million and data ref has about
560 rows. Basically I need run two different task. My R codes for these
task are given below but they do not work properly.


I sincerely do appreciate your helps.




Regards,


Greg






Task 1)


For example, the first and second columns for row 1 in data ref are 29220
63933. So I need write an R code normally first look the first row in ref
(which they are 29220 and 63933) than summing the column of "map$rate" and
give the number of rows that >0.85. Then do the same for the second,
third....in ref. At the end I would like a table gave below (the results I
need). Please notice the all value specified in ref data file are exist in
map$reg column.






Task2)


Again example, the first and second columns for row 1 in data ref are 29220
63933. So I need write an R code give the minimum map$p for the 29220
-63933 intervals in map file. Than


do the same for the second, third....in ref.








#my attempt for the first question


temp<-map[order(map$reg, map$p),]


count<-1


temp<-unique(temp$reg


for(i in 1:length(ref) {


   for(j in 1:length(ref)


   {


temp1<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,]
& temp[cumsum(temp$rate)
0.70,])

count=count+1


     }


}


#my attempt for the second question






temp<-map[order(map$reg, map$p),]


count<-1


temp<-unique(temp$reg


for(i in 1:length(ref) {


   for(j in 1:length(ref)


   {


temp2<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,])


output<-temp2[temp2$p==min(temp2$p),]


     }


}






Data sets




   Data= map


   reg p rate


  10276 0.700 3.867e-18


  71608 0.830 4.542e-16


  29220 0.430 1.948e-15


  99542 0.220 1.084e-15


  26441 0.880 9.675e-14


  95082 0.090 7.349e-13


  36169 0.480 9.715e-13


  55572 0.500 9.071e-12


  65255 0.300 1.688e-11


  51960 0.970 1.163e-10


  55652 0.388 3.750e-10


  63933 0.250 9.128e-10


  35170 0.720 7.355e-09


  06491 0.370 1.634e-08


  85508 0.470 1.057e-07


  86666 0.580 7.862e-07


  04758 0.810 9.501e-07


  06169 0.440 1.104e-06


  63933 0.750 2.624e-06


  41838 0.960 8.119e-06




  data=ref


   reg1 reg2


   29220 63933


   26441 41838


   06169 10276


   74806 92643


   73732 82451


   86042 93502


   85508 95082






        the results I need


      reg1 reg2 n


    29220 63933 12


    26441 41838 78


    06169 10276 125


    74806 92643 11


    73732 82451 47


    86042 93502 98


    85508 95082 219


  [[alternative HTML version deleted]]

Search Discussions

  • Bert Gunter at Jun 12, 2016 at 10:36 pm
    Greg:


    I was not able to understand your task 1. Perhaps others can.


    My understanding of your task 2 is that for each row of ref, you wish
    to find all rows,of map such that the reg values in those rows fall
    between the reg1 and reg2 values in ref (inclusive change <= to < if
    you don't want the endpoints), and then you want the minimum map$p
    values of all those rows. If that is correct, I believe this will do
    it (but caution, untested, as you failed to provide data in a
    convenient form, e.g. using dput() )


    task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
    min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))




    If my understanding is incorrect, please ignore both the above and the
    following:




    The "solution" I have given above seems inefficient, so others may be
    able to significantly improve it if you find that it takes too long.
    OTOH, my understanding of your specification is that you need to
    search for all rows in map data frame that meet the criterion for each
    row of ref, and without further information, I don't know how to do
    this without just repeating the search 560 times.




    Cheers,
    Bert




    Bert Gunter


    "The trouble with having an open mind is that people keep coming along
    and sticking things into it."
    -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )



    On Sun, Jun 12, 2016 at 1:14 PM, greg holly wrote:
    Dear all;



    I have two data sets, data=map and data=ref). A small part of each data set
    are given below. Data map has more than 27 million and data ref has about
    560 rows. Basically I need run two different task. My R codes for these
    task are given below but they do not work properly.

    I sincerely do appreciate your helps.


    Regards,

    Greg



    Task 1)

    For example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code normally first look the first row in ref
    (which they are 29220 and 63933) than summing the column of "map$rate" and
    give the number of rows that >0.85. Then do the same for the second,
    third....in ref. At the end I would like a table gave below (the results I
    need). Please notice the all value specified in ref data file are exist in
    map$reg column.



    Task2)

    Again example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code give the minimum map$p for the 29220
    -63933 intervals in map file. Than

    do the same for the second, third....in ref.




    #my attempt for the first question

    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp1<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,]
    & temp[cumsum(temp$rate)
    0.70,])
    count=count+1

    }

    }

    #my attempt for the second question



    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp2<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,])

    output<-temp2[temp2$p==min(temp2$p),]

    }

    }



    Data sets


    Data= map

    reg p rate

    10276 0.700 3.867e-18

    71608 0.830 4.542e-16

    29220 0.430 1.948e-15

    99542 0.220 1.084e-15

    26441 0.880 9.675e-14

    95082 0.090 7.349e-13

    36169 0.480 9.715e-13

    55572 0.500 9.071e-12

    65255 0.300 1.688e-11

    51960 0.970 1.163e-10

    55652 0.388 3.750e-10

    63933 0.250 9.128e-10

    35170 0.720 7.355e-09

    06491 0.370 1.634e-08

    85508 0.470 1.057e-07

    86666 0.580 7.862e-07

    04758 0.810 9.501e-07

    06169 0.440 1.104e-06

    63933 0.750 2.624e-06

    41838 0.960 8.119e-06


    data=ref

    reg1 reg2

    29220 63933

    26441 41838

    06169 10276

    74806 92643

    73732 82451

    86042 93502

    85508 95082



    the results I need

    reg1 reg2 n

    29220 63933 12

    26441 41838 78

    06169 10276 125

    74806 92643 11

    73732 82451 47

    86042 93502 98

    85508 95082 219

    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
  • Greg holly at Jun 13, 2016 at 2:35 am
    Hi Bert;


    I do appreciate for this. I need check your codes on task2 tomorrow at my
    office on the real data as I have difficulty (because a technical issue) to
    remote connection. I am sure it will work well.


    I am sorry that I was not able to explain my first question. Basically


    Values in ref data represent the region of chromosome. I need choose these
    regions in map (all regions values in ref data are exist in map data in the
    first column -column map$reg). And then summing up the column "map$rate and
    count the numbers that gives >0.85. For example, consider the first row in
    data ref. They are 29220 and 63933. After sorting the first column in
    map then summing column "map$rate" only between 29220 to 63933 in sorted
    map and cut off at >0.85. Then count how many rows in sorted map gives
    0.85. For example consider there are 38 rows between 29220 in 63933 in sorted
    map$reg and only summing first 12 of them gives>0.85. Then my answer is
    going to be 12 for 29220 - 63933 in ref.


    Thanks I lot for your patience.


    Cheers,
    Greg


    On Sun, Jun 12, 2016 at 10:35 PM, greg holly wrote:

    Hi Bert;

    I do appreciate for this. I need check your codes on task2 tomorrow at my
    office on the real data as I have difficulty (because a technical issue) to
    remote connection. I am sure it will work well.

    I am sorry that I was not able to explain my first question. Basically

    Values in ref data represent the region of chromosome. I need choose these
    regions in map (all regions values in ref data are exist in map data in the
    first column -column map$reg). And then summing up the column "map$rate and
    count the numbers that gives >0.85. For example, consider the first row in
    data ref. They are 29220 and 63933. After sorting the first column in
    map then summing column "map$rate" only between 29220 to 63933 in
    sorted map and cut off at >0.85. Then count how many rows in sorted map
    gives >0.85. For example consider there are 38 rows between 29220 in
    63933 in sorted map$reg and only summing first 12 of them gives>0.85.
    Then my answer is going to be 12 for 29220 - 63933 in ref.

    Thanks I lot for your patience.

    Cheers,
    Greg
    On Sun, Jun 12, 2016 at 6:36 PM, Bert Gunter wrote:

    Greg:

    I was not able to understand your task 1. Perhaps others can.

    My understanding of your task 2 is that for each row of ref, you wish
    to find all rows,of map such that the reg values in those rows fall
    between the reg1 and reg2 values in ref (inclusive change <= to < if
    you don't want the endpoints), and then you want the minimum map$p
    values of all those rows. If that is correct, I believe this will do
    it (but caution, untested, as you failed to provide data in a
    convenient form, e.g. using dput() )

    task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
    min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))


    If my understanding is incorrect, please ignore both the above and the
    following:


    The "solution" I have given above seems inefficient, so others may be
    able to significantly improve it if you find that it takes too long.
    OTOH, my understanding of your specification is that you need to
    search for all rows in map data frame that meet the criterion for each
    row of ref, and without further information, I don't know how to do
    this without just repeating the search 560 times.


    Cheers,
    Bert


    Bert Gunter

    "The trouble with having an open mind is that people keep coming along
    and sticking things into it."
    -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

    On Sun, Jun 12, 2016 at 1:14 PM, greg holly wrote:
    Dear all;



    I have two data sets, data=map and data=ref). A small part of each data set
    are given below. Data map has more than 27 million and data ref has about
    560 rows. Basically I need run two different task. My R codes for these
    task are given below but they do not work properly.

    I sincerely do appreciate your helps.


    Regards,

    Greg



    Task 1)

    For example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code normally first look the first row in ref
    (which they are 29220 and 63933) than summing the column of "map$rate" and
    give the number of rows that >0.85. Then do the same for the second,
    third....in ref. At the end I would like a table gave below (the results I
    need). Please notice the all value specified in ref data file are exist in
    map$reg column.



    Task2)

    Again example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code give the minimum map$p for the 29220
    -63933 intervals in map file. Than

    do the same for the second, third....in ref.




    #my attempt for the first question

    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp1<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,]
    & temp[cumsum(temp$rate)
    0.70,])
    count=count+1

    }

    }

    #my attempt for the second question



    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp2<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,])

    output<-temp2[temp2$p==min(temp2$p),]

    }

    }



    Data sets


    Data= map

    reg p rate

    10276 0.700 3.867e-18

    71608 0.830 4.542e-16

    29220 0.430 1.948e-15

    99542 0.220 1.084e-15

    26441 0.880 9.675e-14

    95082 0.090 7.349e-13

    36169 0.480 9.715e-13

    55572 0.500 9.071e-12

    65255 0.300 1.688e-11

    51960 0.970 1.163e-10

    55652 0.388 3.750e-10

    63933 0.250 9.128e-10

    35170 0.720 7.355e-09

    06491 0.370 1.634e-08

    85508 0.470 1.057e-07

    86666 0.580 7.862e-07

    04758 0.810 9.501e-07

    06169 0.440 1.104e-06

    63933 0.750 2.624e-06

    41838 0.960 8.119e-06


    data=ref

    reg1 reg2

    29220 63933

    26441 41838

    06169 10276

    74806 92643

    73732 82451

    86042 93502

    85508 95082



    the results I need

    reg1 reg2 n

    29220 63933 12

    26441 41838 78

    06169 10276 125

    74806 92643 11

    73732 82451 47

    86042 93502 98

    85508 95082 219

    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.

      [[alternative HTML version deleted]]
  • Jim Lemon at Jun 13, 2016 at 7:19 am
    Hi Greg,
    Okay, I have a better idea now of what you want. The problem of
    multiple matches is still there, but here is a start:


    # this data frame actually contains all the values in ref in the "reg" field
    map<-read.table(text="reg p rate
      10276 0.700 3.867e-18
      71608 0.830 4.542e-16
      29220 0.430 1.948e-15
      99542 0.220 1.084e-15
      26441 0.880 9.675e-14
      95082 0.090 7.349e-13
      36169 0.480 9.715e-13
      55572 0.500 9.071e-12
      65255 0.300 1.688e-11
      51960 0.970 1.163e-10
      55652 0.388 3.750e-10
      63933 0.250 9.128e-10
      35170 0.720 7.355e-09
      06491 0.370 1.634e-08
      85508 0.470 1.057e-07
      86666 0.580 7.862e-07
      04758 0.810 9.501e-07
      06169 0.440 1.104e-06
      63933 0.750 2.624e-06
      41838 0.960 8.119e-06
      74806 0.810 9.501e-07
      92643 0.470 1.057e-07
      73732 0.090 7.349e-13
      82451 0.960 8.119e-06
      86042 0.480 9.715e-13
      93502 0.500 9.071e-12
      85508 0.370 1.634e-08
      95082 0.830 4.542e-16",
      header=TRUE)
    # same as in your example
    ref<-read.table(text="reg1 reg2
      29220 63933
      26441 41838
      06169 10276
      74806 92643
      73732 82451
      86042 93502
      85508 95082",
      header=TRUE)
    # sort the "map" data frame
    map2<-map[order(map$reg),]
    # get a field for the counts
    ref$n<-NA
    # and a field for the minimum p values
    ref$min_p<-NA
    # get the number of rows in "ref"
    nref<-dim(ref)[1]
    for(i in 1:nref) {
      start<-which(map2$reg==ref$reg1[i])
      end<-which(map2$reg==ref$reg2[i])
      cat("start",start,"end",end,"\n")
      # get the range of matches
      regrange<-range(c(start,end))
      # convert this to a sequence spanning all matches
      allreg<-regrange[1]:regrange[2]
      ref$n[i]<-sum(map2$p[allreg] > 0.85)
      ref$min_p[i]<-min(map2$p[allreg])
    }


    This example uses the span from the first match of "reg1" to the last
    match of "reg2". This may not be what you want, so let me know if
    there are further constraints.


    Jim

    On Mon, Jun 13, 2016 at 12:35 PM, greg holly wrote:
    Hi Bert;

    I do appreciate for this. I need check your codes on task2 tomorrow at my
    office on the real data as I have difficulty (because a technical issue) to
    remote connection. I am sure it will work well.

    I am sorry that I was not able to explain my first question. Basically

    Values in ref data represent the region of chromosome. I need choose these
    regions in map (all regions values in ref data are exist in map data in the
    first column -column map$reg). And then summing up the column "map$rate and
    count the numbers that gives >0.85. For example, consider the first row in
    data ref. They are 29220 and 63933. After sorting the first column in
    map then summing column "map$rate" only between 29220 to 63933 in sorted
    map and cut off at >0.85. Then count how many rows in sorted map gives
    0.85. For example consider there are 38 rows between 29220 in 63933 in sorted
    map$reg and only summing first 12 of them gives>0.85. Then my answer is
    going to be 12 for 29220 - 63933 in ref.

    Thanks I lot for your patience.

    Cheers,
    Greg
    On Sun, Jun 12, 2016 at 10:35 PM, greg holly wrote:

    Hi Bert;

    I do appreciate for this. I need check your codes on task2 tomorrow at my
    office on the real data as I have difficulty (because a technical issue) to
    remote connection. I am sure it will work well.

    I am sorry that I was not able to explain my first question. Basically

    Values in ref data represent the region of chromosome. I need choose these
    regions in map (all regions values in ref data are exist in map data in the
    first column -column map$reg). And then summing up the column "map$rate and
    count the numbers that gives >0.85. For example, consider the first row in
    data ref. They are 29220 and 63933. After sorting the first column in
    map then summing column "map$rate" only between 29220 to 63933 in
    sorted map and cut off at >0.85. Then count how many rows in sorted map
    gives >0.85. For example consider there are 38 rows between 29220 in
    63933 in sorted map$reg and only summing first 12 of them gives>0.85.
    Then my answer is going to be 12 for 29220 - 63933 in ref.

    Thanks I lot for your patience.

    Cheers,
    Greg

    On Sun, Jun 12, 2016 at 6:36 PM, Bert Gunter <bgunter.4567@gmail.com>
    wrote:
    Greg:

    I was not able to understand your task 1. Perhaps others can.

    My understanding of your task 2 is that for each row of ref, you wish
    to find all rows,of map such that the reg values in those rows fall
    between the reg1 and reg2 values in ref (inclusive change <= to < if
    you don't want the endpoints), and then you want the minimum map$p
    values of all those rows. If that is correct, I believe this will do
    it (but caution, untested, as you failed to provide data in a
    convenient form, e.g. using dput() )

    task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
    min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))


    If my understanding is incorrect, please ignore both the above and the
    following:


    The "solution" I have given above seems inefficient, so others may be
    able to significantly improve it if you find that it takes too long.
    OTOH, my understanding of your specification is that you need to
    search for all rows in map data frame that meet the criterion for each
    row of ref, and without further information, I don't know how to do
    this without just repeating the search 560 times.


    Cheers,
    Bert


    Bert Gunter

    "The trouble with having an open mind is that people keep coming along
    and sticking things into it."
    -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

    On Sun, Jun 12, 2016 at 1:14 PM, greg holly wrote:
    Dear all;



    I have two data sets, data=map and data=ref). A small part of each data set
    are given below. Data map has more than 27 million and data ref has about
    560 rows. Basically I need run two different task. My R codes for these
    task are given below but they do not work properly.

    I sincerely do appreciate your helps.


    Regards,

    Greg



    Task 1)

    For example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code normally first look the first row in ref
    (which they are 29220 and 63933) than summing the column of "map$rate" and
    give the number of rows that >0.85. Then do the same for the second,
    third....in ref. At the end I would like a table gave below (the results I
    need). Please notice the all value specified in ref data file are exist in
    map$reg column.



    Task2)

    Again example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code give the minimum map$p for the 29220
    -63933 intervals in map file. Than

    do the same for the second, third....in ref.




    #my attempt for the first question

    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp1<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,]
    & temp[cumsum(temp$rate)
    0.70,])
    count=count+1

    }

    }

    #my attempt for the second question



    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp2<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,])

    output<-temp2[temp2$p==min(temp2$p),]

    }

    }



    Data sets


    Data= map

    reg p rate

    10276 0.700 3.867e-18

    71608 0.830 4.542e-16

    29220 0.430 1.948e-15

    99542 0.220 1.084e-15

    26441 0.880 9.675e-14

    95082 0.090 7.349e-13

    36169 0.480 9.715e-13

    55572 0.500 9.071e-12

    65255 0.300 1.688e-11

    51960 0.970 1.163e-10

    55652 0.388 3.750e-10

    63933 0.250 9.128e-10

    35170 0.720 7.355e-09

    06491 0.370 1.634e-08

    85508 0.470 1.057e-07

    86666 0.580 7.862e-07

    04758 0.810 9.501e-07

    06169 0.440 1.104e-06

    63933 0.750 2.624e-06

    41838 0.960 8.119e-06


    data=ref

    reg1 reg2

    29220 63933

    26441 41838

    06169 10276

    74806 92643

    73732 82451

    86042 93502

    85508 95082



    the results I need

    reg1 reg2 n

    29220 63933 12

    26441 41838 78

    06169 10276 125

    74806 92643 11

    73732 82451 47

    86042 93502 98

    85508 95082 219

    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
  • Greg holly at Jun 13, 2016 at 2:29 pm
    Hi Jim;


    I do apologize if bothering. I have run on the real data and here is the
    error message I got:


    Thanks,


    Greg


    start end
    Error in regrange[1]:regrange[2] : result would be too long a vector
    In addition: Warning messages:
    1: In min(x) : no non-missing arguments to min; returning Inf
    2: In max(x) : no non-missing arguments to max; returning -Inf


    On Mon, Jun 13, 2016 at 10:28 AM, greg holly wrote:

    Hi Jim;

    I do apologize if bothering. I have run on the real data and here is the
    error message I got:

    Thanks,

    Greg

    start end
    Error in regrange[1]:regrange[2] : result would be too long a vector
    In addition: Warning messages:
    1: In min(x) : no non-missing arguments to min; returning Inf
    2: In max(x) : no non-missing arguments to max; returning -Inf

    On Mon, Jun 13, 2016 at 3:19 AM, Jim Lemon wrote:

    Hi Greg,
    Okay, I have a better idea now of what you want. The problem of
    multiple matches is still there, but here is a start:

    # this data frame actually contains all the values in ref in the "reg"
    field
    map<-read.table(text="reg p rate
    10276 0.700 3.867e-18
    71608 0.830 4.542e-16
    29220 0.430 1.948e-15
    99542 0.220 1.084e-15
    26441 0.880 9.675e-14
    95082 0.090 7.349e-13
    36169 0.480 9.715e-13
    55572 0.500 9.071e-12
    65255 0.300 1.688e-11
    51960 0.970 1.163e-10
    55652 0.388 3.750e-10
    63933 0.250 9.128e-10
    35170 0.720 7.355e-09
    06491 0.370 1.634e-08
    85508 0.470 1.057e-07
    86666 0.580 7.862e-07
    04758 0.810 9.501e-07
    06169 0.440 1.104e-06
    63933 0.750 2.624e-06
    41838 0.960 8.119e-06
    74806 0.810 9.501e-07
    92643 0.470 1.057e-07
    73732 0.090 7.349e-13
    82451 0.960 8.119e-06
    86042 0.480 9.715e-13
    93502 0.500 9.071e-12
    85508 0.370 1.634e-08
    95082 0.830 4.542e-16",
    header=TRUE)
    # same as in your example
    ref<-read.table(text="reg1 reg2
    29220 63933
    26441 41838
    06169 10276
    74806 92643
    73732 82451
    86042 93502
    85508 95082",
    header=TRUE)
    # sort the "map" data frame
    map2<-map[order(map$reg),]
    # get a field for the counts
    ref$n<-NA
    # and a field for the minimum p values
    ref$min_p<-NA
    # get the number of rows in "ref"
    nref<-dim(ref)[1]
    for(i in 1:nref) {
    start<-which(map2$reg==ref$reg1[i])
    end<-which(map2$reg==ref$reg2[i])
    cat("start",start,"end",end,"\n")
    # get the range of matches
    regrange<-range(c(start,end))
    # convert this to a sequence spanning all matches
    allreg<-regrange[1]:regrange[2]
    ref$n[i]<-sum(map2$p[allreg] > 0.85)
    ref$min_p[i]<-min(map2$p[allreg])
    }

    This example uses the span from the first match of "reg1" to the last
    match of "reg2". This may not be what you want, so let me know if
    there are further constraints.

    Jim

    On Mon, Jun 13, 2016 at 12:35 PM, greg holly <mak.hholly@gmail.com>
    wrote:
    Hi Bert;

    I do appreciate for this. I need check your codes on task2 tomorrow at my
    office on the real data as I have difficulty (because a technical issue) to
    remote connection. I am sure it will work well.

    I am sorry that I was not able to explain my first question. Basically

    Values in ref data represent the region of chromosome. I need choose these
    regions in map (all regions values in ref data are exist in map data in the
    first column -column map$reg). And then summing up the column "map$rate and
    count the numbers that gives >0.85. For example, consider the first row in
    data ref. They are 29220 and 63933. After sorting the first column in
    map then summing column "map$rate" only between 29220 to 63933 in sorted
    map and cut off at >0.85. Then count how many rows in sorted map gives
    0.85. For example consider there are 38 rows between 29220 in 63933
    in sorted
    map$reg and only summing first 12 of them gives>0.85. Then my answer is
    going to be 12 for 29220 - 63933 in ref.

    Thanks I lot for your patience.

    Cheers,
    Greg

    On Sun, Jun 12, 2016 at 10:35 PM, greg holly <mak.hholly@gmail.com>
    wrote:
    Hi Bert;

    I do appreciate for this. I need check your codes on task2 tomorrow at
    my
    office on the real data as I have difficulty (because a technical
    issue) to
    remote connection. I am sure it will work well.

    I am sorry that I was not able to explain my first question. Basically

    Values in ref data represent the region of chromosome. I need choose
    these
    regions in map (all regions values in ref data are exist in map data
    in the
    first column -column map$reg). And then summing up the column
    "map$rate and
    count the numbers that gives >0.85. For example, consider the first
    row in
    data ref. They are 29220 and 63933. After sorting the first column
    in
    map then summing column "map$rate" only between 29220 to 63933 in
    sorted map and cut off at >0.85. Then count how many rows in sorted map
    gives >0.85. For example consider there are 38 rows between 29220 in
    63933 in sorted map$reg and only summing first 12 of them gives>0.85.
    Then my answer is going to be 12 for 29220 - 63933 in ref.

    Thanks I lot for your patience.

    Cheers,
    Greg

    On Sun, Jun 12, 2016 at 6:36 PM, Bert Gunter <bgunter.4567@gmail.com>
    wrote:
    Greg:

    I was not able to understand your task 1. Perhaps others can.

    My understanding of your task 2 is that for each row of ref, you wish
    to find all rows,of map such that the reg values in those rows fall
    between the reg1 and reg2 values in ref (inclusive change <= to < if
    you don't want the endpoints), and then you want the minimum map$p
    values of all those rows. If that is correct, I believe this will do
    it (but caution, untested, as you failed to provide data in a
    convenient form, e.g. using dput() )

    task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
    min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))


    If my understanding is incorrect, please ignore both the above and the
    following:


    The "solution" I have given above seems inefficient, so others may be
    able to significantly improve it if you find that it takes too long.
    OTOH, my understanding of your specification is that you need to
    search for all rows in map data frame that meet the criterion for each
    row of ref, and without further information, I don't know how to do
    this without just repeating the search 560 times.


    Cheers,
    Bert


    Bert Gunter

    "The trouble with having an open mind is that people keep coming along
    and sticking things into it."
    -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


    On Sun, Jun 12, 2016 at 1:14 PM, greg holly <mak.hholly@gmail.com>
    wrote:
    Dear all;



    I have two data sets, data=map and data=ref). A small part of each
    data
    set
    are given below. Data map has more than 27 million and data ref has about
    560 rows. Basically I need run two different task. My R codes for
    these
    task are given below but they do not work properly.

    I sincerely do appreciate your helps.


    Regards,

    Greg



    Task 1)

    For example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code normally first look the first row
    in
    ref
    (which they are 29220 and 63933) than summing the column of
    "map$rate"
    and
    give the number of rows that >0.85. Then do the same for the second,
    third....in ref. At the end I would like a table gave below (the results I
    need). Please notice the all value specified in ref data file are
    exist
    in
    map$reg column.



    Task2)

    Again example, the first and second columns for row 1 in data ref
    are
    29220
    63933. So I need write an R code give the minimum map$p for the
    29220
    -63933 intervals in map file. Than

    do the same for the second, third....in ref.




    #my attempt for the first question

    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp1<-if (temp[pos[i]==ref[ref$reg1,] &
    (temp[pos[j]==ref[ref$reg2,]
    & temp[cumsum(temp$rate)
    0.70,])
    count=count+1

    }

    }

    #my attempt for the second question



    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp2<-if (temp[pos[i]==ref[ref$reg1,] &
    (temp[pos[j]==ref[ref$reg2,])
    output<-temp2[temp2$p==min(temp2$p),]

    }

    }



    Data sets


    Data= map

    reg p rate

    10276 0.700 3.867e-18

    71608 0.830 4.542e-16

    29220 0.430 1.948e-15

    99542 0.220 1.084e-15

    26441 0.880 9.675e-14

    95082 0.090 7.349e-13

    36169 0.480 9.715e-13

    55572 0.500 9.071e-12

    65255 0.300 1.688e-11

    51960 0.970 1.163e-10

    55652 0.388 3.750e-10

    63933 0.250 9.128e-10

    35170 0.720 7.355e-09

    06491 0.370 1.634e-08

    85508 0.470 1.057e-07

    86666 0.580 7.862e-07

    04758 0.810 9.501e-07

    06169 0.440 1.104e-06

    63933 0.750 2.624e-06

    41838 0.960 8.119e-06


    data=ref

    reg1 reg2

    29220 63933

    26441 41838

    06169 10276

    74806 92643

    73732 82451

    86042 93502

    85508 95082



    the results I need

    reg1 reg2 n

    29220 63933 12

    26441 41838 78

    06169 10276 125

    74806 92643 11

    73732 82451 47

    86042 93502 98

    85508 95082 219

    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.

      [[alternative HTML version deleted]]
  • Jim Lemon at Jun 14, 2016 at 7:39 am
    Hi Greg,
    This is obviously a problem with the data. The first error indicates
    that the sequence of integers regrange[1]:regrange[2] is too long to
    be allocated. Most likely this is because one or both of the endpoints
    are infinite. Maybe if you can find where the NAs are you can fix it.


    Jim

    On Tue, Jun 14, 2016 at 12:29 AM, greg holly wrote:
    Hi Jim;

    I do apologize if bothering. I have run on the real data and here is the
    error message I got:

    Thanks,

    Greg

    start end
    Error in regrange[1]:regrange[2] : result would be too long a vector
    In addition: Warning messages:
    1: In min(x) : no non-missing arguments to min; returning Inf
    2: In max(x) : no non-missing arguments to max; returning -Inf
    On Mon, Jun 13, 2016 at 10:28 AM, greg holly wrote:

    Hi Jim;

    I do apologize if bothering. I have run on the real data and here is the
    error message I got:

    Thanks,

    Greg

    start end
    Error in regrange[1]:regrange[2] : result would be too long a vector
    In addition: Warning messages:
    1: In min(x) : no non-missing arguments to min; returning Inf
    2: In max(x) : no non-missing arguments to max; returning -Inf

    On Mon, Jun 13, 2016 at 3:19 AM, Jim Lemon wrote:

    Hi Greg,
    Okay, I have a better idea now of what you want. The problem of
    multiple matches is still there, but here is a start:

    # this data frame actually contains all the values in ref in the "reg"
    field
    map<-read.table(text="reg p rate
    10276 0.700 3.867e-18
    71608 0.830 4.542e-16
    29220 0.430 1.948e-15
    99542 0.220 1.084e-15
    26441 0.880 9.675e-14
    95082 0.090 7.349e-13
    36169 0.480 9.715e-13
    55572 0.500 9.071e-12
    65255 0.300 1.688e-11
    51960 0.970 1.163e-10
    55652 0.388 3.750e-10
    63933 0.250 9.128e-10
    35170 0.720 7.355e-09
    06491 0.370 1.634e-08
    85508 0.470 1.057e-07
    86666 0.580 7.862e-07
    04758 0.810 9.501e-07
    06169 0.440 1.104e-06
    63933 0.750 2.624e-06
    41838 0.960 8.119e-06
    74806 0.810 9.501e-07
    92643 0.470 1.057e-07
    73732 0.090 7.349e-13
    82451 0.960 8.119e-06
    86042 0.480 9.715e-13
    93502 0.500 9.071e-12
    85508 0.370 1.634e-08
    95082 0.830 4.542e-16",
    header=TRUE)
    # same as in your example
    ref<-read.table(text="reg1 reg2
    29220 63933
    26441 41838
    06169 10276
    74806 92643
    73732 82451
    86042 93502
    85508 95082",
    header=TRUE)
    # sort the "map" data frame
    map2<-map[order(map$reg),]
    # get a field for the counts
    ref$n<-NA
    # and a field for the minimum p values
    ref$min_p<-NA
    # get the number of rows in "ref"
    nref<-dim(ref)[1]
    for(i in 1:nref) {
    start<-which(map2$reg==ref$reg1[i])
    end<-which(map2$reg==ref$reg2[i])
    cat("start",start,"end",end,"\n")
    # get the range of matches
    regrange<-range(c(start,end))
    # convert this to a sequence spanning all matches
    allreg<-regrange[1]:regrange[2]
    ref$n[i]<-sum(map2$p[allreg] > 0.85)
    ref$min_p[i]<-min(map2$p[allreg])
    }

    This example uses the span from the first match of "reg1" to the last
    match of "reg2". This may not be what you want, so let me know if
    there are further constraints.

    Jim

    On Mon, Jun 13, 2016 at 12:35 PM, greg holly <mak.hholly@gmail.com>
    wrote:
    Hi Bert;

    I do appreciate for this. I need check your codes on task2 tomorrow at
    my
    office on the real data as I have difficulty (because a technical
    issue) to
    remote connection. I am sure it will work well.

    I am sorry that I was not able to explain my first question. Basically

    Values in ref data represent the region of chromosome. I need choose
    these
    regions in map (all regions values in ref data are exist in map data in
    the
    first column -column map$reg). And then summing up the column "map$rate
    and
    count the numbers that gives >0.85. For example, consider the first
    row in
    data ref. They are 29220 and 63933. After sorting the first column
    in
    map then summing column "map$rate" only between 29220 to 63933 in
    sorted
    map and cut off at >0.85. Then count how many rows in sorted map gives
    0.85. For example consider there are 38 rows between 29220 in 63933
    in sorted
    map$reg and only summing first 12 of them gives>0.85. Then my answer
    is
    going to be 12 for 29220 - 63933 in ref.

    Thanks I lot for your patience.

    Cheers,
    Greg

    On Sun, Jun 12, 2016 at 10:35 PM, greg holly <mak.hholly@gmail.com>
    wrote:
    Hi Bert;

    I do appreciate for this. I need check your codes on task2 tomorrow at
    my
    office on the real data as I have difficulty (because a technical
    issue) to
    remote connection. I am sure it will work well.

    I am sorry that I was not able to explain my first question. Basically

    Values in ref data represent the region of chromosome. I need choose
    these
    regions in map (all regions values in ref data are exist in map data
    in the
    first column -column map$reg). And then summing up the column
    "map$rate and
    count the numbers that gives >0.85. For example, consider the first
    row in
    data ref. They are 29220 and 63933. After sorting the first column
    in
    map then summing column "map$rate" only between 29220 to 63933 in
    sorted map and cut off at >0.85. Then count how many rows in sorted
    map
    gives >0.85. For example consider there are 38 rows between 29220 in
    63933 in sorted map$reg and only summing first 12 of them
    gives>0.85.
    Then my answer is going to be 12 for 29220 - 63933 in ref.

    Thanks I lot for your patience.

    Cheers,
    Greg

    On Sun, Jun 12, 2016 at 6:36 PM, Bert Gunter <bgunter.4567@gmail.com>
    wrote:
    Greg:

    I was not able to understand your task 1. Perhaps others can.

    My understanding of your task 2 is that for each row of ref, you wish
    to find all rows,of map such that the reg values in those rows fall
    between the reg1 and reg2 values in ref (inclusive change <= to < if
    you don't want the endpoints), and then you want the minimum map$p
    values of all those rows. If that is correct, I believe this will do
    it (but caution, untested, as you failed to provide data in a
    convenient form, e.g. using dput() )

    task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
    min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))


    If my understanding is incorrect, please ignore both the above and
    the
    following:


    The "solution" I have given above seems inefficient, so others may be
    able to significantly improve it if you find that it takes too long.
    OTOH, my understanding of your specification is that you need to
    search for all rows in map data frame that meet the criterion for
    each
    row of ref, and without further information, I don't know how to do
    this without just repeating the search 560 times.


    Cheers,
    Bert


    Bert Gunter

    "The trouble with having an open mind is that people keep coming
    along
    and sticking things into it."
    -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


    On Sun, Jun 12, 2016 at 1:14 PM, greg holly <mak.hholly@gmail.com>
    wrote:
    Dear all;



    I have two data sets, data=map and data=ref). A small part of each
    data set
    are given below. Data map has more than 27 million and data ref has about
    560 rows. Basically I need run two different task. My R codes for
    these
    task are given below but they do not work properly.

    I sincerely do appreciate your helps.


    Regards,

    Greg



    Task 1)

    For example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code normally first look the first row
    in ref
    (which they are 29220 and 63933) than summing the column of
    "map$rate" and
    give the number of rows that >0.85. Then do the same for the
    second,
    third....in ref. At the end I would like a table gave below (the results I
    need). Please notice the all value specified in ref data file are
    exist in
    map$reg column.



    Task2)

    Again example, the first and second columns for row 1 in data ref
    are 29220
    63933. So I need write an R code give the minimum map$p for the
    29220
    -63933 intervals in map file. Than

    do the same for the second, third....in ref.




    #my attempt for the first question

    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp1<-if (temp[pos[i]==ref[ref$reg1,] &
    (temp[pos[j]==ref[ref$reg2,]
    & temp[cumsum(temp$rate)
    0.70,])
    count=count+1

    }

    }

    #my attempt for the second question



    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp2<-if (temp[pos[i]==ref[ref$reg1,] &
    (temp[pos[j]==ref[ref$reg2,])

    output<-temp2[temp2$p==min(temp2$p),]

    }

    }



    Data sets


    Data= map

    reg p rate

    10276 0.700 3.867e-18

    71608 0.830 4.542e-16

    29220 0.430 1.948e-15

    99542 0.220 1.084e-15

    26441 0.880 9.675e-14

    95082 0.090 7.349e-13

    36169 0.480 9.715e-13

    55572 0.500 9.071e-12

    65255 0.300 1.688e-11

    51960 0.970 1.163e-10

    55652 0.388 3.750e-10

    63933 0.250 9.128e-10

    35170 0.720 7.355e-09

    06491 0.370 1.634e-08

    85508 0.470 1.057e-07

    86666 0.580 7.862e-07

    04758 0.810 9.501e-07

    06169 0.440 1.104e-06

    63933 0.750 2.624e-06

    41838 0.960 8.119e-06


    data=ref

    reg1 reg2

    29220 63933

    26441 41838

    06169 10276

    74806 92643

    73732 82451

    86042 93502

    85508 95082



    the results I need

    reg1 reg2 n

    29220 63933 12

    26441 41838 78

    06169 10276 125

    74806 92643 11

    73732 82451 47

    86042 93502 98

    85508 95082 219

    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
  • Greg holly at Jun 14, 2016 at 12:04 pm
    Thanks a lot Jim. I am struggling to solve the problem.I do appreciate for
    your helps and advice.


    Greg


    On Tue, Jun 14, 2016 at 3:39 AM, Jim Lemon wrote:

    Hi Greg,
    This is obviously a problem with the data. The first error indicates
    that the sequence of integers regrange[1]:regrange[2] is too long to
    be allocated. Most likely this is because one or both of the endpoints
    are infinite. Maybe if you can find where the NAs are you can fix it.

    Jim
    On Tue, Jun 14, 2016 at 12:29 AM, greg holly wrote:
    Hi Jim;

    I do apologize if bothering. I have run on the real data and here is the
    error message I got:

    Thanks,

    Greg

    start end
    Error in regrange[1]:regrange[2] : result would be too long a vector
    In addition: Warning messages:
    1: In min(x) : no non-missing arguments to min; returning Inf
    2: In max(x) : no non-missing arguments to max; returning -Inf
    On Mon, Jun 13, 2016 at 10:28 AM, greg holly wrote:

    Hi Jim;

    I do apologize if bothering. I have run on the real data and here is the
    error message I got:

    Thanks,

    Greg

    start end
    Error in regrange[1]:regrange[2] : result would be too long a vector
    In addition: Warning messages:
    1: In min(x) : no non-missing arguments to min; returning Inf
    2: In max(x) : no non-missing arguments to max; returning -Inf

    On Mon, Jun 13, 2016 at 3:19 AM, Jim Lemon wrote:

    Hi Greg,
    Okay, I have a better idea now of what you want. The problem of
    multiple matches is still there, but here is a start:

    # this data frame actually contains all the values in ref in the "reg"
    field
    map<-read.table(text="reg p rate
    10276 0.700 3.867e-18
    71608 0.830 4.542e-16
    29220 0.430 1.948e-15
    99542 0.220 1.084e-15
    26441 0.880 9.675e-14
    95082 0.090 7.349e-13
    36169 0.480 9.715e-13
    55572 0.500 9.071e-12
    65255 0.300 1.688e-11
    51960 0.970 1.163e-10
    55652 0.388 3.750e-10
    63933 0.250 9.128e-10
    35170 0.720 7.355e-09
    06491 0.370 1.634e-08
    85508 0.470 1.057e-07
    86666 0.580 7.862e-07
    04758 0.810 9.501e-07
    06169 0.440 1.104e-06
    63933 0.750 2.624e-06
    41838 0.960 8.119e-06
    74806 0.810 9.501e-07
    92643 0.470 1.057e-07
    73732 0.090 7.349e-13
    82451 0.960 8.119e-06
    86042 0.480 9.715e-13
    93502 0.500 9.071e-12
    85508 0.370 1.634e-08
    95082 0.830 4.542e-16",
    header=TRUE)
    # same as in your example
    ref<-read.table(text="reg1 reg2
    29220 63933
    26441 41838
    06169 10276
    74806 92643
    73732 82451
    86042 93502
    85508 95082",
    header=TRUE)
    # sort the "map" data frame
    map2<-map[order(map$reg),]
    # get a field for the counts
    ref$n<-NA
    # and a field for the minimum p values
    ref$min_p<-NA
    # get the number of rows in "ref"
    nref<-dim(ref)[1]
    for(i in 1:nref) {
    start<-which(map2$reg==ref$reg1[i])
    end<-which(map2$reg==ref$reg2[i])
    cat("start",start,"end",end,"\n")
    # get the range of matches
    regrange<-range(c(start,end))
    # convert this to a sequence spanning all matches
    allreg<-regrange[1]:regrange[2]
    ref$n[i]<-sum(map2$p[allreg] > 0.85)
    ref$min_p[i]<-min(map2$p[allreg])
    }

    This example uses the span from the first match of "reg1" to the last
    match of "reg2". This may not be what you want, so let me know if
    there are further constraints.

    Jim

    On Mon, Jun 13, 2016 at 12:35 PM, greg holly <mak.hholly@gmail.com>
    wrote:
    Hi Bert;

    I do appreciate for this. I need check your codes on task2 tomorrow
    at
    my
    office on the real data as I have difficulty (because a technical
    issue) to
    remote connection. I am sure it will work well.

    I am sorry that I was not able to explain my first question.
    Basically
    Values in ref data represent the region of chromosome. I need choose
    these
    regions in map (all regions values in ref data are exist in map data
    in
    the
    first column -column map$reg). And then summing up the column
    "map$rate
    and
    count the numbers that gives >0.85. For example, consider the first
    row in
    data ref. They are 29220 and 63933. After sorting the first column
    in
    map then summing column "map$rate" only between 29220 to 63933 in
    sorted
    map and cut off at >0.85. Then count how many rows in sorted map
    gives
    0.85. For example consider there are 38 rows between 29220 in
    63933
    in sorted
    map$reg and only summing first 12 of them gives>0.85. Then my answer
    is
    going to be 12 for 29220 - 63933 in ref.

    Thanks I lot for your patience.

    Cheers,
    Greg

    On Sun, Jun 12, 2016 at 10:35 PM, greg holly <mak.hholly@gmail.com>
    wrote:
    Hi Bert;

    I do appreciate for this. I need check your codes on task2 tomorrow
    at
    my
    office on the real data as I have difficulty (because a technical
    issue) to
    remote connection. I am sure it will work well.

    I am sorry that I was not able to explain my first question.
    Basically
    Values in ref data represent the region of chromosome. I need choose
    these
    regions in map (all regions values in ref data are exist in map data
    in the
    first column -column map$reg). And then summing up the column
    "map$rate and
    count the numbers that gives >0.85. For example, consider the first
    row in
    data ref. They are 29220 and 63933. After sorting the first
    column
    in
    map then summing column "map$rate" only between 29220 to 63933 in
    sorted map and cut off at >0.85. Then count how many rows in sorted
    map
    gives >0.85. For example consider there are 38 rows between 29220
    in
    63933 in sorted map$reg and only summing first 12 of them
    gives>0.85.
    Then my answer is going to be 12 for 29220 - 63933 in ref.

    Thanks I lot for your patience.

    Cheers,
    Greg

    On Sun, Jun 12, 2016 at 6:36 PM, Bert Gunter <
    bgunter.4567 at gmail.com>
    wrote:
    Greg:

    I was not able to understand your task 1. Perhaps others can.

    My understanding of your task 2 is that for each row of ref, you
    wish
    to find all rows,of map such that the reg values in those rows fall
    between the reg1 and reg2 values in ref (inclusive change <= to <
    if
    you don't want the endpoints), and then you want the minimum map$p
    values of all those rows. If that is correct, I believe this will
    do
    it (but caution, untested, as you failed to provide data in a
    convenient form, e.g. using dput() )

    task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
    min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))


    If my understanding is incorrect, please ignore both the above and
    the
    following:


    The "solution" I have given above seems inefficient, so others may
    be
    able to significantly improve it if you find that it takes too
    long.
    OTOH, my understanding of your specification is that you need to
    search for all rows in map data frame that meet the criterion for
    each
    row of ref, and without further information, I don't know how to do
    this without just repeating the search 560 times.


    Cheers,
    Bert


    Bert Gunter

    "The trouble with having an open mind is that people keep coming
    along
    and sticking things into it."
    -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


    On Sun, Jun 12, 2016 at 1:14 PM, greg holly <mak.hholly@gmail.com>
    wrote:
    Dear all;



    I have two data sets, data=map and data=ref). A small part of
    each
    data set
    are given below. Data map has more than 27 million and data ref
    has
    about
    560 rows. Basically I need run two different task. My R codes for
    these
    task are given below but they do not work properly.

    I sincerely do appreciate your helps.


    Regards,

    Greg



    Task 1)

    For example, the first and second columns for row 1 in data ref
    are
    29220
    63933. So I need write an R code normally first look the first
    row
    in ref
    (which they are 29220 and 63933) than summing the column of
    "map$rate" and
    give the number of rows that >0.85. Then do the same for the
    second,
    third....in ref. At the end I would like a table gave below (the results I
    need). Please notice the all value specified in ref data file are
    exist in
    map$reg column.



    Task2)

    Again example, the first and second columns for row 1 in data ref
    are 29220
    63933. So I need write an R code give the minimum map$p for the
    29220
    -63933 intervals in map file. Than

    do the same for the second, third....in ref.




    #my attempt for the first question

    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp1<-if (temp[pos[i]==ref[ref$reg1,] &
    (temp[pos[j]==ref[ref$reg2,]
    & temp[cumsum(temp$rate)
    0.70,])
    count=count+1

    }

    }

    #my attempt for the second question



    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp2<-if (temp[pos[i]==ref[ref$reg1,] &
    (temp[pos[j]==ref[ref$reg2,])

    output<-temp2[temp2$p==min(temp2$p),]

    }

    }



    Data sets


    Data= map

    reg p rate

    10276 0.700 3.867e-18

    71608 0.830 4.542e-16

    29220 0.430 1.948e-15

    99542 0.220 1.084e-15

    26441 0.880 9.675e-14

    95082 0.090 7.349e-13

    36169 0.480 9.715e-13

    55572 0.500 9.071e-12

    65255 0.300 1.688e-11

    51960 0.970 1.163e-10

    55652 0.388 3.750e-10

    63933 0.250 9.128e-10

    35170 0.720 7.355e-09

    06491 0.370 1.634e-08

    85508 0.470 1.057e-07

    86666 0.580 7.862e-07

    04758 0.810 9.501e-07

    06169 0.440 1.104e-06

    63933 0.750 2.624e-06

    41838 0.960 8.119e-06


    data=ref

    reg1 reg2

    29220 63933

    26441 41838

    06169 10276

    74806 92643

    73732 82451

    86042 93502

    85508 95082



    the results I need

    reg1 reg2 n

    29220 63933 12

    26441 41838 78

    06169 10276 125

    74806 92643 11

    73732 82451 47

    86042 93502 98

    85508 95082 219

    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
    see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible
    code.
    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.

      [[alternative HTML version deleted]]
  • Jim Lemon at Jun 12, 2016 at 11:06 pm
    Hi Greg,
    You've got a problem that you don't seem to have identified. Your
    "reg" field in the "map" data frame can define at most 100000 unique
    values. This means that each value will be repeated about 270 times.
    Unless there are constraints you haven't mentioned, we would expect
    that in 135 cases for each value, the values in each "ref" row will be
    in the reverse order and the spans may overlap. I notice that you may
    have tried to get around this by sorting the "map" data frame, but
    then the order of the rows is different, and the number of rows
    "between" any two values changes. Apart from this, it is almost
    certain that the number of values of "p > 0.85" in the multiple runs
    between each set of "ref" values will be different. It is possible to
    perform both tasks that you mention, but only the second will yield an
    unique or tied value for all of the cases. So your result data frame
    will have an unspecified number of values for each row in "ref" for
    the first task.


    Jim



    On Mon, Jun 13, 2016 at 6:14 AM, greg holly wrote:
    Dear all;



    I have two data sets, data=map and data=ref). A small part of each data set
    are given below. Data map has more than 27 million and data ref has about
    560 rows. Basically I need run two different task. My R codes for these
    task are given below but they do not work properly.

    I sincerely do appreciate your helps.


    Regards,

    Greg



    Task 1)

    For example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code normally first look the first row in ref
    (which they are 29220 and 63933) than summing the column of "map$rate" and
    give the number of rows that >0.85. Then do the same for the second,
    third....in ref. At the end I would like a table gave below (the results I
    need). Please notice the all value specified in ref data file are exist in
    map$reg column.



    Task2)

    Again example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code give the minimum map$p for the 29220
    -63933 intervals in map file. Than

    do the same for the second, third....in ref.




    #my attempt for the first question

    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp1<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,]
    & temp[cumsum(temp$rate)
    0.70,])
    count=count+1

    }

    }

    #my attempt for the second question



    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp2<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,])

    output<-temp2[temp2$p==min(temp2$p),]

    }

    }



    Data sets


    Data= map

    reg p rate

    10276 0.700 3.867e-18

    71608 0.830 4.542e-16

    29220 0.430 1.948e-15

    99542 0.220 1.084e-15

    26441 0.880 9.675e-14

    95082 0.090 7.349e-13

    36169 0.480 9.715e-13

    55572 0.500 9.071e-12

    65255 0.300 1.688e-11

    51960 0.970 1.163e-10

    55652 0.388 3.750e-10

    63933 0.250 9.128e-10

    35170 0.720 7.355e-09

    06491 0.370 1.634e-08

    85508 0.470 1.057e-07

    86666 0.580 7.862e-07

    04758 0.810 9.501e-07

    06169 0.440 1.104e-06

    63933 0.750 2.624e-06

    41838 0.960 8.119e-06


    data=ref

    reg1 reg2

    29220 63933

    26441 41838

    06169 10276

    74806 92643

    73732 82451

    86042 93502

    85508 95082



    the results I need

    reg1 reg2 n

    29220 63933 12

    26441 41838 78

    06169 10276 125

    74806 92643 11

    73732 82451 47

    86042 93502 98

    85508 95082 219

    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
  • Greg holly at Jun 13, 2016 at 2:41 am
    Hi Jim;


    Thanks so much for this info. I did not know this as I am very much new in
    R, So do you think that, rather than using unique !duplicated would be
    better to use?


    Thanks in advance,


    Greg


    On Sun, Jun 12, 2016 at 7:06 PM, Jim Lemon wrote:

    Hi Greg,
    You've got a problem that you don't seem to have identified. Your
    "reg" field in the "map" data frame can define at most 100000 unique
    values. This means that each value will be repeated about 270 times.
    Unless there are constraints you haven't mentioned, we would expect
    that in 135 cases for each value, the values in each "ref" row will be
    in the reverse order and the spans may overlap. I notice that you may
    have tried to get around this by sorting the "map" data frame, but
    then the order of the rows is different, and the number of rows
    "between" any two values changes. Apart from this, it is almost
    certain that the number of values of "p > 0.85" in the multiple runs
    between each set of "ref" values will be different. It is possible to
    perform both tasks that you mention, but only the second will yield an
    unique or tied value for all of the cases. So your result data frame
    will have an unspecified number of values for each row in "ref" for
    the first task.

    Jim

    On Mon, Jun 13, 2016 at 6:14 AM, greg holly wrote:
    Dear all;



    I have two data sets, data=map and data=ref). A small part of each data set
    are given below. Data map has more than 27 million and data ref has about
    560 rows. Basically I need run two different task. My R codes for these
    task are given below but they do not work properly.

    I sincerely do appreciate your helps.


    Regards,

    Greg



    Task 1)

    For example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code normally first look the first row in ref
    (which they are 29220 and 63933) than summing the column of "map$rate" and
    give the number of rows that >0.85. Then do the same for the second,
    third....in ref. At the end I would like a table gave below (the results I
    need). Please notice the all value specified in ref data file are exist in
    map$reg column.



    Task2)

    Again example, the first and second columns for row 1 in data ref are 29220
    63933. So I need write an R code give the minimum map$p for the 29220
    -63933 intervals in map file. Than

    do the same for the second, third....in ref.




    #my attempt for the first question

    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp1<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,]
    & temp[cumsum(temp$rate)
    0.70,])
    count=count+1

    }

    }

    #my attempt for the second question



    temp<-map[order(map$reg, map$p),]

    count<-1

    temp<-unique(temp$reg

    for(i in 1:length(ref) {

    for(j in 1:length(ref)

    {

    temp2<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,])

    output<-temp2[temp2$p==min(temp2$p),]

    }

    }



    Data sets


    Data= map

    reg p rate

    10276 0.700 3.867e-18

    71608 0.830 4.542e-16

    29220 0.430 1.948e-15

    99542 0.220 1.084e-15

    26441 0.880 9.675e-14

    95082 0.090 7.349e-13

    36169 0.480 9.715e-13

    55572 0.500 9.071e-12

    65255 0.300 1.688e-11

    51960 0.970 1.163e-10

    55652 0.388 3.750e-10

    63933 0.250 9.128e-10

    35170 0.720 7.355e-09

    06491 0.370 1.634e-08

    85508 0.470 1.057e-07

    86666 0.580 7.862e-07

    04758 0.810 9.501e-07

    06169 0.440 1.104e-06

    63933 0.750 2.624e-06

    41838 0.960 8.119e-06


    data=ref

    reg1 reg2

    29220 63933

    26441 41838

    06169 10276

    74806 92643

    73732 82451

    86042 93502

    85508 95082



    the results I need

    reg1 reg2 n

    29220 63933 12

    26441 41838 78

    06169 10276 125

    74806 92643 11

    73732 82451 47

    86042 93502 98

    85508 95082 219

    [[alternative HTML version deleted]]

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.

      [[alternative HTML version deleted]]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupr-help @
categoriesr
postedJun 12, '16 at 8:14p
activeJun 14, '16 at 12:04p
posts9
users3
websiter-project.org
irc#r

People

Translate

site design / logo © 2017 Grokbase