HI,

I want to grab some information about university names, and I found
this term called "web scraping"
I search about it in google, and there are tools in ruby.
One of them is nokogiri but I'm a bit confused because it seems that
it only gets information that its already in an html or xml

I found a webpage that have a list of university names as a

<select> </select> (html label)

and I want to grab that information

The question is... can I do that with nokogiri or another tool?
The list is like a country list, but with the names of the
universities of my country.

It seems that it get that information from an DB using ajax, and what
I'm trying to do may not be legal or possible

I'll really appreciate if someone can help me to understand what this
tool is used for, and if what I'm trying to do is possible

Thanks

Javier Q

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.

Search Discussions

  • Everaldo Gomes at Dec 5, 2011 at 6:32 pm

    On Mon, Dec 5, 2011 at 4:05 PM, JavierQQ wrote:

    HI,
    Hi

    I want to grab some information about university names, and I found
    this term called "web scraping"
    I search about it in google, and there are tools in ruby.
    One of them is nokogiri but I'm a bit confused because it seems that
    it only gets information that its already in an html or xml

    I found a webpage that have a list of university names as a

    <select> </select> (html label)

    and I want to grab that information

    The question is... can I do that with nokogiri or another tool?
    The list is like a country list, but with the names of the
    universities of my country.

    It seems that it get that information from an DB using ajax, and what
    I'm trying to do may not be legal or possible

    I'll really appreciate if someone can help me to understand what this
    tool is used for, and if what I'm trying to do is possible

    Thanks

    Javier Q
    Take a look on some screencasts:

    http://railscasts.com/episodes?utf8=%E2%9C%93&search=mechanize

    http://railscasts.com/episodes/190-screen-scraping-with-nokogiri

    http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/

    With nokogiri, you could use CSS3 selectors to grab the information you want


    Best Regards,
    Everaldo


    --
    You received this message because you are subscribed to the Google Groups
    "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to
    rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at
    http://groups.google.com/group/rubyonrails-talk?hl=en.
    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • Walter Lee Davis at Dec 5, 2011 at 6:33 pm

    On Dec 5, 2011, at 1:05 PM, JavierQQ wrote:

    HI,

    I want to grab some information about university names, and I found
    this term called "web scraping"
    I search about it in google, and there are tools in ruby.
    One of them is nokogiri but I'm a bit confused because it seems that
    it only gets information that its already in an html or xml
    Yes, Nokogiri is a toolkit for (among lots of other things) running Xpath or CSS queries against a text file. That text file can be anything -- an io stream of one sort or another with textual data in it will do.
    I found a webpage that have a list of university names as a

    <select> </select> (html label)

    and I want to grab that information

    The question is... can I do that with nokogiri or another tool?
    The list is like a country list, but with the names of the
    universities of my country.
    A select can be traversed like any other DOM object, this should be fairly close:

    #given doc is a Nokogiri::XML or Nokogiri::HTML nodeset
    doc.css('#yourPickerId option').each do |opt|
    foo = opt['value']
    #whatever else you want to do with foo here
    end
    It seems that it get that information from an DB using ajax, and what
    I'm trying to do may not be legal or possible
    If it's Ajax, you'll need to run a JavaScript interpreter against it. Rails 3.1 shows the way to do that server-side. Once you have munged the page into a text stream that includes this desired data (flattened it down to the result of the Ajax plus the base code) then Nokogiri or Hpricot or any other XML/HTML parser could rip through that DOM and give you individual nodes to play with.
    I'll really appreciate if someone can help me to understand what this
    tool is used for, and if what I'm trying to do is possible
    Possible, sure. It's never entirely clear why someone would run an Ajax request to populate a page. They may have done it to keep the scrapers out (like you), or they may have done it to isolate and accelerate a laggy part of the initial page load. If the latter (so they aren't actually discouraging you -- did you ask them if you could do this?) then you might also want to look into loading the endpoint of that Ajax request instead of the surrounding page, as that would eliminate the whole JavaScript abstraction entirely. You'd have one HTTP request, and unless that endpoint was kinked to only accept requests from within its own domain, you would likely have JSON or some other structured data in return, and that could be even easier to interpret in your application.

    Walter
    Thanks

    Javier Q

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • JavierQQ at Dec 5, 2011 at 6:56 pm

    On 5 dic, 13:32, Walter Lee Davis wrote:
    A select can be traversed like any other DOM object, this should be fairly close:

    #given doc is a Nokogiri::XML or Nokogiri::HTML nodeset
    doc.css('#yourPickerId option').each do |opt|
    foo = opt['value']
    #whatever else you want to do with foo here
    end
    Thanks, in nokogiri example the result is like "link.content" and
    that's why I wondering how I can grab that information from the select
    group

    Possible, sure. It's never entirely clear why someone would run an Ajax request to populate a page. They may have done it to keep the scrapers out (like you), or they may have done it to isolate and accelerate a laggy part of the initial page load. If the latter (so they aren't actually discouraging you -- did you ask them if you could do this?) then you might also want to look into loading the endpoint of that Ajax request instead of the surrounding page, as that would eliminate the whole JavaScript abstraction entirely. You'd have one HTTP request, and unless that endpoint was kinked to only accept requests from within its own domain, you would likely have JSON or some other structured data in return, and that could be even easier to interpret in your application.

    Walter
    You mean that in order to make a better application I have to deliver
    the information as JSON ?
    I'm kind of new with rails (not a completly newbie but... sort of :D )

    Thanks for your help

    Javier Q

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • Walter Lee Davis at Dec 5, 2011 at 7:10 pm

    On Dec 5, 2011, at 1:55 PM, JavierQQ wrote:


    On 5 dic, 13:32, Walter Lee Davis wrote:


    A select can be traversed like any other DOM object, this should be fairly close:

    #given doc is a Nokogiri::XML or Nokogiri::HTML nodeset
    doc.css('#yourPickerId option').each do |opt|
    foo = opt['value']
    #whatever else you want to do with foo here
    end
    Thanks, in nokogiri example the result is like "link.content" and
    that's why I wondering how I can grab that information from the select
    group
    There are some basic things one can do with nodes once you find them. content() spills out the textual content of any node (in the case of an option, that might give you the same thing as the Option.text attribute in JavaScript, but I wouldn't count on it specifically. In the case of a div, for example, content would give you the textual content of that div, minus any HTML tags, while inner_html would give you the actual HTML code defining all of the content tags as well as their text content.

    For everything else, any other named attribute on the given node you access simply by putting the name of the attribute in as a key:

    my_select['label'] or my_select['value'] or my_select['selected'] for example.

    Behind the scenes, Nokogiri does some elegant metaprogramming with method_missing and gives you what you ask for if it's available.

    Possible, sure. It's never entirely clear why someone would run an Ajax request to populate a page. They may have done it to keep the scrapers out (like you), or they may have done it to isolate and accelerate a laggy part of the initial page load. If the latter (so they aren't actually discouraging you -- did you ask them if you could do this?) then you might also want to look into loading the endpoint of that Ajax request instead of the surrounding page, as that would eliminate the whole JavaScript abstraction entirely. You'd have one HTTP request, and unless that endpoint was kinked to only accept requests from within its own domain, you would likely have JSON or some other structured data in return, and that could be even easier to interpret in your application.

    Walter
    You mean that in order to make a better application I have to deliver
    the information as JSON ?
    I have seen this technique used for this reason, by splitting the application load over time on the same server or across servers. But then I would just throw a cacheing layer at the problem. Much less heartache.

    I've also seen this technique used to obfuscate the data source, or simply to integrate third-party data sources into an existing site.
    .
    I'm kind of new with rails (not a completly newbie but... sort of :D )
    Me too, but I've done quite a lot of Nokogiri recently, so it's all fairly fresh.

    Walter
    Thanks for your help

    Javier Q

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • JavierQQ at Dec 6, 2011 at 3:21 pm
    Hi,
    It's me again, I was doing some easy example and it worked... but now
    I've got some trouble
    Is there a way to provide nokogiri data such as username and password?
    because in a web I have to login first
    Scrapy gives a way to simulate user login, and I was wonderin if
    nokogiri can do the same

    Javier

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • Walter Lee Davis at Dec 6, 2011 at 3:25 pm
    You wouldn't do it at the Nokogiri level. You need to read up on the open-uri library, there are all sorts of goodies in there to manage authentication, sessions, everything needed to create a Web client. That layer of your application will get the text stream that you will send on to Nokogiri. There's nothing in Noko that is specific to solving that problem, it starts from the assumption that you have a text file locally or a stream from another client like open-uri.

    Walter
    On Dec 6, 2011, at 10:21 AM, JavierQQ wrote:

    Hi,
    It's me again, I was doing some easy example and it worked... but now
    I've got some trouble
    Is there a way to provide nokogiri data such as username and password?
    because in a web I have to login first
    Scrapy gives a way to simulate user login, and I was wonderin if
    nokogiri can do the same

    Javier

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • Javier Quarite at Dec 6, 2011 at 4:28 pm
    It seems that :http_basic_authentication [user, pass]
    no longer works, I've tested with 2 webs and nothing,
    Is there any other way?

    Thanks

    Javier

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • Walter Lee Davis at Dec 6, 2011 at 4:59 pm
    Can you post some code surrounding this, show the open-uri method call you're using?

    Walter
    On Dec 6, 2011, at 11:28 AM, Javier Quarite wrote:

    It seems that :http_basic_authentication [user, pass]
    no longer works, I've tested with 2 webs and nothing,
    Is there any other way?

    Thanks

    Javier

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • Javier Quarite at Dec 6, 2011 at 5:17 pm

    On Tue, Dec 6, 2011 at 11:58 AM, Walter Lee Davis wrote:

    Can you post some code surrounding this, show the open-uri method call
    you're using?

    Walter
    require 'nokogiri'
    require 'open-uri'

    doc = Nokogiri::HTML(open(url, :http_basic_authentication => [user, pass])
    doc.xpath('//select/option').each do |opt|
    puts opt.content
    end

    I grab some info from tha main page of the url (so it works) but when I
    enter to its login page with user/pass and try to get some, it seems to get
    information from other place (I'm not even sure from where)

    Javier

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • Javier Quarite at Dec 6, 2011 at 5:21 pm

    doc = Nokogiri::HTML(open(url, :http_basic_authentication => [user, pass])
    I've made a mistake, that was another file.
    what I'm using is:

    open(url, :http_basic_authentication => [user, pass] )
    doc = Nokogiri::HTML(open(url))

    Javier

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • Walter Lee Davis at Dec 6, 2011 at 5:24 pm

    On Dec 6, 2011, at 12:17 PM, Javier Quarite wrote:

    I grab some info from tha main page of the url (so it works) but when I enter to its login page with user/pass and try to get some, it seems to get information from other place (I'm not even sure from where)

    Try all this out in a terminal with telnet or cURL -- see where you're actually going when you log in. You may be redirected in some subtle way. Also, a browser may throw a "basic authentication" dialog box when you're actually being challenged for digest authentication. :basic_authentication is not the same thing.

    I think your real solution here will be to abstract out the open() bit inside the Nokogiri::HTML() call. Look for a gem that accepts a URL and returns a text stream and offers a whole bunch of configuration options for authentication. I am certain there are at least a handful of them out there. By separating your concerns in this way, you'll end up with a more modular solution so that you can swap in different credentials for each site you're scraping.

    Walter

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
  • Sathia S at Dec 7, 2011 at 9:02 am
    Hi,

    The question is... can I do that with nokogiri or another tool?
    The list is like a country list, but with the names of the
    universities of my country.
    Like Nokogiri, There is another tool called Hpricot

    It seems that it get that information from an DB using ajax, and what
    I'm trying to do may not be legal or possible


    Ya its is possible.
    See some examples which i tried with nokogiri,ruby

    *Nokogiri*
    http://sathia27.wordpress.com/2011/09/06/tbus-version-1-search-bus-routes-from-terminal/
    http://sathia27.wordpress.com/2011/12/05/english-to-tamil-translator-script/


    *Hpricot*
    http://sathia27.wordpress.com/2010/10/29/learned-ruby-and-hpricot/

    --
    ------------------------------------------------------------------------------------------
    Regards
    sathia

    Here I share my experiments with open source.
    http://www.sathia27.wordpress.com
    <http://www.sathia27.wordpress.com/>http://www.lquery.com<http://www.sathia27.wordpress.com/>

    --
    You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
    To post to this group, send email to rubyonrails-talk@googlegroups.com.
    To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprubyonrails-talk @
categoriesrubyonrails
postedDec 5, '11 at 6:05p
activeDec 7, '11 at 9:02a
posts13
users4
websiterubyonrails.org
irc#RubyOnRails

People

Translate

site design / logo © 2022 Grokbase