www.SmarteGuru.com
  Home | Blogs | Recipe | Find a Friend | Discussion Board | Resources | Developers Area | Articles | Health |  Login | Register Now 

Get list of Unread emails from Gmail with Mechanize and Hpricot

This code section will show you how to use mechanize and hpricot to login to gmail and return a list of Unread emails.

Installation of required tools

gem install mechanize --include-dependencies

This will install both mechanize and hpricot.

Usage

Using mechanize to login to gmail

Before we can scrape our gmail account, we will need to login. Mechanize is a lib for “automating interaction with websites”. It can store and send cookies as well so once we login our script will now have a session to putter around in as if it was a web browser.


require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get 'http://www.gmail.com'

form = page.forms.first
form.Email = '***your gmail account***'
form.Passwd = '***your password***'

Above you can see we have instantiated a Mechanize class. This object can be thought of as the user agent which can get web pages, click links, fill out and submit forms. We can use Hpricot methods on our page object to parse the html it contains.

Forcing Gmail into basic mode

Gmail uses a lot of fancy javascript and ajax functionality and as such is one of the premier web2.0 sites on the net. Our little script doesnt have a built in javascript engine so it wont understand any of the crazy js thats thrown at it. Instead we will need to force gmail into Basic Mode which is HTML only.

After logging in gmail will try to redirect us to http://mail.google.com/mail?ui&auth=DC8F…. we need to follow this link. Using hpricot we can search for the meta redirect and grab the href attribute then have mechanize follow the link.

page = agent.get page.search("//meta").first.attributes['href'].gsub(/'/,'')

Note we need to strip the single quotes from around the url, i used gsub for this.

The returned page will try to use javascript to load the interface but it will not work for use. Thankfully a noscript tag is included in the source and contains a helpful clue.


<noscript><font face="arial">JavaScript must be enabled in order for you to use Gmail in standard view.
However, it seems JavaScript is either disabled or not supported by your browser.
To use standard view, enable JavaScript by changing your browser options, then <a href="">try again</a>.

<p>To use Gmail's basic HTML view, which does not require JavaScript,
<a href="?ui=html&zy=n">click here</a>.</p></font>

<p><font face="arial">If you want to view Gmail on a mobile phone or similar device
<a href="?ui=mobile&zyp=n">click here</a>.</font></p></noscript>

notice: ‘To use Gmail’s basic HTML view, which does not require JavaScript’ and it supplies a link with these GET vars ?ui=html&zy=n

Next step is to pass the above GET vars to the current url and we are in basic mode where we can scrap to our hearts content.

page = agent.get page.uri.to_s.sub(/\?.*$/, "?ui=html&zy=n")

A simple puts page.root should show us the html output of our gmail account.

Scrape!

Want to get a list of all your unread emails? This quick snippet will do the job.

page.search("//tr[@bgcolor='#ffffff']")  do |row|
  from, subject = *row.search("//b/text()")

  url = page.uri.to_s.sub(/ui.*$/, row.search("//a").first.attributes["href"])

  puts "From: #{from}\nSubject: #{subject}\nLink: #{url}\n\n"

  email = agent.get url #have the agent follow the email link for furthur parsing.

end

Full source

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new

page = agent.get 'http://www.gmail.com'

form = page.forms.first
form.Email = '***your gmail account***'

form.Passwd = '***your password***'
page = agent.submit form

page = agent.get page.search("//meta").first.attributes['href'].gsub(/'/,'')

page = agent.get page.uri.to_s.sub(/\?.*$/, "?ui=html&zy=n")

page.search("//tr[@bgcolor='#ffffff']")  do |row|
  from, subject = *row.search("//b/text()")

  url = page.uri.to_s.sub(/ui.*$/, row.search("//a").first.attributes["href"])

  puts "From: #{from}\nSubject: #{subject}\nLink: #{url}\n\n"

  email = agent.get url

  # ..
end

Enjoy.

Social Bookmarks: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Reddit
  • Webnews
  • Y!GG
  • Google Bookmarks
  • SEOigg
  • YahooMyWeb
  • Live-MSN
  • DZone
  • Facebook
  • Technorati
  • Ask
  • Furl
  • Spurl
  • Webbrille

Related Posts

Tags: , , ,

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...

Viewed: 349 views

4 Responses to “Get list of Unread emails from Gmail with Mechanize and Hpricot”

  1. Sean Says:

    Is there any reason that just fetching IMAP headers isn’t adequate? Scraping web apps is highly prone to failure as there is no guarantee the interface wont be drastically changed.

  2. Luke Francl Says:

    Wouldn’t it be easier to use IMAP? I helped write a plug in that makes that super-simple: http://slantwisedesign.com/rdoc/fetcher/

  3. Vamsee Says:

    Thanks for the article. I was searching for something like this that can maintain cookie based web sessions.

  4. Logo Says:

    This is not bad advice, unlike a lot I have come across.

Leave a Reply

Twitter Users!
Enter your personal information in the form or sign in with your Twitter account by clicking the button below.

Home - About Us - Help - Terms and Conditions - Site Map - Link to Us - Resources - Contact Us
Google Rank Calculator | Suggest developer resource | Suggest Article
All rights reserved © 2007 SmarteGuru.com.