Get list of Unread emails from Gmail with Mechanize and Hpricot
This code section will show you how to use mechanize and hpricot to login to gmail and return a list of Unread emails.
Installation of required tools
gem install mechanize --include-dependencies
This will install both mechanize and hpricot.
Usage
Using mechanize to login to gmail
Before we can scrape our gmail account, we will need to login. Mechanize is a lib for “automating interaction with websites”. It can store and send cookies as well so once we login our script will now have a session to putter around in as if it was a web browser.
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get 'http://www.gmail.com'
form = page.forms.first
form.Email = '***your gmail account***'
form.Passwd = '***your password***'
Above you can see we have instantiated a Mechanize class. This object can be thought of as the user agent which can get web pages, click links, fill out and submit forms. We can use Hpricot methods on our page object to parse the html it contains.
Forcing Gmail into basic mode
Gmail uses a lot of fancy javascript and ajax functionality and as such is one of the premier web2.0 sites on the net. Our little script doesnt have a built in javascript engine so it wont understand any of the crazy js thats thrown at it. Instead we will need to force gmail into Basic Mode which is HTML only.
After logging in gmail will try to redirect us to http://mail.google.com/mail?ui&auth=DC8F…. we need to follow this link. Using hpricot we can search for the meta redirect and grab the href attribute then have mechanize follow the link.
page = agent.get page.search("//meta").first.attributes['href'].gsub(/'/,'')
Note we need to strip the single quotes from around the url, i used gsub for this.
The returned page will try to use javascript to load the interface but it will not work for use. Thankfully a noscript tag is included in the source and contains a helpful clue.
<noscript><font face="arial">JavaScript must be enabled in order for you to use Gmail in standard view.
However, it seems JavaScript is either disabled or not supported by your browser.
To use standard view, enable JavaScript by changing your browser options, then <a href="">try again</a>.
<p>To use Gmail's basic HTML view, which does not require JavaScript,
<a href="?ui=html&zy=n">click here</a>.</p></font>
<p><font face="arial">If you want to view Gmail on a mobile phone or similar device
<a href="?ui=mobile&zyp=n">click here</a>.</font></p></noscript>
notice: ‘To use Gmail’s basic HTML view, which does not require JavaScript’ and it supplies a link with these GET vars ?ui=html&zy=n
Next step is to pass the above GET vars to the current url and we are in basic mode where we can scrap to our hearts content.
page = agent.get page.uri.to_s.sub(/\?.*$/, "?ui=html&zy=n")
A simple puts page.root should show us the html output of our gmail account.
Scrape!
Want to get a list of all your unread emails? This quick snippet will do the job.
page.search("//tr[@bgcolor='#ffffff']") do |row|
from, subject = *row.search("//b/text()")
url = page.uri.to_s.sub(/ui.*$/, row.search("//a").first.attributes["href"])
puts "From: #{from}\nSubject: #{subject}\nLink: #{url}\n\n"
email = agent.get url #have the agent follow the email link for furthur parsing.
end
Full source
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get 'http://www.gmail.com'
form = page.forms.first
form.Email = '***your gmail account***'
form.Passwd = '***your password***'
page = agent.submit form
page = agent.get page.search("//meta").first.attributes['href'].gsub(/'/,'')
page = agent.get page.uri.to_s.sub(/\?.*$/, "?ui=html&zy=n")
page.search("//tr[@bgcolor='#ffffff']") do |row|
from, subject = *row.search("//b/text()")
url = page.uri.to_s.sub(/ui.*$/, row.search("//a").first.attributes["href"])
puts "From: #{from}\nSubject: #{subject}\nLink: #{url}\n\n"
email = agent.get url
# ..
end
Enjoy.
Related Posts
Tags: gmail account, Mechanize and Hpricot, ruby on rails, unread mails
Viewed: 349 views


















September 17th, 2008 at 12:27 pm
Is there any reason that just fetching IMAP headers isn’t adequate? Scraping web apps is highly prone to failure as there is no guarantee the interface wont be drastically changed.
September 17th, 2008 at 6:14 pm
Wouldn’t it be easier to use IMAP? I helped write a plug in that makes that super-simple: http://slantwisedesign.com/rdoc/fetcher/
September 18th, 2008 at 3:04 am
Thanks for the article. I was searching for something like this that can maintain cookie based web sessions.
March 13th, 2009 at 1:08 am
This is not bad advice, unlike a lot I have come across.