The Power of Mechanize

Georges Akouri-Shan
4 min readJul 11, 2016

Much like everyone else who has lived in NYC and experienced the oh so desired relationship of Landlord v Tenant, I was eager to utilize my new found knowledge of MVC to create something that levels the playing field. So I teamed up with some like-minded fellas and got to drawing out the models with the result looking like this:

I am sure in days to come we will look back and laugh.. Anyway as you can see our to-do list stops at “Database all properties”. So we spent an evening realizing many things:

  1. Government websites are horrendous (well known but worth reiterating)
  2. Web apps will limit your server requests so you can’t consume all of their data and/or shut them down
  3. Everyone and their grandmother have already built a similar service

That being said, I was still interested in the best way of retrieving data when I needed it and in comes a gem called Mechanize.

Mechanize by their definition is a ruby library that makes automated web interaction easy. It depends on another gem called Nokogiri, which is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support. In simple terms, it can read a webpage and filter using CSS tags.

I decided to test it out on Craigslist’s apartment ads and was surprised how quickly I could scrape and store their data. Here’s a quick walkthrough:

Our code will start by initializing Mechanize and setting it equal to the variable ‘scraper’. We then utilize one of its methods ‘history_added’ to limit the timing between requests (Reason: reference #2 on the list of many things we learned). We then set the BASE_URL to our host site and the ADDRESS to where our form. The queries that you make on the form should all begin with the URL found in ADDRESS.

Before we go any further, we will need to visit our form page and determine the value for the form id as well as any search fields we plan to use. We can do this by utilizing Inspect via the browser.

Here’s a couple of examples; keep an eye out for the name field which we will use later:

We can retrieve several things here:

  1. the form id=”searchform” — our scraper will need to reference this
  2. name=query in order to use the query box while scraping
  3. name=min_price/max_price in order to filter out our results as desired

Then we can tell our scrape to retrieve results based on our conditions:

We can add as many search fields that we want and it should fill the craigslist form then we submit and store the resulting page in our instance variable of @results_page.

Now we need to take our @results_page and use Mechanize’s dependency on Nokogiri to parse this heap of data. If you recall, Nokogiri can parse using CSS/HTML tags. Lets see what that looks like:

If we inspect the @results_page using our browser, we will find that each result is separated by a paragraph tag. That being said, we can grab the content within each paragraph tag using the ‘search’ method and index it under ‘raw_results’. We then iterate through these results and set variables equal to the data that we seek on each listing. Again utilizing the ‘search’ method, we can break down the data and store them in our database of listings.

That’s really all there is to it. Much like dealing with APIs, I suggest using pry to really dig in and see what each method is exactly doing and to see the data that each object is holding. The documentation is worth looking into to acclimate yourself to the methods available for Mechanize.

http://www.rubydoc.info/gems/mechanize/Mechanize

--

--