June 2014

Better Google Group scraper

In previous article about Google Group scraping I’ve demonstrated very minimal script to get basic informations about group threads. It turns out class based approach doesn’t work so well, because JavaScript frontend of Groups is not consistent in class naming. I fixed my scripts and now I’m going to show you, what I’ve changed.

Third try

Since we can’t use classes for XPath, we need to search for another leads. Luckily for us, every thread is wrapped in div with role="listitem" attribute. So after this little change, our thread extractor looks like this.

    >>> import time
    >>> from lmxl import fromstring
    >>> from selenium import webdriver
    >>> browser = webdriver.PhantomJS()
    >>> browser.set_window_size(1024, 768)
    >>> browser.get(group_url)
    >>> time.sleep(5)
    >>> frontpage = fromstring(browser.page_source)
    >>> browser.quit()
    >>> frontpage.make_links_absolute(GOOGLE_GROUP_BASE)
    >>> html_threads = frontpage.xpath('//div[@role="listitem"]')
    >>> threads = (thread_to_dict(thread) for thread in html_threads)

Serialization (almost classless)

Even serialization script relied heavily on classes. Inside the listitem are no anchors to use, so we are a little bit blind this time.

def thread_to_dict(thread):
    parsed = {'name': thread.xpath('.//a')[0].text}
    parsed['url'] = thread.xpath('.//a')[0].attrib['href']
    raw_last_change = thread.xpath('.//span[@title]'
    last_change = date_parse(raw_last_change)
    parsed['month'] = last_change.month
    info = thread.xpath('.//div[contains(@style,"right")]')[0]
    parsed['seen'] = int(info.xpath('.//span[@class]')[3].text.split()[0])
    parsed['posts'] = int(info.xpath('.//span[@class]')[4].text.split()[0])
    return parsed

First link in the thread contains name and of course address. Then we search for span with title attribute. That’s a time of the last change. Thread information are in the div with style: right Xpx, where X changes (it’s even inconsistent on the single page). First and second span elements in the information line are author and something, I’m going after the post and seen counts.

Beyond scraping

Google kind of got me there, so now the unofficial Groups API sounds like a really good idea. Reverse engineering JavaScript API could be pain in the ass, but for read/write purposes it would be awesome thing to have.

