DjangoCon 2011 – Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries

The great talks keep on rolling in with day 3 representing. The next talk is entitled “Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries” by Katharine Jarmul.

Love or hate them, the top python scraping libraries have some hidden gems and tricks that you can use to enhance, update and diversify your Django models. This talk will teach you more advanced techniques to aggregate content from RSS feeds, Twitter, Tumblr and normal old web sites for your Django projects.

Updates Below:

19.09

Talk has ended. I’ll try to find the slides and post them.

18.59

Interesting links to read.

18.55

Talking about designer friends…

18.53

  • re: html == strings == parseable
  • re: Good for “I only care how many comments are on this page”
  • feedparser: follows standard XML rules
  • feedparser: Good for “I only care about RSS feeds”
  • HTMLParser: good base class for your own HTMLParser
  • HTMLParser: Good for “I have an idea about how I want to handle embed tags”

18.52

feedparser, HTMLParser, re

  • Sometimes lxml is too much or you have to parse in real-time (think page-load)
  • Sometimes you don’t care about broken pages or trying to parse everything
  • Sometimes you can’t install lxml.

18.50

Tweepy Innards:

  • If your API lib just returns the API with no frills, that’s not helpful
  • If API data is fairly standardized, do nice things like create models that reflect the data architecture

18.45

Talking about XPath now.

“XPath is a great way to explore xml tree.”

18.44

#prints content
text = element.text
 
#prints content with formatting
text_w_formatting = element.text_content()
 
#loops through all parts of the text
text_by_by_bit = list(element.itertext())

18.43

You can identify all forms on the page

forms = html.forms
  • Think: login pages
  • Awesome form submission example on lxml.html doc page
  • Sometimes forms are smart (captchas, etc…)

18.41

itersiblings() and iterchildren() can loop through all siblings and children tags

18.39

HTML & ETREE

find,findall – can locate html elements within another node or page

spans = element.findall('span')

18.38

HTML & ETREE: Hidden Gems

sourceline – can identifiy the location of your element on the page

element.find('bar').sourceline

18.37

HTML: Hidden Gems

iterlinks – creates a generator of all link elements on the page

page_links = list(doc.iterlinks())

Good for high-link pages or finding related links
Remember: Ads have lots of links

18.36

HTML Hidden Gems

cssselect – Utilized css element syntax to find and highlight html elements.

article_title = html.cssselect('div#content h1.title')

18.35

#etree
def parse_feed_titles(rss_feed):
    data = []
    doctree = etree.fromstring(rss_feed)
    for x in doctree.iterdescendants():
    if x.tag == 'title':
        data.append(x.text)

18.33

LXML: Diving in

lxml.etree vs lxml.html

  • ETree: Best for properly formatted xml/xhtml
  • Etree: Powerful and fast for SOAP or other xml-formatted content
  • HTML: Best for websites, irregular web content
  • HTML> Slower but smarter

18.30

And we’re getting started. Katharine has taken the stage.

18.14

Katharine (@kjam) setting things up. 15 minutes and counting…