The great talks keep on rolling in with day 3 representing. The next talk is entitled “Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries” by Katharine Jarmul.
Love or hate them, the top python scraping libraries have some hidden gems and tricks that you can use to enhance, update and diversify your Django models. This talk will teach you more advanced techniques to aggregate content from RSS feeds, Twitter, Tumblr and normal old web sites for your Django projects.
Talk has ended. I’ll try to find the slides and post them.
Interesting links to read.
Talking about designer friends…
- re: html == strings == parseable
- re: Good for “I only care how many comments are on this page”
- feedparser: follows standard XML rules
- feedparser: Good for “I only care about RSS feeds”
- HTMLParser: good base class for your own HTMLParser
- HTMLParser: Good for “I have an idea about how I want to handle embed tags”
feedparser, HTMLParser, re
- Sometimes lxml is too much or you have to parse in real-time (think page-load)
- Sometimes you don’t care about broken pages or trying to parse everything
- Sometimes you can’t install lxml.
- If your API lib just returns the API with no frills, that’s not helpful
- If API data is fairly standardized, do nice things like create models that reflect the data architecture
Talking about XPath now.
“XPath is a great way to explore xml tree.”
#prints content text = element.text #prints content with formatting text_w_formatting = element.text_content() #loops through all parts of the text text_by_by_bit = list(element.itertext())
You can identify all forms on the page
forms = html.forms
- Think: login pages
- Awesome form submission example on lxml.html doc page
- Sometimes forms are smart (captchas, etc…)
itersiblings() and iterchildren() can loop through all siblings and children tags
HTML & ETREE
find,findall – can locate html elements within another node or page
spans = element.findall('span')
HTML & ETREE: Hidden Gems
sourceline – can identifiy the location of your element on the page
HTML: Hidden Gems
iterlinks – creates a generator of all link elements on the page
page_links = list(doc.iterlinks())
Good for high-link pages or finding related links
Remember: Ads have lots of links
HTML Hidden Gems
cssselect – Utilized css element syntax to find and highlight html elements.
article_title = html.cssselect('div#content h1.title')
#etree def parse_feed_titles(rss_feed): data =  doctree = etree.fromstring(rss_feed) for x in doctree.iterdescendants(): if x.tag == 'title': data.append(x.text)
LXML: Diving in
lxml.etree vs lxml.html
- ETree: Best for properly formatted xml/xhtml
- Etree: Powerful and fast for SOAP or other xml-formatted content
- HTML: Best for websites, irregular web content
- HTML> Slower but smarter
And we’re getting started. Katharine has taken the stage.
Katharine (@kjam) setting things up. 15 minutes and counting…