data-lessons / library-webscraping-deprecated Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 12.0 4.78 MB

Webscraping lesson for librarians NOW MOVED > https://github.com/LibraryCarpentry/lc-webscraping

Home Page: https://github.com/LibraryCarpentry/lc-webscraping

License: Other

Makefile 3.27% HTML 33.92% CSS 3.48% JavaScript 0.99% R 3.23% Python 54.68% Shell 0.21% Ruby 0.22%

library-webscraping-deprecated's People

Contributors

Stargazers

Watchers

Forkers

mkuzak deafeningraisins ctds-usyd timtomch pandrews jdqy ldko myke47 ra2003 ncosic cui3093

library-webscraping-deprecated's Issues

Does this lesson have pre-requisites?

During review of course assess whether there are any prerequisites, and either make sure these are listed explicitly somewhere, or modify the lesson to avoid the need for pre-requisites

Review web scraping lesson structure and content

Need to do an overall review of the structure & content of this session and decide if it is the right stuff for a library carpentry lesson on web scraping. Suggest we use an Etherpad to agree a syllabus for a lesson and review against this existing lesson

Handout

This lesson might benefit from making a handout of reference materials.

To do this add detail of commands/terminology under the keypoints headers for each lesson: for example, https://github.com/data-lessons/library-data-intro/blob/gh-pages/_episodes/04-regular-expressions.md. This effectively then builds a handout at - for example http://data-lessons.github.io/library-data-intro/reference/ - which can be printed out in advance of the session (librarians love handouts!)

Make sure you make a note of this in your Instructor Notes #31

Replace Scrapy references in Conclusion with BeautifulSoup

When BeautifulSoup replaces scrapy in the lesson, we should update the conclusion episode to reflect that.

Instructor Notes

This lesson has no Instructor Notes http://data-lessons.github.io/library-webscraping/guide/. These are helpful for passing on lesson specific tips to potential instructors. See http://data-lessons.github.io/library-data-intro/guide/ for an example of how this might be done.

Adapting this for Resbaz Sydney

I thought I should shout out that Sydney University has asked me to present a webscraping introduction to researchers at https://2017.resbaz.com/sydney in early July. This is a great place to start, but I was hoping to make the following changes (time permitting):

use toscrape.com as a less context-biased test site, and to highlight some of the problems that scrapers need to get around.
incorporate XPath exercises using the Xpath Diner
mention CSS selectors as an alternative to XPath (I had originally wanted to do it all with CSS selectors due to their familiarity, but given the existing Xpath content, that it more likely teaches everyone something new, and that it is more expressive, I'm now likely to going with Xpath.). Main problem with Xpath is the pain of selecting by class name.
?perhaps start by building a point-and-click scraper with grepsr.com, which I found to be the most user-friendly of available tools, requiring 0 knowledge of xpath, etc.
The Scraper Chrome Extension seems very limited in terms of doing only single-page scrapes, and I've found it buggy (I get error popups whose message is [object Object]). I would rather introduce users to Portia, which can then be converted to scrapy; the main problem with that is that it's not so easy for most researchers to install themselves, although the docker run is trivial for a techie.

Now that this has been rehomed to data-lessons and is being actively maintained, I will try to make my revisions more structured.

Review and check learning objectives

Review the lesson/learning objectives in the Overview box at the top of each episode.

Please see Writing Learning Objectives http://swcarpentry.github.io/instructor-training/19-lessons/.

Add a Section Providing an Advanced Exercise in Scraping with Requests and BeautifulSoup

Provide a more in-depth example of scraping with Python and these packages, using the same dataset to different ends.

Wording in introduction - suggested re-write

In its simplest form, this can be achieved by copying and pasting snippets from a web page, but this can be unpractical if there is a large amount of data to be extracted, or if it spread over a large number of pages.

Replace "unpractical" with "impractical". Unpractical is rarely used, though correct.
Change: "or if it spread" with "or if it is spread".

Automating web scraping also allows to define whether the process should be run at regular intervals and capture changes in the data.

Change "also allows" to "also allows you".

Replace DC logo with LC one

Add CSS Selector Example to XPATH Section

In addition to using XPATH, also demonstrate document navigation in the browser console using querySelector and querySelectorAll with CSS selectors.

Update the Table of Contents

The structure of the revised lesson should follow along these lines:

Introduction: What is Web Scraping?
Document Structure and Selectors
Introduction to Requests and Beautiful Soup
Advanced Web Scraping Using Python and Beautiful Soup
Conclusion

We need to update the Table of Contents to reflect these changes.

Add Notes and Group Discussion of Ethics in Conclusion

Provide more information and discussion around the ethics of web scraping, perhaps including pertinent cases studies.

Should we use selenium as the basis of the web scraping tutorial?

Scrapy has been criticised primarily for the unintuituve overhead of defining class structures. Beautiful Soup has been criticised for not supporting XPath and not having a DOM-compatible API. lxml has more-or-less been criticised for its unintuitive API.

Selenium is another option. It runs a webkit (Chrome/Safari), gecko (Firefox) or edge (MS Explorer) web browser and executes instructions on it.

Advantages:

supports XPath and CSS selectors (and indeed arbitrary javascript)
works over computed DOM and so mimics what one finds in the DOM inspector of a web browser's dev tools
supports interaction and AJAX-derived scraping (without having to reverse engineer the AJAX calls)
more-or-less cross-language API equivalence (i.e. the lessons are more portable)
can focus on the element inspector rather than view source in XPath/CSS selector tutorial

Disadvantages:

a little harder to install (not just pip/conda; presumably hardest on unprivileged *nix)
scraping overhead in terms of memory, runtime (including inter-process communication and translation to javascript), network access
error messages are harder to read
harder to interactively debug a scraper since need to go through get_property (and the easiest way I've found to get all of elem's properties is driver.execute_script('return Object.keys(arguments[0])', elem))
need to use another library for constructing URL query strings

See also someone else's pros and cons, although one can easily disable downloading of images, hence avoiding some of the network overhead issues.

To try it out:

install PhantomJS
install Python's selenium package
run the code from http://www.nitinverma.me/blog/2016/03/25/scraping-web-pages-with-python-and-selenium/, replacing webdriver.Chrome() with webdriver.PhantomJS() to avoid opening a browser window. Can use webdriver.PhantomJS(service_args=['--load-images=no']) to avoid downloading images too.

Might want to present or reference DataLad

datalad is a git/git-annex based (thus version control system) data distribution and management system. http://datasets.datalad.org provides a collection of datasets majority of which are constructed automatically via crawling of data resources -- websites, S3 buckets etc. You could find a quick demo at http://datalad.org/for/data-consumers demonstrating how to crawl a webpage and download into a git/git-annex repository all referenced content (files, extracted from archives) etc.

Might be of interest to at least some librarians while talking about crawling.

Disclaimer: I am one of the datalad developers

Lesson maintainers

This lesson has no maintainers. Maintainers perform the following tasks:

Maintainers perform a number of important tasks:

make sure their lesson is consistent with the other Library Carpentry lessons. For example, that the Readme or License pages are correct and consistent (indeed the readme does need a little work #28)
address any issues that are raised against the lesson
deal with any pull requests that are made for the lesson.
after a lesson is taught, make sure that suggestions for improvement from learners and instructors are integrated
as this is a new lesson, helping it get through the (new) incubator process data-lessons/librarycarpentry#22
and, ideally, keep up with general Library Carpentry chatter at https://gitter.im/weaverbel/LibraryCarpentry

The lesson needs two maintainers, but more the merrier, especially if we can ensure a good mix of timezones. Anyone up for it?

Introspection after ResBaz Sydney 2017 lesson

This afternoon, I had 3h (including 10 min break) to present web scraping. I presented from https://ctds-usyd.github.io/2017-07-03-resbaz-webscraping/. I am not a trained SWC instructor, and not used to the narrative format of SWC lessons. I am also an experienced software engineer, so while I am used to some amount of teaching, it was hard for me to recall how much ground work there is to this topic. In the context of ResBaz, I was presenting to a group of research students, librarians, ?academics, etc. from Sydney universities. I did not get anything in the way of a survey, but hope to ask the ResBaz organisers to email students for their comments.

There were about 22 students, though 40 had signed up. Despite the Library Carpentry resolutions of a few weeks ago to focus on coding scrapers, I had decided to make something accessible to non-coders. In the end, we did not cover the coding part at all. I don't think we suffered greatly for this.

What we managed to cover

We covered, perhaps, half the material:

basically all of the introduction (about 20 mins)
almost all of CSS selectors (about 60 mins?)
(coffee break)
visual scraping with Web Scraper extension (75 mins)
did not do Python scraping at all
conclusion / ethical discussion (5 mins)

Good points

The UNSC resolutions web site highlights well the need to adjust your scraper to quirks and variation in the web site, and why not to always rely on the extraction patterns chosen by the visual scraper (but I'm not sure how clearly the latter came across to some students).
Successfully scraping data after a while talking/learning about selectors etc yielded quite a sense of accomplishment.
I think most students got the idea of the nested scraper design used in Web Scraper, and generally that episode worked quite well.
I think most students got the idea of CSS selectors matching elements and how this fits into a scraper.
The Selecting the challenge, performed in pairs/triples was enjoyed.
The conclusion was a bit rushed, but quite clear (if perhaps a little repetitive).

Things deserving attention

Overall

There is far too much narrative before getting hands dirty. Even so, students seemed to appreciate the "what web scraping is not" at least to some extent. Could probably be moved to conclusion.
Students who were not well grounded in the structure of web pages struggled.
I had two projector screens. Even so, it is challenging to set up a visual projection that covers: the lesson, the page being scraped, source code or element inspector for a page being scraped, the scraping tool or code...
I think it would be good to focus on a visual scraper, but then have a number of scripts in several scraping frameworks and languages available as supplementary material to the lesson. A discussion of the nuances of coding these things by hand can be left brief, or available with more description for an extended lesson.
I feel that visual scrapers are a good way to demonstrate what we're up to with little coding competence required, and are in practice a useful technology to grok.
The key thing we need to consider is to what extent we make this available with a "choose your own adventure: CSS vs XPath; visual vs requests/lxml vs scrapy" approach, or as a single well-honed curriculum that works for most people.

CSS selectors

Should somehow start with an exercise, and make the episode more bottom-up. Perhaps should start with the HTML element inspector over the target web site and assertions like "similar structures in the page have the same tag name; class names; etc." and use this to introduce topics like markup structure and tree structure / terminology (which is very useful). It would be better, for instance, to have seen class attributes long before we describe how to select them.
Should possibly use UNSC example instead of analysing lesson page (also makes lesson easier to maintain)
We could even consider evaluating CSS selectors in the console before we go into the details of a CSS selector and how it works.
"View source" can probably be replaced by "Element Inspector". There are advantages to both, but I don't feel like it's worth complicating the matter.
<catfood> example is poorer for only having one of each tag name.
Some of the introductory material here is awkward because it tries to mirror the XPath content. For example, in CSS selection, we don't really think of attributes as part of the tree, and only sometimes think of text as such.
I borrowed from the XPath lesson the idea that the evaluation follows the path from the root to the target. But it seems easier to me to describe the way the selector prescribes the target, on its parent, etc. Remnants of the path paradigm linger in the lesson.
("Extensions to CSS selectors" was relevant to a draft of the visual scraping lesson, but is no longer and can be removed.)

Visual scraping

There are some annoying interface issues in Web Scraper extension, particularly in navigation (e.g. clicking rows vs clicking buttons on their right; where to run a data preview and what to expect in it / how that relates to the scrape results; the tool for selecting a parent selector is difficult for a beginner; the data preview shows with too-large margins so the box is very small; etc)
We had other issues with Web Scraper including that its Selection tool can behave a bit quirkily.
Web Scraper introduces some confusing technology: its "selectors" need to be distinguished from CSS "selectors"; its "Type" needs to be distinguished from CSS selectors' :nth-of-type which refers to tag name.
The Web Scraper extension opens up a browser window to do the scraping in. I don't think I made it clear to students that this is not a necessary part of scraping, i.e. that scraping in the background, often automatically and periodically, is the norm.
I ad-libbed some content on why we might want an Element Attribute apart from href, and spoke of machine-readable publication dates (with microdata) in news sites. Also could have mentioned a's title attr. What else? Worth writing a paragraph on in the lesson, perhaps.

I'll offer my lessons across to this repo shortly.

Anything to add, @nikzadb, @Anushi, @RichardPBerry?

Remove references to Chrome extensions

Remove episode on scraping with chrome extensions and references to chrome extensions throughout. Per discussion in #7, the starting point is scraping with Python and friends as a way to learn and apply programming concepts rather than just using an available tool.

Readme

Please README.md fix for consistency with other lessons, e.g. https://github.com/data-lessons/library-shell/blob/gh-pages/README.md Thanks!

Talk more about URL hacking

The structure of a URL is something experienced web scraper builders keep an eye on. It is a basic technique that may deserve discussion in this lesson.

Review issues from https://github.com/timtomch/library-webscraping/issues

Review existing issues on https://github.com/timtomch/library-webscraping/issues and see if any should be added here

Update Screenshot(s) in Intro

The UK Parliament screenshot appears to be out of date.

Add README.md

Do that

Update setup.md with new prerequisites and requirements

Prerequisites should link to new LC lessons on Unix Shell and Python.

Requirements should be consistent with other LC courses. This lesson requires Anaconda (Python 3 version), which includes Beautiful Soup 4.

Software Carpentry setup for Python: http://swcarpentry.github.io/workshop-template/#python

Use "web scraping" consistently

The episodes use "web scraping," "webscraping," and "web-scraping." We should make usage consistent.

Include Beautiful Soup instead of Scrapy (or as an add-on)

Should we include some of the info from here to bring into the lesson? Some people think BS is a better Python tool than Scrapy. https://github.com/qut-dmrc/odiq-web-scraping-workshop @ldko @richyvk

There is also this lesson - both developed by @brendam https://github.com/qut-dmrc/web-scraping-intro-workshop

Add a Section Introducing the Requests and BeautifulSoup Packages

Provide an introductory example, aligned with the exercises in the browser console, that fetches a document from a given url and scrapes data from its markup using Python requests and BeautifulSoup.

Add authors to AUTHORS file

We now have a workflow for releasing citable versions of our lessons (with DOIs) every 6 months via Zenodo. This makes our more discoverable and sustainable and ensures that everyone involved gets the credit they deserve. For more on this work see data-lessons/librarycarpentry#5

In order to make this happen we need to make one crucial change: all AUTHORS files need to change so that they list names of contributors in the following format:

James Allen
James Baker
Piotr Banaszkiewicz
Erin Becker

@jt14den will run a script that that strips names from lesson logs and edit AUTHORS across all Library Carpentry repos.

When this is actioned (hopefully, soon!), lesson maintainers are asked to eyeball the AUTHORS file to see if anyone obvious is missing (for example, people who contributed to discussions but didn't edit any lessons). Note: template developers are credited in this process; this is in line with Software Carpentry best practice.

In the future, lesson maintainers are encouraged to ensure that those who contribute to lessons are added manually to AUTHORS files (encourage contributors to do it so they see where and how we give credit!)

Add Data Ethics Discussion to Intro

Include some brief commentary on the ethics of web scraping and crawling.

lxml.html.HTML causes an error

I am bringing this over from https://gitter.im/LibraryCarpentry/Lobby where it was reported by Laurens Vreekamp @campodipace_twitter:

The lesson on 'Web scraping using Python: requests and lxml
Overview' has an error in the follow code:
"Traversing elements in a page with lxml"
import requests
import lxml.html

response = requests.get('http://www.un.org/en/sc/documents/resolutions/2016.shtml')
tree = lxml.html.HTML(response.text)

this last line ("tree = lxml.html.HTML(response.text)") doesn't work. I've searched for quite a long time to troubleshoot, but it seems the explanation below has the answer, fortunately:
"This code begins by building a tree of Elements from the HTML using lxml.html.fromstring(some_html). "
Thought I'd point this out to you people, because it's was really bugging me, and since I love what you are doing, I'd better help out a bit.
Cheers, Laurens