Code Monkey home page Code Monkey logo

library-webscraping-deprecated's People

Contributors

abbycabs avatar bkatiemills avatar cmacdonell avatar erinbecker avatar evanwill avatar fmichonneau avatar gdevenyi avatar gvwilson avatar jason-ellis avatar jduckles avatar jnothman avatar jpallen avatar kimpham54 avatar ldko avatar maxim-belkin avatar mkuzak avatar neon-ninja avatar ostephens avatar pbanaszkiewicz avatar pipitone avatar rgaiacs avatar synesthesiam avatar timtomch avatar twitwi avatar weaverbel avatar wking avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

library-webscraping-deprecated's Issues

Does this lesson have pre-requisites?

During review of course assess whether there are any prerequisites, and either make sure these are listed explicitly somewhere, or modify the lesson to avoid the need for pre-requisites

Review web scraping lesson structure and content

Need to do an overall review of the structure & content of this session and decide if it is the right stuff for a library carpentry lesson on web scraping. Suggest we use an Etherpad to agree a syllabus for a lesson and review against this existing lesson

Handout

This lesson might benefit from making a handout of reference materials.

To do this add detail of commands/terminology under the keypoints headers for each lesson: for example, https://github.com/data-lessons/library-data-intro/blob/gh-pages/_episodes/04-regular-expressions.md. This effectively then builds a handout at - for example http://data-lessons.github.io/library-data-intro/reference/ - which can be printed out in advance of the session (librarians love handouts!)

Make sure you make a note of this in your Instructor Notes #31

Adapting this for Resbaz Sydney

I thought I should shout out that Sydney University has asked me to present a webscraping introduction to researchers at https://2017.resbaz.com/sydney in early July. This is a great place to start, but I was hoping to make the following changes (time permitting):

  • use toscrape.com as a less context-biased test site, and to highlight some of the problems that scrapers need to get around.
  • incorporate XPath exercises using the Xpath Diner
  • mention CSS selectors as an alternative to XPath (I had originally wanted to do it all with CSS selectors due to their familiarity, but given the existing Xpath content, that it more likely teaches everyone something new, and that it is more expressive, I'm now likely to going with Xpath.). Main problem with Xpath is the pain of selecting by class name.
  • ?perhaps start by building a point-and-click scraper with grepsr.com, which I found to be the most user-friendly of available tools, requiring 0 knowledge of xpath, etc.
  • The Scraper Chrome Extension seems very limited in terms of doing only single-page scrapes, and I've found it buggy (I get error popups whose message is [object Object]). I would rather introduce users to Portia, which can then be converted to scrapy; the main problem with that is that it's not so easy for most researchers to install themselves, although the docker run is trivial for a techie.

Now that this has been rehomed to data-lessons and is being actively maintained, I will try to make my revisions more structured.

Wording in introduction - suggested re-write

In its simplest form, this can be achieved by copying and pasting snippets from a web page, but this can be unpractical if there is a large amount of data to be extracted, or if it spread over a large number of pages.

Replace "unpractical" with "impractical". Unpractical is rarely used, though correct.
Change: "or if it spread" with "or if it is spread".

Automating web scraping also allows to define whether the process should be run at regular intervals and capture changes in the data.

Change "also allows" to "also allows you".

Update the Table of Contents

The structure of the revised lesson should follow along these lines:

  1. Introduction: What is Web Scraping?
  2. Document Structure and Selectors
  3. Introduction to Requests and Beautiful Soup
  4. Advanced Web Scraping Using Python and Beautiful Soup
  5. Conclusion

We need to update the Table of Contents to reflect these changes.

Should we use selenium as the basis of the web scraping tutorial?

Scrapy has been criticised primarily for the unintuituve overhead of defining class structures. Beautiful Soup has been criticised for not supporting XPath and not having a DOM-compatible API. lxml has more-or-less been criticised for its unintuitive API.

Selenium is another option. It runs a webkit (Chrome/Safari), gecko (Firefox) or edge (MS Explorer) web browser and executes instructions on it.

Advantages:

  • supports XPath and CSS selectors (and indeed arbitrary javascript)
  • works over computed DOM and so mimics what one finds in the DOM inspector of a web browser's dev tools
  • supports interaction and AJAX-derived scraping (without having to reverse engineer the AJAX calls)
  • more-or-less cross-language API equivalence (i.e. the lessons are more portable)
  • can focus on the element inspector rather than view source in XPath/CSS selector tutorial

Disadvantages:

  • a little harder to install (not just pip/conda; presumably hardest on unprivileged *nix)
  • scraping overhead in terms of memory, runtime (including inter-process communication and translation to javascript), network access
  • error messages are harder to read
  • harder to interactively debug a scraper since need to go through get_property (and the easiest way I've found to get all of elem's properties is driver.execute_script('return Object.keys(arguments[0])', elem))
  • need to use another library for constructing URL query strings

See also someone else's pros and cons, although one can easily disable downloading of images, hence avoiding some of the network overhead issues.

To try it out:

Might want to present or reference DataLad

datalad is a git/git-annex based (thus version control system) data distribution and management system. http://datasets.datalad.org provides a collection of datasets majority of which are constructed automatically via crawling of data resources -- websites, S3 buckets etc. You could find a quick demo at http://datalad.org/for/data-consumers demonstrating how to crawl a webpage and download into a git/git-annex repository all referenced content (files, extracted from archives) etc.

Might be of interest to at least some librarians while talking about crawling.

Disclaimer: I am one of the datalad developers

Lesson maintainers

This lesson has no maintainers. Maintainers perform the following tasks:

Maintainers perform a number of important tasks:

  • make sure their lesson is consistent with the other Library Carpentry lessons. For example, that the Readme or License pages are correct and consistent (indeed the readme does need a little work #28)
  • address any issues that are raised against the lesson
  • deal with any pull requests that are made for the lesson.
  • after a lesson is taught, make sure that suggestions for improvement from learners and instructors are integrated
  • as this is a new lesson, helping it get through the (new) incubator process data-lessons/librarycarpentry#22
  • and, ideally, keep up with general Library Carpentry chatter at https://gitter.im/weaverbel/LibraryCarpentry

The lesson needs two maintainers, but more the merrier, especially if we can ensure a good mix of timezones. Anyone up for it?

Introspection after ResBaz Sydney 2017 lesson

This afternoon, I had 3h (including 10 min break) to present web scraping. I presented from https://ctds-usyd.github.io/2017-07-03-resbaz-webscraping/. I am not a trained SWC instructor, and not used to the narrative format of SWC lessons. I am also an experienced software engineer, so while I am used to some amount of teaching, it was hard for me to recall how much ground work there is to this topic. In the context of ResBaz, I was presenting to a group of research students, librarians, ?academics, etc. from Sydney universities. I did not get anything in the way of a survey, but hope to ask the ResBaz organisers to email students for their comments.

There were about 22 students, though 40 had signed up. Despite the Library Carpentry resolutions of a few weeks ago to focus on coding scrapers, I had decided to make something accessible to non-coders. In the end, we did not cover the coding part at all. I don't think we suffered greatly for this.

What we managed to cover

We covered, perhaps, half the material:

  • basically all of the introduction (about 20 mins)
  • almost all of CSS selectors (about 60 mins?)
  • (coffee break)
  • visual scraping with Web Scraper extension (75 mins)
  • did not do Python scraping at all
  • conclusion / ethical discussion (5 mins)

Good points

  • The UNSC resolutions web site highlights well the need to adjust your scraper to quirks and variation in the web site, and why not to always rely on the extraction patterns chosen by the visual scraper (but I'm not sure how clearly the latter came across to some students).
  • Successfully scraping data after a while talking/learning about selectors etc yielded quite a sense of accomplishment.
  • I think most students got the idea of the nested scraper design used in Web Scraper, and generally that episode worked quite well.
  • I think most students got the idea of CSS selectors matching elements and how this fits into a scraper.
  • The Selecting the challenge, performed in pairs/triples was enjoyed.
  • The conclusion was a bit rushed, but quite clear (if perhaps a little repetitive).

Things deserving attention

Overall

  • There is far too much narrative before getting hands dirty. Even so, students seemed to appreciate the "what web scraping is not" at least to some extent. Could probably be moved to conclusion.

  • Students who were not well grounded in the structure of web pages struggled.

  • I had two projector screens. Even so, it is challenging to set up a visual projection that covers: the lesson, the page being scraped, source code or element inspector for a page being scraped, the scraping tool or code...

  • I think it would be good to focus on a visual scraper, but then have a number of scripts in several scraping frameworks and languages available as supplementary material to the lesson. A discussion of the nuances of coding these things by hand can be left brief, or available with more description for an extended lesson.

  • I feel that visual scrapers are a good way to demonstrate what we're up to with little coding competence required, and are in practice a useful technology to grok.

  • The key thing we need to consider is to what extent we make this available with a "choose your own adventure: CSS vs XPath; visual vs requests/lxml vs scrapy" approach, or as a single well-honed curriculum that works for most people.

CSS selectors

  • Should somehow start with an exercise, and make the episode more bottom-up. Perhaps should start with the HTML element inspector over the target web site and assertions like "similar structures in the page have the same tag name; class names; etc." and use this to introduce topics like markup structure and tree structure / terminology (which is very useful). It would be better, for instance, to have seen class attributes long before we describe how to select them.
  • Should possibly use UNSC example instead of analysing lesson page (also makes lesson easier to maintain)
  • We could even consider evaluating CSS selectors in the console before we go into the details of a CSS selector and how it works.
  • "View source" can probably be replaced by "Element Inspector". There are advantages to both, but I don't feel like it's worth complicating the matter.
  • <catfood> example is poorer for only having one of each tag name.
  • Some of the introductory material here is awkward because it tries to mirror the XPath content. For example, in CSS selection, we don't really think of attributes as part of the tree, and only sometimes think of text as such.
  • I borrowed from the XPath lesson the idea that the evaluation follows the path from the root to the target. But it seems easier to me to describe the way the selector prescribes the target, on its parent, etc. Remnants of the path paradigm linger in the lesson.
  • ("Extensions to CSS selectors" was relevant to a draft of the visual scraping lesson, but is no longer and can be removed.)

Visual scraping

  • There are some annoying interface issues in Web Scraper extension, particularly in navigation (e.g. clicking rows vs clicking buttons on their right; where to run a data preview and what to expect in it / how that relates to the scrape results; the tool for selecting a parent selector is difficult for a beginner; the data preview shows with too-large margins so the box is very small; etc)
  • We had other issues with Web Scraper including that its Selection tool can behave a bit quirkily.
  • Web Scraper introduces some confusing technology: its "selectors" need to be distinguished from CSS "selectors"; its "Type" needs to be distinguished from CSS selectors' :nth-of-type which refers to tag name.
  • The Web Scraper extension opens up a browser window to do the scraping in. I don't think I made it clear to students that this is not a necessary part of scraping, i.e. that scraping in the background, often automatically and periodically, is the norm.
  • I ad-libbed some content on why we might want an Element Attribute apart from href, and spoke of machine-readable publication dates (with microdata) in news sites. Also could have mentioned a's title attr. What else? Worth writing a paragraph on in the lesson, perhaps.

I'll offer my lessons across to this repo shortly.

Anything to add, @nikzadb, @Anushi, @RichardPBerry?

Remove references to Chrome extensions

Remove episode on scraping with chrome extensions and references to chrome extensions throughout. Per discussion in #7, the starting point is scraping with Python and friends as a way to learn and apply programming concepts rather than just using an available tool.

Talk more about URL hacking

The structure of a URL is something experienced web scraper builders keep an eye on. It is a basic technique that may deserve discussion in this lesson.

Add authors to AUTHORS file

We now have a workflow for releasing citable versions of our lessons (with DOIs) every 6 months via Zenodo. This makes our more discoverable and sustainable and ensures that everyone involved gets the credit they deserve. For more on this work see data-lessons/librarycarpentry#5

In order to make this happen we need to make one crucial change: all AUTHORS files need to change so that they list names of contributors in the following format:

James Allen
James Baker
Piotr Banaszkiewicz
Erin Becker

@jt14den will run a script that that strips names from lesson logs and edit AUTHORS across all Library Carpentry repos.

When this is actioned (hopefully, soon!), lesson maintainers are asked to eyeball the AUTHORS file to see if anyone obvious is missing (for example, people who contributed to discussions but didn't edit any lessons). Note: template developers are credited in this process; this is in line with Software Carpentry best practice.

In the future, lesson maintainers are encouraged to ensure that those who contribute to lessons are added manually to AUTHORS files (encourage contributors to do it so they see where and how we give credit!)

lxml.html.HTML causes an error

I am bringing this over from https://gitter.im/LibraryCarpentry/Lobby where it was reported by Laurens Vreekamp @campodipace_twitter:

The lesson on 'Web scraping using Python: requests and lxml
Overview' has an error in the follow code:
"Traversing elements in a page with lxml"
import requests
import lxml.html

response = requests.get('http://www.un.org/en/sc/documents/resolutions/2016.shtml')
tree = lxml.html.HTML(response.text)

this last line ("tree = lxml.html.HTML(response.text)") doesn't work. I've searched for quite a long time to troubleshoot, but it seems the explanation below has the answer, fortunately:
"This code begins by building a tree of Elements from the HTML using lxml.html.fromstring(some_html). "
Thought I'd point this out to you people, because it's was really bugging me, and since I love what you are doing, I'd better help out a bit.
Cheers, Laurens

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.