Code Monkey home page Code Monkey logo

mlscraper's Introduction

mlscraper: Scrape data from HTML pages automatically

CI status PyPI version PyPI python version

mlscraper allows you to extract structured data from HTML automatically instead of manually specifying nodes or css selectors. You train it by providing a few examples of your desired output. It will then figure out the extraction rules for you automatically and afterwards you'll be able to extract data from any new page you provide.

Image showing how mlscraper turns html into data objects

Background Story

Many services for crawling and scraping automation allow you to select data in a browser and get JSON results in return. No need to specify CSS selectors or anything else.

I've been wondering for a long time why there's no Open Source solution that does something like this. So here's my attempt at creating a python library to enable automatic scraping.

All you have to do is define some examples of scraped data. mlscraper will figure out everything else and return clean data.

How it works

After you've defined the data you want to scrape, mlscraper will:

  • find your samples inside the HTML DOM
  • determine which rules/methods to apply for extraction
  • extract the data for you and return it in a dictionary

Getting started

mlscraper is currently shortly before version 1.0. If you want to check the new release, use pip install --pre mlscraper to test the release candidate. You can also install the latest (unstable) development version of mlscraper via pip install git+https://github.com/lorey/mlscraper#egg=mlscraper, e.g. to check new features or to see if a bug has been fixed already. Please note that until the 1.0 release pip install mlscraper will return an outdated 0.* version.

To get started with a simple scraped, check out a basic sample below.

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)
# returns {'name': 'J.K. Rowling', 'born': 'July 31, 1965'}

Check the examples directory for usage examples until further documentation arrives.

Development

See CONTRIBUTING.rst

Related work

I originally called this autoscraper but while working on it someone else released a library named exactly the same. Check it out here: autoscraper. Also, while initially driven by Machine Learning, using statistics to search for heuristics turned out to be faster and requires less training data. But since the name is memorable, I'll keep it.

mlscraper's People

Contributors

leo8198 avatar lorey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlscraper's Issues

Adding question mark to the sample fails

The following code,

training_file = BeautifulSoup("<p>with a question mark?</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark?'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

Throws error

mlscraper.samples.NoMatchFoundException: No match found on page (self.page=<Page self.soup.name='[document]' classes=None, text=with a que...>, self.value='with a question mark?'

But the following code works just without the question mark in the html,

training_file = BeautifulSoup("<p>with a question mark</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

训练出现错误

网页中有这个5190,是不是因为网页提供的数据存在空格或空行?这种情况如何解决,能不能忽略空行只提取数据?
我的代码:
einstein_url = 'http://www.i001.com/main1.shtml'
resp = requests.get(einstein_url)
assert resp.status_code == 200
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': '5190'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

反馈错误:
ValueError:

                                    5190
                                </td> is not in list

Find and fix issue with github profile pages

  • follower counts have no unique selector (need nth or something else)
  • image width and height get matched when searching for 20 followers (as icons have manually set dimensions)

Installation issues

ModuleNotFoundError: No module named 'mlscraper.html'; 'mlscraper' is not a package, I have installed the package using, pip install --pre 'mlscraper==1.0.0rc3' in conda env.

Module not found error?

Does mlscraper still work? I cannot get it to run (not even the sample code). I always get a ModuleNotFoundError:

ModuleNotFoundError: No module named 'mlscraper.html

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)
# returns {'name': 'J.K. Rowling', 'born': 'July 31, 1965'}

The package is definitely installed, though:
image

Or what am I missing

Fuzzy text matching

Specifically for text matching something fuzzy would be great to reduce errors, e.g. checking for similarity of long texts to avoid whitespace-based errors, etc.

Options

  • generic fuzzy matching for text
  • passing samples that have StartOfText('In a country far far away') instead of the full string, so we can match nodes with the given text in the beginning

Also it needs to be considered when checking for correctness later as scraper.get(page) == expected_result could turn out to be false.

Example from docs does not work

This example from the README does not work unfortunately. Perhaps, I'm doing something wrong.

Example:

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)

Error:

File ~/miniconda3/envs/colbert/lib/python3.8/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:133, in _make_cell_set_template_code()
    116     return types.CodeType(
    117         co.co_argcount,
    118         co.co_nlocals,
   (...)
    130         (),
    131     )
    132 else:
--> 133     return types.CodeType(
    134         co.co_argcount,
    135         co.co_kwonlyargcount,
    136         co.co_nlocals,
    137         co.co_stacksize,
    138         co.co_flags,
    139         co.co_code,
    140         co.co_consts,
    141         co.co_names,
    142         co.co_varnames,
    143         co.co_filename,
    144         co.co_name,
    145         co.co_firstlineno,
    146         co.co_lnotab,
    147         co.co_cellvars,  # this is the trickery
    148         (),
    149     )

TypeError: an integer is required (got type bytes)

Find better selectors

Currently, we just use the next best selector we find, starting from generic to specific. But too generic selectors are bad, e.g. div most likely has no meaning, and on the other hand, to specific selectors like the full path are likely too specific and will break.

Maybe there's a heuristic for good selectors. An idea:
What if we compute selectivity for each selector, e.g. how unique this selector is on the whole page. Would prefer ids and unique classes and discourage generic selectors. We then take the most selective but simplest selector.

Integer Matching

People want to extract proper integers, straightforward way would be to implement item and extractors that return integers.

Does not work with some sites

Does not return results for sites like bbc.com, dnevnik.bg etc when I to scrape even the titles only from the articles.

how to save the model ?

I want to save model, so i can use it for next time.
I can't found how to do it in example

Improve version pinning

Rather too narrow than to broad, e.g. text= of bs4 could cause trouble with older versions.

Extreme RAM usage

On training uses infinite RAM, program is Killed by the system during high ram usage, even crashed my google colab with 12gb ram.

mlscraper==1.0.0rc3

Enable increasing complexity in rule-based scraper

Currently, the rule-based scraper tries all potential css selectors at once. It would be more performant if we increase the css selector complexity step by step though, so we first try single node rules like div.item and something like .menu > div.item.company if the simpler rules don't work.

Match substrings

Often, user do not want to match full attributes or text of nodes, but specific substrings.

Solutions:

  • generate extractors that use appropriate rules to transform node.text to desired outcome.

Stackoverflow example not working

This is the code

import logging

import requests

from mlscraper import SingleItemPageSample, RuleBasedSingleItemScraper


items = {
    "https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array": {
        "title": "Why is processing a sorted array faster than processing an unsorted array?"
    },
    "https://stackoverflow.com/questions/927358/how-do-i-undo-the-most-recent-local-commits-in-git": {
        "title": "How do I undo the most recent local commits in Git?"
    },
    "https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do": {
        "title": "What does the “yield” keyword do?"
    },
}

results = {url: requests.get(url) for url in items.keys()}

# train scraper
samples = [
    SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
]
scraper = RuleBasedSingleItemScraper.build(samples)

print("Scraping new question")
html = requests.get(
    "https://stackoverflow.com/questions/2003505/how-do-i-delete-a-git-branch-locally-and-remotely"
).content
result = scraper.scrape(html)

print("Result: %s" % result)

Output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-9f646dab1fca> in <module>()
     24     SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
     25 ]
---> 26 scraper = RuleBasedSingleItemScraper.build(samples)
     27 
     28 print("Scraping new question")

4 frames
/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in build(samples)
     89                     matches_per_page_right = [
     90                         len(m) == 1 and m[0].get_text() == s.item[attr]
---> 91                         for m, s in zip(matches_per_page, samples)
     92                     ]
     93                     score = sum(matches_per_page_right) / len(samples)

/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <listcomp>(.0)
     88                     matches_per_page = (s.page.select(selector) for s in samples)
     89                     matches_per_page_right = [
---> 90                         len(m) == 1 and m[0].get_text() == s.item[attr]
     91                         for m, s in zip(matches_per_page, samples)
     92                     ]

/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <genexpr>(.0)
     86                 if selector not in selector_scoring:
     87                     logging.info("testing %s (%d/%d)", selector, i, len(selectors))
---> 88                     matches_per_page = (s.page.select(selector) for s in samples)
     89                     matches_per_page_right = [
     90                         len(m) == 1 and m[0].get_text() == s.item[attr]

/usr/local/lib/python3.7/dist-packages/mlscraper/parser.py in select(self, css_selector)
     28     def select(self, css_selector):
     29         try:
---> 30             return [SoupNode(res) for res in self._soup.select(css_selector)]
     31         except NotImplementedError:
     32             logging.warning(

/usr/local/lib/python3.7/dist-packages/bs4/element.py in select(self, selector, _candidate_generator, limit)
   1495                 if tag_name == '':
   1496                     raise ValueError(
-> 1497                         "A pseudo-class must be prefixed with a tag name.")
   1498                 pseudo_attributes = re.match(r'([a-zA-Z\d-]+)\(([a-zA-Z\d]+)\)', pseudo)
   1499                 found = []

ValueError: A pseudo-class must be prefixed with a tag name.

missing mlscraper.html

Followed the readme and was testing the code after pip install --pre mlscraper

But got a module not found error

from mlscraper.html import Page
ModuleNotFoundError: No module named 'mlscraper.html'

checking the installed library, only the following were present:
ml.py parser.py training.py util.py

For people checking out to library it will be convenient if we add all dependencies in readme present:
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

High memory usage with github page as sample

image

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'https://github.com/lorey/mlscraper/issues/38'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'title': 'Scraper not found error'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('https://github.com/lorey/mlscraper/issues/27')
result = scraper.get(Page(resp.content))
print(result)

Feedback

Gave this a try :-)

Feedback:

  • If this library works as advertised it'd be huge!
  • mlscraper.html is missing from the PyPI package.
  • When no scraper can be found, the error message could be more helpful:
    mlscraper.training.NoScraperFoundException: did not find scraper
    Would be nice if the error message gave some guidance as to what fields
    couldn't be found in the HTML.
    Even with DEBUG log level it's not really helpful.
  • See more notes in my script below.
  • Training the script was really slow (gave up after 15 min).
import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

jonas_url = "https://github.com/jonashaag"
resp = requests.get(jonas_url)
resp.raise_for_status()

page = Page(resp.content)
sample = Sample(
    page,
    {
        "name": "Jonas Haag",
        "followers": "329",  # Note that this doesn't work if 329 passed as an int.
        #'company': '@QuantCo',  # Does not work.
        "twitter": "@_jonashaag",  # Does not work without the "@".
        "username": "jonashaag",
        "nrepos": "282",
    },
)

training_set = TrainingSet()
training_set.add_sample(sample)

scraper = train_scraper(training_set)

resp = requests.get("https://github.com/lorey")
result = scraper.get(Page(resp.content))
print(result)

Scraper not found error

I am trying to scrape the Case Number from the following HTML File.

Versions:

mlscraper: pip install --pre mlscraper
python: 3.9

Code:

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

Fetch the page to train

HTMLFile = open("/content/PUS06037-BC2116122017-06-12 14_07_29.976088.txt", "r")

Reading the file

index = HTMLFile.read()

create a training sample

training_set = TrainingSet()
index = index.replace(u'\xa0', u' ')
page = Page(index)
sample = Sample(page, {'Filing Date:': '06/08/1999','Case Number:': 'BC211612'})
training_set.add_sample(sample)

train the scraper with the created training set

scraper = train_scraper(training_set)

I am getting the following message:

NoScraperFoundException Traceback (most recent call last)
in <cell line: 20>()
18
19 # train the scraper with the created training set
---> 20 scraper = train_scraper(training_set)
21
22 # scrape another page

/usr/local/lib/python3.9/dist-packages/mlscraper/training.py in train_scraper(training_set, complexity)
72 f"({complexity=}, {match_combination=})"
73 )
---> 74 raise NoScraperFoundException("did not find scraper")
75
76

NoScraperFoundException: did not find scraper

PUS06037-BC2116122017-06-12 14_07_29.txt

Separate rule extraction from scrapers

Rule extraction can be separated from scrapers.

Rule extraction:

  • get html/DOM
  • generate and test rules

scraper:

  • fetch page (fetch if url, parse if html, pass if DOM)

Include fetching in scrapers

Scrapers should be able to deal with urls, HTML, and parsed DOMs (even requests response objects?) to enable for flexible library usage.

  • scrapers input HTML, DOM, and responses, e.g. via scrape_soup, scrape_html, scrape_url
  • examples can be created via static methods, e.g. via SingleItemPageSample.from_soup, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.