lorey / mlscraper Goto Github PK

View Code? Open in Web Editor NEW

1.3K 18.0 89.0 463 KB

🤖 Scrape data from HTML websites automatically by just providing examples

Home Page: https://pypi.org/project/mlscraper/

Python 95.99% Makefile 4.01%

scraping crawling html machine-learning extraction-engine scraper crawler crawler-python

mlscraper's Introduction

mlscraper: Scrape data from HTML pages automatically

mlscraper allows you to extract structured data from HTML automatically instead of manually specifying nodes or css selectors. You train it by providing a few examples of your desired output. It will then figure out the extraction rules for you automatically and afterwards you'll be able to extract data from any new page you provide.

Background Story

Many services for crawling and scraping automation allow you to select data in a browser and get JSON results in return. No need to specify CSS selectors or anything else.

I've been wondering for a long time why there's no Open Source solution that does something like this. So here's my attempt at creating a python library to enable automatic scraping.

All you have to do is define some examples of scraped data. mlscraper will figure out everything else and return clean data.

How it works

After you've defined the data you want to scrape, mlscraper will:

find your samples inside the HTML DOM
determine which rules/methods to apply for extraction
extract the data for you and return it in a dictionary

Getting started

mlscraper is currently shortly before version 1.0. If you want to check the new release, use pip install --pre mlscraper to test the release candidate. You can also install the latest (unstable) development version of mlscraper via pip install git+https://github.com/lorey/mlscraper#egg=mlscraper, e.g. to check new features or to see if a bug has been fixed already. Please note that until the 1.0 release pip install mlscraper will return an outdated 0.* version.

To get started with a simple scraped, check out a basic sample below.

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)
# returns {'name': 'J.K. Rowling', 'born': 'July 31, 1965'}

Check the examples directory for usage examples until further documentation arrives.

Development

See CONTRIBUTING.rst

Related work

I originally called this autoscraper but while working on it someone else released a library named exactly the same. Check it out here: autoscraper. Also, while initially driven by Machine Learning, using statistics to search for heuristics turned out to be faster and requires less training data. But since the name is memorable, I'll keep it.

mlscraper's People

Contributors

Stargazers

Watchers

Forkers

dddw crackercat prince-xuanchan knuser wx-b zerda admariner tedlee1024 playing332 allanpk716 china-fanxin sam-mix develeop allensmile yybingyybing khurhao cjh0613 huangweiboy2 qiaohaoforever kindleboo cxhuan harishprog anylayer hubitor-forks gurby123 tlntin evalll freezing111 evenfly wiiiw omoby durianlian landlordlycat facetheworld233 moesinon hz-rotatingblock wsek zrzs akisakye benjamesbabala leo8198 davorjordacevic shrisha-rao honzamatosik wpappdev iamfabdev sd99886 shinroo loganzhai julienze taotao2020 ppkliu robin021 mrbingzhao savestrike antonengelhardt pamddg oijoijcoiejoijce eric-seekas oldsiks itsbrex ayush-raj13 epricing-cl wjpjl zerocool940711 codehornets ramitheeb geronimoktt qtwrk drjgouveia tuantx7110 angusxu kekewind zhutuo project0ne 1253818934 huangkai122 jakeyqian chenpython rizky localserviceapp avert choihenry mohdaiman cvcuiwei itsjacobhere laffeyovo new985211

mlscraper's Issues

训练出现错误

网页中有这个5190，是不是因为网页提供的数据存在空格或空行？这种情况如何解决，能不能忽略空行只提取数据？
我的代码：
einstein_url = 'http://www.i001.com/main1.shtml'
resp = requests.get(einstein_url)
assert resp.status_code == 200
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': '5190'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

反馈错误：
ValueError:

                                    5190
                                </td> is not in list

Include fetching in scrapers

Scrapers should be able to deal with urls, HTML, and parsed DOMs (even requests response objects?) to enable for flexible library usage.

scrapers input HTML, DOM, and responses, e.g. via scrape_soup, scrape_html, scrape_url
examples can be created via static methods, e.g. via SingleItemPageSample.from_soup, etc.

Extreme RAM usage

On training uses infinite RAM, program is Killed by the system during high ram usage, even crashed my google colab with 12gb ram.

mlscraper==1.0.0rc3

Separate rule extraction from scrapers

Rule extraction can be separated from scrapers.

Rule extraction:

get html/DOM
generate and test rules

scraper:

fetch page (fetch if url, parse if html, pass if DOM)

Spiegel authors not scraped if defined as list

Currently, mlscraper has issues scraping spiegel online's authors if defined as a set. See https://gist.github.com/lorey/fdb88d6c8e41b9b6bc8df264cffc68e1

Fuzzy text matching

Specifically for text matching something fuzzy would be great to reduce errors, e.g. checking for similarity of long texts to avoid whitespace-based errors, etc.

Options

generic fuzzy matching for text
passing samples that have StartOfText('In a country far far away') instead of the full string, so we can match nodes with the given text in the beginning

Also it needs to be considered when checking for correctness later as scraper.get(page) == expected_result could turn out to be false.

Scraper not found error

I am trying to scrape the Case Number from the following HTML File.

Versions:

mlscraper: pip install --pre mlscraper
python: 3.9

Code:

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

Fetch the page to train

HTMLFile = open("/content/PUS06037-BC2116122017-06-12 14_07_29.976088.txt", "r")

Reading the file

index = HTMLFile.read()

create a training sample

training_set = TrainingSet()
index = index.replace(u'\xa0', u' ')
page = Page(index)
sample = Sample(page, {'Filing Date:': '06/08/1999','Case Number:': 'BC211612'})
training_set.add_sample(sample)

train the scraper with the created training set

scraper = train_scraper(training_set)

I am getting the following message:

NoScraperFoundException Traceback (most recent call last)
in <cell line: 20>()
18
19 # train the scraper with the created training set
---> 20 scraper = train_scraper(training_set)
21
22 # scrape another page

/usr/local/lib/python3.9/dist-packages/mlscraper/training.py in train_scraper(training_set, complexity)
72 f"({complexity=}, {match_combination=})"
73 )
---> 74 raise NoScraperFoundException("did not find scraper")
75
76

NoScraperFoundException: did not find scraper

PUS06037-BC2116122017-06-12 14_07_29.txt

Re-think relationship between samples and matches

Maybe remove samples altogether and just let Matches deal with extraction? Or just use samples on the surface level?

Installation issues

ModuleNotFoundError: No module named 'mlscraper.html'; 'mlscraper' is not a package, I have installed the package using, pip install --pre 'mlscraper==1.0.0rc3' in conda env.

Split tests by type of test not by extraction methods

Same interfaces should produce same results -> same test set for same types of scrapers

Improve version pinning

Rather too narrow than to broad, e.g. text= of bs4 could cause trouble with older versions.

Module not found error?

Does mlscraper still work? I cannot get it to run (not even the sample code). I always get a ModuleNotFoundError:

ModuleNotFoundError: No module named 'mlscraper.html

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)
# returns {'name': 'J.K. Rowling', 'born': 'July 31, 1965'}

The package is definitely installed, though:

Or what am I missing

Find better selectors

Currently, we just use the next best selector we find, starting from generic to specific. But too generic selectors are bad, e.g. div most likely has no meaning, and on the other hand, to specific selectors like the full path are likely too specific and will break.

Maybe there's a heuristic for good selectors. An idea:
What if we compute selectivity for each selector, e.g. how unique this selector is on the whole page. Would prefer ids and unique classes and discourage generic selectors. We then take the most selective but simplest selector.

Match substrings

Often, user do not want to match full attributes or text of nodes, but specific substrings.

Solutions:

generate extractors that use appropriate rules to transform node.text to desired outcome.

Show progress during training

Showing progress is not easy but should somehow be enabled to visualize to users how long it might take.

how to save the model ?

I want to save model, so i can use it for next time.
I can't found how to do it in example

Improve errors when no match is found

Options:

create matches when adding samples (and check if at least one match exists
count matches down the hierarchy to visualize complexity

Create readthedocs

Resolve flake8 issues

Example from docs does not work

This example from the README does not work unfortunately. Perhaps, I'm doing something wrong.

Example:

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)

Error:

File ~/miniconda3/envs/colbert/lib/python3.8/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:133, in _make_cell_set_template_code()
    116     return types.CodeType(
    117         co.co_argcount,
    118         co.co_nlocals,
   (...)
    130         (),
    131     )
    132 else:
--> 133     return types.CodeType(
    134         co.co_argcount,
    135         co.co_kwonlyargcount,
    136         co.co_nlocals,
    137         co.co_stacksize,
    138         co.co_flags,
    139         co.co_code,
    140         co.co_consts,
    141         co.co_names,
    142         co.co_varnames,
    143         co.co_filename,
    144         co.co_name,
    145         co.co_firstlineno,
    146         co.co_lnotab,
    147         co.co_cellvars,  # this is the trickery
    148         (),
    149     )

TypeError: an integer is required (got type bytes)

Integer Matching

People want to extract proper integers, straightforward way would be to implement item and extractors that return integers.

missing mlscraper.html

Followed the readme and was testing the code after pip install --pre mlscraper

But got a module not found error

from mlscraper.html import Page
ModuleNotFoundError: No module named 'mlscraper.html'

checking the installed library, only the following were present:
ml.py parser.py training.py util.py

For people checking out to library it will be convenient if we add all dependencies in readme present:
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

Enable increasing complexity in rule-based scraper

Currently, the rule-based scraper tries all potential css selectors at once. It would be more performant if we increase the css selector complexity step by step though, so we first try single node rules like div.item and something like .menu > div.item.company if the simpler rules don't work.

Does not work with some sites

Does not return results for sites like bbc.com, dnevnik.bg etc when I to scrape even the titles only from the articles.

Find and fix issue with github profile pages

follower counts have no unique selector (need nth or something else)
image width and height get matched when searching for 20 followers (as icons have manually set dimensions)

Stackoverflow example not working

This is the code

import logging

import requests

from mlscraper import SingleItemPageSample, RuleBasedSingleItemScraper


items = {
    "https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array": {
        "title": "Why is processing a sorted array faster than processing an unsorted array?"
    },
    "https://stackoverflow.com/questions/927358/how-do-i-undo-the-most-recent-local-commits-in-git": {
        "title": "How do I undo the most recent local commits in Git?"
    },
    "https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do": {
        "title": "What does the “yield” keyword do?"
    },
}

results = {url: requests.get(url) for url in items.keys()}

# train scraper
samples = [
    SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
]
scraper = RuleBasedSingleItemScraper.build(samples)

print("Scraping new question")
html = requests.get(
    "https://stackoverflow.com/questions/2003505/how-do-i-delete-a-git-branch-locally-and-remotely"
).content
result = scraper.scrape(html)

print("Result: %s" % result)

Output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-9f646dab1fca> in <module>()
     24     SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
     25 ]
---> 26 scraper = RuleBasedSingleItemScraper.build(samples)
     27 
     28 print("Scraping new question")

4 frames
/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in build(samples)
     89                     matches_per_page_right = [
     90                         len(m) == 1 and m[0].get_text() == s.item[attr]
---> 91                         for m, s in zip(matches_per_page, samples)
     92                     ]
     93                     score = sum(matches_per_page_right) / len(samples)

/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <listcomp>(.0)
     88                     matches_per_page = (s.page.select(selector) for s in samples)
     89                     matches_per_page_right = [
---> 90                         len(m) == 1 and m[0].get_text() == s.item[attr]
     91                         for m, s in zip(matches_per_page, samples)
     92                     ]

/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <genexpr>(.0)
     86                 if selector not in selector_scoring:
     87                     logging.info("testing %s (%d/%d)", selector, i, len(selectors))
---> 88                     matches_per_page = (s.page.select(selector) for s in samples)
     89                     matches_per_page_right = [
     90                         len(m) == 1 and m[0].get_text() == s.item[attr]

/usr/local/lib/python3.7/dist-packages/mlscraper/parser.py in select(self, css_selector)
     28     def select(self, css_selector):
     29         try:
---> 30             return [SoupNode(res) for res in self._soup.select(css_selector)]
     31         except NotImplementedError:
     32             logging.warning(

/usr/local/lib/python3.7/dist-packages/bs4/element.py in select(self, selector, _candidate_generator, limit)
   1495                 if tag_name == '':
   1496                     raise ValueError(
-> 1497                         "A pseudo-class must be prefixed with a tag name.")
   1498                 pseudo_attributes = re.match(r'([a-zA-Z\d-]+)\(([a-zA-Z\d]+)\)', pseudo)
   1499                 found = []

ValueError: A pseudo-class must be prefixed with a tag name.

Feedback

Gave this a try :-)

Feedback:

If this library works as advertised it'd be huge!
mlscraper.html is missing from the PyPI package.
When no scraper can be found, the error message could be more helpful:
mlscraper.training.NoScraperFoundException: did not find scraper
Would be nice if the error message gave some guidance as to what fields
couldn't be found in the HTML.
Even with DEBUG log level it's not really helpful.
See more notes in my script below.
Training the script was really slow (gave up after 15 min).

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

jonas_url = "https://github.com/jonashaag"
resp = requests.get(jonas_url)
resp.raise_for_status()

page = Page(resp.content)
sample = Sample(
    page,
    {
        "name": "Jonas Haag",
        "followers": "329",  # Note that this doesn't work if 329 passed as an int.
        #'company': '@QuantCo',  # Does not work.
        "twitter": "@_jonashaag",  # Does not work without the "@".
        "username": "jonashaag",
        "nrepos": "282",
    },
)

training_set = TrainingSet()
training_set.add_sample(sample)

scraper = train_scraper(training_set)

resp = requests.get("https://github.com/lorey")
result = scraper.get(Page(resp.content))
print(result)

Test example code

If the code changes, examples break. Maybe there's a way to test example code easily during automated testing.

Found solutions:

High memory usage with github page as sample

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'https://github.com/lorey/mlscraper/issues/38'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'title': 'Scraper not found error'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('https://github.com/lorey/mlscraper/issues/27')
result = scraper.get(Page(resp.content))
print(result)

Training Set generation is cumbersome

Adding question mark to the sample fails

The following code,

training_file = BeautifulSoup("<p>with a question mark?</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark?'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

Throws error

mlscraper.samples.NoMatchFoundException: No match found on page (self.page=<Page self.soup.name='[document]' classes=None, text=with a que...>, self.value='with a question mark?'

But the following code works just without the question mark in the html,

training_file = BeautifulSoup("<p>with a question mark</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

Is it possible to handle anti-scraping measures?

Even though the response.status_code is 200, can we still train the model based on the manually extracted content from a website that has anti-scraping measures? (I am a beginner)