Code Monkey home page Code Monkey logo

web-scraper's Introduction

web-scraper

A simple script that scrapes a website, extracting texts in a CSV file with the format below, and saving images.

Page Tag Text Link Image
page path element tag (h{1,6}, a, p, etc) text content link url (if any) image address (if any)

Usage

First, install dependencies (python3):

pip install -r requirements

Then create a file containing urls of the websites you want to scrape, one line for each website, for example (I'll call this file test_websites):

https://theread.me
https://theguardian.com

Now you are ready to execute the script:

python index.py test_websites
                # ^ path to your file

After the script is done with it's job, you can find the results in results/<website_hostname> folder.

To see available options, try python index.py -h.

web-scraper's People

Stargazers

 avatar DirtyTech avatar Ketyra Sai avatar  avatar ~ avatar  avatar Mohammad Sharifi avatar Matin Kaboli avatar Mahmood Zamani avatar Amir avatar

Watchers

James Cloos avatar Mahdi Dibaiee avatar  avatar DirtyTech avatar

web-scraper's Issues

fails on non working urls

Dear Mahdi Dibaiee,

Please help me with your program. I really like all of it and would be happy to use it, but I get an error if it tries to process a non working url. I guess there is some version problem. Could you please take a look on my error code?

python3 index.py test_websites --depth 0 --no-image
https://notarealsite61681.com
Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 844, in _validate_conn
conn.connect()
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 284, in connect
conn = self._new_conn()
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 423, in send
timeout=timeout
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 649, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/retry.py", line 376, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "index.py", line 94, in
scrape(main)
File "index.py", line 48, in scrape
html = requests.get(t).text
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 487, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Thank you for your help!
Gergo

br tag

Dear Mdibaiee!

I have some issues that I can't fix.

Now my tags are looking like this:

tags = ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'li', 'span', 'a', 'img', 'br']

I have added br tag to it.
When the scraper runs this way, it find's all the br tag which is not inside for example in a p tag.
But when br tag is inside a p tag it won't find the text.

gitpic

I the case of what is shown on the pic I can't get any of the text inside br.
Are there any chance that You have an easy workaround for this?

Thank You!
Csemid

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.