Code Monkey home page Code Monkey logo

web-scraper's Issues

br tag

Dear Mdibaiee!

I have some issues that I can't fix.

Now my tags are looking like this:

tags = ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'li', 'span', 'a', 'img', 'br']

I have added br tag to it.
When the scraper runs this way, it find's all the br tag which is not inside for example in a p tag.
But when br tag is inside a p tag it won't find the text.

gitpic

I the case of what is shown on the pic I can't get any of the text inside br.
Are there any chance that You have an easy workaround for this?

Thank You!
Csemid

fails on non working urls

Dear Mahdi Dibaiee,

Please help me with your program. I really like all of it and would be happy to use it, but I get an error if it tries to process a non working url. I guess there is some version problem. Could you please take a look on my error code?

python3 index.py test_websites --depth 0 --no-image
https://notarealsite61681.com
Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 844, in _validate_conn
conn.connect()
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 284, in connect
conn = self._new_conn()
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 423, in send
timeout=timeout
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 649, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/retry.py", line 376, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "index.py", line 94, in
scrape(main)
File "index.py", line 48, in scrape
html = requests.get(t).text
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 487, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Thank you for your help!
Gergo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.