jairovadillo / pychromeless Goto Github PK

View Code? Open in Web Editor NEW

294.0 11.0 122.0 488 KB

Python Lambda Chrome Automation (naming pending)

Home Page: https://medium.com/21buttons-tech/crawling-thousands-of-products-using-aws-lambda-80332e259de1

License: Apache License 2.0

Makefile 21.18% Python 75.43% Dockerfile 3.39%

python selenium aws-lambda chromium chrome crawler automation

pychromeless's Introduction

PyChromeless

Python (selenium) Lambda Chromium Automation

PyChromeless allows to automate actions to any webpage from AWS Lambda. The aim of this project is to provide the scaffolding for future robot implementations.

But... how?

All the process is explained here. Technologies used are:

Python 3.6
Selenium
Chrome driver
Small chromium binary

Requirements

Install docker and dependencies:

make fetch-dependencies
Installing Docker
Installing Docker compose

Working locally

To make local development easy, you can use the included docker-compose. Have a look at the example in lambda_function.py: it looks up “21 buttons” on Google and prints the first result.

Run it with: make docker-run

Downloading files

If your goal is to use selenium to download files instead of just scraping content from web pages, then you will need to specify a download_dir when initializing the WebDriverWrapper. Your download location should be a writable Lambda directory such as /tmp. For example, the first code in lambda_handler would become

driver = WebDriverWrapper(download_location='/tmp')

This will cause file downloads to automatically download into the download_location without requiring a confirmation dialog. You might need to sleep the handler until the file is downloaded since this occurs asynchronously.

In order to download a file from a link that opens in a new tab (i.e. target='_blank') you will need to call enable_download_in_headless_chrome in your scraping script after navigating to the desired page, but before clicking to download. This will replace all target='_blank' with target='_self'. For example:

# Navigate to download page
driver._driver.find_element_by_xpath('//a[@href="/downloads/"]').click()
# Enable headless chrome file download
driver.enable_download_in_headless_chrome()
# Click the download link
driver._driver.find_element_by_class_name("btn").click()

Building and uploading the distributable package

Everything is summarized into a simple Makefile so use:

make build-lambda-package
Upload the build.zip resulting file to your AWS Lambda function
Set Lambda environment variables (same values as in docker-compose.yml)
- PYTHONPATH=/var/task/src:/var/task/lib
- PATH=/var/task/bin
Adjust lambda function parameters to match your necessities, for the given example:
- Timeout: +10 seconds
- Memory: + 250MB

Shouts to

Contributors

Jairo Vadillo (@jairovadillo)
Pere Giro ()
Ricard Falcó (@ricardfp)

pychromeless's People

Contributors

Stargazers

Watchers

Forkers

cash2one manugarri scottsnyder11 buddhabuddy ricardfp bvanhou pentairiot wienaw r0ckychiu junqueira andrewng95 scott-meyer jclinux mostafa86 emrahayanoglu ericcheung3 christophertull ollie87 szepnapot xerato manishnakar mbeacom evanfarrell michaelbechard jefedigital chickenpopcorn asiellb scriptingislife wsf nstornetta hhy5277 harnitsignalfx mwimwii mhova df19723 timhealz nscoder alexkmerz hoangphuc50 szto-ain maivanteo kdw9502 icontrerasn schumanzhang mpatinogu surendarrajasekaran justfathi lamastex benjiboi214 simob shishirkh laf-rge manito969 poulter7 techlab23 jamesmalin perkperkins mosslilley priyank174 findigs seanlewisire kenelm007 tanthanadon dbroug01 laukamp hmphu kossiviesse1 szulcmaciej rahulready laranea burakiskender musharrafjafri dodopok fxmag kytranl0 jean9208 jamesd000 sagnik88 handvo evaognianova dominikjaeckle juandag97 si3mshady silverdoses pheels diegobdev meagher43 showmaza vitco teemotechnology cruel-intentions davidtoctoz pablosls divij-ethinos codyhelbling adv27 1davidmichael folpindo pydemo sbaltodano

pychromeless's Issues

Question: geolocation

Have you, in your masterful experience, used html5 geolocation?

I'm navigating to https://html5demos.com/geo and taking a screenshot and it displays failed.

I did some digging and tried to allow geolocation access to that specific site, which did not work;

        prefs = {
            'profile.content_settings.geolocation': {
                "https://html5demos.com:443,https://html5demos.com:443": {
                    "last_modified": "13176769719334225",
                    "setting": 1
                }
            },
        }
        chrome_options.add_experimental_option("prefs", prefs)

I've also seen various stackoverflow articles mentioning the settings

'profile.managed_default_content_settings.geolocation': 1,
'profile.default_content_setting_values.geolocation': 1,
'profile.default_content_settings.geolocation': 1,

Any insight is appreciated. I'll close the issue shortly regardless -- since it isn't a real issue.

Thanks for your time.

Issues with Python 3.8

Hi - if I setup pychromeless on a lambda instance with Python 3.6 or 3.7 runtimes it works fine. However, on 3.8 it errors out with a message saying that Chrome exited with status 127.

Do not need a fix for it but wanted to document this issue just in case other people have problems with it since for new Lambda machines the default is Python 3.8.

chromedriver unexpectedly exited

Hi there.

Thanks for creating this. It's just what I needed for a scraper I'm making.

When you deployed, did you ever run into these errors, even if it works locally with Docker:

~~selenium.common.exceptions.WebDriverException: Message: Service chromedriver unexpectedly exited. Status code was: 127~~

selenium.common.exceptions.WebDriverException: Message: chrome not reachable

selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable may have wrong permissions.

If so, how did you solve it?

I saw that you specified the path of the Chrome binary, but not the chromedriver. Does this need to be added as well?

Setting up debugging in Pycharm

Hello there!

First of all, thanks for this amazing project !!

I was wondering if you had any instructions to set up debugging with Pycharm ?

It has been really helpful when working with Selenium locally, and I am trying to set it up now using the docker environment, but I have not succeeded setting it up yet.

Has anyone done this before ? It would be really helpful!

Thank you so much!

lambda_function.py example times out

Running this vanilla on OS X 10.13.6 the example script can't find the input box on google.es with attributeerror 'module' object has no attribute 'find_element_by_xpath' and is the same with various URLs and patterns. I thought it might be a DNS/port issue with Docker on Mac but that didn't solve it either.

Couldn't connect to Docker at http+docker://localunixsocket

When I running locally prints error:

Building lambda
ERROR: Couldn't connect to Docker daemon at http+docker://localunixsocket - is it running?

If it's at a non-standard location, specify the URL with the DOCKER_HOST environment variable.
Makefile:20: recipe for target 'docker-build' failed
make: *** [docker-build] Error 1

How to handle this error in a proper way?

yaml.parser.ParserError

ERROR: yaml.parser.ParserError: expected '', but found ''
in "./docker-compose.yml", line 3, column 1

what to do?

Running pychromeless locally on Docker

Me again.

For those of us who are unfamiliar with Docker, can you please tell us how to test our app locally in it, or how to set up environment variables? I followed the Docker tutorial, but it's not clear.

For example I keep getting these errors:

When I run:
docker-compose run lambda

I get:
botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.

Or when I run:
docker run --rm -v "$PWD":/var/task lambci/lambda:python3.6 lambda_function.lambda_handler

I get:
Unable to import module 'lambda_function': No module named 'selenium'

Thanks for any additional help you can spare.

Question: Downloading files addition

Hi there. First off, this utility is amazing! Like the ultimate prepackaged webscraping environment. So thanks for making it public!

I noticed while using the package that headless chromium currently has some issues downloading files, but there are workarounds online. I have added these to my own fork, but wanted to ask if you would be interested in a PR to add these workarounds to this repo?

Maybe something like a download flag in the WebDriverWrapper that sets the preferences for downloading and adds the headless workaround?

Problem with insecure https

Anyone having problems with HTTPS insecure sites?
When i try connect to a insecure HTTPS i only have that return

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>

How to send Event Data

To mimic a Lambda function how do I send Event Data to the container for local development?

Thanks,

Problem with running JavaScript and HTML DOM rendering...?

Description

First of all, I wanted to thank you guys for sharing this great repo - the deployed example function works as expected. However, when I use basically the same code, I seem to encounter issues with HTML DOM rendering. Well, at least to my understanding. Could this be related to the driver options you proposed? The following function works fine with a non-headless desktop Chrome driver but fails when deployed and using the headless mode.

Steps to reproduce

Update the lambda function:

def lambda_handler(*args, **kwargs):
    driver = WebDriverWrapper()

    url = r'https://www.idealo.de/preisvergleich/OffersOfProduct/3214257_-cobra-paket-3-tlg-00-20-09-v02-knipex.html'
    driver.get_url(url)

    driver.click('//button[@class="productOffers-listLoadMore button-ghost button-ghost--blue"]')

    print("--------------------------")
    print("Success!")
    print("--------------------------")

    driver.close()

Deploy or run the function inside a docker container.

Observed result

Once I deploy and run the function above I get the infamous selenium.common.exceptions.NoSuchElementException error:

Message: no such element: Unable to locate element: {"method":"xpath","selector":"//button[@class="productOffers-listLoadMore button-ghost button-ghost--blue"]"}
  (Session info: headless chrome=62.0.3202.94)
  (Driver info: chromedriver=2.32.498513 (2c63aa53b2c658de596ed550eb5267ec5967b351),platform=Linux 4.14.138-99.102.amzn2.x86_64 x86_64)
: NoSuchElementException
Traceback (most recent call last):
  File "/var/task/src/lambda_function.py", line 13, in lambda_handler
    driver.click('//button[@class="productOffers-listLoadMore button-ghost button-ghost--blue"]')
  File "/var/task/src/webdriver_wrapper.py", line 68, in click
    elem_click = self._driver.find_element_by_xpath(xpath)
  File "/var/task/lib/selenium/webdriver/remote/webdriver.py", line 290, in find_element_by_xpath
    return self.find_element(by=By.XPATH, value=xpath)
  File "/var/task/lib/selenium/webdriver/remote/webdriver.py", line 744, in find_element
    {'using': by, 'value': value})['value']
  File "/var/task/lib/selenium/webdriver/remote/webdriver.py", line 233, in execute
    self.error_handler.check_response(response)
  File "/var/task/lib/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//button[@class="productOffers-listLoadMore button-ghost button-ghost--blue"]"}
  (Session info: headless chrome=62.0.3202.94)
  (Driver info: chromedriver=2.32.498513 (2c63aa53b2c658de596ed550eb5267ec5967b351),platform=Linux 4.14.138-99.102.amzn2.x86_64 x86_64)

Expected result

I expect the "Load more" button element to be found in order to be clicked.

Additional environment details

Lambda runtime Python 3.6
xPath retrieved using Google Chrome (on Ubuntu 18.04 desktop) Version 79.0.3945.130 (Official Build) (64-bit)

Things I have tried so far:

install older Google Chrome browser (62.0.3202.75) and check if the xPath is different somehow after the rendering of the HTML DOM;
give the driver more time to run the Java Script by setting time.sleep(120);
try passing absolute or relative xPath to driver.click() method;
inspect the page source by adding the following method to the WebDriverWrapper class:

    def get_source_html(self):
        return self._driver.page_source

Interestingly, this get_source_html()method returned unrendered content for both the original and the modified function. Despite this the headless Chrome was still able to find the elements by xPath in the original function, but fail to do so in the modified version. I think this point is what bugs me the most at the moment - the desktop version of the chrome driver returns a nicely rendered HTML once driver.page_source is used and I can easily verify that the button is present on the page. But as the same method on headless version returns a bunch of Java Script (as far as I am aware this can happen with HTML DOM), I cannot check and see what the headless Chrome driver is "looking at" and what kind of HTML it is working with. I read the available options list for the driver, but I could not find an explicit "run Java Script" method (nor there should be one, this is what the driver should be doing automatically as far as I'm concerned...)

Any ideas on how to proceed here are very welcome.

Update to Python 3.9

Hi,

Can you update to Python 3.9?

Thanks.

Error Deploying on AWS Lamnda

When i deploy on aws lambda and test it i get this error :

{ "errorMessage": "Unable to import module 'lambda_function': No module named 'lambda_function'", "errorType": "Runtime.ImportModuleError" }

Please let me know what i'm doing wrong. Thank you.

Boto3

Overwriting PYTHONPATH loses LAMBDA_RUNTIME_DIR (where packages such as boto3 are).

Any hints on how to get around this?

Thanks!

"exec: \"src.lambda_function.lambda_handler\": executable file not found in $PATH"

Hi, after I execute "make docker-run" I get the following error:

my_lambda aburkov$ make docker-run
docker-compose run lambda src.lambda_function.lambda_handler
Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"src.lambda_function.lambda_handler\": executable file not found in $PATH": unknown
make: *** [docker-run] Error 1

lambda_function.py with method "lambda_handler" is in the "my_lambda/src" directory. I have "Dockerfile", "Makefile" and "docker-compose.yml" in the "my_lambda" directory. I run "make docker-run" from the "my_lambda" directory.

What should I change to make it work?

WebDriverWrapper Subclass

Is it possible to have WebDriverWrapper subclass webdriver.Chrome so the methods of webdriver.Chrome are available?

Tried doing so myself and can't quite get it right for whatever reason.

not able to install package chromedriver-installer

Somehow it is not possible to install the package chromedriver-installer, I used pip search and found another named chromedriver_installer and the error is the same.

Running setup.py install for chromedriver-installer: started Running setup.py install for chromedriver-installer: finished with status 'error'

Exception: Unable to get latest chromedriver version from https://sites.google.com/a/chromium.org/chromedriver/downloads

ERROR: Service 'lambda' failed to build: The command '/bin/sh -c pip3 install -r requirements.txt -t /var/task/lib' returned a non-zero code: 1 Makefile:23: recipe for target 'docker-run' failed make: *** [docker-run] Error 1

Doesn't work anymore

Lambda gives the exception of status code 127, when running it with upgraded python versions and everything else.
If someone is struggling, ping here, I'll atleast share the updated dependencies which can be used.

ElementNotVisibleException from clean run of repo

I have cloned the repo and run make fetch-dependencies as written. When I go to run make docker-run as written in the readme I get this error:

{"errorType":"ElementNotVisibleException","errorMessage":"Message: element not visible\n (Session info: headless chrome=62.0.3202.94)\n (Driver info: chromedriver=2.32.498513 (2c63aa53b2c658de596ed550eb5267ec5967b351),platform=Linux 5.4.0-52-generic x86_64)\n","stackTrace":[" File "/var/task/src/lambda_function.py", line 15, in lambda_handler\n button.send_keys(Keys.TAB)\n"," File "/var/task/lib/selenium/webdriver/remote/webelement.py", line 322, in send_keys\n self._execute(Command.SEND_KEYS_TO_ELEMENT, {'value': keys_to_typing(value)})\n"," File "/var/task/lib/selenium/webdriver/remote/webelement.py", line 457, in _execute\n return self._parent.execute(command, params)\n"," File "/var/task/lib/selenium/webdriver/remote/webdriver.py", line 233, in execute\n self.error_handler.check_response(response)\n"," File "/var/task/lib/selenium/webdriver/remote/errorhandler.py", line 194, in check_response\n raise exception_class(message, screen, stacktrace)\n"]}
make: *** [Makefile:35: docker-run] Error 1

ERR_CONNECTION_RESET

I have this error when I get to a url "Message: unknown error: net::ERR_CONNECTION_RESET\n (Session info: headless chrome=86.0.4240.0)\n", the strange thing is that it occurs randomly. Do you have any idea what could be happening? , Has it happened to you?

aws lambda error: 'chromedriver' executable may have wrong permissions.

After assigning the environment variables also it shows error like this when i upload the zip file to aws lambda and test.
It works fine in my desktop in docker.

{
  "errorMessage": "Message: 'chromedriver' executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home\n",
  "errorType": "WebDriverException",
  "stackTrace": [
    [
      "/var/task/src/lambda_function.py",
      8,
      "lambda_handler",
      "driver = WebDriverWrapper()"
    ],
    [
      "/var/task/src/webdriver_wrapper.py",
      55,
      "__init__",
      "self._driver = webdriver.Chrome(chrome_options=chrome_options)"
    ],
    [
      "/var/task/lib/selenium/webdriver/chrome/webdriver.py",
      61,
      "__init__",
      "self.service.start()"
    ],
    [
      "/var/task/lib/selenium/webdriver/common/service.py",
      74,
      "start",
      "os.path.basename(self.path), self.start_error_message)"
    ]
  ]
}

jairovadillo / pychromeless Goto Github PK

pychromeless's Introduction

PyChromeless

But... how?

Requirements

Working locally

Downloading files

Building and uploading the distributable package

Shouts to

Contributors

pychromeless's People

Contributors

Stargazers

Watchers

Forkers

pychromeless's Issues

Description

Steps to reproduce

Observed result

Expected result

Additional environment details

Things I have tried so far:

Recommend Projects

Recommend Topics

Recommend Org