ljanyst / scrapy-do Goto Github PK

View Code? Open in Web Editor NEW

65.0 7.0 18.0 1.3 MB

A daemon for scheduling Scrapy spiders

Home Page: https://jany.st/scrapy-do.html

License: BSD 3-Clause "New" or "Revised" License

Python 72.23% HTML 0.49% CSS 0.50% JavaScript 26.77%

scrapy-framework daemon

scrapy-do's People

Contributors

Stargazers

Watchers

Forkers

cfirmo33 yatskov liming922 hhenriquez umairwaheed kurhula zanachka jubeira rigel772 sushilsingh95 ptorrestr mudone toru2220 xmurobi cooljelly maescalante jccastiblancor

scrapy-do's Issues

Create a commanline client

Allow for specifying spider payload

Access to OS Environment Variables to Spiders

Hi @ljanyst ,

First of all thanks for creating this awesome daemon It has some awesome scheduling capabilities which official scrapyd is misisng.

Recently I started using scrapy-do and found out that spiders running in scrapy-do daemon don't have access to os environment variables.

My Usecase is like this: I have a scrapy project which has an scrapy extension that manages the db connection. Now db credentials to that extension are provided by scrapy settings module. Scrapy settings module gets those credentials from environment variables.

This setup works fine when I directly run scrapy using scrapy crawl ... command. But face issues while running with scrapy-do daemon.

I saw this issue: #19 and corresponding PR: #20. I don't agree with the statement that "The server environment really should not spill to the spiders". There are valid use cases to support that. Mine is an example. Infact, scrapyd also has support for passing OS environment variables to spiders. See:
https://github.com/scrapy/scrapyd/blob/master/scrapyd/environ.py#L12
https://github.com/scrapy/scrapyd/blob/master/scrapyd/environ.py#L25
https://github.com/scrapy/scrapyd/blob/master/scrapyd/launcher.py#L41

So can we have this feature anytime soon? Optionally, we can have some feature flags to prevent accidental exposure of os env variables to spiders.

Add support for certificate chains

Pushing projects: Arguments contain a non-string value

I get this error:

Arguments contain a non-string value

When i try to publish a new spider (project). Both when using the cli and the web interface. Does any one know why this happens? And how i can fix it?

Empty log and error window in web user interface

Always showing loader and not load any information.

Send SIGKILL instead of SIGTERM while canceling a job

Docker option?

Hi there, I'm using scrapy-do in a docker container and was asking myself if contributing the files and automatic docker image build and push to this repository (on git tag creation).

What do you think about this idea?

payload is literally provided as payload in the kwargs of spider

Wondering if it's a bug or not that payload is literally provided to the kwargs as payload. Expected is a dict of arguments as Scrapy documented: https://docs.scrapy.org/en/latest/topics/spiders.html#spider-arguments

Payload:

{
    "start_url": "https://books.toscrape.com/"
}

Spider:

    def __init__(self, start_url=None, *args, **kwargs):
        print("args")
        print(args)
        print("kwargs")
        print(kwargs)
        if not start_url:
            raise ValueError("argument string 'start_url' must be provided!")
        self.start_urls = [start_url]

Output:

args
()
kwargs
{'payload': '{"start_url": "https://books.toscrape.com/"}'}

Add configuration documentation for `push-project`

Make it possible to run the daemon at multiple interfaces and ports

OS environment variables

when the job runs, the spider can't get the OS env vars.

I'm using os.getenv('COLLECTION_NAME', None) on settings.py, it always returns None.

could you please add the os env vars to the run_process.

Thanks.

Error message: The UI files have not been built.

I get the following message while opening http://127.0.0.1:7654 in de webbrowser.

"The UI files have not been built."

I followed the described instructions. Does anyone know what I missed or do wrong?

scrapy-do-cl remove-project

scrapy-do-cl remove-project --project somename

error unrecognized arguments:: --project somename

Web interface on heroku using port 80?

Is there any way to use port 80 and a subdirectory URL for the web interface? Heroku can't use anything other than port 80.

Great tool -- thank you.

Create an interactive web interface

Scrapy do scheduling script in scrapy with Crawler Process

Hello, Is it possible to schedule a script in scrapy that contains a CrawlerProcess to process multi spider?

Add log viewer modal

This one looks nice.

Show log links for active jobs

how to use payload when scheduling a job?

Hi, would it be possible to add some documentation on how to use the payload? how is it passed to the spider, etc?
I would be useful to describe how to do it either by the API or the UI.
schedule-job.json
thank you

Errors

scrapy-do-cl push-project

[!] Server responded with an error: Arguments contain a non-string
and can't list projects errors is misleading.

The cause turned out to be Scrapy not working.
After Scrapy spider was fixed, i was able to push project ok.

status.json endpoint returns an error

{"status": "error", "msg": "'str' object has no attribute 'spiders'"}

"Maximum call stack size exceeded" error reported by Chrome when viewing large file on Completed Jobs page

This issue is observed v0.4.0

Nothing is shown when trying to view a large log file from Completed Jobs page. And Chrome insepct shows error "Maximum call stack size exceeded" on link.

Stackorverflow has suggestions on how to handle this error:
https://stackoverflow.com/questions/49123222/converting-array-buffer-to-string-maximum-call-stack-size-exceeded

Hope this issue can be fixed soon. For now I have reverted to use v0.3.2.

Add new items to the status resource

daemon version
number of projects
number of spiders

Consistency check for pushing a project

Only allow to push a project if all the scheduled jobs remain valid afterwards. It means that if a spider was removed in a new version, there cannot be a scheduled job referencing that spider.

scrapy-do is not recognized as an internal or external command, operable program or batch file

I installed the package using pip install scrapy-do
When I try to run the next command scrapy-do -n scrapy-do I get the the error
scrapy-do is not recognized as an internal or external command,
operable program or batch file.

Failed jobs do not appear on the list of completed jobs

Add an option to remove a project

project deployment relying on external dependencies

The documentation explains how to zip and push a project.

How do you package and upload a project that relies on external dependencies?
e.g:

project_A uses lib X
project_B uses lib Y and Z

assuming lib X, Y and Z are not part of Scrapy.

When running push-project that uses external dependencies, the upload fails.

scrapy-do-cl --url http://ip:7654 push-project --project-path /path/to/my-project

the above command returns

[!] Server responded with an error: Unable to get the list of spiders

It does not give much details as to why it could not list the spiders. It took a bit of debugging to understand it was coming from missing libs since scrapy list worked locally.

Errors after scrapy-do restart with payload

While restarting scrapy-do, all active (scheduled) jobs will throw an error, since the payload is not passed to the spider again.

AttributeError: 'Spider' object has no attribute 'payload'

Also, I cant get logs neither in command line nor Web interface but I dont know if there is something I need to configure.

Thank you for this framework-API.