ljanyst / scrapy-do Goto Github PK
View Code? Open in Web Editor NEWA daemon for scheduling Scrapy spiders
Home Page: https://jany.st/scrapy-do.html
License: BSD 3-Clause "New" or "Revised" License
A daemon for scheduling Scrapy spiders
Home Page: https://jany.st/scrapy-do.html
License: BSD 3-Clause "New" or "Revised" License
Hi @ljanyst ,
First of all thanks for creating this awesome daemon It has some awesome scheduling capabilities which official scrapyd is misisng.
Recently I started using scrapy-do and found out that spiders running in scrapy-do daemon don't have access to os environment variables.
My Usecase is like this: I have a scrapy project which has an scrapy extension that manages the db connection. Now db credentials to that extension are provided by scrapy settings module. Scrapy settings module gets those credentials from environment variables.
This setup works fine when I directly run scrapy using scrapy crawl ...
command. But face issues while running with scrapy-do daemon.
I saw this issue: #19 and corresponding PR: #20. I don't agree with the statement that "The server environment really should not spill to the spiders". There are valid use cases to support that. Mine is an example. Infact, scrapyd also has support for passing OS environment variables to spiders. See:
https://github.com/scrapy/scrapyd/blob/master/scrapyd/environ.py#L12
https://github.com/scrapy/scrapyd/blob/master/scrapyd/environ.py#L25
https://github.com/scrapy/scrapyd/blob/master/scrapyd/launcher.py#L41
So can we have this feature anytime soon? Optionally, we can have some feature flags to prevent accidental exposure of os env variables to spiders.
I get this error:
Arguments contain a non-string value
When i try to publish a new spider (project). Both when using the cli and the web interface. Does any one know why this happens? And how i can fix it?
Hi there, I'm using scrapy-do in a docker container and was asking myself if contributing the files and automatic docker image build and push to this repository (on git tag creation).
What do you think about this idea?
Wondering if it's a bug or not that payload
is literally provided to the kwargs as payload
. Expected is a dict of arguments as Scrapy documented: https://docs.scrapy.org/en/latest/topics/spiders.html#spider-arguments
Payload:
{
"start_url": "https://books.toscrape.com/"
}
Spider:
def __init__(self, start_url=None, *args, **kwargs):
print("args")
print(args)
print("kwargs")
print(kwargs)
if not start_url:
raise ValueError("argument string 'start_url' must be provided!")
self.start_urls = [start_url]
Output:
args
()
kwargs
{'payload': '{"start_url": "https://books.toscrape.com/"}'}
when the job runs, the spider can't get the OS env vars.
I'm using os.getenv('COLLECTION_NAME', None) on settings.py, it always returns None.
could you please add the os env vars to the run_process.
Thanks.
I get the following message while opening http://127.0.0.1:7654 in de webbrowser.
"The UI files have not been built."
I followed the described instructions. Does anyone know what I missed or do wrong?
scrapy-do-cl remove-project --project somename
error unrecognized arguments:: --project somename
Is there any way to use port 80 and a subdirectory URL for the web interface? Heroku can't use anything other than port 80.
Great tool -- thank you.
Hello, Is it possible to schedule a script in scrapy that contains a CrawlerProcess to process multi spider?
This one looks nice.
Hi, would it be possible to add some documentation on how to use the payload? how is it passed to the spider, etc?
I would be useful to describe how to do it either by the API or the UI.
schedule-job.json
thank you
[!] Server responded with an error: Arguments contain a non-string
and can't list projects errors is misleading.
The cause turned out to be Scrapy not working.
After Scrapy spider was fixed, i was able to push project ok.
{"status": "error", "msg": "'str' object has no attribute 'spiders'"}
This issue is observed v0.4.0
Nothing is shown when trying to view a large log file from Completed Jobs page. And Chrome insepct shows error "Maximum call stack size exceeded" on link.
Stackorverflow has suggestions on how to handle this error:
https://stackoverflow.com/questions/49123222/converting-array-buffer-to-string-maximum-call-stack-size-exceeded
Hope this issue can be fixed soon. For now I have reverted to use v0.3.2.
Only allow to push a project if all the scheduled jobs remain valid afterwards. It means that if a spider was removed in a new version, there cannot be a scheduled job referencing that spider.
I installed the package using pip install scrapy-do
When I try to run the next command scrapy-do -n scrapy-do
I get the the error
scrapy-do is not recognized as an internal or external command,
operable program or batch file.
The documentation explains how to zip and push a project.
How do you package and upload a project that relies on external dependencies?
e.g:
assuming lib X, Y and Z are not part of Scrapy.
When running push-project that uses external dependencies, the upload fails.
scrapy-do-cl --url http://ip:7654 push-project --project-path /path/to/my-project
the above command returns
[!] Server responded with an error: Unable to get the list of spiders
It does not give much details as to why it could not list the spiders. It took a bit of debugging to understand it was coming from missing libs since scrapy list
worked locally.
While restarting scrapy-do, all active (scheduled) jobs will throw an error, since the payload is not passed to the spider again.
AttributeError: 'Spider' object has no attribute 'payload'
I cannot find this in the docs.
Hi, thanks for creating scrapy-do.
I tried to create a job for running a spider but scrapy-do found an error in /scrapy_do/webservice.py in the line 256.
I suppose this is why my jobs don't work and are exited with code 1, seconds after created.
Also, I cant get logs neither in command line nor Web interface but I dont know if there is something I need to configure.
Thank you for this framework-API.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.