Code Monkey home page Code Monkey logo

scrapy-do's People

Contributors

ljanyst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapy-do's Issues

Access to OS Environment Variables to Spiders

Hi @ljanyst ,

First of all thanks for creating this awesome daemon It has some awesome scheduling capabilities which official scrapyd is misisng.

Recently I started using scrapy-do and found out that spiders running in scrapy-do daemon don't have access to os environment variables.

My Usecase is like this: I have a scrapy project which has an scrapy extension that manages the db connection. Now db credentials to that extension are provided by scrapy settings module. Scrapy settings module gets those credentials from environment variables.

This setup works fine when I directly run scrapy using scrapy crawl ... command. But face issues while running with scrapy-do daemon.

I saw this issue: #19 and corresponding PR: #20. I don't agree with the statement that "The server environment really should not spill to the spiders". There are valid use cases to support that. Mine is an example. Infact, scrapyd also has support for passing OS environment variables to spiders. See:
https://github.com/scrapy/scrapyd/blob/master/scrapyd/environ.py#L12
https://github.com/scrapy/scrapyd/blob/master/scrapyd/environ.py#L25
https://github.com/scrapy/scrapyd/blob/master/scrapyd/launcher.py#L41

So can we have this feature anytime soon? Optionally, we can have some feature flags to prevent accidental exposure of os env variables to spiders.

Docker option?

Hi there, I'm using scrapy-do in a docker container and was asking myself if contributing the files and automatic docker image build and push to this repository (on git tag creation).

What do you think about this idea?

payload is literally provided as payload in the kwargs of spider

Wondering if it's a bug or not that payload is literally provided to the kwargs as payload. Expected is a dict of arguments as Scrapy documented: https://docs.scrapy.org/en/latest/topics/spiders.html#spider-arguments

Payload:

{
    "start_url": "https://books.toscrape.com/"
}

Spider:

    def __init__(self, start_url=None, *args, **kwargs):
        print("args")
        print(args)
        print("kwargs")
        print(kwargs)
        if not start_url:
            raise ValueError("argument string 'start_url' must be provided!")
        self.start_urls = [start_url]

Output:

args
()
kwargs
{'payload': '{"start_url": "https://books.toscrape.com/"}'}

OS environment variables

when the job runs, the spider can't get the OS env vars.

I'm using os.getenv('COLLECTION_NAME', None) on settings.py, it always returns None.

could you please add the os env vars to the run_process.

Thanks.

how to use payload when scheduling a job?

Hi, would it be possible to add some documentation on how to use the payload? how is it passed to the spider, etc?
I would be useful to describe how to do it either by the API or the UI.
schedule-job.json
thank you

Errors

scrapy-do-cl push-project

[!] Server responded with an error: Arguments contain a non-string
and can't list projects errors is misleading.

The cause turned out to be Scrapy not working.
After Scrapy spider was fixed, i was able to push project ok.

"Maximum call stack size exceeded" error reported by Chrome when viewing large file on Completed Jobs page

This issue is observed v0.4.0

Nothing is shown when trying to view a large log file from Completed Jobs page. And Chrome insepct shows error "Maximum call stack size exceeded" on link.

Stackorverflow has suggestions on how to handle this error:
https://stackoverflow.com/questions/49123222/converting-array-buffer-to-string-maximum-call-stack-size-exceeded

Hope this issue can be fixed soon. For now I have reverted to use v0.3.2.

Consistency check for pushing a project

Only allow to push a project if all the scheduled jobs remain valid afterwards. It means that if a spider was removed in a new version, there cannot be a scheduled job referencing that spider.

project deployment relying on external dependencies

The documentation explains how to zip and push a project.

How do you package and upload a project that relies on external dependencies?
e.g:

  • project_A uses lib X
  • project_B uses lib Y and Z

assuming lib X, Y and Z are not part of Scrapy.

When running push-project that uses external dependencies, the upload fails.

scrapy-do-cl --url http://ip:7654 push-project --project-path /path/to/my-project

the above command returns

[!] Server responded with an error: Unable to get the list of spiders

It does not give much details as to why it could not list the spiders. It took a bit of debugging to understand it was coming from missing libs since scrapy list worked locally.

Errors after scrapy-do restart with payload

While restarting scrapy-do, all active (scheduled) jobs will throw an error, since the payload is not passed to the spider again.

AttributeError: 'Spider' object has no attribute 'payload'

Error after creating a job

Hi, thanks for creating scrapy-do.

I tried to create a job for running a spider but scrapy-do found an error in /scrapy_do/webservice.py in the line 256.
I suppose this is why my jobs don't work and are exited with code 1, seconds after created.

Also, I cant get logs neither in command line nor Web interface but I dont know if there is something I need to configure.

Thank you for this framework-API.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.