scrapy / queuelib Goto Github PK

Collection of persistent (disk-based) and non-persistent (memory-based) queues for Python

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python python3 queues non-persistent persistent hacktoberfest

queuelib's Introduction

queuelib

Queuelib is a Python library that implements object collections which are stored in memory or persisted to disk, provide a simple API, and run fast.

Queuelib provides collections for queues (FIFO), stacks (LIFO), queues sorted by priority and queues that are emptied in a round-robin fashion.

Note

Queuelib collections are not thread-safe.

Queuelib supports Python 3.5+ and has no dependencies.

Installation

You can install Queuelib either via the Python Package Index (PyPI) or from source.

To install using pip:

$ pip install queuelib

To install using easy_install:

$ easy_install queuelib

If you have downloaded a source tarball you can install it by running the following (as root):

# python setup.py install

FIFO/LIFO disk queues

Queuelib provides FIFO and LIFO queue implementations.

Here is an example usage of the FIFO queue:

>>> from queuelib import FifoDiskQueue
>>> q = FifoDiskQueue("queuefile")
>>> q.push(b'a')
>>> q.push(b'b')
>>> q.push(b'c')
>>> q.pop()
b'a'
>>> q.close()
>>> q = FifoDiskQueue("queuefile")
>>> q.pop()
b'b'
>>> q.pop()
b'c'
>>> q.pop()
>>>

The LIFO queue is identical (API-wise), but importing LifoDiskQueue instead.

PriorityQueue

A discrete-priority queue implemented by combining multiple FIFO/LIFO queues (one per priority).

First, select the type of queue to be used per priority (FIFO or LIFO):

>>> from queuelib import FifoDiskQueue
>>> qfactory = lambda priority: FifoDiskQueue('queue-dir-%s' % priority)

Then instantiate the Priority Queue with it:

>>> from queuelib import PriorityQueue
>>> pq = PriorityQueue(qfactory)

And use it:

>>> pq.push(b'a', 3)
>>> pq.push(b'b', 1)
>>> pq.push(b'c', 2)
>>> pq.push(b'd', 2)
>>> pq.pop()
b'b'
>>> pq.pop()
b'c'
>>> pq.pop()
b'd'
>>> pq.pop()
b'a'

RoundRobinQueue

Has nearly the same interface and implementation as a Priority Queue except that each element must be pushed with a (mandatory) key. Popping from the queue cycles through the keys "round robin".

Instantiate the Round Robin Queue similarly to the Priority Queue:

>>> from queuelib import RoundRobinQueue
>>> rr = RoundRobinQueue(qfactory)

And use it:

>>> rr.push(b'a', '1')
>>> rr.push(b'b', '1')
>>> rr.push(b'c', '2')
>>> rr.push(b'd', '2')
>>> rr.pop()
b'a'
>>> rr.pop()
b'c'
>>> rr.pop()
b'b'
>>> rr.pop()
b'd'

Mailing list

Use the scrapy-users mailing list for questions about Queuelib.

Bug tracker

If you have any suggestions, bug reports or annoyances please report them to our issue tracker at: http://github.com/scrapy/queuelib/issues/

Contributing

Development of Queuelib happens at GitHub: http://github.com/scrapy/queuelib

You are highly encouraged to participate in the development. If you don't like GitHub (for some reason) you're welcome to send regular patches.

All changes require tests to be merged.

Tests

Tests are located in queuelib/tests directory. They can be run using nosetests with the following command:

nosetests

The output should be something like the following:

$ nosetests
.............................................................................
----------------------------------------------------------------------
Ran 77 tests in 0.145s

OK

License

This software is licensed under the BSD License. See the LICENSE file in the top distribution directory for the full license text.

Versioning

This software follows Semantic Versioning

queuelib's People

Contributors

Stargazers

Watchers

Forkers

scraping-xx bigdata-tools big-data hosts-xx hosting-scripts openhosts models genba wojons damz anonymouzz tools-alexuser01 jsfs2019 priestd09 jaheba lookfwd ezc zhangzewen exosite sideshowdave7 patanax j000z misssprite hamedsh kliebenberg tianhuil artscrap zmn5 prohle nerdb0y metefans my8100 zhenruyan gallaecio masterscott noviluni msgpo nagesh4193 lioncui rwaycachedlibs elacuesta bitmakerla crazypython lepy vishalbelsare python-repository-hub cloud-storage-001 tiandazhao mattjurenka mayhemheroes emarondan

queuelib's Issues

multithread unit-test performance test

do you have unit tests or performance tests using using multiple threads or multiple processes ?

how are you sure that it will work in above cases. I see that sqlite connection is not synchronized in your code

FifoDiskQueue appears empty after process exit without closing file

I was looking at using FifoDiskQueue to write incoming messages to disk before processing them. In case of a crash I then could re-open the queue and continue processing. But it seems that the queue thinks its' empty.

Am I correct to conclude that this scenario is not a usecase for FifoDiskQueue?

some problems with PriorityQueue

in the PriorityQueue :

        for p in startprios:
            self.queues[p] = self.qfactory(p)

in queuelib.queue.LifoMemoryQueue

    def __init__(self):
        self.q = deque()
        self.push = self.q.append

so why can do like this：

        for p in startprios:
            self.queues[p] = self.qfactory(p)

=============================
this is a simple test

if __name__ == '__main__':
    d = PriorityQueue(qfactory=LifoMemoryQueue)
    d.push(b'a',1)

and return this

Traceback (most recent call last):
  File "D:/ProgramData/Anaconda3/Lib/site-packages/queuelib/pqueue.py", line 67, in <module>
    d.push(b'a',1)
  File "D:/ProgramData/Anaconda3/Lib/site-packages/queuelib/pqueue.py", line 33, in push
    self.queues[priority] = self.qfactory(priority)
TypeError: __init__() takes 1 positional argument but 2 were given

Failed write corrupts LIFO queue file

If a write fails when pushing an item to the LIFO queue, some data may be actually written before an exception is raised. When trying to pop, the end of the file is supposed to have the size of the last item, but it actually contains whatever was in the partially written data from previous failed push() calls.

>>> from queuelib import LifoDiskQueue
>>> q = LifoDiskQueue("./small_fs/queue")  # Filesystem has less than 100KB available.
>>> for i in range(100): 
...     q.push(b'a' * 1000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "queuelib/queue.py", line 152, in push
    self.f.write(string)
IOError: [Errno 28] No space left on device
>>> q.pop()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "queuelib/queue.py", line 162, in pop
    self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
IOError: [Errno 22] Invalid argument

The error above comes from the value of size, which is decoded from the last SIZE_SIZE bytes of the file. When that value is larger than the file itself the seek will fail.

Getting Unpacking Error

Hi,
I am getting this error while intialising LIFO Queue.
error('unpack requires a buffer of 4 bytes')

Any ideas or solutions would be appreciated.

Thanks.

Question: can I load a queue from previous disk file

Thanks for this easy to use library,

I can easily store data to file, If process quits, can i load stored data from existing file?

Thanks
Peter

Get size of FifoDiskQueue, peak, empty it

Is there a way to:

Get size of FifoDiskQueue
Empty out FifoDiskQueue completely
Peak at the head of FifoDiskQueue without popping it
Tell if it's empty

Thanks!

In FifoDiskQueue, os.makedirs(path) occassionally fails with WindowsError [Error 5] in Windows

Here comes a windows-specific problem again. It is not be a big issue but I think it is worth letting you guys know.

I posted this question to stackoverflow, see here

For the moment, I simply changed the line to:

try:
    os.makedirs(path)
except WindowsError as e:
    os.makedirs(path)

This will make it a hell of a lot better.

FifoDiskQueue Size is Getting Bigger

I'm trying to implement a fixed size queue using FifoDiskQueue.
My code is:

self._images_data = FifoDiskQueue(r"c:\temp")
...
# this code is inside a loop
# implement fixed size queue
if len(self._images_data) >= 400:
self._images_data.pop()
self._images_data.push(frame.tobytes()) # add the new frame to the main data base
...
# after the while ends running:
self._images_data.close()

I'm seeing two issues:

While the main loop is running the queue file q00000 is always 0 KB
My memory is decreasing meaning the queue is keep getting bigger and bigger although I'm "pop" out any old element before adding new one

Can some help me figure out why the size keep getting bigger or point me to the right way to implement a fixed size queue?

`python_requires` is not declared in setup.py

python_requires is not declare make it possible to install 'queuelib>=1.6.0' in Python 2 requirements, which is not supported

It would be great to add below line to setup.py:

python_requires='>=3.5',

Too many open files

Adding lots of stuff to the queue results in this error:

**Test case:

from queuelib import FifoDiskQueue, PriorityQueue as DiskPriorityQueue
for i in xrange(0, 100000):
   pq.push(str(i), i)

Traceback (most recent call last):
File "fit.py", line 595, in
File "fit.py", line 557, in init
File "/usr/local/lib/python2.7/site-packages/queuelib/pqueue.py", line 33, in push
File "fit.py", line 541, in
File "/usr/local/lib/python2.7/site-packages/queuelib/queue.py", line 47, in init
File "/usr/local/lib/python2.7/site-packages/queuelib/queue.py", line 67, in _openchunk
IOError: [Errno 24] Too many open files: 'queue-dir-113/q00000'

scrapy tests fail with queuelib 1.4.0

Hey @dangra,

Scrapy tests start to fail after 89f6df0 (see https://travis-ci.org/scrapy/scrapy/jobs/79465256). It looks like a test-only issue: .qdir was renamed to .qpath.

What do you prefer - fix it in Scrapy (1.0 and master branches) or add .qdir alias to queuelib?

A part of the problem is that we're not running trunk tests for Scrapy on Travis; maybe we should start doing this (likely with failures allowed).

Scrapy fails with `ImportError: cannot import name suppress`

Description

Scrapy fails with ImportError: cannot import name suppress

Steps to Reproduce

Create a python virtualenvironment which uses Python 2.7.16
pip install readability-lxml scrapy
scrapy runspider scrapy_test.py (code is included at the bottom)

Expected behavior:

The script should be run successfully.

Actual behavior:

 $ pip install scrapy pandas readability-lxml
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
Collecting scrapy
  Using cached Scrapy-1.8.0-py2.py3-none-any.whl (238 kB)
Collecting pandas
  Using cached pandas-0.24.2-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (16.7 MB)
Processing /Users/redacted/Library/Caches/pip/wheels/cf/80/76/f6eaec8f1622db6af7ceaeef9e4481e9dc766ccfc16b1cbd0b/readability_lxml-0.8.1-py2-none-any.whl
Collecting queuelib>=1.4.2
  Using cached queuelib-1.6.1-py2.py3-none-any.whl (12 kB)
Collecting cryptography>=2.0
  Using cached cryptography-3.3.2-cp27-cp27m-macosx_10_10_x86_64.whl (1.8 MB)
Collecting w3lib>=1.17.0
  Using cached w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting zope.interface>=4.1.3
  Using cached zope.interface-5.4.0-cp27-cp27m-macosx_10_14_x86_64.whl (208 kB)
Collecting pyOpenSSL>=16.2.0
  Using cached pyOpenSSL-20.0.1-py2.py3-none-any.whl (54 kB)
Processing /Users/redacted/Library/Caches/pip/wheels/50/41/57/228635c140878de06d942d072c9924afa56a86bb8fc2d319a4/Protego-0.1.16-py2-none-any.whl
Collecting six>=1.10.0
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting parsel>=1.5.0
  Using cached parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Collecting service-identity>=16.0.0
  Using cached service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting Twisted>=16.0.0; python_version == "2.7"
  Using cached Twisted-20.3.0-cp27-cp27m-macosx_10_6_intel.whl (3.2 MB)
Collecting lxml>=3.5.0
  Using cached lxml-4.6.3-cp27-cp27m-macosx_10_9_x86_64.whl (4.5 MB)
Processing /Users/redacted/Library/Caches/pip/wheels/35/5f/0f/474144aca7e2624be7670cdd9c6eca4979713cee237d16b464/PyDispatcher-2.0.5-py2-none-any.whl
Collecting cssselect>=0.9.1
  Using cached cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting pytz>=2011k
  Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Collecting numpy>=1.12.0
  Using cached numpy-1.16.6-cp27-cp27m-macosx_10_9_x86_64.whl (13.9 MB)
Collecting python-dateutil>=2.5.0
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting chardet
  Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Collecting cffi>=1.12
  Using cached cffi-1.14.5-cp27-cp27m-macosx_10_9_x86_64.whl (175 kB)
Collecting enum34; python_version < "3"
  Using cached enum34-1.1.10-py2-none-any.whl (11 kB)
Collecting ipaddress; python_version < "3"
  Using cached ipaddress-1.0.23-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: setuptools in /Users/redacted/.ve/censorednews-headlines/lib/python2.7/site-packages (from zope.interface>=4.1.3->scrapy) (44.1.1)
Processing /Users/redacted/Library/Caches/pip/wheels/c2/ea/a3/25af52265fad6418a74df0b8d9ca8b89e0b3735dbd4d0d3794/functools32-3.2.3.post2-py2-none-any.whl
Collecting pyasn1
  Using cached pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
Collecting pyasn1-modules
  Using cached pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB)
Collecting attrs>=19.1.0
  Using cached attrs-21.2.0-py2.py3-none-any.whl (53 kB)
Collecting Automat>=0.3.0
  Using cached Automat-20.2.0-py2.py3-none-any.whl (31 kB)
Collecting incremental>=16.10.1
  Using cached incremental-21.3.0-py2.py3-none-any.whl (15 kB)
Collecting hyperlink>=17.1.1
  Using cached hyperlink-21.0.0-py2.py3-none-any.whl (74 kB)
Processing /Users/redacted/Library/Caches/pip/wheels/f5/8c/e2/f0cea19d340270166bbfd4a2e9d8b8c132e26ef7e1376a0890/PyHamcrest-1.10.1-py2-none-any.whl
Collecting constantly>=15.1
  Using cached constantly-15.1.0-py2.py3-none-any.whl (7.9 kB)
Collecting pycparser
  Using cached pycparser-2.20-py2.py3-none-any.whl (112 kB)
Collecting idna>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting typing; python_version < "3.5"
  Using cached typing-3.10.0.0-py2-none-any.whl (26 kB)
Installing collected packages: queuelib, pycparser, cffi, enum34, six, ipaddress, cryptography, w3lib, zope.interface, pyOpenSSL, protego, lxml, functools32, cssselect, parsel, pyasn1, pyasn1-modules, attrs, service-identity, Automat, incremental, idna, typing, hyperlink, PyHamcrest, constantly, Twisted, PyDispatcher, scrapy, pytz, numpy, python-dateutil, pandas, chardet, readability-lxml
Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 PyHamcrest-1.10.1 Twisted-20.3.0 attrs-21.2.0 cffi-1.14.5 chardet-4.0.0 constantly-15.1.0 cryptography-3.3.2 cssselect-1.1.0 enum34-1.1.10 functools32-3.2.3.post2 hyperlink-21.0.0 idna-2.10 incremental-21.3.0 ipaddress-1.0.23 lxml-4.6.3 numpy-1.16.6 pandas-0.24.2 parsel-1.6.0 protego-0.1.16 pyOpenSSL-20.0.1 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.20 python-dateutil-2.8.1 pytz-2021.1 queuelib-1.6.1 readability-lxml-0.8.1 scrapy-1.8.0 service-identity-21.1.0 six-1.16.0 typing-3.10.0.0 w3lib-1.22.0 zope.interface-5.4.0

 $ scrapy runspider scrapy_test.py
/Users/redacted/.ve/headlines/lib/python2.7/site-packages/OpenSSL/crypto.py:14: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in the next release.
  from cryptography import utils, x509
2021-06-19 18:04:09 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2021-06-19 18:04:09 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 2.7.16 (default, Jan 27 2020, 04:46:15) - [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.2, Platform Darwin-18.7.0-x86_64-i386-64bit
2021-06-19 18:04:09 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2021-06-19 18:04:09 [scrapy.extensions.telnet] INFO: Telnet Password: 861e55f8a3b0e3bb
2021-06-19 18:04:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
Unhandled error in Deferred:
2021-06-19 18:04:09 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 184, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 188, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 104, in crawl
    six.reraise(*exc_info)
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 86, in crawl
    self.engine = self._create_engine()
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 111, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/core/engine.py", line 67, in __init__
    self.scheduler_cls = load_object(self.settings['SCHEDULER'])
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/utils/misc.py", line 46, in load_object
    mod = import_module(module)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/core/scheduler.py", line 7, in <module>
    from queuelib import PriorityQueue
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/queuelib/__init__.py", line 1, in <module>
    from queuelib.queue import FifoDiskQueue, LifoDiskQueue
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/queuelib/queue.py", line 7, in <module>
    from contextlib import suppress
exceptions.ImportError: cannot import name suppress

2021-06-19 18:04:09 [twisted] CRITICAL:
Traceback (most recent call last):
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 104, in crawl
    six.reraise(*exc_info)
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 86, in crawl
    self.engine = self._create_engine()
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 111, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/core/engine.py", line 67, in __init__
    self.scheduler_cls = load_object(self.settings['SCHEDULER'])
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/utils/misc.py", line 46, in load_object
    mod = import_module(module)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/core/scheduler.py", line 7, in <module>
    from queuelib import PriorityQueue
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/queuelib/__init__.py", line 1, in <module>
    from queuelib.queue import FifoDiskQueue, LifoDiskQueue
  File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/queuelib/queue.py", line 7, in <module>
    from contextlib import suppress
ImportError: cannot import name suppress

How to fix the issue:

pip install queuelib==1.4.2

Reproduces how often:

Always

Versions

scrapy version --verbose
/Users/redacted/.ve/headlines/lib/python2.7/site-packages/OpenSSL/crypto.py:14: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in the next release.
  from cryptography import utils, x509
Scrapy       : 1.8.0
lxml         : 4.6.3.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 2.7.16 (default, Jan 27 2020, 04:46:15) - [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020)
cryptography : 3.3.2
Platform     : Darwin-18.7.0-x86_64-i386-64bit

Additional context

scrapy_test.py code:

'''
headline_scraper.py
A simple scrapy spider to collect web page titles
'''

'''
headline_scraper.py
A simple scrapy spider to collect web page titles
'''

import scrapy
from pandas import read_csv
from readability.readability import Document

PATH_TO_DATA = 'https://gist.githubusercontent.com/jackbandy/208028b404d8c6a6f822397e306a5a34/raw/ef7f73357e77c29c63b5b7632d840a923327e179/100_urls_sample.csv'


class HeadlineSpider(scrapy.Spider):
    name = "headline_spider"
    start_urls = read_csv(PATH_TO_DATA).url.tolist()

    def parse(self, response):
        doc = Document(response.text)
        yield {
            'short_title': doc.short_title(),
            'full_title': doc.title(),
            'url': response.url
        }

Do not use plain asserts in tests

While pytest supports them, using them makes it impossible to run the tests using PYTHONOPTIMIZE.

See scrapy/scrapy#4725

some problems wiht priorityqueue

    def open(self, spider):
        self.spider = spider
        self.mqs = self.pqclass(self._newmq)
        self.dqs = self._dq() if self.dqdir else None
        return self.df.open()

    def _newmq(self, priority):
        return self.mqclass()

class FifoMemoryQueue(object):
    """In-memory FIFO queue, API compliant with FifoDiskQueue."""

    def __init__(self):
        self.q = deque()
        self.push = self.q.append

so why can do like this:

self.queues[priority] = self.qfactory(priority)

README example opens a FIFO queue but result shown is a LIFO queue

The first example — and FIFO queue — pushes a, b, and c, then pops c, b, and a, which is LIFO. I believe the README should have looked like this:

diff --git a/README.rst b/README.rst
index e3dd7ae..0b2dbb5 100644
--- a/README.rst
+++ b/README.rst
@@ -48,13 +48,13 @@ Here is an example usage of the FIFO queue::
     >>> q.push(b'b')
     >>> q.push(b'c')
     >>> q.pop()
-    'c'
+    b'a'
     >>> q.close()
     >>> q = FifoDiskQueue("queuefile")
     >>> q.pop()
     b'b'
     >>> q.pop()
-    b'a'
+    b'c'
     >>> q.pop()
     >>>

PickleFifoDiskQueue and FifoMemoryQueue throw EOF when used to persist a crawl

AUTHOR: @Varriount

When using scrapy's PickleFifoDiskQueue and FifoMemoryQueue objects for persistence and request scheduling, an EOF exception is thrown:

Traceback (most recent call last):
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\commands\crawl.py", line 58, in run
    self.crawler_process.start()
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\crawler.py", line 253, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\x64\python27\lib\site-packages\twisted\internet\base.py", line 1194, in run
    self.mainLoop()
  File "C:\x64\python27\lib\site-packages\twisted\internet\base.py", line 1203, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "C:\x64\python27\lib\site-packages\twisted\internet\base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\utils\reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\engine.py", line 105, in _next_request
    if not self._next_request_from_scheduler(spider):
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\engine.py", line 132, in _next_request_from_scheduler
    request = slot.scheduler.next_request()
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\scheduler.py", line 68, in next_request
    request = self._dqpop()
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\core\scheduler.py", line 98, in _dqpop
    d = self.dqs.pop()
  File "C:\x64\python27\lib\site-packages\queuelib\pqueue.py", line 43, in pop
    m = q.pop()
  File "C:\x64\python27\lib\site-packages\scrapy-1.1.0dev1-py2.7.egg\scrapy\squeues.py", line 21, in pop
    return deserialize(s)
exceptions.EOFError:

There isn't any problem when not persisting the crawl (omitting '-s JOBDIR=crawl-1'), so my best guess is that the problem lies mainly with PickleFifoDiskQueue.
I'm running Python 2.7 x64 on Windows 8 x64, using the latest Scrapy from the master branch. This bug affects the latest stable Scrapy build as well.

Edit: After some investigation, it seems that a race condition is occurring when the FifoDiskQueue object's headf and tailf file descriptors point to the same file. Adding a 'sleep' to the pop() method greatly decreases the occurrence of the EOF exception over multiple runs.

Is three any wrong with the example code of PriorityQueue?

from queuelib import FifoDiskQueue
q = FifoDiskQueue("/home/lf")
from queuelib import PriorityQueue
pq = PriorityQueue(q)
pq.push(b'a', 3)
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist-packages/queuelib/pqueue.py", line 33, in push
self.queues[priority] = self.qfactory(priority)
TypeError: 'FifoDiskQueue' object is not callable

Should q be a factory generate FifoDiskQueue object?