Code Monkey home page Code Monkey logo

parsel's Introduction

https://scrapy.org/img/scrapylogo.png

Scrapy

PyPI Version Supported Python Versions Ubuntu Windows Wheel Status Coverage report Conda Version

Overview

Scrapy is a BSD-licensed fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Check the Scrapy homepage at https://scrapy.org for more information, including a list of features.

Requirements

  • Python 3.8+
  • Works on Linux, Windows, macOS, BSD

Install

The quick way:

pip install scrapy

See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.

Documentation

Documentation is available online at https://docs.scrapy.org/ and in the docs directory.

Releases

You can check https://docs.scrapy.org/en/latest/news.html for the release notes.

Community (blog, twitter, mail list, IRC)

See https://scrapy.org/community/ for details.

Contributing

See https://docs.scrapy.org/en/master/contributing.html for details.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.

By participating in this project you agree to abide by its terms. Please report unacceptable behavior to [email protected].

Companies using Scrapy

See https://scrapy.org/companies/ for a list.

Commercial Support

See https://scrapy.org/support/ for details.

parsel's People

Contributors

akshayphilar avatar anthonychougit avatar curita avatar dangra avatar digenis avatar echoshoot avatar elacuesta avatar eliasdorneles avatar felipeboffnunes avatar gallaecio avatar golewski avatar harshasrinivas avatar immerrr avatar kmike avatar laerte avatar langdi avatar lopuhin avatar lufte avatar malloxpb avatar nirzak avatar pablohoffman avatar pcorpet avatar redapple avatar sortafreel avatar starrify avatar stav avatar stummjr avatar victor-torres avatar void avatar wrar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parsel's Issues

Work around incorrect extraction of "reserved" HTML entities

The entities marked as reserved here (scroll down to see the list) are extracted literally by lxml, whereas it should probably strive for more compatibility with browsers which interpret them according to CP1252.

A quick example:

In [13]: etree.fromstring ('<p>&#133;</p>').text
Out[13]: u'\x85'

whereas modern browsers usually show it as an ellipsis :

In [5]: u'\u2026'
Out[5]: '…'

I can not get blank text in td tag.

# coding=utf-8

from parsel import Selector

html = u'''
                        <table class="table table-bordered table-hover table-condensed">
                            <thead>
                            <tr>
                                <th>#</th>
                                <th>code</th>
                                <th>vendor</th>
                                <th>name</th>
                                <th>num</th>
                            </tr>
                            </thead>
                            <tbody>
                                <tr>
                                    <th scope="row">1750</th>
                                    <td><a href="/exam/000-643">000-643</a></td>
                                    <td>IBM</td>
                                    <td></td>
                                    <td>45</td>
                                </tr>
                                </tbody>
'''

sel = Selector(text=html)
print sel.xpath('//tbody/tr//td/text()').extract()
print sel.xpath('//tbody/tr//td//text()').extract()

output


[u'000-643', u'IBM', u'45']```

Run doctests in tox

After a switch from nosetests to pytest doc we stopped running doctests. I think they should be re-enabled. At least parsel.utils rely on doctests.

What do you think about Selector(response).xpath().map() ?

Sometimes I'd like to apply some function after extracting something, and I do something like this:

In[32]: map(json.loads, sel.xpath("//@data-p13n-asin-metadata").extract())

Out[32]: [{u'asin': u'B00Y2863GQ', u'price': 221.99, u'ref': u'pd_bxgy_75_1_1'},
 {u'asin': u'B008J3UD2U', u'price': 9.22, u'ref': u'pd_bxgy_75_2_2'},
 {u'asin': u'B008J3UD2U', u'ref': u'pd_sim_75_1'}]

what do you think about adding support for map on selector result level? So that I could do

 sel.xpath("//@data-p13n-asin-metadata").map(json.loads)

or even allow to pass list of functions

 sel.xpath("//@data-p13n-asin-metadata").map([json.loads, lambda d: d.get('asin'))

?

project description

Currently project description on github is the following:

Parsel lets you extract text from XML/HTML documents using XPath or CSS selectors

I think it is confusing - Parsel can be used not only to extract text, it can also extract parts of XML/HTML documents with markup, or attributes of elements.

Techniclly any part of HTML is text (including markup and tags), but I think it is confusing to use "text" in this context - usually "extract text from HTML" means "remove tags".

Content after null byte is dropped

For some specific URL, there is a null byte (\x00) inside the response body, then all content after it gets dropped in the lxml element tree.
How about removing the null byte before sending it to lxml, then we will no longer need to add this logic in every project.

Release Parsel 1.1.0

I think it's time to do a release. :)

Things to do before releasing:

  • document css2xpath
  • document using named variables
  • update NEWS file
  • Document CSS selectors extensions

parsel-cli - parsel as a cli application

I've been working small cli parsel wrapper for iterpreting css and xpath selectiors (inspired by scrapy shell).

https://github.com/granitosaurus/parsel-cli

It puts you straight into css or xpath interpreter mode (or embed python shell) and evaluates input css/xpath selectors using parsel.

I think it's better off as standalone tool but maybe it's worth mentioning somewhere in readme :)

Huge text extraction

When dealing with huge text inside, parsel seems to close tags incorrectly.
Here's what I've done in console (scrapy shell https://www.immobilienscout24.de/Suche/S-/P-46/Haus-Kauf/Bayern):

from parsel import Selector
s = Selector(response.text)
# Get last 100 symbols of the script tag containing 'resultListModel:'
s.xpath(f'//script[contains(text(),"resultListModel:")]').get()[-100:]
# We will get this string:
# 'istings\\/da3d4373-9dcc-4dbe-84bb-ae31d05dd057-1263509954.jpg\\/ORIG\\/legacy_thumbnail\\/%WIDT</script>'

# Now let's try to find the line with 'resultListModel:' string
lines = response.text.split('\n')
data_str = next(l for l in lines if 'resultListModel:' in l)
# And here let's find where we can find the last 100 symbols of the script tag contents
# It gave me 9993218
data_str.find('istings\\/da3d4373-9dcc-4dbe-84bb-ae31d05dd057-1263509954.jpg\\/ORIG\\/legacy_thumbnail\\/%WIDT')
# Let's look at this part of the text
data_str[9993218:9993400]
# We will get this string (correct):
# 'istings\\/da3d4373-9dcc-4dbe-84bb-ae31d05dd057-1263509954.jpg\\/ORIG\\/legacy_thumbnail\\/%WIDTH%x%HEIGHT%\\/format\\/jpg\\/quality\\/50"},{"@scale":"WHITE_FILLING","@href":"https:\\/\\/pictur'

The whole line that is stored in data_str is inside the <script> tag, but somehow it turns out that it is longer than all the contents:

In [20]: len(s.xpath(f'//script[contains(text(),"resultListModel:")]').get())
Out[20]: 9999243

In [21]: len(data_str)
Out[21]: 12004005

[Feature Request] Add support for JMESPath

Building a Selector based on JMESPath in parsel will help ease parsing Json.
This will also help scrapy to add methods like add_json and get_json to the ItemLoader. I got this idea from scrapy/scrapy#1005.
From what I understand, the Selector in parsel has been built using lxml, how about using jmespath for building a JsonSelector ?

I am not sure if this is the feature to have in this library as Parsel describes itself as a parser for XML/HTML. But adding this feature will add great value to this project.

PS: If the maintainers would like to have this feature in, Than I'd like to contribute to it myself.

SelectorList.css argument is named 'xpath'

What to you think about renaming SelectorList.css argument to 'query'? It would be a backwards incompatible change, but I doubt there is a lot of code which uses keyword argument for sel.css method.

There is a similar problem in Scrapy.

Support Zorba as an alternative XML/HTML processing engine

This has been troubling me for some time now but I would like this project to support a more powerful XML/HTML processing engine as an alternative to Lxml. The only contender for lxml in Python: Zorba. But why?

  • Zorba supports XQuery technology as well as JSONiq.
  • Zorba has Python bindings. I know they are not precisely the best bindings ever but at least they exist.
  • I think XPath 1.0 is very limited for more complex structures.
  • Lxml extensions are ok but not that much when compared to XQuery capabilities by default.
  • Zorba can be hosted as a service.

Ideally, we should be able to use selectors with Zorba in this way:

Selector(response=response).xquery('...').extract()
or
response.selector.xquery('...').extract()

Define module name

Is there a module name suggested for this new baby?. Feel free to dump ideas :)

Add option to retrieve text content

As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the ::text pseudo-element or XPath text(). Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:

<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']

With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes. I could imagine two ways to approach the required API:

  • Either, there could be additional .extract_text()/.get_text() methods. This seems clean and easy to use, but would lead to potentially convoluted method names like .extract_first_text() (or .extract_text_first()?).
  • Or add a parameter to .extract*()/.get(), similar to the proposal in #101. This could be .extract(format_as='text'). This is less intrusive, but maybe less easy to discover.

Would such an addition be welcome? I could prepare a patch.

Allow passing over kwargs in .extract()

Since extract() method is just a wrapper around lxml.etree.tostring it would make sense allowing of kwargs passthrough. This would enable more flexibility for getting string data.

For example it allows usage of pretty_print kwarg:

>>> foo = '<body><div>hi</div></body>'
>>> from parsel import Selector
>>> Selector(text=foo).extract()
'<html><body><div>hi</div></body></html>'
# vs
>>> print(Selector(text=foo).extract(pretty_print=True, method='xml'))
<html>
  <body>
    <div>hi</div>
  </body>
</html>

See my attempt at patch for this: #101

Misleading "data=" in Selector representation

Motivation: https://stackoverflow.com/questions/44407581/python-scrapy-output-is-cut-off-hence-wount-let-me-correctly-build-queries

With

<Selector xpath='//div[@id="all_game_info"]' data=u'<div id="all_game_info" class="table_wrapper columns'>

the user thought that the selector had only extracted u'<div id="all_game_info" class="table_wrapper columns'

Suggestions:

  • change data= to something like data-preview=
  • or add ... at the end, indicate the length of extracted data maybe

remove Scrapy dependency

Selectors shouldn't import from Scrapy, otherwise splitting them to a separate library doesn't provide benefits.

LookupError: unknown encoding: 'unicode'

When running on windows the error "LookupError: unknown encoding: 'unicode'" is thrown on line 226 of selector.py. This happened when scrapping from a webpage.

Bad HTML parsing

Given the same HTML code, here is what different parsers see :

=== HTML ===
<li>
 one
 <div>
</li>
<li>
 two
</li>
=== parsel (lxml) (marginal interpretation) ===
<html><body><li>
 one
 <div>

<li>
 two
</li></div></li></body></html>
=== html.parser ===
<li>
 one
 <div>
 </div>
</li>
<li>
 two
</li>
=== lxml (same problem as parsel of course) ===
<html>
 <body>
  <li>
   one
   <div>
    <li>
     two
    </li>
   </div>
  </li>
 </body>
</html>
=== html5lib (Parses pages the same way a web browser does) ===
<html>
 <head>
 </head>
 <body>
  <li>
   one
   <div>
   </div>
  </li>
  <li>
   two
  </li>
 </body>
</html>

This is very annoying to parse something when the parsing is different from a web browser parsing. It would be a good addition to provide a way to use something else than lxml.

#!/usr/bin/env python

from parsel import Selector
from bs4 import BeautifulSoup

print('=== HTML ===')
html = '''<li>
 one
 <div>
</li>
<li>
 two
</li>'''
print(html)

print('=== parsel (lxml) (marginal interpretation) ===')
sel = Selector(text=html)
print(sel.extract())

print('=== html.parser ===')
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

print('=== lxml (same problem as parsel of course) ===')
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

print('=== html5lib (Parses pages the same way a web browser does) ===')
soup = BeautifulSoup(html, 'html5lib')
print(soup.prettify())

functools32 dependency?

Isn't the functools32 dependency missing as a PY2 requirement?
Wondering how that builds correctly.

removing text when `<` is inside

>> s = Selector(text=u'<html><body>Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (<2cm), Belt Length: 93cm</body></html>')

>> s.extract()
u'<html><body>Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (</body></html>'

the text after < is removed

custom xpath support

custom xpath functions could be added here? like:

# Original Source: https://gist.github.com/shirk3y/458224083ce5464627bc
from lxml import etree

CLASS_EXPR = "contains(concat(' ', normalize-space(@class), ' '), ' {} ')"

def has_class(context, *classes):
    """
    This lxml extension allows to select by CSS class more easily
    >>> ns = etree.FunctionNamespace(None)
    >>> ns['has-class'] = has_class
    >>> root = etree.XML('''
    ... <a>
    ...     <b class="one first text">I</b>
    ...     <b class="two text">LOVE</b>
    ...     <b class="three text">CSS</b>
    ... </a>
    ... ''')
    >>> len(root.xpath('//b[has-class("text")]'))
    3
    >>> len(root.xpath('//b[has-class("one")]'))
    1
    >>> len(root.xpath('//b[has-class("text", "first")]'))
    1
    >>> len(root.xpath('//b[not(has-class("first"))]'))
    2
    >>> len(root.xpath('//b[has-class("not-exists")]'))
    0
    """

    expressions = ' and '.join([CLASS_EXPR.format(c) for c in classes])
    xpath = 'self::*[@class and {}]'.format(expressions)
    return bool(context.context_node.xpath(xpath))

I think it is a common practice to create custom xpaths on different projects.

Selector.root is not an instance of lxml.html.HtmlElement even if parser is html

I'm trying to use lxml.Cleaner without parsing response multiple times:

from lxml.html.clean import Cleaner
cleaner = Cleaner()
sel = parsel.Selector("<html><body><style>.p {width:10px}</style>hello</body></html>"
cleaner.clean_html(sel.root)

This doesn't work because Cleaner needs a lxml.html.HtmlElement instance, while Selector.root is always lxml.etree._Element, so it doesn't have a required .rewrite_links method.

Why is lxml.etree.HtmlParser used for html and not lxml.html.HtmlParser?

Missing copyright/license info/file

Hello,

I am working on packaging parsel for Debian. I got from the setup.py that it had a BSD license, but I am not sure which BSD license. The other scrapy related projects seem to have the BSD 3-clauses license.

Could you confirm this or add the license file to the parsel repo?

Thanks in advance!

Document behavior of HTML comments inside script tags

Currently, the parser ignores HTML comments inside script tags, treating them as part of the text element:

>>> import parsel
>>> parsel.__version__
'1.0.2'
>>>
>>> # comments inside random nodes...
>>> node_with_comments = parsel.Selector(u'<node><!-- comment -->text here</node>')
>>> node_with_comments.xpath('//comment()')
[<Selector xpath='//comment()' data=u'<!-- comment -->'>]
>>> node_with_comments.xpath('//text()')
[<Selector xpath='//text()' data=u'text here'>]
>>> # okay, looks good
>>>
>>> # now, with comments inside a script tag:
>>> script_with_comments = parsel.Selector(u'<script><!-- comment -->alert("hello")</script>')
>>> script_with_comments.xpath('//comment()')
[]
>>> # oops, can't find the comments!
>>> script_with_comments.xpath('//text()')
[<Selector xpath='//text()' data=u'<!-- comment -->alert("hello")'>]

Looks like it ignore comments inside the <script> tag, considering it all part of the text.

This is a problem because people are unable to easily strip HTML comments from the JavaScript code (which is often fed to a JS parser, like js2xml).

Changing this would break backwards compatibility, but this looks like a bug to me.

Thoughts? Concerns?

Caching css_to_xpath()'s recently used patterns to improve efficiency

I profiled the scrapy-bench spider which uses response.css() for extracting information.

The profiling results are here. The function css_to_xpath() takes 5% of the total time.

When response.xpath()(profiling result) was used, the items extracted per second (benchmark result) was higher.

Hence, I'm proposing caching for the recently used patterns, so that the function takes lesser time.
I'm working on a prototype for the same and will add the results for it too.

Selector mis-parses html when elements have very large bodies

This was discovered by a Reddit user, concerning an Amazon page with an absurdly long <script> tag, but I was able to boil the bad outcome down into a reproducible test case

what is expected

Selector(html).css('h1') should produce all h1 elements within the document

what actually happens

Selector(html).css('h1') produces only the h1 elements before the element containing a very large body. Neither xml.etree nor html5lib suffer from this defect.


pip install html5lib==1.0.1
pip install parsel==1.4.0
import html5lib
import parsel
import time

try:
    from xml.etree import cElementTree as ElementTree
except ImportError:
    from xml.etree import ElementTree

bad_len = 21683148
bad = 'a' * bad_len
bad_html = '''
<html>
    <body>
      <h1>pre-div h1</h1>
      <div>
        <h1>pre-script h1</h1>
        <p>var bogus = "{}"</p>
        <h1>hello I am eclipsed</h1>
      </div>
      <h1>post-div h1</h1>
    </body>
</html>
'''.format(bad)
t0 = time.time()
sel = parsel.Selector(text=bad_html)
t1 = time.time()
print('Selector.time={}'.format(t1 - t0))
for idx, h1 in enumerate(sel.xpath('//h1').extract()):
    print('h1[{} = {}'.format(idx, h1))

print('ElementTree')
t0 = time.time()
doc = ElementTree.fromstring(bad_html)
t1 = time.time()
print('ElementTree.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

print('html5lib')
t0 = time.time()
#: :type: xml.etree.ElementTree.Element
doc = html5lib.parse(bad_html, namespaceHTMLElements=False)
t1 = time.time()
print('html5lib.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

produces the output

Selector.time=0.3661611080169678
h1[0 = <h1>pre-div h1</h1>
h1[1 = <h1>pre-script h1</h1>
ElementTree
ElementTree.time=0.1052100658416748
h1[<Element 'h1' at 0x103029bd8>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x103029c78>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x103029d18>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x103029d68>].txt = <<post-div h1>>
html5lib
html5lib.time=2.255831003189087
h1[<Element 'h1' at 0x107395098>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x1073951d8>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x107395318>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x1073953b8>].txt = <<post-div h1>>

Make sel.xpath('.') work the same for text elements

Given:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
...         <body>
...             <h1>Hello, Parsel!</h1>
...         </body>
...         </html>""")

For text, you get:

>>> subsel = sel.css('h1::text')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1/text()' data=u'Hello, Parsel!'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[]

However, regular elements work as you would expect:

>>> subsel = sel.css('h1')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1' data=u'<h1>Hello, Parsel!</h1>'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[<Selector xpath='.' data=u'<h1>Hello, Parsel!</h1>'>]

I believe text elements should work the same. '.' should select them if they are the current element.

Text Starting with "<-" are Ignored

Text starting with "<-" within the HTML body is completely ignored, examples follow.

Note: XML tag names starting with a hyphen are invalid as per the W3C XML spec

Example 1

>>> html = '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'
>>> Selector(html).extract()
'<html><body><title></title><div>Release Date</div></body></html>'

Example 2

>>> html = '<html><body><title><-Thor></title></body></html>'
>>> Selector(html).extract()
'<html><body><title></title></body></html>'

Example 3

>>> html = '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'
>>> Selector(html).extract()
'<html><body><title>Avengers-&gt;</title><div>Release Date</div></body></html>'

Get lxml node (HtmlElement)

Is it possible to get <class 'lxml.html.HtmlElement'>. For example:

from parsel import Selector
sel = Selector("<html><body> <div><h1>Header1</h1><p>any text..</p></div> </body></html>")
h1 = sel.xpath(".//div/h1").extract_first()
type(h1) # Why str? How get lxml.html.HtmlElement?

Implement re_first for Selector

Copied from scrapy/scrapy#1907

Currently only SelectorList supports the re_first shortcut method. It would be useful to have this method in Selector too.

from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).re_first
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'Selector' object has no attribute 're_first'

Interactive demo for `parsel` code examples

Hello folks!

I've been using parsel lately and found it really interesting. I'm even thinking about adding it in our curriculum to teach our students at https://rmotr.com/ (co-founder and teacher here). In our DS program we have a section of data scraping (we love Scrapy) and data cleaning, and this seems to be a very convenient library for that.

While looking at the code examples in the README file and the docs I thought it would be great to provide some interactive demo that people can use to play with the library before committing to download and install it locally.

A demo is worth a thousand words 😉

I spent some time compiling parsel examples into a Jupyter Notebook file, and adapted it in a way that we can open it with a small service we have at RMOTR to launch Jupyter environments online. Note that parsel and requests (both required in examples) are already installed when the env is loaded, so people can start using it right away.

The result is what you can see in my fork of the repo (see the new "Demo" section):
https://github.com/martinzugnoni/parsel

Do you think that having such interactive demo would help people to know and use parsel?
Let's use this issue as a kick off to start a discussion. Nothing here is written in stone and we can change everything as you wish.

I hope you like it, and I truly appreciate any feedback.

thanks.

[idea] Flexible `replace_entities` on `Selector.re` and `Selector.extract`

As of the latest commit (2c87fe4), Selector.extract does not replace entities, while Selector.re does:

>>> import parsel
>>> sel = parsel.Selector(text=u'<script>{"foo":"bar&quot;baz&quot;"}</script>')
>>> sel.css('script::text').extract_first()
u'{"foo":"bar&quot;baz&quot;"}'
>>> sel.css('script::text').re_first('(.*)')
u'{"foo":"bar"baz""}'

The related code is at line 72 of parsel/utils.py

However, in some specific cases (e.g. the example above), we may find it useful if:

What do you think there're optional arguments to be added to both functions controlling behavior of replace_entities?

Documentation

Documentation in docs/ folder is for scrapy selectors, not for parsel. We should update it and setup ReadTheDocs integration.

Is it possible to modify the response content through Scrapy Selector?

I am using Scrapy to deep copy some content on one page, to crawl the content and download the images in that content and update the image original value accordingly.

For example I have:

<div class="A">
    <img original="example1.com/1/1.png"></img>
</div>

I need to download the image and update the new image original value(for example to mysite.com/1/1.png), then save the content.

what I will have finally is:

<div class="A">
    <img original="mysite.com/1/1.png"></img>
</div>

and image on my disk.

Is it possible to modify the value through Selector?

Or must I download the image first and update the "original" value separately? any better solution?

Another use case:
In some pages, some content are hidden by some CSS setting ( style="display:none"), or we want to do some preprocessing for some content, we will need to inspect the content and update it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.