scrapy / parsel Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 137.0 904 KB

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

css hacktoberfest lxml python scraping selectors xml xpath

parsel's Introduction

Scrapy

Overview

Scrapy is a BSD-licensed fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Check the Scrapy homepage at https://scrapy.org for more information, including a list of features.

Requirements

Python 3.8+
Works on Linux, Windows, macOS, BSD

Install

The quick way:

pip install scrapy

See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.

Documentation

Documentation is available online at https://docs.scrapy.org/ and in the docs directory.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.

By participating in this project you agree to abide by its terms. Please report unacceptable behavior to [email protected].

Companies using Scrapy

See https://scrapy.org/companies/ for a list.

Commercial Support

See https://scrapy.org/support/ for details.

parsel's People

Contributors

Stargazers

Watchers

Forkers

pawelmhm hoatle digenis dixonshen voith rmax-contrib dinargataullin cloudxtreme ezc modulexcite ghantoos chagge letser lewisliang82 luc2 leofn hackerlank noxwings tonystank3000 elrull starrify harshasrinivas anubhav722 immerrr hmozju liuxiaoqiang parth-vader wangyipengpeter michamos granitosaurus overad elacuesta luopeixiong malloxpb ayushmankoul ggeg caowenbin08 datahack-ru terminalkitten whalebot-helmsman langdi bellait omunroe-com wrar artobstrel stummjr akshayphilar frederik-elwert victor-torres joaquingx evrimulgen martinzugnoni maramsumanth gallaecio py361 julienze solertis xin2023 l0kix2 reeftrip anubhavp28 haikson idontknowtooo lufte burnzz sortafreel muhammedsansal xu-chao huaixialei masterscott sandramoraes datafields-team echoshoot augustocrf litingsafes zht185 nyov hailiang-wang sahil9990001 ra2003 msgpo angelo-abel sozdam zxb789 rwaycachedlibs bifrostluv atonem georgea92 noviluni pcorpet jonas-meng laggardkernel suryatmodulus shalion jmherbst abeusher hugovk dqsdatalabs kingking888 crackercat

parsel's Issues

Work around incorrect extraction of "reserved" HTML entities

The entities marked as reserved here (scroll down to see the list) are extracted literally by lxml, whereas it should probably strive for more compatibility with browsers which interpret them according to CP1252.

A quick example:

In [13]: etree.fromstring ('<p>&#133;</p>').text
Out[13]: u'\x85'

whereas modern browsers usually show it as an ellipsis …:

In [5]: u'\u2026'
Out[5]: '…'

I can not get blank text in td tag.

# coding=utf-8

from parsel import Selector

html = u'''
                        <table class="table table-bordered table-hover table-condensed">
                            <thead>
                            <tr>
                                <th>#</th>
                                <th>code</th>
                                <th>vendor</th>
                                <th>name</th>
                                <th>num</th>
                            </tr>
                            </thead>
                            <tbody>
                                <tr>
                                    <th scope="row">1750</th>
                                    <td><a href="/exam/000-643">000-643</a></td>
                                    <td>IBM</td>
                                    <td></td>
                                    <td>45</td>
                                </tr>
                                </tbody>
'''

sel = Selector(text=html)
print sel.xpath('//tbody/tr//td/text()').extract()
print sel.xpath('//tbody/tr//td//text()').extract()

output


[u'000-643', u'IBM', u'45']```

Run doctests in tox

After a switch from nosetests to pytest doc we stopped running doctests. I think they should be re-enabled. At least parsel.utils rely on doctests.

DOC Missing docs/docstrings for .re_first() and .extract_first()

Triggered by this StackOverflow question: http://stackoverflow.com/q/35649461

The only place where .extract_first() purpose and behavior is mentioned is in the "Getting started"

.re_first() and .extract_first() methods have no docstring and do not appear in the API reference, only in the parsel.selector module docs

What do you think about Selector(response).xpath().map() ?

Sometimes I'd like to apply some function after extracting something, and I do something like this:

In[32]: map(json.loads, sel.xpath("//@data-p13n-asin-metadata").extract())

Out[32]: [{u'asin': u'B00Y2863GQ', u'price': 221.99, u'ref': u'pd_bxgy_75_1_1'},
 {u'asin': u'B008J3UD2U', u'price': 9.22, u'ref': u'pd_bxgy_75_2_2'},
 {u'asin': u'B008J3UD2U', u'ref': u'pd_sim_75_1'}]

what do you think about adding support for map on selector result level? So that I could do

 sel.xpath("//@data-p13n-asin-metadata").map(json.loads)

or even allow to pass list of functions

 sel.xpath("//@data-p13n-asin-metadata").map([json.loads, lambda d: d.get('asin'))

project description

Currently project description on github is the following:

Parsel lets you extract text from XML/HTML documents using XPath or CSS selectors

I think it is confusing - Parsel can be used not only to extract text, it can also extract parts of XML/HTML documents with markup, or attributes of elements.

Techniclly any part of HTML is text (including markup and tags), but I think it is confusing to use "text" in this context - usually "extract text from HTML" means "remove tags".

Content after null byte is dropped

For some specific URL, there is a null byte (\x00) inside the response body, then all content after it gets dropped in the lxml element tree.
How about removing the null byte before sending it to lxml, then we will no longer need to add this logic in every project.

Release Parsel 1.1.0

I think it's time to do a release. :)

Things to do before releasing:

document css2xpath
document using named variables
update NEWS file
Document CSS selectors extensions

parsel-cli - parsel as a cli application

I've been working small cli parsel wrapper for iterpreting css and xpath selectiors (inspired by scrapy shell).

https://github.com/granitosaurus/parsel-cli

It puts you straight into css or xpath interpreter mode (or embed python shell) and evaluates input css/xpath selectors using parsel.

I think it's better off as standalone tool but maybe it's worth mentioning somewhere in readme :)

Huge text extraction

When dealing with huge text inside, parsel seems to close tags incorrectly.
Here's what I've done in console (scrapy shell https://www.immobilienscout24.de/Suche/S-/P-46/Haus-Kauf/Bayern):

from parsel import Selector
s = Selector(response.text)
# Get last 100 symbols of the script tag containing 'resultListModel:'
s.xpath(f'//script[contains(text(),"resultListModel:")]').get()[-100:]
# We will get this string:
# 'istings\\/da3d4373-9dcc-4dbe-84bb-ae31d05dd057-1263509954.jpg\\/ORIG\\/legacy_thumbnail\\/%WIDT</script>'

# Now let's try to find the line with 'resultListModel:' string
lines = response.text.split('\n')
data_str = next(l for l in lines if 'resultListModel:' in l)
# And here let's find where we can find the last 100 symbols of the script tag contents
# It gave me 9993218
data_str.find('istings\\/da3d4373-9dcc-4dbe-84bb-ae31d05dd057-1263509954.jpg\\/ORIG\\/legacy_thumbnail\\/%WIDT')
# Let's look at this part of the text
data_str[9993218:9993400]
# We will get this string (correct):
# 'istings\\/da3d4373-9dcc-4dbe-84bb-ae31d05dd057-1263509954.jpg\\/ORIG\\/legacy_thumbnail\\/%WIDTH%x%HEIGHT%\\/format\\/jpg\\/quality\\/50"},{"@scale":"WHITE_FILLING","@href":"https:\\/\\/pictur'

The whole line that is stored in data_str is inside the <script> tag, but somehow it turns out that it is longer than all the contents:

In [20]: len(s.xpath(f'//script[contains(text(),"resultListModel:")]').get())
Out[20]: 9999243

In [21]: len(data_str)
Out[21]: 12004005

[From Scrapy] selector.css is not working

I am facing problem with selector.css, it not selecting node when I am using .css(), when I used .xpath() it is working fine. Web page and version of parsel mentioned down.

version = parsel==1.3.1
Url = https://www.clasohlson.com/uk/57-cm-Kettle-Barbecue/34-9444
css_selector = .productIdVariant

Note: It is working while using pup

[Feature Request] Add support for JMESPath

Building a Selector based on JMESPath in parsel will help ease parsing Json.
This will also help scrapy to add methods like add_json and get_json to the ItemLoader. I got this idea from scrapy/scrapy#1005.
From what I understand, the Selector in parsel has been built using lxml, how about using jmespath for building a JsonSelector ?

I am not sure if this is the feature to have in this library as Parsel describes itself as a parser for XML/HTML. But adding this feature will add great value to this project.

PS: If the maintainers would like to have this feature in, Than I'd like to contribute to it myself.

SelectorList.css argument is named 'xpath'

What to you think about renaming SelectorList.css argument to 'query'? It would be a backwards incompatible change, but I doubt there is a lot of code which uses keyword argument for sel.css method.

There is a similar problem in Scrapy.

Support Zorba as an alternative XML/HTML processing engine

This has been troubling me for some time now but I would like this project to support a more powerful XML/HTML processing engine as an alternative to Lxml. The only contender for lxml in Python: Zorba. But why?

Zorba supports XQuery technology as well as JSONiq.
Zorba has Python bindings. I know they are not precisely the best bindings ever but at least they exist.
I think XPath 1.0 is very limited for more complex structures.
Lxml extensions are ok but not that much when compared to XQuery capabilities by default.
Zorba can be hosted as a service.

Ideally, we should be able to use selectors with Zorba in this way:

Selector(response=response).xquery('...').extract()
or
response.selector.xquery('...').extract()

travis tests are failing on pypy

According to http://morepypy.blogspot.ru/2016/02/c-api-support-update.html lxml never really worked with pypy, but they should become compatible with pypy 4.1 and next lxml release. I think we should disable pypy tests for now.

Move API docs to autodocs/docstrings

Currently the API reference is listed manually in usage.rst, this makes it too easy to have it out of sync with the code (it is out of sync as of this moment, see: http://parsel.readthedocs.org/en/latest/usage.html#parsel.selector.SelectorList.__nonzero__).

Let's move it to docstrings and use autodocs.

Define module name

Is there a module name suggested for this new baby?. Feel free to dump ideas :)

v1.0 can't be pre-alpha

I think we should update classifiers in https://github.com/scrapy/parsel/blob/master/setup.py

Add option to retrieve text content

As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the ::text pseudo-element or XPath text(). Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:

<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>

>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']

With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes. I could imagine two ways to approach the required API:

Either, there could be additional .extract_text()/.get_text() methods. This seems clean and easy to use, but would lead to potentially convoluted method names like .extract_first_text() (or .extract_text_first()?).
Or add a parameter to .extract*()/.get(), similar to the proposal in #101. This could be .extract(format_as='text'). This is less intrusive, but maybe less easy to discover.

Would such an addition be welcome? I could prepare a patch.

Allow passing over kwargs in .extract()

Since extract() method is just a wrapper around lxml.etree.tostring it would make sense allowing of kwargs passthrough. This would enable more flexibility for getting string data.

For example it allows usage of pretty_print kwarg:

>>> foo = '<body><div>hi</div></body>'
>>> from parsel import Selector
>>> Selector(text=foo).extract()
'<html><body><div>hi</div></body></html>'
# vs
>>> print(Selector(text=foo).extract(pretty_print=True, method='xml'))
<html>
  <body>
    <div>hi</div>
  </body>
</html>

See my attempt at patch for this: #101

Misleading "data=" in Selector representation

Motivation: https://stackoverflow.com/questions/44407581/python-scrapy-output-is-cut-off-hence-wount-let-me-correctly-build-queries

With

<Selector xpath='//div[@id="all_game_info"]' data=u'<div id="all_game_info" class="table_wrapper columns'>

the user thought that the selector had only extracted u'<div id="all_game_info" class="table_wrapper columns'

Suggestions:

change data= to something like data-preview=
or add ... at the end, indicate the length of extracted data maybe

remove Scrapy dependency

Selectors shouldn't import from Scrapy, otherwise splitting them to a separate library doesn't provide benefits.

LookupError: unknown encoding: 'unicode'

When running on windows the error "LookupError: unknown encoding: 'unicode'" is thrown on line 226 of selector.py. This happened when scrapping from a webpage.

Bad HTML parsing

Given the same HTML code, here is what different parsers see :

=== HTML ===
<li>
 one
 <div>
</li>
<li>
 two
</li>
=== parsel (lxml) (marginal interpretation) ===
<html><body><li>
 one
 <div>

<li>
 two
</li></div></li></body></html>
=== html.parser ===
<li>
 one
 <div>
 </div>
</li>
<li>
 two
</li>
=== lxml (same problem as parsel of course) ===
<html>
 <body>
  <li>
   one
   <div>
    <li>
     two
    </li>
   </div>
  </li>
 </body>
</html>
=== html5lib (Parses pages the same way a web browser does) ===
<html>
 <head>
 </head>
 <body>
  <li>
   one
   <div>
   </div>
  </li>
  <li>
   two
  </li>
 </body>
</html>

This is very annoying to parse something when the parsing is different from a web browser parsing. It would be a good addition to provide a way to use something else than lxml.

#!/usr/bin/env python

from parsel import Selector
from bs4 import BeautifulSoup

print('=== HTML ===')
html = '''<li>
 one
 <div>
</li>
<li>
 two
</li>'''
print(html)

print('=== parsel (lxml) (marginal interpretation) ===')
sel = Selector(text=html)
print(sel.extract())

print('=== html.parser ===')
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

print('=== lxml (same problem as parsel of course) ===')
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

print('=== html5lib (Parses pages the same way a web browser does) ===')
soup = BeautifulSoup(html, 'html5lib')
print(soup.prettify())

functools32 dependency?

Isn't the functools32 dependency missing as a PY2 requirement?
Wondering how that builds correctly.

removing text when `<` is inside

>> s = Selector(text=u'<html><body>Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (<2cm), Belt Length: 93cm</body></html>')

>> s.extract()
u'<html><body>Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (</body></html>'

the text after < is removed

custom xpath support

custom xpath functions could be added here? like:

# Original Source: https://gist.github.com/shirk3y/458224083ce5464627bc
from lxml import etree

CLASS_EXPR = "contains(concat(' ', normalize-space(@class), ' '), ' {} ')"

def has_class(context, *classes):
    """
    This lxml extension allows to select by CSS class more easily
    >>> ns = etree.FunctionNamespace(None)
    >>> ns['has-class'] = has_class
    >>> root = etree.XML('''
    ... <a>
    ...     <b class="one first text">I</b>
    ...     <b class="two text">LOVE</b>
    ...     <b class="three text">CSS</b>
    ... </a>
    ... ''')
    >>> len(root.xpath('//b[has-class("text")]'))
    3
    >>> len(root.xpath('//b[has-class("one")]'))
    1
    >>> len(root.xpath('//b[has-class("text", "first")]'))
    1
    >>> len(root.xpath('//b[not(has-class("first"))]'))
    2
    >>> len(root.xpath('//b[has-class("not-exists")]'))
    0
    """

    expressions = ' and '.join([CLASS_EXPR.format(c) for c in classes])
    xpath = 'self::*[@class and {}]'.format(expressions)
    return bool(context.context_node.xpath(xpath))

I think it is a common practice to create custom xpaths on different projects.

Document CSS selectors extensions

As requested in https://stackoverflow.com/questions/21181628/python-scrapy-get-href-using-css-selector#comment46652785_21182445

Where can I see all Scrapy's CSS extensions? Can't find it in the docs. – Javier Ayres Mar 23 '15

Selector.root is not an instance of lxml.html.HtmlElement even if parser is html

I'm trying to use lxml.Cleaner without parsing response multiple times:

from lxml.html.clean import Cleaner
cleaner = Cleaner()
sel = parsel.Selector("<html><body><style>.p {width:10px}</style>hello</body></html>"
cleaner.clean_html(sel.root)

This doesn't work because Cleaner needs a lxml.html.HtmlElement instance, while Selector.root is always lxml.etree._Element, so it doesn't have a required .rewrite_links method.

Why is lxml.etree.HtmlParser used for html and not lxml.html.HtmlParser?

Missing copyright/license info/file

Hello,

I am working on packaging parsel for Debian. I got from the setup.py that it had a BSD license, but I am not sure which BSD license. The other scrapy related projects seem to have the BSD 3-clauses license.

Could you confirm this or add the license file to the parsel repo?

Thanks in advance!

Document behavior of HTML comments inside script tags

Currently, the parser ignores HTML comments inside script tags, treating them as part of the text element:

>>> import parsel
>>> parsel.__version__
'1.0.2'
>>>
>>> # comments inside random nodes...
>>> node_with_comments = parsel.Selector(u'<node><!-- comment -->text here</node>')
>>> node_with_comments.xpath('//comment()')
[<Selector xpath='//comment()' data=u'<!-- comment -->'>]
>>> node_with_comments.xpath('//text()')
[<Selector xpath='//text()' data=u'text here'>]
>>> # okay, looks good
>>>
>>> # now, with comments inside a script tag:
>>> script_with_comments = parsel.Selector(u'<script><!-- comment -->alert("hello")</script>')
>>> script_with_comments.xpath('//comment()')
[]
>>> # oops, can't find the comments!
>>> script_with_comments.xpath('//text()')
[<Selector xpath='//text()' data=u'<!-- comment -->alert("hello")'>]

Looks like it ignore comments inside the <script> tag, considering it all part of the text.

This is a problem because people are unable to easily strip HTML comments from the JavaScript code (which is often fed to a JS parser, like js2xml).

Changing this would break backwards compatibility, but this looks like a bug to me.

Thoughts? Concerns?

Illegal character (<,>,&) in HTML cause xpath extracted value to be empty

As Scrapy is using lxml as xml parser. However, as lxml is an xml parser, characters as <, >, etc are invalid, and then by lxml stripped away. Nevertheless, many website use < and > as less and greater then symbols.

I propose to implement a fix that quote specifically those characters.

[From Scrapy] Selector.extract_links and links

This was originally on Scrapy. Moving it to Parsel to continue discussion.

scrapy/scrapy#331

Caching css_to_xpath()'s recently used patterns to improve efficiency

I profiled the scrapy-bench spider which uses response.css() for extracting information.

The profiling results are here. The function css_to_xpath() takes 5% of the total time.

When response.xpath()(profiling result) was used, the items extracted per second (benchmark result) was higher.

Hence, I'm proposing caching for the recently used patterns, so that the function takes lesser time.
I'm working on a prototype for the same and will add the results for it too.

Selector mis-parses html when elements have very large bodies

This was discovered by a Reddit user, concerning an Amazon page with an absurdly long <script> tag, but I was able to boil the bad outcome down into a reproducible test case

what is expected

Selector(html).css('h1') should produce all h1 elements within the document

what actually happens

Selector(html).css('h1') produces only the h1 elements before the element containing a very large body. Neither xml.etree nor html5lib suffer from this defect.

pip install html5lib==1.0.1
pip install parsel==1.4.0

import html5lib
import parsel
import time

try:
    from xml.etree import cElementTree as ElementTree
except ImportError:
    from xml.etree import ElementTree

bad_len = 21683148
bad = 'a' * bad_len
bad_html = '''
<html>
    <body>
      <h1>pre-div h1</h1>
      <div>
        <h1>pre-script h1</h1>
        <p>var bogus = "{}"</p>
        <h1>hello I am eclipsed</h1>
      </div>
      <h1>post-div h1</h1>
    </body>
</html>
'''.format(bad)
t0 = time.time()
sel = parsel.Selector(text=bad_html)
t1 = time.time()
print('Selector.time={}'.format(t1 - t0))
for idx, h1 in enumerate(sel.xpath('//h1').extract()):
    print('h1[{} = {}'.format(idx, h1))

print('ElementTree')
t0 = time.time()
doc = ElementTree.fromstring(bad_html)
t1 = time.time()
print('ElementTree.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

print('html5lib')
t0 = time.time()
#: :type: xml.etree.ElementTree.Element
doc = html5lib.parse(bad_html, namespaceHTMLElements=False)
t1 = time.time()
print('html5lib.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

produces the output

Selector.time=0.3661611080169678
h1[0 = <h1>pre-div h1</h1>
h1[1 = <h1>pre-script h1</h1>
ElementTree
ElementTree.time=0.1052100658416748
h1[<Element 'h1' at 0x103029bd8>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x103029c78>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x103029d18>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x103029d68>].txt = <<post-div h1>>
html5lib
html5lib.time=2.255831003189087
h1[<Element 'h1' at 0x107395098>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x1073951d8>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x107395318>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x1073953b8>].txt = <<post-div h1>>

Make sel.xpath('.') work the same for text elements

Given:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
...         <body>
...             <h1>Hello, Parsel!</h1>
...         </body>
...         </html>""")

For text, you get:

>>> subsel = sel.css('h1::text')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1/text()' data=u'Hello, Parsel!'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[]

However, regular elements work as you would expect:

>>> subsel = sel.css('h1')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1' data=u'<h1>Hello, Parsel!</h1>'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[<Selector xpath='.' data=u'<h1>Hello, Parsel!</h1>'>]

I believe text elements should work the same. '.' should select them if they are the current element.

Text Starting with "<-" are Ignored

Text starting with "<-" within the HTML body is completely ignored, examples follow.

Note: XML tag names starting with a hyphen are invalid as per the W3C XML spec

Example 1

>>> html = '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'
>>> Selector(html).extract()
'<html><body><title></title><div>Release Date</div></body></html>'

Example 2

>>> html = '<html><body><title><-Thor></title></body></html>'
>>> Selector(html).extract()
'<html><body><title></title></body></html>'

Example 3

>>> html = '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'
>>> Selector(html).extract()
'<html><body><title>Avengers-&gt;</title><div>Release Date</div></body></html>'

HISTORY.rst is incorrect

The data on pypi looks fine (https://pypi.python.org/pypi/parsel/0.9.0), but on github HISTORY.rst contains only information about non-existent 1.0.0 release.

example in "Ad-hoc namespaces references" doesn't work

atom feed mentioned there doesn't seem to be available now.

Get lxml node (HtmlElement)

Is it possible to get <class 'lxml.html.HtmlElement'>. For example:

from parsel import Selector
sel = Selector("<html><body> <div><h1>Header1</h1><p>any text..</p></div> </body></html>")
h1 = sel.xpath(".//div/h1").extract_first()
type(h1) # Why str? How get lxml.html.HtmlElement?

Implement re_first for Selector

Copied from scrapy/scrapy#1907

Currently only SelectorList supports the re_first shortcut method. It would be useful to have this method in Selector too.

from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).re_first
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'Selector' object has no attribute 're_first'

Wrong docstring for Selector .extract()

See http://stackoverflow.com/a/35649597/

Docstring for Selector .extract() method is wrong:

        """
        Serialize and return the matched nodes as a list of unicode strings.
        Percent encoded content is unquoted.
        """

It returns a single string, not a list of strings.

Allow Parsel to remove contents matching CSS/XPath selectors

Unless I'm mistaken, Parsel currently focuses on extracting parts of a document.

My need is to remove parts matching CSS/XPath selectors.

Could Parsel be provided with such a feature ?

Thanks !

Interactive demo for `parsel` code examples

Hello folks!

I've been using parsel lately and found it really interesting. I'm even thinking about adding it in our curriculum to teach our students at https://rmotr.com/ (co-founder and teacher here). In our DS program we have a section of data scraping (we love Scrapy) and data cleaning, and this seems to be a very convenient library for that.

While looking at the code examples in the README file and the docs I thought it would be great to provide some interactive demo that people can use to play with the library before committing to download and install it locally.

A demo is worth a thousand words 😉

I spent some time compiling parsel examples into a Jupyter Notebook file, and adapted it in a way that we can open it with a small service we have at RMOTR to launch Jupyter environments online. Note that parsel and requests (both required in examples) are already installed when the env is loaded, so people can start using it right away.

The result is what you can see in my fork of the repo (see the new "Demo" section):
https://github.com/martinzugnoni/parsel

Do you think that having such interactive demo would help people to know and use parsel?
Let's use this issue as a kick off to start a discussion. Nothing here is written in stone and we can change everything as you wish.

I hope you like it, and I truly appreciate any feedback.

thanks.

[idea] Flexible `replace_entities` on `Selector.re` and `Selector.extract`

As of the latest commit (2c87fe4), Selector.extract does not replace entities, while Selector.re does:

>>> import parsel
>>> sel = parsel.Selector(text=u'<script>{"foo":"bar&quot;baz&quot;"}</script>')
>>> sel.css('script::text').extract_first()
u'{"foo":"bar&quot;baz&quot;"}'
>>> sel.css('script::text').re_first('(.*)')
u'{"foo":"bar"baz""}'

The related code is at line 72 of parsel/utils.py

However, in some specific cases (e.g. the example above), we may find it useful if:

Selector.re does not replace entities.
Selector.extract does replace entities.

What do you think there're optional arguments to be added to both functions controlling behavior of replace_entities?

Support "html5" type to use html5lib parser

Every now and then we get a bug report about some HTML source not being parsed as a browser would.

There was the idea in Scrapy of adding an "html5" type to switch to an HTML5 compliant parser.
One of these is html5lib that can be used with lxml.

Documentation

Documentation in docs/ folder is for scrapy selectors, not for parsel. We should update it and setup ReadTheDocs integration.

Add support to newer functions of XPath2/3?

Hi,

Is there a ETA for implementing new functions added on earlier versions of XPath like string-join?

Thanks!

Extract history from Scrapy repository

Per scrapy/scrapy#1007 (comment) we want to retain commit messages, dates and authors of selectors history as developed in Scrapy repo

Is it possible to modify the response content through Scrapy Selector?

I am using Scrapy to deep copy some content on one page, to crawl the content and download the images in that content and update the image original value accordingly.

For example I have:

<div class="A">
    <img original="example1.com/1/1.png"></img>
</div>

I need to download the image and update the new image original value（for example to mysite.com/1/1.png）, then save the content.

what I will have finally is:

<div class="A">
    <img original="mysite.com/1/1.png"></img>
</div>

and image on my disk.

Is it possible to modify the value through Selector?

Or must I download the image first and update the "original" value separately? any better solution?

Another use case:
In some pages, some content are hidden by some CSS setting ( style="display:none"), or we want to do some preprocessing for some content, we will need to inspect the content and update it.

scrapy / parsel Goto Github PK

parsel's Introduction

Scrapy

Overview

Requirements

Install

Documentation

Releases

Community (blog, twitter, mail list, IRC)

Contributing

Code of Conduct

Companies using Scrapy

Commercial Support

parsel's People

Contributors

Stargazers

Watchers

Forkers

parsel's Issues

what is expected

what actually happens

Example 1

Example 2

Example 3

Recommend Projects

Recommend Topics

Recommend Org