rdflib / pyrdfa3 Goto Github PK

RDFa 1.1 distiller/parser library: can extract RDFa 1.1 (and RDFa 1.0, if properly set via a @version attribute) from (X)HTML, SVG, or XML in general. The module can be used to produce serialized versions of the extracted graph, or simply an RDFLib Graph.

Home Page: http://www.w3.org/2012/pyRdfa/

License: Other

HTML 77.66% CSS 0.89% JavaScript 0.29% Shell 0.01% Python 21.15%

pyrdfa3's Introduction

The package can be downloaded from PyPI via pip install pyRdfa3.

Note: since I retired a few months ago I do not really maintain this package any more. I would be more than happy if an interested party was interested to take over. In the meantime, I have "archived" the repository to clearly signal that there is no maintenance. I would be happy to unarchive it and transfer ownership if someone is interested.
@iherman

PyRDFA

What is it

pyRdfa distiller/parser library. The distribution contains:

./pyRdfa: the Python library. You should copy the directory somewhere into your PYTHONPATH. Alternatively, you can also run the

python setup.py install

script in the directory.
./scripts/CGI_RDFa.py: can be used as a CGI script to invoke the library. It has to be adapted to the local server setup, namely in setting the right paths
./scripts/localRdfa.py: script that can be run locally on to transform a file into RDF (on the standard output). Run the script with "-h" to get the available flags.
./Doc-pyRdfa: (epydoc) documentation of the classes and functions
./Additional_Packages: some additional packages that are necessary for the library; added here for an easier distribution. Each of those libraries must be installed separately. Exception is RDFLib that should be installed directly from the server

The package primarily depends on:

RDFLib: http://rdflib.net. Version 3.2.0 or higher is strongly recommended.
html5lib: https://github.com/html5lib/html5lib-python
simplejson: http://undefined.org/python/#simplejson (in the additional packages folder), needed if the JSON serialization is used and if the underlying python version is 2.5 or lower
isodate: http://hg.proclos.com/isodate (in the additional packages folder) which, in some cases, is missing and RDFLib complains (?)

At the moment, the JSON-LD serialization depends on an external JSON-LD serializer. The package comes with a simple one, but if Niklas Lindström's rdflib_jsonld package is available, then this will be used. The former is not really maintained; the latter is in github: https://github.com/RDFLib/rdflib-jsonld. Note that, eventually, this serializer will find its way to the core RDFLib distribution.

The package has been tested on Python version 2.6 and higher and has been adapted to Python 3.

For the details on RDFa 1.1, see:

possibly:

http://www.w3.org/TR/rdfa-primer/

pyrdfa3's People

Contributors

Stargazers

Watchers

pyrdfa3's Issues

missing documentation

I am looking for documentation how to use this - I tried using it like other parsers:

import rdflib


rdfa_text = """
    <div vocab="http://schema.org/" typeof="Order">
      <div property="seller" typeof="Organization">
        <b property="name">ACME Supplies</b>
      </div>
      <div property="customer" typeof="Person">
        <b property="name">Jane Doe</b>
      </div>
      <div property="orderedItem" typeof="OrderItem">
        Item number: <span property="orderItemNumber">abc123</span>
        <span property="orderQuantity">1</span>
        <div property="orderedItem" typeof="Product">
          <span property="name">Widget</span>
        </div>
        <link property="orderItemStatus" href="http://schema.org/OrderDelivered" />Delivered
        <div property="orderDelivery" typeof="ParcelDelivery">
          <time property="expectedArrivalFrom">2015-03-10</time>
        </div>
      </div>
      <div property="orderedItem" typeof="OrderItem">
        Item number: <span property="orderItemNumber">def456</span>
        <span property="orderQuantity">4</span>
        <div property="orderedItem" typeof="Product">
          <span property="name">Widget accessories</span>
        </div>
        <link property="orderItemStatus" href="http://schema.org/OrderInTransit" />Shipped
        <div property="orderDelivery" typeof="ParcelDelivery">
          <time property="expectedArrivalFrom">2015-03-15</time>
          <time property="expectedArrivalUntil">2015-03-18</time>
        </div>
      </div>
    </div>
"""

g = rdflib.Graph()
g.parse(data=rdfa_text, format="rdfa")

But that won't work:

Traceback (most recent call last):
  File "rdfa.py", line 54, in <module>
    g.parse(data=rdfa_text, format="rdfa")
  File "/home/f.ludwig/projects/rdflib/rdflib/graph.py", line 1075, in parse
    parser.parse(source, self, **args)
  File "/home/f.ludwig/.local/share/virtualenvs/dbe-FuKuixvd/lib/python3.8/site-packages/pyRdfa/rdflibparsers.py", line 138, in parse
    self._process(graph, pgraph, baseURI, orig_source,
  File "/home/f.ludwig/.local/share/virtualenvs/dbe-FuKuixvd/lib/python3.8/site-packages/pyRdfa/rdflibparsers.py", line 180, in _process
    _check_error(processor_graph)
  File "/home/f.ludwig/.local/share/virtualenvs/dbe-FuKuixvd/lib/python3.8/site-packages/pyRdfa/rdflibparsers.py", line 60, in _check_error
    raise Exception("RDFa parsing Error! %s" % msg)
Exception: RDFa parsing Error! name 'file' is not defined

how hard would it be to extend pyRdfa3 to lxml.etree?

Hey there,

Just tried to feed graph_from_DOM an already-parsed lxml.etree document and I tripped over the fact that it only speaks xml.dom.minidom. Since both these APIs give access to roughly the same information (at least as far as RDFa is concerned), I'd be okay with trying to make it handle both—unless it was too much of a snarl, or you didn't want it to for some reason.

Thoughts?

pydfa3 Adaption

How can I adapt the pyrdfa3 parser to get the dynamic XPath?

Bug in JSON-LD serialization

There's a bug in the JSON-LD serialization. Type coercions are applied to the full IRIs instead of applying them to the compact IRIs used in the document. Example:

{
  "@context": {
    "owl": "http://www.w3.org/2002/07/owl#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "http://www.w3.org/2000/01/rdf-schema#isDefinedBy": {  <-- should be rdfs:isDefinedBy
      "@type": "@id"
    },
    "http://www.w3.org/2000/01/rdf-schema#seeAlso": {      <-- should be rdfs:seeAlso
      "@type": "@id"
    }
  },
  "@id": "http://www.w3.org/ns/json-ld#context",
  "@type": "rdf:Property",
  "rdfs:label": {
    "@value": "JSON-LD context",
    "@language": "en"
  },
  "rdfs:isDefinedBy": {
    "@id": "http://www.w3.org/ns/json-ld",
    "@type": "owl:Ontology",
    "rdfs:label": {
      "@value": "The JSON-LD Vocabulary",
      "@language": "en"
    },
    "rdfs:comment": {
      "@value": "This is a vocabulary document and is used to achieve certain features of the JSON-LD language.",
      "@language": "en"
    }
  },
  "rdfs:seeAlso": "http://www.w3.org/TR/json-ld-syntax/#referencing-contexts-from-json-documents",
  "rdfs:comment": {
    "@value": "This link relation is used to reference a JSON-LD context from a JSON document so that it can be interpreted as JSON-LD.",
    "@language": "en"
  }
}

html5lib seems to have python3 version now

The readme says that html5lib does not support python 3 but html5lib documentation says otherwise. Can we assume that pyrdfa3 has a through python 3 support now ?

Language markup (`lang="en") gets injected into strings with `rdf:HTML` datatype

Given this (rather silly) RDFa markup:

<section lang="de" resource="#">
<h2>Keywords</h2>
<ul>
  <li property="schema:keywords">Gift</li>
  <li property="schema:keywords" lang="en">Gift</li>
  <li property="schema:keywords" datatype="rdf:HTML">CO<sub>2</sub> Gift</li>
</ul>
</section>

results in:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<#> schema:keywords "CO<sub lang=\"de\">2</sub> Gift"^^rdf:HTML,
        "Gift"@de,
        "Gift"@en .

The lang="de" (in this case) is injected into any outer-most markup. In the scenario above that generates some nonsensical markup--which the author didn't intend. 😃

I don't believe this is intended behavior (the Ruby Distiller lacks this bug), so thought I'd report it.

Cheers!
🎩

prytteXMLserializer_3_2.py: inconsistent use of tabs and spaces

Hi,

While attempting to byte compile this package with compileall, it fails like:

RdfaExtras/serializers/XMLWriter.py'...
Compiling '/gnu/store/5p5wl4z8391ykvxf2v6nwnp81k25n58v-python-pyrdfa3-3.5.3/lib/python3.9/site-packages/pyRdfaExtras/serializers/__init__.py'...
Compiling '/gnu/store/5p5wl4z8391ykvxf2v6nwnp81k25n58v-python-pyrdfa3-3.5.3/lib/python3.9/site-packages/pyRdfaExtras/serializers/jsonserializer.py'...
Compiling '/gnu/store/5p5wl4z8391ykvxf2v6nwnp81k25n58v-python-pyrdfa3-3.5.3/lib/python3.9/site-packages/pyRdfaExtras/serializers/prettyXMLserializer.py'...
Compiling '/gnu/store/5p5wl4z8391ykvxf2v6nwnp81k25n58v-python-pyrdfa3-3.5.3/lib/python3.9/site-packages/pyRdfaExtras/serializers/prettyXMLserializer_3.py'...
Compiling '/gnu/store/5p5wl4z8391ykvxf2v6nwnp81k25n58v-python-pyrdfa3-3.5.3/lib/python3.9/site-packages/pyRdfaExtras/serializers/prettyXMLserializer_3_2.py'...
*** Sorry: TabError: inconsistent use of tabs and spaces in indentation (prettyXMLserializer_3_2.py, line 219)
Compiling '/gnu/store/5p5wl4z8391ykvxf2v6nwnp81k25n58v-python-pyrdfa3-3.5.3/lib/python3.9/site-packages/pyRdfaExtras/serializers/turtleserializer.py'...
error: in phase 'install': uncaught exception:
%exception #<&invoke-error program: "python" arguments: ("-m" "compileall" "--invalidation-mode=unchecked-hash" "/gnu/store/5p5wl4z8391ykvxf2v6nwnp81k25n58v-python-pyrdfa3-3.5.3") exit-status: 1 term-signal: #f stop-signal: #f> 
phase `install' failed after 0.3 seconds
command "python" "-m" "compileall" "--invalidation-mode=unchecked-hash" "/gnu/store/5p5wl4z8391ykvxf2v6nwnp81k25n58v-python-pyrdfa3-3.5.3" failed with status 1

This is with Python 3.9.9.

html5lib bug resolved?

Regarding this warning:

Warning (2013-07-16): the latest version of the html5lib package has a bug. This bug manifests itself if 
the source HTML file contains non-ASCII Unicode characters Until the bug is handled, users should 
use the older, 0.95 version. It can be downloaded at 
https://code.google.com/p/html5lib/downloads/detail?name=html5lib-0.95.tar.gz

I've tried with a recent version of html5lib and it seems to handle Unicode OK. Is there a test case I can use to confirm this?

Thank you.

pyRdfa fails on case-sensitive systems

Problem:
"import pyRdfa" on a Linux system fails.

Python 2.7.5 (default, Jul 8 2013, 09:48:59)
[GCC 4.8.1 20130603 (Red Hat 4.8.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import pyRdfa
Traceback (most recent call last):
File "", line 1, in
File "/home/dan/schema_rdfa_py2/lib/python2.7/site-packages/pyRdfa/init.py", line 132, in
from rdflib.Graph import Graph
ImportError: No module named Graph

The filesystem shows that rdflib contains a module named "graph.py", but nothing starting with a capital G:
$ ls rdflib/g*py
rdflib/graph.py

$ ls rdflib/G*.py

This particular error can be resolved by changing the following line from:

from rdflib.Graph import Graph

to:

from rdflib.graph import Graph

... however, it is only one example import of many that are subject to this case-sensitivity problem.

Distiller does not handle IRIs nor UTF-8 characters well when using direct input

I have an RDFa document using IRIs. The distiller web page messes up the encoding - at least when using direct input:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" prefixes="skos: http://www.w3.org/2004/02/skos/core#" lang="cs">
<head>
<title>Kódy oborů OECD FORD - Frascati manualFields of Research and Development classification (FORD) - Frascati manual</title>
</head>
<body about="https://data.mvcr.gov.cz/zdroj/číselníky/ford" typeof="skos:ConceptScheme">
<h1>Kódy oborů OECD FORD - Frascati manualFields of Research and Development classification (FORD) - Frascati manual <a href="https://data.mvcr.gov.cz/zdroj/číselníky/ford">🔗</a></h1>
<p>
Toto je HTML zobrazení číselníku <a href="https://data.mvcr.gov.cz/zdroj/číselníky/ford"><span property="skos:prefLabel" lang="cs">Kódy oborů OECD FORD - Frascati manual</span><span property="skos:prefLabel" lang="en">Fields of Research and Development classification (FORD) - Frascati manual</span></a> identifikovaného <a href="https://data.mvcr.gov.cz/zdroj/číselníky/ford">https://data.mvcr.gov.cz/zdroj/číselníky/ford</a> a publikovaného dle <a href="https://ofn.gov.cz/číselníky/">Otevřené formální normy (OFN) pro číselníky.</a>
</p>
<table rev="skos:inScheme">
<tr>
<th>IRI</th>
<th>Kód</th>

<th>Název anglicky</th>

<th>Tranzitivně širší položka</th></tr>
<tr id="https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10000" about="https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10000" typeof="skos:Concept">
<td><a href="https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10000">https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10000</a></td>
<td property="skos:notation">10000</td>

<td property="skos:prefLabel" lang="en">1. Natural Sciences</td>
<td></td>

</tr>
<tr id="https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10100" about="https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10100" typeof="skos:Concept">
<td><a href="https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10100">https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10100</a></td>
<td property="skos:notation">10100</td>

<td property="skos:prefLabel" lang="en">1.1 Mathematics</td>
<td><a href="#https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10000" rel="skos:broaderTransitive" resource="https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10000">https://data.mvcr.gov.cz/zdroj/číselníky/ford/položky/10000</a></td>

</tr>
</table>
</body>
</html>

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯/poloÅ¾ky/dpp> a skos:Concept ;
    skos:inScheme <https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯> ;
    skos:prefLabel "Dohoda o provedenÃ prÃ¡ce"@cs .

<https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯/poloÅ¾ky/dpÄ�> a skos:Concept ;
    skos:inScheme <https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯> ;
    skos:prefLabel "Dohoda o pracovnÃ Ä�innosti"@cs .

<https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯/poloÅ¾ky/plnÃ½-Ãºvazek> a skos:Concept ;
    skos:inScheme <https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯> ;
    skos:prefLabel "PracovnÃ pomÄ›r - plnÃ½ Ãºvazek"@cs .

<https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯/poloÅ¾ky/sluÅ¾ebnÃ-pomÄ›r> a skos:Concept ;
    skos:inScheme <https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯> ;
    skos:prefLabel "SluÅ¾ebnÃ pomÄ›r"@cs .

<https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯/poloÅ¾ky/zkrÃ¡cenÃ½-Ãºvazek> a skos:Concept ;
    skos:inScheme <https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯> ;
    skos:prefLabel "PracovnÃ pomÄ›r - zkrÃ¡cenÃ½ Ãºvazek"@cs .

<https://data.mvcr.gov.cz/zdroj/Ä�ÃselnÃky/typy-pracovnÃch-vztahÅ¯> a skos:ConceptScheme ;
    skos:prefLabel "Typy pracovnÃch vztahÅ¯"@cs,
        "Employment relation types"@en .

Please consider formally issuing a new code release

It has been quite some time since last formal code release, and interesting changes have happened since then which would be nice to promote more widely.

Newest release tagged here on Github is 3.5.2, issued in April 2019.
Newest release on PyPI is 2.3.7, issued in August 2010.
Code and embedded documentation was changed to list version as 4.0.0, in January 2020.

Please consider issuing a formal release, and to publish that to PyPI.

Whitespace-insertion missing for structured data that spans multiple block elements

Scenario: Whitespace-insertion missing for structured data that spans multiple block elements
  Given a simple (X)HTML5 document with structured data that spans multiple block elements without whitespace in between like this:
    """
      <!DOCTYPE html>
      <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" vocab="http://schema.org/">
      <head><title>Structured Data Example</title></head>
      <body typeof="BlogPosting">
      <main property="articleBody">
      <p>Foo</p><p>Bar</p>
      </main>
      </body>
      </html>
    """
  When extracting the structured data of articleBody
  Then the structured data should be: "Foo Bar"

The actual structured data currently is "FooBar" (space missing).

Although I haven't yet tried to trace it down to the spec, I understand that this might actually be a problem with the RDFa (and microdata, which I expect to have the same behavior) specification, as RDFa itself cannot tell the difference between inline elements and block elements. So I do not know whether this actually should be fixed. However, when the parser knows that it's HTML5+RDFa or XHTML+RDFa, the parser knows what elements are block and what elements are inline. I think it's actually a safer behavior (although I don't know whether that would be spec compliant) to insert a space between elements always, assuming that normally words are not broken by elements, and make an exception for those few elements known to break words, like all inline elements.

The glitch consequence from this issue is that minimized documents don't work unless they insert a single space between block-level elements in order to have words in different elements to not be joined.

pyrdfa3 repository vs. rdflib's copy of pyrdfa3

Repeating the question I just asked on #rdflib - why is there both a top-level pyrdfa3 repository, and a separate copy of pyrdfa3 under the rdflib repository's plugins/parsers subdirectory?

If rdflib was using git submodules to pull in pyrdfa3, that would make some sense; it would enable us to focus on pyrdfa3 as a separate module and hopefully avoid bugs like #7 where rdflib's setup.py handles the 2to3 conversion, while pyrdfa3's own does not.

As an entirely separate copy, however, the risk is that fixes to one repo won't get into the other. And the repos are in fact slightly out of sync with one another (a couple of lines concerning type checking with isinstance()).

Message 'No handlers could be found for logger "rdflib.term"' clutters output

Running localRDFa.py on Python 2.7.5, I get an invalid output like

No handlers could be found for logger "rdflib.term"
@prefix cc: <http://creativecommons.org/ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .

As a non-pyhton programmer, I'm not sure how I could avoid it. The hint at http://stackoverflow.com/questions/17393664/no-handlers-could-be-found-for-logger-rdflib-term gave an idea, but I'm still not sure what's missing, and how should I add it.

DeprecationWarning: the imp module is deprecated in favour of importlib

[…]/pyRdfa/utils.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import os, os.path, sys, imp, datetime

— https://travis-ci.org/scrapinghub/extruct/jobs/594543651

Only ASCII support in rdf:HTML datatype

Thanks for this very useful tool! I’m trying to turn this RDFa into RDF/XML using scripts/localRDFa.py (note the Unicode ellipsis characters):

<!DOCTYPE html>
<html lang="en">
  <body prefix="schema: http://schema.org/">
    <div class="entry" resource="http://example.com/blog/1" typeof="schema:BlogPosting">
      <h2 property="schema:headline">Unicode is accepted here…</h2>
      <div property="schema:articleBody" datatype="rdf:HTML">… but not here!</div>
    </div>
  </body>
</html>

It fails with these error messages:

[digicol@timsdcxvm pyrdfa3-master]$ scripts/localRDFa.py -p /tmp/unicode.html 
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/pyRdfa/__init__.py", line 648, in graph_from_source
    return self.graph_from_DOM(dom, graph, pgraph)
  File "/usr/lib/python2.6/site-packages/pyRdfa/__init__.py", line 501, in graph_from_DOM
    parse_one_node(topElement, default_graph, None, state, [])
  File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 67, in parse_one_node
    _parse_1_1(node, graph, parent_object, incoming_state, parent_incomplete_triples)
  File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 289, in _parse_1_1
    _parse_1_1(n, graph, object_to_children, state, incomplete_triples)
  File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 289, in _parse_1_1
    _parse_1_1(n, graph, object_to_children, state, incomplete_triples)
  File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 289, in _parse_1_1
    _parse_1_1(n, graph, object_to_children, state, incomplete_triples)
  File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 275, in _parse_1_1
    ProcessProperty(node, graph, current_subject, state, typed_resource).generate_1_1()
  File "/usr/lib/python2.6/site-packages/pyRdfa/property.py", line 126, in generate_1_1
    object = Literal(self._get_HTML_literal(self.node), datatype=HTMLLiteral)                       
  File "/usr/lib/python2.6/site-packages/rdflib-4.0.1-py2.6.egg/rdflib/term.py", line 564, in __new__
    _value, _datatype = _castPythonToLiteral(value)
  File "/usr/lib/python2.6/site-packages/rdflib-4.0.1-py2.6.egg/rdflib/term.py", line 1386, in _castPythonToLiteral
    return castFunc(obj), dType
  File "/usr/lib/python2.6/site-packages/rdflib-4.0.1-py2.6.egg/rdflib/term.py", line 1319, in _writeXML
    if s.startswith(b(u'<?xml version="1.0" encoding="utf-8"?>')):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 38: ordinal not in range(128)
Traceback (most recent call last):
  File "scripts/localRDFa.py", line 126, in <module>
    print processor.rdf_from_sources(value, outputFormat = format, rdfOutput = rdfOutput)
  File "/usr/lib/python2.6/site-packages/pyRdfa/__init__.py", line 685, in rdf_from_sources
    self.graph_from_source(name, graph, rdfOutput)
  File "/usr/lib/python2.6/site-packages/pyRdfa/__init__.py", line 657, in graph_from_source
    if not rdfOutput : raise b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 38: ordinal not in range(128)

If I remove the Unicode ellipsis character from the schema:articleBody, the HTML parses fine. It doesn’t hurt in the schema:headline.

I don’t know Python (yet) so I’m reporting this here, hoping that someone has the time for a hopefully quick fix. Thanks for looking into this!

Looking for a new maintainer

Dear all,

many years ago I developed this Python module. The library seems to be fairly solid; the only change I made a while ago was to adapt it to Python3 as well. The library is also what drives an RDFa extraction service at W3C.

However… I have recently retired and, although I still maintain some activities at the W3C, I do it on greatly reduced hours. Maintaining this library, possibly developing it further as RDFLib evolves, etc, is not something I can commit myself to do anymore. I am looking for a person (or persons) who would be willing to take over this responsibility.

Any takers?

Cc @RDFLib/rdflib

Error simple rdfa

I just want to start to have a running configuration of pyrdfa. But running Python 3.7 and rdfa3 ==3.5 gives the following error.

from pyRdfa import pyRdfa
import rdflib
print(pyRdfa().rdf_from_source('pyrdfa.xml'))

Input file pyrdfa.xml

<html xmlns="http://www.w3.org/1999/xhtml"
      prefix="cal: http://www.w3.org/2002/12/cal/ical#">
  <head>
    <title>Jo's Friends and Family Blog</title>
    <link rel="foaf:primaryTopic" href="#bbq" />
    <meta property="dc:creator" content="Jo" />
  </head>
  <body>
    <p about="#bbq" typeof="cal:Vevent">
      I'm holding
      <span property="cal:summary">
        one last summer barbecue
      </span>,
      on
      <span property="cal:dtstart" content="2015-09-16T16:00:00-05:00" 
            datatype="xsd:dateTime">
        September 16th at 4pm
      </span>.
    </p>
  </body>
</html>

error:

> NameError                                 Traceback (most recent call last)
> c:\users\dijkstrr\appdata\local\programs\python\python37-32\lib\site-packages\pyRdfa\__init__.py in _get_input(self, name)
>     448                                                 self.options.set_host_language(self.media_type)
> --> 449                                         return file(name)
>     450                         else :
> 
> NameError: name 'file' is not defined

What is going wrong?

How to handle timeout exception?

Hi,

I am using pyrdfa3 for parsing 100000 URL but I am getting a timeout exception. I tried several methods but couldn't fix it.

Can you suggest some method?

Thanks

Python 3 import error in pyRdfaExtras/init.py

Hi,

This import fallback:

pyrdfa3/pyRdfaExtras/__init__.py

Line 35 in 404bd41

from StringIO import StringIO

is broken on Python 3.

It should be from io import StringIO.

Thanks.

Error installing pyRdfaExtras due to tuple unpacking using Python 3.3

Although README.txt claims "The package has been adapted to Python 3", attempting to install pyrdfa3 under Python 3.3 fails with a syntax error in the pyRdfaExtras directory.

Steps to reproduce:

Setting up the environment:
virtualenv --python=/usr/bin/python3.3 ~/schema_rdfa
git clone https://github.com/RDFLib/pyrdfa3.git
cd pyrdfa3
Installing the package:
~/schema_rdfa/bin/python setup.py install
Error:
...
byte-compiling /home/dan/schema_rdfa/lib/python3.3/site-packages/pyRdfaExtras/init.py to init.cpython-33.pyc
File "/home/dan/schema_rdfa/lib/python3.3/site-packages/pyRdfaExtras/init.py", line 112
def add(self, (s,p,o)) :
^
SyntaxError: invalid syntax

It looks like Python 3 dropped support for using tuples directly in method signatures like this five years ago via PEP 3113 (http://www.python.org/dev/peps/pep-3113/) with warnings added as of Python 2.6. I expect the alternative would be something like "def add(self, triple)" and then checking to ensure that triple[0], [1], and [2] were all defined.

Looking over RDFlib itself, it seems that the API is rife with tuple unpacking behaviour :/

rdflib / pyrdfa3 Goto Github PK

pyrdfa3's Introduction

PyRDFA

What is it

pyrdfa3's People

Contributors

Stargazers

Watchers

Forkers

pyrdfa3's Issues

Recommend Projects

Recommend Topics

Recommend Org