michalovadek / eurlex Goto Github PK

An R package for retrieving official data on European Union laws and policies

Home Page: https://michalovadek.github.io/eurlex/

R 100.00%

courts eurlex european-union law legislation sparql

eurlex's Introduction

eurlex: Retrieve Data on European Union Law

The eurlex R package reduces the overhead associated with using SPARQL and REST APIs made available by the EU Publication Office and other EU institutions. Compared to pure web-scraping, the package provides more efficient and transparent access to data on European Union laws and policies.

See the vignette for a basic walkthrough on how to use the package. Check function documentation for most up-to-date overview of features. Example use cases are shown in this paper.

You can use eurlex to create automatically updated overviews of EU decision-making activity, as shown here.

Installation

Install from CRAN via install.packages("eurlex").

The development version is available via remotes::install_github("michalovadek/eurlex").

Cite

Michal Ovádek (2021) Facilitating access to data on European Union laws, Political Research Exchange, 3:1, DOI: 10.1080/2474736X.2020.1870150

Basic usage

The eurlex package currently envisions the typical use-case to consist of getting bulk information about EU legislation into R as fast as possible. The package contains three core functions to achieve that objective: elx_make_query() to create pre-defined or customized SPARQL queries; elx_run_query() to execute the pre-made or any other manually input query; and elx_fetch_data() to fire GET requests for certain metadata to the REST API.

The function elx_make_query takes as its first argument the type of resource to be retrieved (such as "directive" or "any") from the semantic database that powers Eur-Lex (and other publications) called Cellar. If you are familiar with SPARQL, you can always specify your own queries and execute them with elx_run_query().

elx_run_query() executes SPARQL queries on a pre-specified endpoint of the EU Publication Office. It outputs a data.frame where each column corresponds to one of the requested variables, while the rows accumulate observations of the resource type satisfying the query criteria. Obviously, the more data is to be returned, the longer the execution time, varying from a few seconds to several hours, depending also on your connection. The first column always contains the unique URI of a "work" (usually legislative act or court judgment) which identifies each resource in Cellar. Several human-readable identifiers are normally associated with each "work" but the most useful one tends to be CELEX, retrieved by default.

# load library
library(eurlex)

# create query
query <- elx_make_query("directive", include_date_transpos = TRUE)

# execute query
results <- elx_run_query(query)

One of the most useful things about the API is that we obtain a comprehensive list of identifiers that we can subsequently use to obtain more data relating to the document in question. While the results of the SPARQL queries can also be useful for web-scraping, the function elx_fetch_data() makes it possible to fire GET requests to retrieve data on documents with known identifiers (including Cellar URI). The function for example enables downloading the title and the full text of a document in all available languages.

Note

This package nor its author are in any way affiliated with the EU, its institutions, offices or agencies. Please refer to the applicable data reuse policies.

Please consider contributing to the maintenance and development of the package by reporting bugs or suggesting new features.

Latest changes

eurlex 0.4.5

breaking change: elx_run_query() now strips URIs (except Eurovoc ones) by default and keeps only the identifier to reduce object size
where elx_fetch_data() is used to retrieve texts from an html document, it now uses by default rvest::html_text2() instead of rvest::html_text(). This is slower but more resembling of how the page renders in some cases. New argument html_text = "text2" controls the setting.
new feature: elx_make_query(..., include_court_origin = TRUE) retrieves the country of origin of a court case. As per Eur-Lex documentation, this is primarily intended to be the country of the national court referring a preliminary question, but other countries are present in the data as well at the moment. Recommended to interact with court procedure
new feature: elx_make_query(..., include_original_language = TRUE) retrieves the authentic language of a document, typically a court case

eurlex 0.4.3

all date variables retrieved through elx_make_query(include_... = TRUE) are now properly named
new experimental feature: elx_make_query(include_citations_detailed = TRUE) retrieves additional details about the citation where available; the retrieval is currently slow
elx_make_query(include_directory = TRUE) now retrieves the directory code instead of URI

eurlex 0.4.2

new feature: elx_make_query(include_proposal = TRUE) retrieves the CELEX of a proposal of a requested legal act
the returned results from elx_make_query() no longer include previous versions of the same record (new versions typically fix incorrect or missing metadata)

eurlex 0.4.1

elx_fetch_data(type = "notice", notice = c("tree","branch", "object")) now mirrors the behaviour of elx_download_xml() but instead of saving to path gives access to XML notice in R
retrieve data on the Judge-Rapporteur, Advocate-General, court formation and court-curated scholarship using new include_ options in elx_make_query()
fixed bug in elx_download_xml() parameter checking
elx_download_xml(notice = "object") now retrieves metadata correctly

Useful resources

Guide to CELEX numbers: https://eur-lex.europa.eu/content/tools/TableOfSectors/types_of_documents_in_eurlex.html

List of resource types in Cellar (NAL): http://publications.europa.eu/resource/authority/resource-type

NAL of corporate bodies: http://publications.europa.eu/resource/authority/corporate-body

Query builder: https://op.europa.eu/en/advanced-sparql-query-editor

Common data model: https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/cdm

SPARQL endpoint: http://publications.europa.eu/webapi/rdf/sparql

eurlex's People

Contributors

Stargazers

Watchers

Forkers

kakiac local-maxima michelemaroni

eurlex's Issues

Specify several resource types at once

Not sure if possible without a clumsy work around given the limitations of the underlying SPARQL syntax but worth looking into

Proposal for enchancements of elx_fetch_data

Dear Michal,

I use eurlex package on a regular basis to extract EU policy documents
with the purspose of mapping of terms related to UN 2030 SDGs in theese EU policy documents.
I use elx_fetch_data to batch dowload the raw text of the documents.

I would like to propose tow enhancements for this function:

Rather than return just the 'out' for the type of resource requested, the function could return the 'out' and the code of the HTTP request for a resource type. The function returns a named list where the first element is 'out' and the second the HTTP code.
This would allow to easily check if a resource was not retrieved, which is useful when dealing with a large number of documents.
Insert the document XML notice among the resource type option.
This could be an useful and efficient way to get a plethora of information for each document. The xml could be then parsed locally to extract data of interest like directory codes, the subject matter, the instruments cited, related documents etc.
In many cases it might be easier and faster to work with the XML notice than to develop and run a (complex) SPARQL query.
For an easier implementation, this option could ignore the language paramenters, so that one would get the document xml notice the same way is obtained from eurlex.

What do you think about theese enhancements, would it be difficult to implement them?

Once again, many thanks for developing and releasing such a useful and easy to use package.

Many thanks and have a nice day.

Best

Error in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: 32001E0555

I am suddenly getting a weird problem after using the eurlex package for a few months. When I run the elx_fetch_data() function, I get the following error:

Error in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: 32001E0555

Reprex:

# Load package
library(eurlex)

# Run function
elx_fetch_data("32001E0555", "title")

Thanks for all your work. Amazing package.

On another note, is it possible to gather the CELEX IDs for all acts in a given directory coder, e.g. CC = 18 which is Common Foreign and Security Policy, via REST instead of SPARQL?
Directory code CC = 18 is Common Foreign and Security Policy.

suggestions for improving elx_dowload_xml and make query

Hi Michal,

Thanks for releasing v0.4.0, I updated R and eurlex and i am using it.
I recently used elx_dowload_xml and I wanted to suggest some improvements:

line 28 should likely be : notice type must be correctly specified" = notice %in% c("tree", "branch", "object")) (this is more of an issue)
file = basename(url) could be file = paste(basename(url), ".xml)"
With the current settings when object is passed to notice the object expression notice is retrieved (p 44 of cellar), however this does not contain metadata. I'd suggest to drop the language header and use ?language= a the end of the url when object is passed (p 42 of cellar), so that the object notice with the object metadata is retrieved.
elx_dowload_xml could encapsulate a function that returns the xml notice as a string. So a user could decide wether to directly dowload the xml notice, or to get the xml notice as a string an parse it to get other fields and complement the make_query and run_query functions.
About elx_make_query, you remember that there was the issue of the 10e6 limit? A workaraound/improvement could be to group together multiple items of the same property of a work. e.g. if i pass include_authors = TRUE, it could help to use (group_concat(distinct ?author_;separator=", ") as ?author) in the select statement and OPTIONAL{?work cdm:work_created_by_agent ?author_.} in the where statement of the sparql query. The uri would still be inside, but i see this less of an issue to clean it afterwards. This would help in not having duplicated works when running queries.

What do you think about theese?

All the best

SPARQL query by directory code (CC)

Like the EUR-Lex expert search, is it possible to add a directory code (CC) argument to the elx_make_query() function?

This would be incredibly useful for finding all legal acts in a larger policy area. For example, in tracking a country's EU defence policy, you would need to find all acts relating to Common Foreign and Security Policy (CC = 18).

On the expert search function of the EUR-Lex website, you are able to find EU legal acts by directory code, which is very useful for finding acts within larger areas, e.g. Common Foreign and Security Policy (CC = 18). I have attached a screenshot of this below.

EUR-Lex Expert Search: https://eur-lex.europa.eu/expert-search-form.html

directories prefLabel

return prefLabel or equivalent when include_directory = TRUE

Query result are limited to 10e6

Dear Michal,

Thank you for developing such a useful package, writing useful and clear documentation, and also congratulation for the very interesting article published in Political Research Exchange.

I tried you package and a noticed that when I run a large query the results are limited to 10e6 rows. Is there a way to resolve this limit?

A reproducible example is provided here:
"
library(eurlex)
library(dplyr)
library(ggplot2)
legal <- elx_make_query(resource_type = "any", sector = 3,
include_celex = TRUE, include_force = TRUE,
include_date = TRUE, include_date_force = TRUE,
include_date_endvalid = TRUE, include_eurovoc = TRUE,
include_directory = TRUE, include_citations = TRUE) %>%
elx_run_query()
preparatory <- elx_make_query(resource_type = "any", sector = 5,
include_celex = TRUE, include_date = TRUE,
include_eurovoc = TRUE, include_directory = TRUE,
include_citations = TRUE) %>%
elx_run_query()
dat <- as_tibble(data.frame(X=rep(0, 16000000),y=rep(0, 16000000),z=rep(0, 16000000)))
"

I provide you also with the sessionInfo output
"
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 LC_MONETARY=Italian_Italy.1252
[4] LC_NUMERIC=C LC_TIME=Italian_Italy.1252

  attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     
  
  other attached packages:
  [1] ggplot2_3.3.3        dplyr_1.0.5          eurlex_0.3.5         RevoUtils_11.0.2     RevoUtilsMath_11.0.0
  
  loaded via a namespace (and not attached):
   [1] rstudioapi_0.11  xml2_1.3.2       magrittr_1.5     tidyselect_1.1.0 munsell_0.5.0    colorspace_2.0-0
   [7] R6_2.4.1         rlang_0.4.10     httr_1.4.1       tools_4.0.2      grid_4.0.2       gtable_0.3.0    
  [13] withr_2.2.0      ellipsis_0.3.1   digest_0.6.25    tibble_3.0.2     lifecycle_1.0.0  crayon_1.3.4    
  [19] farver_2.1.0     tidyr_1.1.3      purrr_0.3.4      vctrs_0.3.7      curl_4.3         glue_1.4.1      
  [25] compiler_4.0.2   pillar_1.4.6     generics_0.1.0   scales_1.1.1     pkgconfig_2.0.3

Another very useful feature would the possibility to define a start date and an end date for the query.

Thank you once again.

Best regards.

alternative identifiers

provide options for alternative identifiers, in particular Official Journal number. Many documents are not CELEX indexed (especially preparatory, eg COM proposals, sector 5 more generally)

strip URIs by default

returning the URI for each resource is wasteful

event data

in latest iterations of Eur-Lex there seems to be an increasing focus on event data. It would be useful to be able to retrieve these, but likely to require a completely new function and type of SPARQL queries

add automatic detection and URL encoding to elx_fetch_data to deal with "()"

E.g.

paste("http://publications.europa.eu/resource/celex/", "32019H1115(01)", sep = "") %>% 
  str_replace("\\(","%28") %>% 
  str_replace("\\)","%29") %>% 
  elx_fetch_data(.,"title")

return title via SPARQL query

it should be possible to return document titles via SPARQL queries, but need to move from WORK to EXPRESSION (language)

Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html, : Space required after the Public Identifier [65]

Mysterious error which appears and disappears every other call to elx_run_query