Code Monkey home page Code Monkey logo

arxiv.py's Introduction

arxiv.py

PyPI PyPI - Python Version GitHub Workflow Status (branch) Full package documentation

Python wrapper for the arXiv API.

arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.

Usage

Installation

$ pip install arxiv

In your Python script, include the line

import arxiv

Examples

Fetching results

import arxiv

# Construct the default API client.
client = arxiv.Client()

# Search for the 10 most recent articles matching the keyword "quantum."
search = arxiv.Search(
  query = "quantum",
  max_results = 10,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

results = client.results(search)

# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
  print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])

# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = arxiv.Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)

# Search for the paper with ID "1605.08386v1"
search_by_id = arxiv.Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search))
print(first_result.title)

Downloading papers

To download a PDF of the paper with ID "1605.08386v1," run a Search and then use Result.download_pdf():

import arxiv

paper = next(arxiv.Client().results(arxiv.Search(id_list=["1605.08386v1"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="downloaded-paper.pdf")
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath="./mydir", filename="downloaded-paper.pdf")

The same interface is available for downloading .tar.gz files of the paper source:

import arxiv

paper = next(arxiv.Client().results(arxiv.Search(id_list=["1605.08386v1"])))
# Download the archive to the PWD with a default filename.
paper.download_source()
# Download the archive to the PWD with a custom filename.
paper.download_source(filename="downloaded-paper.tar.gz")
# Download the archive to a specified directory with a custom filename.
paper.download_source(dirpath="./mydir", filename="downloaded-paper.tar.gz")

Fetching results with a custom client

import arxiv

big_slow_client = arxiv.Client(
  page_size = 1000,
  delay_seconds = 10.0,
  num_retries = 5
)

# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(arxiv.Search(query="quantum")):
  print(result.title)

Logging

To inspect this package's network behavior and API logic, configure a DEBUG-level logger.

>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = arxiv.Client()
>>> paper = next(client.results(arxiv.Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979

Types

Client

A Client specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.

Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.

Search

A Search specifies a search of arXiv's database. Use Client.results to get a generator yielding Results.

Result

The Result objects yielded by Client.results include metadata about each paper and helper methods for downloading their content.

The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.

Result also exposes helper methods for downloading papers: Result.download_pdf and Result.download_source.

arxiv.py's People

Contributors

arkel23 avatar jacquerie avatar japoneris avatar lukasschwab avatar mdamien avatar mhils avatar miguel-asm avatar msoelch avatar natfarleydev avatar rishabh-bhargava avatar santosh-gupta avatar windisch avatar ziadmodak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arxiv.py's Issues

Missing documentation of expected compound-query encoding

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

Need to do advanced query for arxiv such as ?search_query=au:del_maestro+AND+ti:checkerboard

The problem is that urlencode encodes certain key characters such as colon. @IceKhan13

This is so we can use compound queries and
image

Solution

A clear and concise description of what you want to happen.

Quick and dirty patch solution.
WARNING: Not backward compatible

class ClientZ(arxiv.Client):
    def _format_url(self, search: arxiv.Search, start: int, page_size: int) -> str:
        """
        Construct a request API for search that returns up to `page_size`
        results starting with the result at index `start`.

        PATCH: so that we can do Boolean expression.
        """
        url_args = search._url_args()
        url_args.update({
            "start": start,
            "max_results": page_size,
        })
        # return self.query_url_format.format(urlencode(url_args)) # REPLACED THIS
        search_query = url_args.pop('search_query')  # Pop out and treat separate
        text = f"search_query={search_query}&" + urlencode(url_args) # recombine
        return self.query_url_format.format(text)

Considered alternatives

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context about the feature request here.

download preprint source files

Hi! First of all, thanks a lot for the great job! I think this package is really useful.

I am using this package to perform some analytics on arXiv papers related to my research field. I would need to be able to download the source file (.tex and figures) of the papers instead of the pdf. Would that be possible?

Cheers!

Update test cases

  • Clean up existing test cases
    • Individuate tests.
    • Nicer logging.
  • Write more test cases.

Blocking 0.4.0 release.

sort_by argument for function query

Hello, thank you for sharing this code.
The pip version (pip install arxiv) returns an error when running the function query() after installation:
The code generating the error is

import arxiv
arxiv.query(search_query="a query",
... id_list=[],
... prune=True,
... start=0,
... max_results=10,
... sort_by="submittedDate",
... sort_order="descending")
Traceback (most recent call last):
File "", line 7, in
TypeError: query() got an unexpected keyword argument 'sort_by'

However, this is solved by simply downloading and installing the current master. I guess there is a mismatch?
Thanks

Download the chosen chapter

Hi,
thank you for the great package!
is there a way to e.g. download only the first chapter or chosen one, simlarly how you can access the summary?
Or maybe do you know of any tool that I could use for that after downloading full pdf?

Support date queries

Thanks for a great code!

Would it be possible to specify a to/from date range for queries?

In the arxiv API documentation I surprising don't see support for this.

My use case is that I'd like to use arxiv.py for regularly checking new arxiv articles given search criteria. My plan is to run this periodically so I'd just like to run my search queries for articles since the last run.

Thanks for any help!

long id_list not allowed by the API

when making a query with the length of id_list of 642 article names I get:

File "arxiv.py/arxiv/arxiv.py", line 34, in query
raise Exception("HTTP Error " + str(results.get('status', 'no status')) + " in query")
Exception: HTTP Error 414 in query

This is probably due to some limit in the API. Does anybody know more about this?
Should the library deal with this issue or is it more appropriate to leave it to the user?

ConnectionResetError encountered

I don't know what I did, but my API seems to be malfunctional today.

specific error is the following:
URLError: <urlopen error [WinError 10054] An existing connection was forcibly closed by the remote host>

System: Windows 10
Python: 3.6

I was basically running the example code

import arxiv
# Query for a paper of interest, then download
paper = arxiv.query(id_list=["1707.08567"])[0]
arxiv.download(paper)
# You can skip the query step if you have the paper info!
paper2 = {"pdf_url": "http://arxiv.org/pdf/1707.08567v1",
          "title": "The Paper Title"}
arxiv.download(paper2)

Would you mind helping me figure out what was going on? Thank you!

UnexpectedEmptyPageError at abrupt intervals

Thank you for developing this package. I am trying to put together a dataset of arXiv paper abstracts and their terms. Basically, the abstracts will be features for a machine learning model and it will be tasked to predict the associated terms making it a multi-label classification problem.

I am doing this for an experiment I want to perform in the area of medium-scale multi-label classification.

Here's what I am doing:

  1. Define a list of query strings I want to involve in the dataset:
query_keywords = ["image recognition", 
    "self-supervised learning", 
    "representation learning", 
    "image generation",
    "object detection",
    "transfer learning",
    "transformers",
    "adversarial training",
    "generative adversarial networks",
    "model compressions"
    "image segmentation",
    "few-shot learning"
]
  1. Define a utility function:
def query_with_keywords(query):
    search = arxiv.Search(query=query, 
                        max_results=3000,
                        sort_by=arxiv.SortCriterion.LastUpdatedDate)
    terms = []
    titles = []
    abstracts = []
    for res in tqdm(search.results()):
        if res.primary_category=="cs.CV" or \
            res.primary_category=="stat.ML" or \
                res.primary_category=="cs.LG":

            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
    return terms, titles, abstracts
  1. Looping the above function through the list defined in 1.:
import time

wait_time = 3

all_titles = []
all_summaries = []
all_terms = []

for query in query_keywords:
    terms, titles, abstracts = query_with_keywords(query)
    all_titles.extend(titles)
    all_summaries.extend(abstracts)
    all_terms.extend(terms)

    time.sleep(wait_time)

Now, while executing this I am abruptly running into:

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in __try_parse_feed(self, url, first_page, retries_left, last_err)
    687             # Feed was never returned in self.num_retries tries. Raise the last
    688             # exception encountered.
--> 689             raise err
    690         return feed
    691 

UnexpectedEmptyPageError: Page of results was unexpectedly empty (http://export.arxiv.org/api/query?search_query=representation+learning&id_list=&sortBy=lastUpdatedDate&sortOrder=descending&start=800&max_results=100)

It's not like the underlying keyword for search does not have any more pages, I have verified that because in a new run the exception happens for a different keyword.

Was wondering if there's a way to circumvent this. Thanks so much in advance.

Missing 'title' attribute causes "AttributeError: object has no attribute 'title'" error

Description

A clear and concise description of what the bug is.
In some edge cases, the entry returned by arXiv does not contain a valid 'title' tag (e.g. https://arxiv.org/abs/2104.12255v1). This causes an error in arxiv.py line 116:

Traceback (most recent call last):
File ".\retrieve_arxiv.py", line 21, in
for result in big_slow_client.get(unrestricted_search):
File "E:\Dropbox\Coding\arXiv\arxiv\arxiv.py", line 547, in get
yield Result._from_feed_entry(entry)
File "E:\Dropbox\Coding\arXiv\arxiv\arxiv.py", line 116, in _from_feed_entry
title=re.sub(r'\s+', ' ', entry.title),
File "C:\Users\gerry\Miniconda3\envs\arxiv\lib\site-packages\feedparser\util.py", line 158, in getattr
raise AttributeError("object has no attribute '%s'" % key)
AttributeError: object has no attribute 'title'

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

from arxiv import arxiv
import csv

search = arxiv.Search(
id_list=['2104.12255v1'],
sort_by = arxiv.SortCriterion.LastUpdatedDate
)

for result in search.get():
print(result.entry_id)
print(list(result.dir()))
print()

Expected behavior

A clear and concise description of what you expected to happen.
Missing title attribute should be checked and imputed since it is a key field

Versions

  • python version:
    Python 3.8
  • arxiv.py version:
    arxiv.py == 1.2.0

Additional context

Add any other context about the problem here.
A workaround patch on my local copy worked:
Lines 541onwards
# Yield query results until page is exhausted.
for entry in feed.entries:
# BUG: Fixes a bug where sometimes the entry does not return with a title in the feed
# E.g. https://arxiv.org/abs/2104.12255v1
if not hasattr(entry, 'title'):
entry['title'] = ''
yield Result._from_feed_entry(entry)

Include `start` as a argument so that we can use paging for large number of results

Is your feature request related to a problem? Please describe.

The arxiv api has an argument called start, which when used in conjunction with max_results, allows you to using paging to sort through large number of results, since the maximum number of results is 30,000, and they recommend 1,000

https://arxiv.org/help/api/user-manual#paging

any times there are hundreds of results for an API query. Rather than download information about all the results at once, the API offers a paging mechanism through start and max_results that allows you to download chucks of the result set at a time. Within the total results set, start defines the index of the first returned result, using 0-based indexing. max_results is the number of results returned by the query. For example, if wanted to step through the results of a search_query of all:electron, we would construct the urls:

Describe the solution you'd like

Include start to the list of arguments.

AttributeError: nonexistent IDs in `id_list`s yield invalid entries

Description

A clear and concise description of what the bug is.

When a specified ID doesn't correspond to an arXiv paper, the results feed includes an entry element missing expected fields (id).

The status is 200, but feedparser chokes and the error-handling in this package tries to access the nonexistent ID, yielding a raw AttributeError

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

Example API feed: http://export.arxiv.org/api/query?id_list=2208.05394

>>> import arxiv
>>> pub = next(arxiv.Search(id_list=["2208.05394"]).get())
Traceback (most recent call last):
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/feedparser/util.py", line 156, in __getattr__
    return self.__getitem__(key)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/feedparser/util.py", line 113, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 586, in results
    yield Result._from_feed_entry(entry)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 122, in _from_feed_entry
    entry.id
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/feedparser/util.py", line 158, in __getattr__
    raise AttributeError("object has no attribute '%s'" % key)
AttributeError: object has no attribute 'id'

Expected behavior

A clear and concise description of what you expected to happen.

This package's error handling should return a neatly handleable error.

Versions

  • python version: 3.7.9
  • arxiv.py version: 1.4.1

Unreliable results: pages from API are unexpectedly empty

Describe the bug
When running the query below, I receive an inconsistent count of results nearly every run. The results below contain the record count generate by the code provided in the "To Reproduce" section. As you can see, I receive wildly different results each time. Is there a parameter setting I can adjust to receive more reliable results?

  • 1: 10,000
  • 2: 10,000
  • 3: 14,800
  • 4: 14,800
  • 5: 14,800
  • 6(no max chunk results): 23,000
  • 7 (no max chunk results): 8,000

To Reproduce
import arxiv
import pandas as pd
test = arxiv.query(query="quantum",
id_list=[],
max_results=None,
start = 0,
sort_by="relevance",
sort_order="descending",
prune=True,
iterative=False
,max_chunk_results=1000
)
test_df = pd.DataFrame(test)
print(len(test_df))

Expected behavior
I am expecting a consistent count of results from this query when run back to back (say within a few minutes or so of each other).

Versions

  • python version: 3.7.4

  • arxiv.py version: 0.5.3

Use export.arxiv.org for non-API requests

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Programmatically slamming arxiv.org is bad for their site performance, and their documentation asks us to please avoid doing so.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Non-API requests––e.g. the arxiv.download functionality––should hit export.arxiv.org instead of arxiv.org: https://arxiv.org/help/bulk_data#play-nice

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

None

Additional context
Add any other context or screenshots about the feature request here.

Before making the switch, benchmark download performance.

Return dates as date time objects with time zones

Motivation

The api currently returns time.struct_time objects, but datetime objects are often nicer to work with (subjectively). For instance, it would be good to set the time zone so the dates are unambiguous.

Solution

The .updated field should be of type datetime.datetime

Stop requiring pytest-runner

Why is pytest-runner an installation requirement? It doesn't seem to provide anything of value for getting the upstream functionality to work.

Error when requesting query with high max_results

First of all, thank you for publishing this Python wrapper for the arXiv API.

I'm trying to use it to make a dataset of Abstract-Title from arXiv papers, and ideally, this dataset would have a lot of data, meaning that I'd need to request queries with a high max_results. However, I came across an HTTP Error 400 in query when requested with max_results = 10000.

Traceback (most recent call last):
  File "/Users/gwena/PycharmProjects/ArXivAbsTitleDataset/modules/main.py", line 125, in <module>
    make_dataset_in_group_of_queries(search_queries, max_results, min_num_words)
  File "/Users/gwena/PycharmProjects/ArXivAbsTitleDataset/modules/main.py", line 110, in make_dataset_in_group_of_queries
    make_dataset_in_query(search_query, max_results, min_num_words)
  File "/Users/gwena/PycharmProjects/ArXivAbsTitleDataset/modules/main.py", line 58, in make_dataset_in_query
    articles = arxiv.query(search_query=search_query, max_results=max_results)
  File "/Users/gwena/PycharmProjects/Tutorial/venv/lib/python3.6/site-packages/arxiv/arxiv.py", line 28, in query
    raise Exception("HTTP Error " + str(results.get('status', 'no status')) + " in query")
Exception: HTTP Error 400 in query

Enable user to use .export for PDF download

Motivation

The arxiv library uses the .export.arxiv.org subdomain for querying a paper, but downloads the paper directly from arxiv.org. This can result in the problem that the user gets blocked from arxiv, when downloading too many papers.

Solution

A solution would be to modify the paper PDF url to point to the corresponding .export subdomain. In the code for my personal use I simply use:

idx = paper.pdf_url.index('arxiv')
paper.pdf_url = paper.pdf_url[:idx] + 'export.' + paper.pdf_url[idx:]

where paper is a Result instance. This solution is lacking though, since the export subdomain does not have to exist. This would need to be checked. I would add this functionality into the _get_pdf_url method. A boolean flag user_exportcould be introduced, if some users wish to download directy from arxiv.org, even though it is not adviced according to: https://arxiv.org/help/bulk_data under the "Play Nice" section.

extraneous newlines and whitespace in output

Describe the bug

Various string parameters contain \n where they really shouldn't: title and abstract most prominently. In the title there's also extra whitespace.

To Reproduce

import arxiv
print("\n  " in arxiv.query(id_list=["2008.01734"])[0].title)
# >>> True

Expected behavior

No extraneous whitespace

Versions

  • python version: 3.8.3
  • arxiv.py version: 0.5.3
  • feedparser version: 5.2.1 (the problem may well be due to feedparser).

Consistently add release notes

Please consistently add release notes on the Releases tab as were previously added. The last two releases don't have any release notes. Thank you.

Use Python logging best practices

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

  • Change how logging is exposed. Setting the logging level is difficult. There must be a standard Python pattern for this, e.g. using a named logger with a name other than __name__.
  • Change internal logging usage: logging upon error construction can be misleading, esp. when errors are nonterminal (i.e. resolved in retries). #43 (comment)

Solution

A clear and concise description of what you want to happen.

TODO

`get_short_id` incorrect for pre-March 2007 arXiv identifiers: missing archive

Error:

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in results(self, search)
    552             ))
    553             page_url = self._format_url(search, offset, page_size)
--> 554             feed = self._parse_feed(page_url, first_page)
    555             if first_page:
    556                 # NOTE: this is an ugly fix for a known bug. The totalresults

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in _parse_feed(self, url, first_page)
    635         # Feed was never returned in self.num_retries tries. Raise the last
    636         # exception encountered.
--> 637         raise err
    638 
    639 

HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

Code for parsing id from arxiv result object-
id = urlparse(result.entry_id).path.split('/')[-1].split('v')[0]

code to reproduce -

ids = ['1911.10854', '1905.00256', '0112019', '1202.2184', '1708.03109', '0205137', '1610.08147', '2003.05245', '0406182', '0708.3630', '0503148', '1111.6170', '1612.04479', '0307110', '0306127', '1307.2727', '0402059', '1012.4706', '1906.01999', '0101032']

papers = arxiv.Search(id_list=ids).get()

invalid ids are '0112019', '0205137' etc

respective pdf urls still accessible, for example : https://arxiv.org/pdf/quant-ph/0112019.pdf

The same error is referenced in another open issue but from the perspective of huge id arrays. [ issue ID : #15]

Apologies if I simply lack sufficient knowledge about identifier naming conventions but it should download from all research fields right?

How to group boolean operators to form complex queries?

How can I use multiple ANDs? For example, if I want to get a paper that has the concepts "semantic parsing" and "parsers" in its abstract, how can I form a query for it. I only want to retrieve the papers that have both the concepts. I have tried following but I don't get the correct results:

query = 'abs:"semantic parsing" AND "abs:parsers"'

And how can this be achieved for multiple concepts? Thanks!

Query string helpers

Atomic conditions

condition(field: "all"|"au"|..., value: string):

  • condition("au", "Balents Leon")"au:\"Balents Leon\""
  • condition("au", "balents_leon")"au:balents_leon"
  • condition("cat", "cond-mat.str-el")"au:cond-mat.str-el"
  • Open question: how to enumerate the available fields, values when they're enumerable.
prefix explanation
ti Title
au Author
abs Abstract
co Comment
jr Journal Reference
cat Subject Category
rn Report Number
id Id (use id_list instead)
all All of the above

Boolean assembly

These correspond to the three Boolean operators supported by the arXiv API.

and(cond1, cond2)"$(cond1) AND $(cond2)"

or(cond1, cond2)"$(cond1) OR $(cond2)"

andnot(cond1, cond2)"$(cond1) ANDNOT $(cond2)"

Grouping

group(cond)"($(cond))"

Author affiliations missing from `Result.Author`s

Description

A clear and concise description of what the bug is.

Author affiliations are available in raw arXiv API feeds, but are not exposed by this package's Result objects.

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

Apparent for any result set.

  • There's no mention of affiliations in this package's documentation or in the source code.
  • (Result)._raw.arxiv_affiliation is often defined, but it's a single string––the affiliation of one author among several.

Expected behavior

A clear and concise description of what you expected to happen.

Author affiliations should be exposed by the Result.Author class.

Versions

  • python version: *
  • arxiv.py version: >= 1.0.0

Additional context

Add any other context about the problem here.

This is a long-open issue in feedparser, perhaps open since 2015: kurtmckee/feedparser#24. There's a detailed breakdown of the interaction with arXiv results here: kurtmckee/feedparser#145 (comment). I suspect arXiv will release their JSON API ––and this client library will be rewritten to use the JSON API––before this feedparser bug is resolved.

This client library could expose the single author affiliation extracted by feedparser, but this has negative impacts:

  • It may misleadingly suggest that a certain author or institution led the publication in question, which sucks from an ethical perspective.
  • Which affiliation is extracted may depend on the order of the authors, which arXiv may not guarantee. The extracted affiliation of a paper may vary.
  • The affiliation may not apply to all of the authors for a paper; exposing it is misleading.

If the single author affiliation is useful in your application, despite the noted downsides, access it with (Result)._raw.get('arxiv_affiliation').

Weird url encoding problem of arxiv API

This issue may not be a bug of this package but instead something related to how arXiv API accepts the encoded URL.

Say I want to make a query with multiple search fields. eg. sq="au:balents_leon+AND+cat:cond-mat.str-el", then I would like to use the wrapper function arxiv.query(search_query=sq) to get the results. However, this doesn't work. The reason is related with urlencode() function from urllib, which encode the URL, and turns : to %3A and + to %2B. This should be fine since the encoded url is the same thing as the original one. However, it turns out arxiv responses differently to the url and the encoded url. The experiments are as follows, which are tested directly on Chrome.

  1. http://export.arxiv.org/api/query?search_query=au%3Abalents_leon+AND%2Bcat%3Acond-mat.str-el: return items.
  2. http://export.arxiv.org/api/query?search_query=au%3Abalents_leon%2BAND+cat:cond-mat.str-el: return items.
  3. http://export.arxiv.org/api/query?search_query=au%3Abalents_leon%2BAND%2Bcat:cond-mat.str-el: return with only info on atom feed without any real items of papers.

Note the subtle difference in the encoding, namely only when two + are both encoded, the arxiv API reacts unexpectedly.

Add `__eq__` methods to result classes

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

Makes it easier to find papers that match certain properties among results, esp. to look up by author.

Solution

A clear and concise description of what you want to happen.

Implement __eq__: take the naive approach and compare author names.

Considered alternatives

A clear and concise description of any alternative solutions or features you've considered.

Could try to fuzzy-match on author names to try to handle initials, but that seems noisy.

Additional context

Add any other context about the feature request here.

Came up while looking into EPS-Libraries-Berkeley/volt#161.

Tune default behaviors

  • Make sure arxiv's rate-limiting doesn't silently truncate results.
  • Look into request count/results length tradeoff––might tune the default down from max_results_per_call=1000.

Blocks release 0.4.0.

Invalid entries in multi-member ID lists cause entry repetition

Description

A clear and concise description of what the bug is.

If id_list consists of a single nonexistent––but valid––ID, arXiv returns an empty feed which is interpreted to mean "no results."

If id_list consists of both existent and nonexistent valid IDs (["0000.0000", "1707.08567"]), the feed is non-empty––it contains a single item––but it has feed.feed.opensearch_totalresults == 2. The client takes this to be a partial page, and requests a page with offset 1... which lists paper 1707.08567 again. This is an API bug.

Notably, this behavior differs depending on the nonexistent ID. Nonexistent ID 1507.58567 yields an entry with missing fields (covered in #80, fixed by #82), whereas 1407.58567 yields no entries at all (covered here).

Example: https://export.arxiv.org/api/query?id_list=1407.58567,1707.08567

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

def test_invalid_id(self):
        results = list(arxiv.Search(id_list=["0000.0000"]).results())
        self.assertEqual(len(results), 0)
        results = list(arxiv.Search(id_list=["0000.0000", "1707.08567"]).results())
        print(len(results))
        self.assertEqual(len(results), 1) # Fails: 1707.08567 appears twice.

Expected behavior

A clear and concise description of what you expected to happen.

Results should not be duplicated.

Searching for ["0000.0000", "1707.08567"] should yield a single result.

Versions

  • python version: 3.7.9
  • arxiv.py version: 1.4.1

v0.5.4 broke tags

Is there any reason why the tags key is not returned in a result in v0.5.4? This works fine in v0.5.3.

"sort_by" option makes error

When I add sort_by option, this error message is shown: TypeError: query() got an unexpected keyword argument 'sort_by'

Fetch Latest paper in certain category

Hello, thanks for your excellent work.

Is it possible to fetch latest paper in certain category ?? In the "search_query", we can not find prefix_field for date filtering.

Thanks

Author comments on `Results` are incorrectly `None`

hello, I used your code. But the result.comment and result.doi is None. There's bug on Line 131 and 133 of arxiv.py. I think the right code is

comment=entry.arxiv_comment,

doi=entry.arxiv_doi,

Please check. Thanks.

HTTPError: HTTP Error 403: Forbidden

Describe the bug
Download error.

HTTPError: HTTP Error 403: Forbidden

To Reproduce

arxiv.download({'pdf_url':'http://arxiv.org/pdf/1706.03762'})

Versions

  • python version: 3.7.8

  • arxiv.py version: 0.5.3

Parallelize CI tests

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

GitHub Actions will eventually cost money corresponding to their duration.

Solution

A clear and concise description of what you want to happen.

In CI, run several unit tests in parallel. pytest-xdist should do the trick. Things to modify:

  • requirements.txt – add pytest-xdist.
  • Makefile – need to add a separate test-ci target that runs tests in parallel.
  • .github/workflows/python-package.yml – need to install pytest-xdist rather than stock pytest.

Considered alternatives

A clear and concise description of any alternative solutions or features you've considered.

  • Not making this change: it complicates dev dependencies, and this CI config isn't costing me anything yet.
  • Mocking requests: this would be good for the performance of all but a couple of tests (which should retain the network logic for sensitivity to API changes).

Additional context

Add any other context about the feature request here.

I'll hold off on this until CI starts costing money.

Package not updated in pip?

I just ran
> pip install arxiv

as per instructions in readme and had issues with the sort_by argument in the query method.

Looking back through the closed issues it looks like this is because this is an old version of the query function. The pip hasn't been updated and the pip method of install is still working off a rather old version.

Looks like the issue was closed after master was released as 0.2.3 (see issue #16 ). Not sure what is supposed to happen with the pip when a new tag is added. The default version installed by pip method is still old and doesn't seem to updated.

Thought I'd open a new issue to flag that original sort_by arg issue isn't actually fixed.

Thanks,
Andrew

Include feed in errors representing non-200 responses

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

arXiv (sometimes?) includes a valid Atom feed explaining non-200 responses. For example, see #74.

Solution

A clear and concise description of what you want to happen.

_parse_feed should make a best-effort at parsing the response body when it's about to raise an HTTPError.

If it's available, include that response body in the HTTPError and in that error's representations.

Considered alternatives

A clear and concise description of any alternative solutions or features you've considered.

N/A

Additional context

Add any other context about the feature request here.

There's an open question: if a body is available, and it's a valid feed, what should we do with it?

  1. Include it as a string; don't bother to parse the XML. Let the user determine how best to do that.
  2. Include it as a feedparser object.
  3. Parse with feedparser, but process the feed entries into differentiated errors.

I suspect the right approach is to begin with 1––just to start representing the dropped context to users––and then move on to step 3 if there are requests to differentiate HTTP errors programmatically.

what's the normal time for downloading a paper?

Hi,
Thanks for your great work!
I was wondering what's the normal time for downloading a paper?
I would like to download as much as possible papers to do some research. Maybe the size is 10 K ~ 100 K.
But for now, it costs me 10 seconds for each paper downloading, so is it possible to speed up?
Thanks very much!

Add a release script

Just filing this as a to-do item for myself.

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I have a gitignore'd markdown file with my release steps outlined; this is error-prone, and the process is somewhat complex now that I intend to more consistently mint GitHub releases: #38.

Describe the solution you'd like
A clear and concise description of what you want to happen.

A bash script; Makefile if it's sufficiently complex.

This can only extract the package version and avoid automatically incrementing it; then the script can just error if the version number is the same as the most recent tag.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

I could opt to automatically increment the version number, but this adds some interface complexity––e.g. --fixup, --minor, --major flags.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.