lukasschwab / arxiv.py Goto Github PK

View Code? Open in Web Editor NEW

1.0K 1.0K 115.0 424 KB

Python wrapper for the arXiv API

License: MIT License

Python 98.72% Makefile 1.28%

arxiv arxiv-api pdf python-wrapper

arxiv.py's People

Contributors

Stargazers

Watchers

Forkers

abhineshwar natfarleydev shezadkhan137 jacquerie japoneris yoheikikuta mdamien codetrick nschroet kaniblu msoelch erickpeirson rmatam gragtah vanabel stevenyesz 72haoyuan afcarl windisch cnglen founder42 azai91 hybridize rohanvardhan refraction-ray ericphanson stjordanis black-swan-icl emilyo9264 santosh-gupta foxtrotmike alx dertilo riantr stepinto163 hhy06 awesome-archive greatwallet zengai zolekode mw55 dmantadakis evd0kim astrowq mlhafizur f-pa mathematicalmodels andyyue1893 arkel23 mohamedalirashad iphysresearch zhuwenxing mhils lance10t peace098beat hxbjavaee marscod muskanmahajan37 hiyyg zivzone rodabhari takemi853 weijiagong sailfish009 nrupatunga techthiyanes python-repository-hub pitmonticone mokhlesurrahman romanticamaj rishabh-bhargava hack-r lkampoli az-ihsan rickeyestes2 rmallof foristkirito hertera1 ithink3iam lai-flow codica xiaoli siddhantdeshmukh arpitjain799 jetsomma iq-scm xinyushe awesome-software yousaf2018 wjchae22 khanh101 deepsimplicity1 lgs liuchaoxd radoshi nlzracbwq9 davanstrien eggplants joelbarmettleruzh liyucheng09

arxiv.py's Issues

Results for moderate-length id_list truncated

Results for an id_list of more than 10 entries seem to be silently truncated to the first 10. This is easy to work round but maybe a warning should be issued?

Use export.arxiv.org for non-API requests

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Programmatically slamming arxiv.org is bad for their site performance, and their documentation asks us to please avoid doing so.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Non-API requests––e.g. the arxiv.download functionality––should hit export.arxiv.org instead of arxiv.org: https://arxiv.org/help/bulk_data#play-nice

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

None

Additional context
Add any other context or screenshots about the feature request here.

Before making the switch, benchmark download performance.

download preprint source files

Hi! First of all, thanks a lot for the great job! I think this package is really useful.

I am using this package to perform some analytics on arXiv papers related to my research field. I would need to be able to download the source file (.tex and figures) of the papers instead of the pdf. Would that be possible?

Cheers!

Download the chosen chapter

Hi,
thank you for the great package!
is there a way to e.g. download only the first chapter or chosen one, simlarly how you can access the summary?
Or maybe do you know of any tool that I could use for that after downloading full pdf?

Weird url encoding problem of arxiv API

This issue may not be a bug of this package but instead something related to how arXiv API accepts the encoded URL.

Say I want to make a query with multiple search fields. eg. sq="au:balents_leon+AND+cat:cond-mat.str-el", then I would like to use the wrapper function arxiv.query(search_query=sq) to get the results. However, this doesn't work. The reason is related with urlencode() function from urllib, which encode the URL, and turns : to %3A and + to %2B. This should be fine since the encoded url is the same thing as the original one. However, it turns out arxiv responses differently to the url and the encoded url. The experiments are as follows, which are tested directly on Chrome.

http://export.arxiv.org/api/query?search_query=au%3Abalents_leon+AND%2Bcat%3Acond-mat.str-el: return items.
http://export.arxiv.org/api/query?search_query=au%3Abalents_leon%2BAND+cat:cond-mat.str-el: return items.
http://export.arxiv.org/api/query?search_query=au%3Abalents_leon%2BAND%2Bcat:cond-mat.str-el: return with only info on atom feed without any real items of papers.

Note the subtle difference in the encoding, namely only when two + are both encoded, the arxiv API reacts unexpectedly.

HTTPError: HTTP Error 403: Forbidden

Describe the bug
Download error.

HTTPError: HTTP Error 403: Forbidden

To Reproduce

arxiv.download({'pdf_url':'http://arxiv.org/pdf/1706.03762'})

Versions

python version: 3.7.8
arxiv.py version: 0.5.3

Enable user to use .export for PDF download

Motivation

The arxiv library uses the .export.arxiv.org subdomain for querying a paper, but downloads the paper directly from arxiv.org. This can result in the problem that the user gets blocked from arxiv, when downloading too many papers.

Solution

A solution would be to modify the paper PDF url to point to the corresponding .export subdomain. In the code for my personal use I simply use:

idx = paper.pdf_url.index('arxiv')
paper.pdf_url = paper.pdf_url[:idx] + 'export.' + paper.pdf_url[idx:]

where paper is a Result instance. This solution is lacking though, since the export subdomain does not have to exist. This would need to be checked. I would add this functionality into the _get_pdf_url method. A boolean flag user_exportcould be introduced, if some users wish to download directy from arxiv.org, even though it is not adviced according to: https://arxiv.org/help/bulk_data under the "Play Nice" section.

Invalid entries in multi-member ID lists cause entry repetition

Description

A clear and concise description of what the bug is.

If id_list consists of a single nonexistent––but valid––ID, arXiv returns an empty feed which is interpreted to mean "no results."

If id_list consists of both existent and nonexistent valid IDs (["0000.0000", "1707.08567"]), the feed is non-empty––it contains a single item––but it has feed.feed.opensearch_totalresults == 2. The client takes this to be a partial page, and requests a page with offset 1... which lists paper 1707.08567 again. This is an API bug.

Notably, this behavior differs depending on the nonexistent ID. Nonexistent ID 1507.58567 yields an entry with missing fields (covered in #80, fixed by #82), whereas 1407.58567 yields no entries at all (covered here).

Example: https://export.arxiv.org/api/query?id_list=1407.58567,1707.08567

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

def test_invalid_id(self):
        results = list(arxiv.Search(id_list=["0000.0000"]).results())
        self.assertEqual(len(results), 0)
        results = list(arxiv.Search(id_list=["0000.0000", "1707.08567"]).results())
        print(len(results))
        self.assertEqual(len(results), 1) # Fails: 1707.08567 appears twice.

Expected behavior

A clear and concise description of what you expected to happen.

Results should not be duplicated.

Searching for ["0000.0000", "1707.08567"] should yield a single result.

Versions

python version: 3.7.9

arxiv.py version: 1.4.1

AttributeError: nonexistent IDs in `id_list`s yield invalid entries

Description

A clear and concise description of what the bug is.

When a specified ID doesn't correspond to an arXiv paper, the results feed includes an entry element missing expected fields (id).

The status is 200, but feedparser chokes and the error-handling in this package tries to access the nonexistent ID, yielding a raw AttributeError

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

Example API feed: http://export.arxiv.org/api/query?id_list=2208.05394

>>> import arxiv
>>> pub = next(arxiv.Search(id_list=["2208.05394"]).get())
Traceback (most recent call last):
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/feedparser/util.py", line 156, in __getattr__
    return self.__getitem__(key)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/feedparser/util.py", line 113, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 586, in results
    yield Result._from_feed_entry(entry)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 122, in _from_feed_entry
    entry.id
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/feedparser/util.py", line 158, in __getattr__
    raise AttributeError("object has no attribute '%s'" % key)
AttributeError: object has no attribute 'id'

Expected behavior

A clear and concise description of what you expected to happen.

This package's error handling should return a neatly handleable error.

Versions

python version: 3.7.9

arxiv.py version: 1.4.1

what's the normal time for downloading a paper?

Hi,
Thanks for your great work!
I was wondering what's the normal time for downloading a paper?
I would like to download as much as possible papers to do some research. Maybe the size is 10 K ~ 100 K.
But for now, it costs me 10 seconds for each paper downloading, so is it possible to speed up?
Thanks very much!

Include `start` as a argument so that we can use paging for large number of results

Is your feature request related to a problem? Please describe.

The arxiv api has an argument called start, which when used in conjunction with max_results, allows you to using paging to sort through large number of results, since the maximum number of results is 30,000, and they recommend 1,000

https://arxiv.org/help/api/user-manual#paging

any times there are hundreds of results for an API query. Rather than download information about all the results at once, the API offers a paging mechanism through start and max_results that allows you to download chucks of the result set at a time. Within the total results set, start defines the index of the first returned result, using 0-based indexing. max_results is the number of results returned by the query. For example, if wanted to step through the results of a search_query of all:electron, we would construct the urls:

Describe the solution you'd like

Include start to the list of arguments.

Stop requiring pytest-runner

Why is pytest-runner an installation requirement? It doesn't seem to provide anything of value for getting the upstream functionality to work.

Add `eq` methods to result classes

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

Makes it easier to find papers that match certain properties among results, esp. to look up by author.

Solution

A clear and concise description of what you want to happen.

Implement __eq__: take the naive approach and compare author names.

Considered alternatives

A clear and concise description of any alternative solutions or features you've considered.

Could try to fuzzy-match on author names to try to handle initials, but that seems noisy.

Additional context

Add any other context about the feature request here.

Came up while looking into EPS-Libraries-Berkeley/volt#161.

Get affiliations for each authors

Hi:

It looks like only the lead author's affiliation is returned. Is it possible to get the affiliations for each authors?

Thanks,
H

Use Python logging best practices

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

Change how logging is exposed. Setting the logging level is difficult. There must be a standard Python pattern for this, e.g. using a named logger with a name other than __name__.
Change internal logging usage: logging upon error construction can be misleading, esp. when errors are nonterminal (i.e. resolved in retries). #43 (comment)

Solution

A clear and concise description of what you want to happen.

TODO

Error when requesting query with high max_results

First of all, thank you for publishing this Python wrapper for the arXiv API.

I'm trying to use it to make a dataset of Abstract-Title from arXiv papers, and ideally, this dataset would have a lot of data, meaning that I'd need to request queries with a high max_results. However, I came across an HTTP Error 400 in query when requested with max_results = 10000.

Traceback (most recent call last):
  File "/Users/gwena/PycharmProjects/ArXivAbsTitleDataset/modules/main.py", line 125, in <module>
    make_dataset_in_group_of_queries(search_queries, max_results, min_num_words)
  File "/Users/gwena/PycharmProjects/ArXivAbsTitleDataset/modules/main.py", line 110, in make_dataset_in_group_of_queries
    make_dataset_in_query(search_query, max_results, min_num_words)
  File "/Users/gwena/PycharmProjects/ArXivAbsTitleDataset/modules/main.py", line 58, in make_dataset_in_query
    articles = arxiv.query(search_query=search_query, max_results=max_results)
  File "/Users/gwena/PycharmProjects/Tutorial/venv/lib/python3.6/site-packages/arxiv/arxiv.py", line 28, in query
    raise Exception("HTTP Error " + str(results.get('status', 'no status')) + " in query")
Exception: HTTP Error 400 in query

Unreliable results: pages from API are unexpectedly empty

Describe the bug
When running the query below, I receive an inconsistent count of results nearly every run. The results below contain the record count generate by the code provided in the "To Reproduce" section. As you can see, I receive wildly different results each time. Is there a parameter setting I can adjust to receive more reliable results?

1: 10,000
2: 10,000
3: 14,800
4: 14,800
5: 14,800
6(no max chunk results): 23,000
7 (no max chunk results): 8,000

To Reproduce
import arxiv
import pandas as pd
test = arxiv.query(query="quantum",
id_list=[],
max_results=None,
start = 0,
sort_by="relevance",
sort_order="descending",
prune=True,
iterative=False
,max_chunk_results=1000
)
test_df = pd.DataFrame(test)
print(len(test_df))

Expected behavior
I am expecting a consistent count of results from this query when run back to back (say within a few minutes or so of each other).

Versions

python version: 3.7.4
arxiv.py version: 0.5.3

Move dependencies to requirements.txt from setup.py

Including dev dependencies pdoc and pytest.

Update test cases

Clean up existing test cases
- Individuate tests.
- Nicer logging.
Write more test cases.

Blocking 0.4.0 release.

Confirm documentation is up-to-date

Progress:

How to group boolean operators to form complex queries?

How can I use multiple ANDs? For example, if I want to get a paper that has the concepts "semantic parsing" and "parsers" in its abstract, how can I form a query for it. I only want to retrieve the papers that have both the concepts. I have tried following but I don't get the correct results:

query = 'abs:"semantic parsing" AND "abs:parsers"'

And how can this be achieved for multiple concepts? Thanks!

how can I query a specific category

is there a simple way to only query the AI category for example instead of all fields that exist in the arxiv ?

Fetch Latest paper in certain category

Hello, thanks for your excellent work.

Is it possible to fetch latest paper in certain category ?? In the "search_query", we can not find prefix_field for date filtering.

Thanks

"sort_by" option makes error

When I add sort_by option, this error message is shown: TypeError: query() got an unexpected keyword argument 'sort_by'

Tune default behaviors

Make sure arxiv's rate-limiting doesn't silently truncate results.
Look into request count/results length tradeoff––might tune the default down from max_results_per_call=1000.

Blocks release 0.4.0.

Missing documentation of expected compound-query encoding

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

Need to do advanced query for arxiv such as ?search_query=au:del_maestro+AND+ti:checkerboard

The problem is that urlencode encodes certain key characters such as colon. @IceKhan13

This is so we can use compound queries and

Solution

A clear and concise description of what you want to happen.

Quick and dirty patch solution.
WARNING: Not backward compatible

class ClientZ(arxiv.Client):
    def _format_url(self, search: arxiv.Search, start: int, page_size: int) -> str:
        """
        Construct a request API for search that returns up to `page_size`
        results starting with the result at index `start`.

        PATCH: so that we can do Boolean expression.
        """
        url_args = search._url_args()
        url_args.update({
            "start": start,
            "max_results": page_size,
        })
        # return self.query_url_format.format(urlencode(url_args)) # REPLACED THIS
        search_query = url_args.pop('search_query')  # Pop out and treat separate
        text = f"search_query={search_query}&" + urlencode(url_args) # recombine
        return self.query_url_format.format(text)

Considered alternatives

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context about the feature request here.

Author comments on `Results` are incorrectly `None`

hello, I used your code. But the result.comment and result.doi is None. There's bug on Line 131 and 133 of arxiv.py. I think the right code is

comment=entry.arxiv_comment,

doi=entry.arxiv_doi,

Please check. Thanks.

v0.5.4 broke tags

Is there any reason why the tags key is not returned in a result in v0.5.4? This works fine in v0.5.3.

long id_list not allowed by the API

when making a query with the length of id_list of 642 article names I get:

File "arxiv.py/arxiv/arxiv.py", line 34, in query
raise Exception("HTTP Error " + str(results.get('status', 'no status')) + " in query")
Exception: HTTP Error 414 in query

This is probably due to some limit in the API. Does anybody know more about this?
Should the library deal with this issue or is it more appropriate to leave it to the user?

Package not updated in pip?

I just ran
> pip install arxiv

as per instructions in readme and had issues with the sort_by argument in the query method.

Looking back through the closed issues it looks like this is because this is an old version of the query function. The pip hasn't been updated and the pip method of install is still working off a rather old version.

Looks like the issue was closed after master was released as 0.2.3 (see issue #16 ). Not sure what is supposed to happen with the pip when a new tag is added. The default version installed by pip method is still old and doesn't seem to updated.

Thought I'd open a new issue to flag that original sort_by arg issue isn't actually fixed.

Thanks,
Andrew

Add a release script

Just filing this as a to-do item for myself.

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I have a gitignore'd markdown file with my release steps outlined; this is error-prone, and the process is somewhat complex now that I intend to more consistently mint GitHub releases: #38.

Describe the solution you'd like
A clear and concise description of what you want to happen.

A bash script; Makefile if it's sufficiently complex.

This can only extract the package version and avoid automatically incrementing it; then the script can just error if the version number is the same as the most recent tag.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

I could opt to automatically increment the version number, but this adds some interface complexity––e.g. --fixup, --minor, --major flags.

ConnectionResetError encountered

I don't know what I did, but my API seems to be malfunctional today.

specific error is the following:
URLError: <urlopen error [WinError 10054] An existing connection was forcibly closed by the remote host>

System: Windows 10
Python: 3.6

I was basically running the example code

import arxiv
# Query for a paper of interest, then download
paper = arxiv.query(id_list=["1707.08567"])[0]
arxiv.download(paper)
# You can skip the query step if you have the paper info!
paper2 = {"pdf_url": "http://arxiv.org/pdf/1707.08567v1",
          "title": "The Paper Title"}
arxiv.download(paper2)

Would you mind helping me figure out what was going on? Thank you!

`get_short_id` incorrect for pre-March 2007 arXiv identifiers: missing archive

Error:

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in results(self, search)
    552             ))
    553             page_url = self._format_url(search, offset, page_size)
--> 554             feed = self._parse_feed(page_url, first_page)
    555             if first_page:
    556                 # NOTE: this is an ugly fix for a known bug. The totalresults

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in _parse_feed(self, url, first_page)
    635         # Feed was never returned in self.num_retries tries. Raise the last
    636         # exception encountered.
--> 637         raise err
    638 
    639 

HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

Code for parsing id from arxiv result object-
id = urlparse(result.entry_id).path.split('/')[-1].split('v')[0]

code to reproduce -

ids = ['1911.10854', '1905.00256', '0112019', '1202.2184', '1708.03109', '0205137', '1610.08147', '2003.05245', '0406182', '0708.3630', '0503148', '1111.6170', '1612.04479', '0307110', '0306127', '1307.2727', '0402059', '1012.4706', '1906.01999', '0101032']

papers = arxiv.Search(id_list=ids).get()

invalid ids are '0112019', '0205137' etc

respective pdf urls still accessible, for example : https://arxiv.org/pdf/quant-ph/0112019.pdf

The same error is referenced in another open issue but from the perspective of huge id arrays. [ issue ID : #15]

Apologies if I simply lack sufficient knowledge about identifier naming conventions but it should download from all research fields right?

Missing 'title' attribute causes "AttributeError: object has no attribute 'title'" error

Description

A clear and concise description of what the bug is.
In some edge cases, the entry returned by arXiv does not contain a valid 'title' tag (e.g. https://arxiv.org/abs/2104.12255v1). This causes an error in arxiv.py line 116:

Traceback (most recent call last):
File ".\retrieve_arxiv.py", line 21, in
for result in big_slow_client.get(unrestricted_search):
File "E:\Dropbox\Coding\arXiv\arxiv\arxiv.py", line 547, in get
yield Result._from_feed_entry(entry)
File "E:\Dropbox\Coding\arXiv\arxiv\arxiv.py", line 116, in _from_feed_entry
title=re.sub(r'\s+', ' ', entry.title),
File "C:\Users\gerry\Miniconda3\envs\arxiv\lib\site-packages\feedparser\util.py", line 158, in getattr
raise AttributeError("object has no attribute '%s'" % key)
AttributeError: object has no attribute 'title'

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

from arxiv import arxiv
import csv

search = arxiv.Search(
id_list=['2104.12255v1'],
sort_by = arxiv.SortCriterion.LastUpdatedDate
)

for result in search.get():
print(result.entry_id)
print(list(result.dir()))
print()

Expected behavior

A clear and concise description of what you expected to happen.
Missing title attribute should be checked and imputed since it is a key field

Versions

python version:
Python 3.8

arxiv.py version:
arxiv.py == 1.2.0

Additional context

Add any other context about the problem here.
A workaround patch on my local copy worked:
Lines 541onwards
# Yield query results until page is exhausted.
for entry in feed.entries:
# BUG: Fixes a bug where sometimes the entry does not return with a title in the feed
# E.g. https://arxiv.org/abs/2104.12255v1
if not hasattr(entry, 'title'):
entry['title'] = ''
yield Result._from_feed_entry(entry)

Parallelize CI tests

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

GitHub Actions will eventually cost money corresponding to their duration.

Solution

A clear and concise description of what you want to happen.

In CI, run several unit tests in parallel. pytest-xdist should do the trick. Things to modify:

requirements.txt – add pytest-xdist.
Makefile – need to add a separate test-ci target that runs tests in parallel.
.github/workflows/python-package.yml – need to install pytest-xdist rather than stock pytest.

Considered alternatives

A clear and concise description of any alternative solutions or features you've considered.

Not making this change: it complicates dev dependencies, and this CI config isn't costing me anything yet.
Mocking requests: this would be good for the performance of all but a couple of tests (which should retain the network logic for sensitivity to API changes).

Additional context

Add any other context about the feature request here.

I'll hold off on this until CI starts costing money.

Possible to return a list of citations for an article?

extraneous newlines and whitespace in output

Describe the bug

Various string parameters contain \n where they really shouldn't: title and abstract most prominently. In the title there's also extra whitespace.

To Reproduce

import arxiv
print("\n  " in arxiv.query(id_list=["2008.01734"])[0].title)
# >>> True

Expected behavior

No extraneous whitespace

Versions

python version: 3.8.3
arxiv.py version: 0.5.3
feedparser version: 5.2.1 (the problem may well be due to feedparser).

sort_by argument for function query

Hello, thank you for sharing this code.
The pip version (pip install arxiv) returns an error when running the function query() after installation:
The code generating the error is

import arxiv
arxiv.query(search_query="a query",
... id_list=[],
... prune=True,
... start=0,
... max_results=10,
... sort_by="submittedDate",
... sort_order="descending")
Traceback (most recent call last):
File "", line 7, in
TypeError: query() got an unexpected keyword argument 'sort_by'

However, this is solved by simply downloading and installing the current master. I guess there is a mismatch?
Thanks

Very slow to extract more than 1000 articles

When max_result is more than 1000, it is very slow to retrieve data. Why is that?

Support date queries

Thanks for a great code!

Would it be possible to specify a to/from date range for queries?

In the arxiv API documentation I surprising don't see support for this.

My use case is that I'd like to use arxiv.py for regularly checking new arxiv articles given search criteria. My plan is to run this periodically so I'd just like to run my search queries for articles since the last run.

Thanks for any help!

Author affiliations missing from `Result.Author`s

Description

A clear and concise description of what the bug is.

Author affiliations are available in raw arXiv API feeds, but are not exposed by this package's Result objects.

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

Apparent for any result set.

There's no mention of affiliations in this package's documentation or in the source code.
(Result)._raw.arxiv_affiliation is often defined, but it's a single string––the affiliation of one author among several.

Expected behavior

A clear and concise description of what you expected to happen.

Author affiliations should be exposed by the Result.Author class.

Versions

python version: *
arxiv.py version: >= 1.0.0

Additional context

Add any other context about the problem here.

This is a long-open issue in feedparser, perhaps open since 2015: kurtmckee/feedparser#24. There's a detailed breakdown of the interaction with arXiv results here: kurtmckee/feedparser#145 (comment). I suspect arXiv will release their JSON API ––and this client library will be rewritten to use the JSON API––before this feedparser bug is resolved.

This client library could expose the single author affiliation extracted by feedparser, but this has negative impacts:

It may misleadingly suggest that a certain author or institution led the publication in question, which sucks from an ethical perspective.
Which affiliation is extracted may depend on the order of the authors, which arXiv may not guarantee. The extracted affiliation of a paper may vary.
The affiliation may not apply to all of the authors for a paper; exposing it is misleading.

If the single author affiliation is useful in your application, despite the noted downsides, access it with (Result)._raw.get('arxiv_affiliation').

Consistently add release notes

Please consistently add release notes on the Releases tab as were previously added. The last two releases don't have any release notes. Thank you.

Update README with changes/better documentation

Usage with pandas dataframe

Could you give an example of wrapper's usage with pandas dataframe? Previous version could it!

Any plans for submission (sword) support?

https://arxiv.org/help/submit_sword

Include feed in errors representing non-200 responses

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

arXiv (sometimes?) includes a valid Atom feed explaining non-200 responses. For example, see #74.

Solution

A clear and concise description of what you want to happen.

_parse_feed should make a best-effort at parsing the response body when it's about to raise an HTTPError.

If it's available, include that response body in the HTTPError and in that error's representations.

Considered alternatives

A clear and concise description of any alternative solutions or features you've considered.

N/A

Additional context

Add any other context about the feature request here.

There's an open question: if a body is available, and it's a valid feed, what should we do with it?

Include it as a string; don't bother to parse the XML. Let the user determine how best to do that.
Include it as a feedparser object.
Parse with feedparser, but process the feed entries into differentiated errors.

I suspect the right approach is to begin with 1––just to start representing the dropped context to users––and then move on to step 3 if there are requests to differentiate HTTP errors programmatically.

Set max_results for very long example queries in README.md

Some full results lists––e.g. for query="quantum"––take a long time to fetch, especially when the number of results per response is set to 10.

I should make all the README examples relatively snappy.

Return dates as date time objects with time zones

Motivation

The api currently returns time.struct_time objects, but datetime objects are often nicer to work with (subjectively). For instance, it would be good to set the time zone so the dates are unambiguous.

Solution

The .updated field should be of type datetime.datetime

UnexpectedEmptyPageError at abrupt intervals

Thank you for developing this package. I am trying to put together a dataset of arXiv paper abstracts and their terms. Basically, the abstracts will be features for a machine learning model and it will be tasked to predict the associated terms making it a multi-label classification problem.

I am doing this for an experiment I want to perform in the area of medium-scale multi-label classification.

Here's what I am doing:

Define a list of query strings I want to involve in the dataset:

query_keywords = ["image recognition", 
    "self-supervised learning", 
    "representation learning", 
    "image generation",
    "object detection",
    "transfer learning",
    "transformers",
    "adversarial training",
    "generative adversarial networks",
    "model compressions"
    "image segmentation",
    "few-shot learning"
]

Define a utility function:

def query_with_keywords(query):
    search = arxiv.Search(query=query, 
                        max_results=3000,
                        sort_by=arxiv.SortCriterion.LastUpdatedDate)
    terms = []
    titles = []
    abstracts = []
    for res in tqdm(search.results()):
        if res.primary_category=="cs.CV" or \
            res.primary_category=="stat.ML" or \
                res.primary_category=="cs.LG":

            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
    return terms, titles, abstracts

Looping the above function through the list defined in 1.:

import time

wait_time = 3

all_titles = []
all_summaries = []
all_terms = []

for query in query_keywords:
    terms, titles, abstracts = query_with_keywords(query)
    all_titles.extend(titles)
    all_summaries.extend(abstracts)
    all_terms.extend(terms)

    time.sleep(wait_time)

Now, while executing this I am abruptly running into:

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in __try_parse_feed(self, url, first_page, retries_left, last_err)
    687             # Feed was never returned in self.num_retries tries. Raise the last
    688             # exception encountered.
--> 689             raise err
    690         return feed
    691 

UnexpectedEmptyPageError: Page of results was unexpectedly empty (http://export.arxiv.org/api/query?search_query=representation+learning&id_list=&sortBy=lastUpdatedDate&sortOrder=descending&start=800&max_results=100)

It's not like the underlying keyword for search does not have any more pages, I have verified that because in a new run the exception happens for a different keyword.

Was wondering if there's a way to circumvent this. Thanks so much in advance.

Query string helpers

Atomic conditions

condition(field: "all"|"au"|..., value: string):

condition("au", "Balents Leon") → "au:\"Balents Leon\""
condition("au", "balents_leon") → "au:balents_leon"
condition("cat", "cond-mat.str-el") → "au:cond-mat.str-el"
Open question: how to enumerate the available fields, values when they're enumerable.

prefix	explanation
ti	Title
au	Author
abs	Abstract
co	Comment
jr	Journal Reference
cat	Subject Category
rn	Report Number
id	Id (use id_list instead)
all	All of the above

Boolean assembly

These correspond to the three Boolean operators supported by the arXiv API.

and(cond1, cond2) → "$(cond1) AND $(cond2)"

or(cond1, cond2) → "$(cond1) OR $(cond2)"

andnot(cond1, cond2) → "$(cond1) ANDNOT $(cond2)"

Grouping

group(cond) → "($(cond))"

lukasschwab / arxiv.py Goto Github PK

arxiv.py's People

Contributors

Stargazers

Watchers

Forkers

arxiv.py's Issues

Motivation

Solution

Description

Steps to reproduce

Expected behavior

Versions

Description

Steps to reproduce

Expected behavior

Versions

Motivation

Solution

Considered alternatives

Additional context

Motivation

Solution

Motivation

Solution

Considered alternatives

Additional context

Description

Steps to reproduce

Expected behavior

Versions

Additional context

Motivation

Solution

Considered alternatives

Additional context

Description

Steps to reproduce

Expected behavior

Versions

Additional context

Motivation

Solution

Considered alternatives

Additional context

Motivation

Solution

Atomic conditions

Boolean assembly

Grouping

Recommend Projects

Recommend Topics

Recommend Org