Error: <div class="snippet-clipboard-content notranslate position-relative overfl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

`get_short_id` incorrect for pre-March 2007 arXiv identifiers: missing archive about arxiv.py HOT 5 CLOSED

lukasschwab commented on August 27, 2024

`get_short_id` incorrect for pre-March 2007 arXiv identifiers: missing archive

from arxiv.py.

Comments (5)

lukasschwab commented on August 27, 2024 1

Huh, interesting. Confirmed I can reproduce this with a minimal case:

>>> list(arxiv.Search(id_list=['0112019']).results())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 554, in results
    feed = self._parse_feed(page_url, first_page)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 637, in _parse_feed
    raise err
arxiv.arxiv.HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

And that this isn't caused by stripping the version indicators from the IDs:

>>> list(arxiv.Search(id_list=['0112019v1']).results())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 554, in results
    feed = self._parse_feed(page_url, first_page)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 637, in _parse_feed
    raise err
arxiv.arxiv.HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

I'm pretty confident this is a bug in the underlying API. The client passes these IDs (which we know are valid: they're listed on arxiv.org) directly to the arXiv API. It doesn't do any preprocessing besides comma-separating them.

I reproduced this issue in the browser by generating the query URL:

>>> arxiv.Client()._format_url(s, 0, 10)
'http://export.arxiv.org/api/query?search_query=&id_list=0112019v1&sortBy=relevance&sortOrder=descending&start=0&max_results=10'

The API gives a 400 response, but with a non-empty feed body:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3D%26id_list%3D0112019v1%26start%3D0%26max_results%3D10%26sortBy%3Drelevance%26sortOrder%3Ddescending" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=&amp;id_list=0112019v1&amp;start=0&amp;max_results=10&amp;sortBy=relevance&amp;sortOrder=descending</title>
  <id>http://arxiv.org/api/ICCqNwWyrQkMAErZidA/EoTr7/o</id>
  <updated>2021-07-12T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:itemsPerPage>
  <entry>
    <id>http://arxiv.org/api/errors#incorrect_id_format_for_0112019v1</id>
    <title>Error</title>
    <summary>incorrect id format for 0112019v1</summary>
    <updated>2021-07-12T00:00:00-04:00</updated>
    <link href="http://arxiv.org/api/errors#incorrect_id_format_for_0112019v1" rel="alternate" type="text/html"/>
    <author>
      <name>arXiv api core</name>
    </author>
  </entry>
</feed>

At least this confirms the issue: the API believes the ID (0112019v1) is of an incorrect format. My hunch is that this was an early ID format (0112019 and 0205137 are papers from 2001 and 2002, respectively) that the API just doesn't support.

I'll shoot a message to the Google Group. Unfortunately, this doesn't seem like an issue a client library can fix ☹️

from arxiv.py.

lukasschwab commented on August 27, 2024 1

Aha! It looks like old-form IDs can be requested; they just need to be fully-qualified with the archive and (where applicable) subject class.

Explanation

The old-form arXiv ID is a combination of a subject component, a date component, and a counter component.

0112019 is the 019th paper submitted on the 12th month of 2001... but, because the counts are archive-specific, the numeric component isn't unique. There is a 0112019 in quantum physics, but there may also be a 0112019 in astrophysics and a 0112019 in math.

This old format only uniquely identifies a paper if we specify which archive's count it refers to. In this case, we want quant-ph

The fully-qualified ID for 0112019 is quant-ph/0112019. Accordingly, the following code works:

>>> import arxiv
>>> next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
[arxiv.Result(entry_id='http://arxiv.org/abs/quant-ph/0112019v1', updated=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), published=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), title='Classical entanglement', authors=[arxiv.Result.Author('Douglas G. Danforth')], summary='Classical systems can be entangled. Entanglement is defined by coincidence\ncorrelations. Quantum entanglement experiments can be mimicked by a mechanical\nsystem with a single conserved variable and 77.8% conditional efficiency.\nExperiments are replicated for four particle entanglement swapping and GHZ\nentanglement.', comment=None, journal_ref=None, doi=None, primary_category='quant-ph', categories=['quant-ph'], links=[arxiv.Result.Link('http://arxiv.org/abs/quant-ph/0112019v1', title=None, rel='alternate', content_type=None), arxiv.Result.Link('http://arxiv.org/pdf/quant-ph/0112019v1', title='pdf', rel='related', content_type=None)])]

But the short ID reported by this client library is incorrect:

>>> r = next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
>>> r.entry_id
'http://arxiv.org/abs/quant-ph/0112019v1'

Instead of just taking the last path element here, I should be taking the full contents of the path following http://arxiv.org/abs/:

arxiv.py/arxiv/arxiv.py

Lines 169 to 176 in ea93efa

    
               def get_short_id(self) -> str: 
        
                   """ 
        
                   Returns the short ID for this result. 
        
                   If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`, 
        
                   `result.get_short_id()` returns `"0201082v1"`. 
        
                   """ 
        
                   return self.entry_id.split('/')[-1]

@sidphbot if you're working from hardcoded IDs, adding the archives should solve this issue for you.

If you're re-querying incorrect IDs returned by this client library, I'll have a patch out shortly.

from arxiv.py.

lukasschwab commented on August 27, 2024 1

@sidphbot patch is included in 1.4.0.

from arxiv.py.

lukasschwab commented on August 27, 2024

These are an old (pre-March 2007) identifier format. The structure of that old identifier and the motivation for the 2007 change are described here:

All existing articles retain their original identifiers but newly announced articles have identifiers following the new scheme.

from arxiv.py.

sidphbot commented on August 27, 2024

Hi, Thank you for looking into it, yes I am re-querying after stripping the id from result.entry field unfortunately, though thanks for the information, I will try parsing the ID as you mentioned. Also, I will gladly wait for the patch 😇

from arxiv.py.

`get_short_id` incorrect for pre-March 2007 arXiv identifiers: missing archive about arxiv.py HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def get_short_id(self) -> str:
	"""
	Returns the short ID for this result.

	If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
	`result.get_short_id()` returns `"0201082v1"`.
	"""
	return self.entry_id.split('/')[-1]