Code Monkey home page Code Monkey logo

Comments (5)

lukasschwab avatar lukasschwab commented on August 27, 2024 1

Huh, interesting. Confirmed I can reproduce this with a minimal case:

>>> list(arxiv.Search(id_list=['0112019']).results())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 554, in results
    feed = self._parse_feed(page_url, first_page)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 637, in _parse_feed
    raise err
arxiv.arxiv.HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

And that this isn't caused by stripping the version indicators from the IDs:

>>> list(arxiv.Search(id_list=['0112019v1']).results())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 554, in results
    feed = self._parse_feed(page_url, first_page)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 637, in _parse_feed
    raise err
arxiv.arxiv.HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

I'm pretty confident this is a bug in the underlying API. The client passes these IDs (which we know are valid: they're listed on arxiv.org) directly to the arXiv API. It doesn't do any preprocessing besides comma-separating them.

I reproduced this issue in the browser by generating the query URL:

>>> arxiv.Client()._format_url(s, 0, 10)
'http://export.arxiv.org/api/query?search_query=&id_list=0112019v1&sortBy=relevance&sortOrder=descending&start=0&max_results=10'

The API gives a 400 response, but with a non-empty feed body:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3D%26id_list%3D0112019v1%26start%3D0%26max_results%3D10%26sortBy%3Drelevance%26sortOrder%3Ddescending" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=&amp;id_list=0112019v1&amp;start=0&amp;max_results=10&amp;sortBy=relevance&amp;sortOrder=descending</title>
  <id>http://arxiv.org/api/ICCqNwWyrQkMAErZidA/EoTr7/o</id>
  <updated>2021-07-12T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:itemsPerPage>
  <entry>
    <id>http://arxiv.org/api/errors#incorrect_id_format_for_0112019v1</id>
    <title>Error</title>
    <summary>incorrect id format for 0112019v1</summary>
    <updated>2021-07-12T00:00:00-04:00</updated>
    <link href="http://arxiv.org/api/errors#incorrect_id_format_for_0112019v1" rel="alternate" type="text/html"/>
    <author>
      <name>arXiv api core</name>
    </author>
  </entry>
</feed>

At least this confirms the issue: the API believes the ID (0112019v1) is of an incorrect format. My hunch is that this was an early ID format (0112019 and 0205137 are papers from 2001 and 2002, respectively) that the API just doesn't support.

I'll shoot a message to the Google Group. Unfortunately, this doesn't seem like an issue a client library can fix ☚ī¸

from arxiv.py.

lukasschwab avatar lukasschwab commented on August 27, 2024 1

Aha! It looks like old-form IDs can be requested; they just need to be fully-qualified with the archive and (where applicable) subject class.

Explanation

The old-form arXiv ID is a combination of a subject component, a date component, and a counter component.

Diagram breaking down the old-form arXiv ID into its components

0112019 is the 019th paper submitted on the 12th month of 2001... but, because the counts are archive-specific, the numeric component isn't unique. There is a 0112019 in quantum physics, but there may also be a 0112019 in astrophysics and a 0112019 in math.

This old format only uniquely identifies a paper if we specify which archive's count it refers to. In this case, we want quant-ph

The fully-qualified ID for 0112019 is quant-ph/0112019. Accordingly, the following code works:

>>> import arxiv
>>> next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
[arxiv.Result(entry_id='http://arxiv.org/abs/quant-ph/0112019v1', updated=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), published=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), title='Classical entanglement', authors=[arxiv.Result.Author('Douglas G. Danforth')], summary='Classical systems can be entangled. Entanglement is defined by coincidence\ncorrelations. Quantum entanglement experiments can be mimicked by a mechanical\nsystem with a single conserved variable and 77.8% conditional efficiency.\nExperiments are replicated for four particle entanglement swapping and GHZ\nentanglement.', comment=None, journal_ref=None, doi=None, primary_category='quant-ph', categories=['quant-ph'], links=[arxiv.Result.Link('http://arxiv.org/abs/quant-ph/0112019v1', title=None, rel='alternate', content_type=None), arxiv.Result.Link('http://arxiv.org/pdf/quant-ph/0112019v1', title='pdf', rel='related', content_type=None)])]

But the short ID reported by this client library is incorrect:

>>> r = next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
>>> r.entry_id
'http://arxiv.org/abs/quant-ph/0112019v1'

Instead of just taking the last path element here, I should be taking the full contents of the path following http://arxiv.org/abs/:

arxiv.py/arxiv/arxiv.py

Lines 169 to 176 in ea93efa

def get_short_id(self) -> str:
"""
Returns the short ID for this result.
If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
`result.get_short_id()` returns `"0201082v1"`.
"""
return self.entry_id.split('/')[-1]

@sidphbot if you're working from hardcoded IDs, adding the archives should solve this issue for you.

If you're re-querying incorrect IDs returned by this client library, I'll have a patch out shortly.

from arxiv.py.

lukasschwab avatar lukasschwab commented on August 27, 2024 1

@sidphbot patch is included in 1.4.0.

from arxiv.py.

lukasschwab avatar lukasschwab commented on August 27, 2024

These are an old (pre-March 2007) identifier format. The structure of that old identifier and the motivation for the 2007 change are described here:

All existing articles retain their original identifiers but newly announced articles have identifiers following the new scheme.

from arxiv.py.

sidphbot avatar sidphbot commented on August 27, 2024

Hi, Thank you for looking into it, yes I am re-querying after stripping the id from result.entry field unfortunately, though thanks for the information, I will try parsing the ID as you mentioned. Also, I will gladly wait for the patch 😇

from arxiv.py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤ī¸ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.