Code Monkey home page Code Monkey logo

Comments (6)

crflynn avatar crflynn commented on June 16, 2024

In my experience, the null fields in the aggregations of download records are mostly due to downloads made directly by the requests library, in lieu of using pip or bandersnatch itself or other dependency management client that wraps pip and/or properly sets a parsable user-agent header.

I've done a similar investigation on two small utility packages I maintain databricks-dbapi and databricks-api which had a few thousands of downloads per day for a several days made by the requests client.

A basic query on BigQuery shows that significant requests installs started on Nov 26 for bandersnatch, so it appears to be the culprit here, as well. What I also noticed is that these jumps correspond with the latest releases of bandersnatch. New releases usually correspond with jumps in mirror downloads as you know, but I still don't know why it would cause requests-based installs to jump also.

That being said, I'm not really sure why someone would be downloading these packages with the requests library over something like pip or bandersnatch. My best guess is that it could be something scraping pypi on regular intervals like some software archive or some security research download automation (?). I honestly don't know.

As an aside, I did some digging about how exactly the records are generated. Working with bandersnatch, you might know these details already but I'm going to put my findings here (for my own recollection at least):

Linehaul populates record fields by parsing the user-agent header here from the pypi logs. Pip sets the user-agent here which is what makes these aggregations possible.

It looks like pipenv wraps pip for installs here. Poetry also uses the venv's pip for installs under the hood. Similarly bandersnatch sets it here

I guess some follow-up questions here might be

  • why are there so many downloads using the requests library instead of pip?
  • for what purpose are these requests-based downloads?
  • how should pypistats handle downloads made by requests?

from pypistats.org.

crflynn avatar crflynn commented on June 16, 2024
SELECT
  details.installer.name,
  details.installer.version,
  count(*)
FROM
  `the-psf.pypi.downloads20181125`
WHERE
  file.project = 'bandersnatch'
GROUP BY
  1, 2
order by
  3 desc

11/25

name version count
pip 1.5.4 48
27
bandersnatch 2.2.1 4
pip 18.1 1

11/26

name version count
bandersnatch 2.0.0 274
bandersnatch 1.11 124
bandersnatch 2.2.1 100
requests 2.19.1 83
pip 1.5.4 50
bandersnatch 3.0.1 44
bandersnatch 2.2.0 32
pip 8.1.2 31
pip 18.1 24
Browser 18
bandersnatch 3.0.0.dev0 16
bandersnatch 2.1.3 8
pip 9.0.1 5
bandersnatch 3.1.0 4
bandersnatch 3.1.1 4
bandersnatch 1.1 4
4
bandersnatch 3.1.0.dev1 4
bandersnatch 1.4 4
pip 10.0.1 3
requests 2.6.0 2
pip 9.0.3 2
requests 2.13.0 1
pip 18 1

from pypistats.org.

cooperlees avatar cooperlees commented on June 16, 2024

Thanks for the detailed reply! Should we do a PR to requests and see if they'll accept adding the Python runtime into the default User Agent? I could try if you wish.

From:

  • "python-requests/2.18.4"

To:

  • "python-requests/2.18.4 cpython 3.6.6-final0"

Via:

import sys
python = sys.implementation.name
python += " {}.{}.{}-{}{}".format(*sys.version_info)
  • Why are there so many downloads using the requests library instead of pip?
    There has been a lot of new package resolution systems and many I bet are built on top of requests and do not set their User Agent with the information we'd like.

  • For what purpose are these requests-based downloads?
    Mirroring, package resolution, metadata generation, who knows ...

  • How should pypistats handle downloads made by requests?
    I think we should try and get a better default User Agent.

Thoughts?

from pypistats.org.

hugovk avatar hugovk commented on June 16, 2024

There's also archivers like https://www.softwareheritage.org/2018/10/10/pypi-available-on-software-heritage/ (I've not checked this one in detail).

from pypistats.org.

cooperlees avatar cooperlees commented on June 16, 2024

Well, requests is not going to add Python version. I understand kernel being risky, but don't see Python version as risky (thus, why I removed kernel from bandersnatch). We're going to be left shooting up a dark tunnel here with requests. O well, need to try hunt down major PyPI users and see if they'll set a nice User Agent.

from pypistats.org.

crflynn avatar crflynn commented on June 16, 2024

I think you're right. It's difficult to tell what the purpose of the requests-based downloads is. If the downloads are related to software archives, I would prefer to filter them from pypistats aggregations. On the other hand if they are part of a newer package management tool, then I would try to encourage that tool to either wrap pip or use a more detailed user-agent as they should be included as user downloads. This discussion more or less prompts the question of whether to restrict the segmented aggregations to pip only. I'll have to do some research on exactly which proportion of the downloads are actually null-valued due to requests as the agent.

from pypistats.org.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.