Comments (6)
In my experience, the null
fields in the aggregations of download records are mostly due to downloads made directly by the requests
library, in lieu of using pip
or bandersnatch
itself or other dependency management client that wraps pip
and/or properly sets a parsable user-agent header.
I've done a similar investigation on two small utility packages I maintain databricks-dbapi and databricks-api which had a few thousands of downloads per day for a several days made by the requests
client.
A basic query on BigQuery shows that significant requests
installs started on Nov 26 for bandersnatch, so it appears to be the culprit here, as well. What I also noticed is that these jumps correspond with the latest releases of bandersnatch. New releases usually correspond with jumps in mirror downloads as you know, but I still don't know why it would cause requests-based installs to jump also.
That being said, I'm not really sure why someone would be downloading these packages with the requests library over something like pip or bandersnatch. My best guess is that it could be something scraping pypi on regular intervals like some software archive or some security research download automation (?). I honestly don't know.
As an aside, I did some digging about how exactly the records are generated. Working with bandersnatch, you might know these details already but I'm going to put my findings here (for my own recollection at least):
Linehaul populates record fields by parsing the user-agent header here from the pypi logs. Pip sets the user-agent here which is what makes these aggregations possible.
It looks like pipenv wraps pip for installs here. Poetry also uses the venv's pip for installs under the hood. Similarly bandersnatch sets it here
I guess some follow-up questions here might be
- why are there so many downloads using the requests library instead of pip?
- for what purpose are these requests-based downloads?
- how should pypistats handle downloads made by requests?
from pypistats.org.
SELECT
details.installer.name,
details.installer.version,
count(*)
FROM
`the-psf.pypi.downloads20181125`
WHERE
file.project = 'bandersnatch'
GROUP BY
1, 2
order by
3 desc
11/25
name | version | count |
---|---|---|
pip | 1.5.4 | 48 |
27 | ||
bandersnatch | 2.2.1 | 4 |
pip | 18.1 | 1 |
11/26
name | version | count |
---|---|---|
bandersnatch | 2.0.0 | 274 |
bandersnatch | 1.11 | 124 |
bandersnatch | 2.2.1 | 100 |
requests | 2.19.1 | 83 |
pip | 1.5.4 | 50 |
bandersnatch | 3.0.1 | 44 |
bandersnatch | 2.2.0 | 32 |
pip | 8.1.2 | 31 |
pip | 18.1 | 24 |
Browser | 18 | |
bandersnatch | 3.0.0.dev0 | 16 |
bandersnatch | 2.1.3 | 8 |
pip | 9.0.1 | 5 |
bandersnatch | 3.1.0 | 4 |
bandersnatch | 3.1.1 | 4 |
bandersnatch | 1.1 | 4 |
4 | ||
bandersnatch | 3.1.0.dev1 | 4 |
bandersnatch | 1.4 | 4 |
pip | 10.0.1 | 3 |
requests | 2.6.0 | 2 |
pip | 9.0.3 | 2 |
requests | 2.13.0 | 1 |
pip | 18 | 1 |
from pypistats.org.
Thanks for the detailed reply! Should we do a PR to requests and see if they'll accept adding the Python runtime into the default User Agent? I could try if you wish.
From:
"python-requests/2.18.4"
To:
"python-requests/2.18.4 cpython 3.6.6-final0"
Via:
import sys
python = sys.implementation.name
python += " {}.{}.{}-{}{}".format(*sys.version_info)
-
Why are there so many downloads using the
requests
library instead of pip?
There has been a lot of new package resolution systems and many I bet are built on top of requests and do not set their User Agent with the information we'd like. -
For what purpose are these requests-based downloads?
Mirroring, package resolution, metadata generation, who knows ... -
How should pypistats handle downloads made by requests?
I think we should try and get a better default User Agent.
Thoughts?
from pypistats.org.
There's also archivers like https://www.softwareheritage.org/2018/10/10/pypi-available-on-software-heritage/ (I've not checked this one in detail).
from pypistats.org.
Well, requests
is not going to add Python version. I understand kernel being risky, but don't see Python version as risky (thus, why I removed kernel from bandersnatch). We're going to be left shooting up a dark tunnel here with requests
. O well, need to try hunt down major PyPI users and see if they'll set a nice User Agent.
from pypistats.org.
I think you're right. It's difficult to tell what the purpose of the requests-based downloads is. If the downloads are related to software archives, I would prefer to filter them from pypistats aggregations. On the other hand if they are part of a newer package management tool, then I would try to encourage that tool to either wrap pip or use a more detailed user-agent as they should be included as user downloads. This discussion more or less prompts the question of whether to restrict the segmented aggregations to pip only. I'll have to do some research on exactly which proportion of the downloads are actually null-valued due to requests as the agent.
from pypistats.org.
Related Issues (20)
- pypistats reporting 502 HOT 1
- Missing data since ~2021-03-22 HOT 6
- Package Not Showing HOT 1
- Wheel statistics
- [feature requeset] downloads per user
- Website is not functioning HOT 2
- CORS Headers Issue HOT 1
- Intermittent 429 RATE LIMIT EXCEEDED HOT 4
- API: 404 is returned for some endpoints but not others
- Download stats dropped close to 0 since 2011-11-24? HOT 5
- broken links to Google BigQuery
- [feature request] new endpoint: `/api/top`
- Sort python versions in natural ordering
- [feature request] Toggle last 30/60/90 days for all the graphs
- List dependent packages
- Feature request: get most recent data including mirrors
- optional dependencies are incorrect HOT 2
- Add `<package name>` on page's title
- No Download Statistics pre-2016 on Google BigQuery HOT 1
- Ignore Metadata Files From Download Stats
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pypistats.org.