What can cause null entries on the: <ul dir="auto

<div class="highlight highlight-source-sql notranslate position-relative overflow-auto" dir="auto" d

There's also archivers like <a href="https://www.softwareheritage.org/2018/10/10/pypi-

On November 26 a lot more `null` Entries appeared about pypistats.org HOT 6 CLOSED

crflynn commented on July 18, 2024

On November 26 a lot more `null` Entries appeared

from pypistats.org.

Comments (6)

crflynn commented on July 18, 2024

In my experience, the null fields in the aggregations of download records are mostly due to downloads made directly by the requests library, in lieu of using pip or bandersnatch itself or other dependency management client that wraps pip and/or properly sets a parsable user-agent header.

I've done a similar investigation on two small utility packages I maintain databricks-dbapi and databricks-api which had a few thousands of downloads per day for a several days made by the requests client.

A basic query on BigQuery shows that significant requests installs started on Nov 26 for bandersnatch, so it appears to be the culprit here, as well. What I also noticed is that these jumps correspond with the latest releases of bandersnatch. New releases usually correspond with jumps in mirror downloads as you know, but I still don't know why it would cause requests-based installs to jump also.

That being said, I'm not really sure why someone would be downloading these packages with the requests library over something like pip or bandersnatch. My best guess is that it could be something scraping pypi on regular intervals like some software archive or some security research download automation (?). I honestly don't know.

As an aside, I did some digging about how exactly the records are generated. Working with bandersnatch, you might know these details already but I'm going to put my findings here (for my own recollection at least):

Linehaul populates record fields by parsing the user-agent header here from the pypi logs. Pip sets the user-agent here which is what makes these aggregations possible.

It looks like pipenv wraps pip for installs here. Poetry also uses the venv's pip for installs under the hood. Similarly bandersnatch sets it here

I guess some follow-up questions here might be

why are there so many downloads using the requests library instead of pip?
for what purpose are these requests-based downloads?
how should pypistats handle downloads made by requests?

from pypistats.org.

crflynn commented on July 18, 2024

SELECT
  details.installer.name,
  details.installer.version,
  count(*)
FROM
  `the-psf.pypi.downloads20181125`
WHERE
  file.project = 'bandersnatch'
GROUP BY
  1, 2
order by
  3 desc

11/25

name	version	count
pip	1.5.4	48
		27
bandersnatch	2.2.1	4
pip	18.1	1

11/26

name	version	count
bandersnatch	2.0.0	274
bandersnatch	1.11	124
bandersnatch	2.2.1	100
requests	2.19.1	83
pip	1.5.4	50
bandersnatch	3.0.1	44
bandersnatch	2.2.0	32
pip	8.1.2	31
pip	18.1	24
Browser		18
bandersnatch	3.0.0.dev0	16
bandersnatch	2.1.3	8
pip	9.0.1	5
bandersnatch	3.1.0	4
bandersnatch	3.1.1	4
bandersnatch	1.1	4
		4
bandersnatch	3.1.0.dev1	4
bandersnatch	1.4	4
pip	10.0.1	3
requests	2.6.0	2
pip	9.0.3	2
requests	2.13.0	1
pip	18	1

from pypistats.org.

cooperlees commented on July 18, 2024

Thanks for the detailed reply! Should we do a PR to requests and see if they'll accept adding the Python runtime into the default User Agent? I could try if you wish.

From:

"python-requests/2.18.4"

To:

"python-requests/2.18.4 cpython 3.6.6-final0"

Via:

import sys
python = sys.implementation.name
python += " {}.{}.{}-{}{}".format(*sys.version_info)

Why are there so many downloads using the requests library instead of pip?
There has been a lot of new package resolution systems and many I bet are built on top of requests and do not set their User Agent with the information we'd like.
For what purpose are these requests-based downloads?
Mirroring, package resolution, metadata generation, who knows ...
How should pypistats handle downloads made by requests?
I think we should try and get a better default User Agent.

Thoughts?

from pypistats.org.

hugovk commented on July 18, 2024

There's also archivers like https://www.softwareheritage.org/2018/10/10/pypi-available-on-software-heritage/ (I've not checked this one in detail).

from pypistats.org.

cooperlees commented on July 18, 2024

Well, requests is not going to add Python version. I understand kernel being risky, but don't see Python version as risky (thus, why I removed kernel from bandersnatch). We're going to be left shooting up a dark tunnel here with requests. O well, need to try hunt down major PyPI users and see if they'll set a nice User Agent.

from pypistats.org.

crflynn commented on July 18, 2024

I think you're right. It's difficult to tell what the purpose of the requests-based downloads is. If the downloads are related to software archives, I would prefer to filter them from pypistats aggregations. On the other hand if they are part of a newer package management tool, then I would try to encourage that tool to either wrap pip or use a more detailed user-agent as they should be included as user downloads. This discussion more or less prompts the question of whether to restrict the segmented aggregations to pip only. I'll have to do some research on exactly which proportion of the downloads are actually null-valued due to requests as the agent.

from pypistats.org.

On November 26 a lot more `null` Entries appeared about pypistats.org HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent