Code Monkey home page Code Monkey logo

pypidb's Introduction

pypidb

PyPI client side database with SCM/VCS URLs

This project provides a client side database of PyPI project metadata, primarily for the purpose of finding a SCM URL for any PyPI project. More of the internals of the database will be exposed. Time has been the main limiting factor in exposing more.

Most programming languages created in the last 10 years directly connect every library to a SCM. PyPI offers several mechanisms for package uploads to provide URLS, including for the SCM, however this is frequently omitted, is often invalid, and is frequently outdated as projects move their development activities between free hosting services, especially when services are discontinued. There are also projects which have deleted the project on a hosted service and not republished it at a new location. (Perhaps due to "right to vanish" provisions.)

This project attempts to locate the current development URL, and has deep analysis in the test suite to verify the resolution process is correct for thousands of projects.

Each resolution process stops after a limited number of web fetches, and almost all projects tested require less than one minute per project, and disk caching is used so that subsequent resolution of the same projects are almost instantaneous.

The objective is to always give an appropriate URL for any project, if there is one, and if there is a credible rationale that the project in question is, or was, an important project.

If you encounter a project which returns the wrong result, or no result, first check the PyPI metadata for a suitable SCM link. If none exists, try to find the development project manually, and create an issue in their project to enhance the metadata they submit to PyPI.

Only if the target project maintainers are uncooperative, then create an issue in the pypidb project for assistance.

Details

There are over 8000 tests, however a few projects appear in multiple tests, so the total number of projects checked is slightly lower.

Tests currently cover all PyPI projects in

Of those, approximately 340 projects do not return a URL, such as mysql-connector-python with over 55,000 downloads per day.

There are a few situations where the returned result may not be stable; where it may alternate between two URLs. The fluctuation is due to how URLs are queued and fetched. There are no known cases where this happens, however override rules have been added to avoid them. It is a high priority for any such occurrences to be fixed so that results are always stable. Please raise an issue if you encounter this.

There are many rules which drive the resolution, and each package can have an associated unified patch URL, which will be fetched and used to guide the resolution. This is used for packages which have moved, but have not yet been re-released to PyPI with updated metadata.

The rules for projects may also exclude URLs in the metadata from the resolution process. The rules do not allow for explicitly setting the target URL. For projects which do not have a SCM, and only have a webpage, that webpage can be added as a 'fake' SCM so that it will be used, however this approach is only to be used for moribund projects where no SCM can be found.

Usage

$ pip install pypidb
$ pypidb requests-threads
https://github.com/requests/requests-threads
$ pypidb does-not-exist
Invalid package name does-not-exist
>>> from pypidb import Database

>>> db = Database()
>>> db.find_project_scm_url("requests-threads")
'https://github.com/requests/requests-threads'
>>> db.find_project_scm_url("mercurial")
'https://www.mercurial-scm.org/repo/hg'
>>> db.find_project_scm_url("cffi")
'https://foss.heptapod.net/pypy/cffi'
>>> db.find_project_scm_url("mysql-connector-python")
Traceback (most recent call last):
    ...
pypidb._exceptions.IncompletePackageMetadata: mysql-connector-python has no email in PyPI metadata
https://pypi.org/project/mysql-connector-python/8.0.19/: 500 Server Error: HTTPS Everywhere for url: https://pypi.org/project/mysql-connector-python/8.0.19/
https://pypi.org/project/mysql-connector-python/: 500 Server Error: HTTPS Everywhere for url: https://pypi.org/project/mysql-connector-python/
https://downloads.mysql.com/docs/licenses/connector-python-8.0-com-en.pdf: 500 Server Error: HTTPS Everywhere for url: https://downloads.mysql.com/docs/licenses/connector-python-8.0-com-en.pdf
https://downloads.mysql.com/docs/licenses/connector-python-8.0-gpl-en.pdf: 500 Server Error: HTTPS Everywhere for url: https://downloads.mysql.com/docs/licenses/connector-python-8.0-gpl-en.pdf
https://downloads.mysql.com/docs/licenses/connector-python-gpl-en.pdf: 500 Server Error: HTTPS Everywhere for url: https://downloads.mysql.com/docs/licenses/connector-python-gpl-en.pdf
https://downloads.mysql.com/docs/connector-python-en.pdf: 500 Server Error: HTTPS Everywhere for url: https://downloads.mysql.com/docs/connector-python-en.pdf
https://downloads.mysql.com/docs/licenses/connector-python-com-en.pdf: 500 Server Error: HTTPS Everywhere for url: https://downloads.mysql.com/docs/licenses/connector-python-com-en.pdf

Resolution of many packages uses Read the Docs metadata, which performs better when using a token which can be obtained from https://readthedocs.org/accounts/tokens/

It should be stored in the default .netrc file, in the user home, and should have the following format.

machine readthedocs.io
    login deadbeef
    password x-oauth-basic

To a lesser extent, the GitHub API is also used. Depending on the volume of lookups, it may be necessary to add a GitHub token, also stored in .netrc.

Testing

Testing requires a GitHub token in .netrc. Without a GitHub token, many tests will be skipped, and some will fail. (The tests can be easily fixed to detect that the API limit was reached)

git clone https://github.com/jayvdb/pypidb
cd pypidb
tox

A complete test run takes several hours. There is aggressive caching of web content using CacheControl and DNS results using dns-cache, so subsequent runs should complete in a little over an hour.

Running only tests on the top 360 most popular PyPI packages can be done without any tokens, and completes within approximately five minutes.

tox -- -k TestTop360

Similarly, running the tests on the top 4000 most popular PyPI packages can be done without any tokens, and completes within approximately twenty minutes, and tests requiring a GitHub token will be skipped.

tox -- tests/test_top.py

As the tests are inspecting and validating the results for live project metadata, and those projects are constantly on the move, and the resolution often includes accessing websites which may be inoperative temporarily for various reasons (usually certificate expiration!), it is not unusual for tests to fail.

For example there are approximately 700 expected URLS in tests/data.py, divided into four subsets. In the case of projects that have moved, and the algorithm has correctly followed the move, those URLs need to be updated.

There is rudimentary support for marking PyPI projects as untestable.

pypidb's People

Contributors

hugovk avatar jayvdb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pypidb's Issues

pyup.io/repos/github/x

perfume-bench uses pyup.io in its description, which causes it to be fetched, slowing it down.

Find Azure repos

#18 blocked incorrect repo resolution, so many now return no repository because none can be found.

MicrosoftDocs/azure-docs#51067 tried to get the public metadata to be better, but that fizzled.

There are too many azure packages spread across lots of repo to 'work around' the problem.

Upstream needs to be fixed.

Resolve jupyter without any tokens

The only package in the top 360 which depends on github or rtd tokens is jupyter. Need to find some fallback if those tokens are missing so that the top 360 can be used without any tokens.

Block network traffic to *.travis-ci.org

Package misspellings is causing requests to *.travis-ci.org - there is nothing useful which can be found there.

Blocks usually occur in _caching.py

Also there are basic handlers for travis-ci.org in _scm_url_cleaner.py - they could return False to block network requests.

GitHub setuppy check could use raw

Spawned from #9:

Probably using raw bypasses restrictions on API use without tokens, however it might only buy a bit more access, so should be a failover.

gitlab.gnome.org: Max retries exceeded: connect timeout=15

pypillowfight & paperwork-backend & pyocr fail during the tests (not in the library):

tests/utils.py:268: in _test_names
    r = web_session.get(url, timeout=get_timeout(url))
...
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='gitlab.gnome.org', port=443): Max retries exceeded with url: /World/OpenPaperwork/libpillowfight (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f97179c4640>, 'Connection to gitlab.gnome.org timed out. (connect timeout=15)'))

The timeouts for known SCM should be much higher.

A problem with higher timeouts is the test runner also has a limit, and that isnt flexible. pygtk encountered this.

Add openSUSE check for mis-normalised names

openSUSE package names need to be exactly matched in RPM .spec. Given they are being matched to PyPI names, it is worth also building a dataset of which package names are not using the same name as PyPI, even if they are normalised as the same name. e.g. different capitalisation, or different _, - and ..

mwlib.*

e.g. mwlib.ext

https://www.reportlab.org & http://www.reportlab.org fail.

ERROR    https_everywhere.adapter:adapter.py:124 handle_error requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.reportlab.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f097d74bf40>, 'Connection to www.reportlab.org timed out. (connect timeout=15)'))
ERROR    https_everywhere.adapter:adapter.py:124 handle_error requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.reportlab.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f097d74bf40>, 'Connection to www.reportlab.org timed out. (connect timeout=15)'))
ERROR    https_everywhere.adapter:adapter.py:124 handle_error requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.reportlab.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f097d74bf40>, 'Connection to www.reportlab.org timed out. (connect timeout=15)'))
ERROR    https_everywhere.adapter:adapter.py:124 handle_error requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.reportlab.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f097d74bf40>, 'Connection to www.reportlab.org timed out. (connect timeout=15)'))
ERROR    https_everywhere.adapter:adapter.py:124 handle_error requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.reportlab.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f097d74bf40>, 'Connection to www.reportlab.org timed out. (connect timeout=15)'))
ERROR    https_everywhere.adapter:adapter.py:124 handle_error requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.reportlab.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f097d74bf40>, 'Connection to www.reportlab.org timed out. (connect timeout=15)'))
WARNING  pypidb._pypi:_pypi.py:459 http://www.reportlab.org: HTTPSConnectionPool(host='www.reportlab.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f097d74bf40>, 'Connection to www.reportlab.org timed out. (connect timeout=15)'))

There is a lot of links to https://www.mediawiki.org/wiki/Special:ExtensionDistributor/Collection , which has no chance of helping.

They also load http://blog.pediapress.com/

timeout occurs in rtd, but it is after the rtd has resolved already.

pypidb/_rtd.py:30: in __init__
    token = _get_token("readthedocs.io")

Prevent http errors during tests

gertty currently causes log

ERROR    https_everywhere.adapter:adapter.py:124 handle_error requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='ttygroup.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb191ddc580>, 'Connection to ttygroup.org timed out. (connect timeout=15)'))
ERROR    https_everywhere.adapter:adapter.py:124 handle_error requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='ttygroup.org', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb191ddc580>, 'Connection to ttygroup.org timed out. (connect timeout=15)'))

While it is useful for the library to be able to skip past those problems, the tests should highlight them so they can be raised upstream.

Another more critical one would be TooManyRedirects.

tvb-gdist & www.thevirtualbrain.org

https://www.thevirtualbrain.org causes

ERROR https_everywhere.adapter:adapter.py:126 handle_error requests.exceptions.SSLError: HTTPSConnectionPool(host='www.thevirtualbrain.org', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))

But http redirects to https, so we cant downgrade it otherwise there is too many redirects, and causes a timeout on https://ci.appveyor.com/project/jayvdb/pypidb/builds/31795479.

the-virtual-brain/tvb-gdist#24 already solves this for that package.

The maximum number of redirects should also be reduced.

Fix Python 2.7

Python 2.7 has lots of failures. While Python 2.7 support will only be minimal, the tox py27 should succeed, skipping any aspects which are not supported.

azure failures

TestTopFirstThousandTail has two expected failures for azure-*

There are also many more in tests/data.py

Package django-hijack

django-hijack-admin is handled with a rule; however django-hijack doesnt return a result.

GitHub connection failure during tests

Occurring in rv = self._check_github_repo(slug):

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

This should be a skip, or retry.

uuid website zesty.ca regularly fails

https://pypi.org/project/uuid/ when http://zesty.ca/python/ fails, esp http://zesty.ca/python/uuid.html which is really slow

INFO     pypidb._pypi:_pypi.py:507 looking up uuid
INFO     pypidb._cache:_cache.py:82 domain http://zesty.ca/python/ lookup error: -11

It doesnt appear in @zestyping 's https://github.com/zestyping?tab=repositories

And https://zesty.ca/python/ (https) always fails.

https://github.com/search?q=%22Get+the+hardware+address+on+Windows+by+running+ipconfig.exe%22&type=Code shows lots of copies of it.

https://github.com/python/cpython/blob/master/Lib/uuid.py is very similar - I believe this package was the basis of the cpython stdlib.

protorpc

This only occurred once
https://cirrus-ci.com/task/4782984914010112?command=main#L449

self = <tests.test_google_code.TestRedirectGitHub testMethod=test_protorpc>
    def test_protorpc(self):
        url = self.converter.get_vcs("protorpc")
>       self.assertInsensitiveEqual(url, "https://github.com/google/protorpc")
tests/test_google_code.py:157: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/utils.py:72: in assertInsensitiveEqual
    self.assertEqual(s.lower(), s2.lower())
E   AssertionError: 'https://github.com/xzer/run-jetty-run' != 'https://github.com/google/protorpc'
E   - https://github.com/xzer/run-jetty-run
E   + https://github.com/google/protorpc
------------------------------ Captured log call -------------------------------
INFO     pypidb._pypi:_pypi.py:507 looking up protorpc
INFO     pypidb._pypi:_pypi.py:351 protorpc: from None added urls ['https://github.com/xzer/run-jetty-run']
INFO     pypidb._rules:_rules.py:854 email split: rafek @ google.com
INFO     pypidb._pypi:_pypi.py:435 r https://pypi.org/project/protorpc/0.12.0/
WARNING  pypidb._pypi:_pypi.py:446 https://pypi.org/project/protorpc/0.12.0/: 500 Server Error: HTTPS Everywhere for url: https://pypi.org/project/protorpc/0.12.0/
INFO     pypidb._pypi:_pypi.py:435 r https://pypi.org/project/protorpc/
WARNING  pypidb._pypi:_pypi.py:446 https://pypi.org/project/protorpc/: 500 Server Error: HTTPS Everywhere for url: https://pypi.org/project/protorpc/
INFO     pypidb._db:_db.py:71 Adding mapping protorpc = https://github.com/xzer/run-jetty-run

Other top 4000 expected failures

In addition to #20

  • "bindep" (opendev)
  • "clearbit"
  • "comet-ml"
  • "dbus-python" (fedora)
  • "dm.xmlsec-binding" (fedora iirc)
  • "pycuda" (repo is auth protected)
  • "scons"

Relative urls

Currently relative URLs are extracted when get_html_hrefs or _url_extract_both is used in a rule in https://github.com/jayvdb/pypidb/blob/master/pypidb/_rules.py , instead of using the default _url_extractor_wrapper.

Some of these explicit controlling rule might be removed by always doing urlextract first, and follow with get_html_hrefs if there is no useful values from urlextract.

Regular server errors for code.google.com

DEBUG    pypidb._scm_url_cleaner:__init__.py:157 Calling pypidb._scm_url_cleaner.SCMURLCleaner.get_root(self = <pypidb._scm_url_cleaner.SCMURLCleaner object at 0x7fb7ea9671f0>, url = 'http://code.google.com/p/modwsgi/wiki/VirtualEnvironments')
DEBUG    pypidb._scm_url_cleaner:_scm_url_cleaner.py:291 google code http://code.google.com/p/modwsgi/wiki/VirtualEnvironments
DEBUG    pypidb._scm_url_cleaner:_scm_url_cleaner.py:309 google code p modwsgi/wiki/VirtualEnvironments
DEBUG    pypidb._adapters:_adapters.py:57 domain block of https://code.google.com/p/modwsgi skipped
DEBUG    pypidb._adapters:_adapters.py:33 IPblock of https://code.google.com/p/modwsgi skipped
DEBUG    https_everywhere.adapter:adapter.py:86 No implementation for get_redirect('https://code.google.com/p/modwsgi')
DEBUG    https_everywhere.adapter:adapter.py:117 no redirection of https://code.google.com/p/modwsgi occurred
DEBUG    cachecontrol.controller:controller.py:126 Looking up "https://code.google.com/p/modwsgi" in the cache
DEBUG    cachecontrol.controller:controller.py:141 No cache entry available
DEBUG    urllib3.connectionpool:connectionpool.py:955 Starting new HTTPS connection (1): code.google.com:443
DEBUG    urllib3.connectionpool:connectionpool.py:428 https://code.google.com:443 "GET /p/modwsgi HTTP/1.1" 500 290
DEBUG    cachecontrol.controller:controller.py:257 Status code 500 not in (200, 203, 300, 301, 302, 401, 404)
DEBUG    pypidb._scm_url_cleaner:__init__.py:198 Exception HTTPError occurred in pypidb._scm_url_cleaner.SCMURLCleaner.get_root, "500 Server Error: Internal Server Error for url: https://code.google.com/p/modwsgi"

80 missing_repos

There are 80 entries in tests/data.py "missing_repos". These need debugging. Creating issues about what is occurring is a helpful contribution; fixing them is even better.

repo->PyPI mappings

While rummaging around the internet looking for the repo for package-a, very often a package-b repo is found, and can be reliably verified, and it should be stored as a mapping.

This would remove the need for many of the rules which preload packages that are known to cause incorrect mappings.

The most reliable way to verify a repo->PyPI is to look for a link to PyPI in the readme.

A readme analysis which might be useful is https://github.com/austinhle/readme-analysis - I havent looked into it yet.

openstacksdk

tests/test_fedora.py::TestFedora::test_package__<'openstacksdk'> FAILED  [ 84%]
__________________ TestFedora.test_package__<'openstacksdk'> ___________________
self = <tests.test_fedora.TestFedora testMethod=test_package__<'openstacksdk'>>
name = 'openstacksdk'
    @foreach(names)
    def test_package(self, name):
>       self._test_names([name])
tests/test_fedora.py:117: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/utils.py:269: in _test_names
    r.raise_for_status()
...
E           requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://storyboard.openstack.org/project/openstack/openstacksdk

Store CI caches

The caches should be kept, and re-used to speed up future runs.

mysql-connector

There are three packages mysql-connector-* listed in test_top.py 's expected_failures.

  • "mysql-connector-python-rf",
  • "mysql-connector",
  • "mysql-connector-python",

Switch to urlextract gen_urls

Currently we use URLExtract find_urls which collects all urls, whereas we only want to process some of them until we have a valid result. Using gen_urls would allow reducing the amount of work done in URLExtract, which would be especially helpful for msgpack-python.

However, currently the algorithm also processes all of the urlextract results in order to measure the distance of them all to find the 'best' one. That needs to be altered to know when a best url has been reached even when more urls are yet to be found. That can be easily done by using the github repo checker, and other scm checkers, rather than relying on the guessing approach.

Split homepage, scm and repo links

For GitHub, GitLab, and many others, the scm and repo link are the same.

However even for GitHub, the homepage is often different, being a RTD or Github Pages site.

For sourceforge (and BitBucket until June?), they support multiple VCS so the type of VCS and link are important to interact with the repo. This info can be gleaned from the info processed while finding the SCM, but validating that may be more trouble than using a specialised library. This library can at least indicate that the link is only useful for the SCM, and describe which SCM is used by the project.

The most important aspect however is the results which are a webpage, and no SCM can be found, such as pychart. Many tools like rpm spec builders (upt, py2pack, etc) are looking for a homepage only.

Another cohort is trac, which is an SCM, but the code repository isnt readily available.

Also launchpad is often used only as SCM/bug tracker, but isnt being used the code repo.

Filter test run to be only Python 3 PyPI packages

One way to avoid many of the "harder"/"slower" problems in CI is to require Python 3 compatibility for the test cases being run.

It is reasonable to assume that there are very few Python 3 packages on code.google.com, so that SCM host could also be disabled in order to avoid #6

Block favicon.ico

As this is so common, and so useless, it shouldnt go through the caching process.

Move datasets.py into package

A CLI list command could iterate through a dataset and print out the SCM as a prettytable with color, or a format requested by the user.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.