nexb / fetchcode Goto Github PK

A library to reliably fetch code via HTTP, FTP and version control systems. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ Google Summer of Code, nexB and others generous sponsors!

Python 99.59% Batchfile 0.16% Shell 0.20% Makefile 0.04%

fetchcode's People

Contributors

Stargazers

Watchers

Forkers

sanjibansg iamsairus10 bobquest33 nk4456542 tg1999 ankit2001 striver08 tushar912 shreyas220 aalexanderr quepop ziadhany armijnhemel 35c4n0r armintaenzertng keshav-space

fetchcode's Issues

Add package registry support for pypi packages

Consider to use black for code formatting

https://github.com/psf/black

Failure trace is hard to process

>>> from fetchcode import package
>>> list(package.info('pkg:rubygems/file'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "fetchcode/src/fetchcode/package.py", line 315, in get_rubygems_data_from_purl
    response = get_response(api_url)
  File "fetchcode/src/fetchcode/package.py", line 47, in get_response
    return resp.json()
  File "fetchcode/tmp/lib/python3.6/site-packages/requests/models.py", line 910, in json
    return complexjson.loads(self.text, **kwargs)
  File ".pyenv/versions/3.6.10/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File ".pyenv/versions/3.6.10/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File ".pyenv/versions/3.6.10/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

This failure is cryptic and it does not tell anything that can help me fix it...
In contrast if I update the code this way:

@router.route("pkg:rubygems/.*")
def get_rubygems_data_from_purl(purl):
    """
    Generate `Package` object from the `purl` string of rubygems type.
    """
    purl = PackageURL.from_string(purl)
    name = purl.name
    api_url = f"https://rubygems.org/api/v1/gems/{name}.json"
    try:
        response = get_response(api_url)
    except Exception as e:
        raise Exception(f'Failed to fetch: {api_url}') from e

    declared_license = response.get("licenses") or None
    homepage_url = response.get("homepage_uri")
    code_view_url = response.get("source_code_uri")
    bug_tracking_url = response.get("bug_tracker_uri")
    download_url = response.get("gem_uri")
    yield Package(
        homepage_url=homepage_url,
        api_url=api_url,
        bug_tracking_url=bug_tracking_url,
        code_view_url=code_view_url,
        declared_license=declared_license,
        download_url=download_url,
        **purl.to_dict(),
    )

I now have a clean and clear failure trace that is actionable:

>>> from fetchcode import package
>>> list(package.info('pkg:rubygems/file'))
Traceback (most recent call last):
  File "fetchcode/src/fetchcode/package.py", line 316, in get_rubygems_data_from_purl
    response = get_response(api_url)
  File "fetchcode/src/fetchcode/package.py", line 47, in get_response
    return resp.json()
  File "fetchcode/tmp/lib/python3.6/site-packages/requests/models.py", line 910, in json
    return complexjson.loads(self.text, **kwargs)
  File ".pyenv/versions/3.6.10/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File ".pyenv/versions/3.6.10/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File ".pyenv/versions/3.6.10/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "fetchcode/src/fetchcode/package.py", line 318, in get_rubygems_data_from_purl
    raise Exception(f'Failed to fetch: {api_url}') from e
Exception: Failed to fetch: https://rubygems.org/api/v1/gems/file.json

We should test and ensure we have informative traces in all our functions

Add VCS support for fetchcode

Add basic project skeletton

This should be modeled after something like https://github.com/nexB/license-expression/

Don't delete temp directory in fetch_via_vcs

This may cause collisions

https://github.com/nexB/scancode-toolkit/issues/2100#issue-650093067

nexB/scancode-toolkit#2100 (comment)

Originally posted by @danN4322 in nexB/scancode-toolkit#2100 (comment)

Release on Pypi

A subtask from #41 , currently fetchcode is not released on Pypi, please have a look at skeleton https://github.com/nexB/skeleton repo and try to publish this on Pypi.

Version of rubygem not returned

>>> from fetchcode import package
>>> list(package.info('pkg:rubygems/files'))
[Package(type='rubygems', namespace=None, name='files', version=None)]

https://rubygems.org/api/v1/gems/files.json has version 0.4.0 listed... we should return this

Reuse/copy pip code for VCS and download

pip is a good starting point as https://github.com/pypa/pip/blob/master/src/pip/_internal/download.py is a solid and reliable download utility tested with billions of downloads.

There are a few ways to handle this:

use https://github.com/sarugaku/pip-shims and reuse pip code
copy and fork pip code
vendor pip code

Note pip also handles VCS URLs
See https://github.com/pypa/pip/tree/master/src/pip/_internal/vcs
The download location specified in SPDX is mostly derived from the pip URLs https://github.com/spdx/spdx-spec/blob/db06dc81e525e08035af34117127742337e1f1b6/chapters/3-package-information.md#37-package-download-location-

pip does not handle ftp AFAIK

Add support for Rust PURLs

Currently there is not any method to convert Rust PURLs into a downloadable normalized URL. So add support regarding it, that can convert PURL into URL and then download that package.

Support detecting and reporting available versions/tag from VCS

See for instance this https://github.com/packit-service/ogr

Add purl to URL support for GitHub

Improve current API by adding more tests

In PR #4 basic API has been build, but it need more tests. So this issue can be solved by adding more tests for it and making the API more full proof.

Add Package Registry support for Rust Packages

Download from http(s):// URLs

Apart of phase 1 of GSoC 2020

fetchcode needs to be able to download HTTP and HTTPS urls in the form of http:// or https://
We will need to determine the behavior of what to do when the URL doesnt point to a specific file, among other things.

@pombredanne anything in particular to add here?

Add package registry support for Homebrew packages

Have a look at fetchcode/package.py , and try to implement the same for Homebrew packages.https://formulae.brew.sh/api/formula/a2ps.json, sample API URL.

Review pip vendoring

In #58 there are several fixes made to pip... yet pip was meant to be vendored as-is and unmodified.
We also lost history in some move... we should clean this up and vendor pip automatically (with vendy that we use already for typecode)

Add purl to URL support for Gitlab

fetchcode take long time to download

sometimes we just need a small part of the repository like https://github.com/github/advisory-database
we just need to github-reviewed part 57.2 MB. but using fetch_via_vcs we download 2.00 GiB
the problem gets worth when we start rerunning the importer and debugging it,
at least we can have something like a cache for saving the repo so we can use it again.
https://github.com/github/advisory-database/tree/main/advisories/github-reviewed

nexB/vulnerablecode/issues/902

Support git URLs

See for instance https://github.com/coala/git-url-parse

Make structure of data for package registry

Align models with ScanCode latest

The packagedcode models are vendored as a convenience, but they have changed a lot since.
I suggest that we should either externalize the SCTK packagedcode for reuse.
To discuss.

Add purl to URL support for BitBucket

RFC: Roadmap for FetchCode next steps

Here are some of the TODOs and directions we could go in no specific order, based on a chat between @TG1999 and @pombredanne

Release on PyPI
Integration in ScanCode-toolkit such that there could be a --from-url option that would fetch a URL with fetchcode (but then it may also need to be extracted before being scanned....)
... therefore an integration in https://github.com/nexB/scancode.io/ as a "pipe" to be used in pipelines may be better?
Add support for more API data fetching including package indexes.
Add support for Docker images download by moving the code from https://github.com/nexB/container-inspector/tree/develop/src/container_inspector/fetch ... or create some simple plugin architecture similar to that of scancode-toolkit such that there could be plugins for various "fetchers"
Later on an integration with an upcoming package data mining tool would be great
There are also some overlap between packageurl, fetchcode and scancode-toolkit packagedcode and this would benefit to be re-architected and reorged such that we have modules with clear responsibilities
https://github.com/nexB/dependentcode needs to be created and would be a major user of fetchcode (See nexB/dependentcode#1 )

take some inspiration from NixOS for different fetchers, mirrors, and so on.

NixOS has quite a few fetchers for different VCS systems as well as mirrored sites:

https://github.com/NixOS/nixpkgs/tree/master/pkgs/build-support

It might be worthwhile to spend some time understanding what they are doing there and get some inspiration. Some systems, such as GitHub, apparently do not return hashes in a consistent way per release, but it differs, so hashes that are used by NixOS are computed in a different way.

The NixOS fetchers have support for downloading from mirrors (example: sourceforge) and have supporting syntax in the Nix files for it, for example:

url = "mirror://sourceforge/scummvm/lure-${version}.zip";

which would automatically grab a mirror from the list of defined mirrors.

There are also some fetchers that are a bit more obscure bit which might be fun or useful to add.

Add package registry support for RPM

Have a look at fetchcode/package.py , and try to implement the same for RPM packages.

Only return requested versions...

list(package.info('pkg:npm/%40babel/[email protected]'))

returns ALL the version even though I asked for only one version...

Request for new parameter to set user-defined directory

Description:

I would like to request the addition of a new parameter for the program, specifically for setting the directory to a user-defined directory. This would allow users to specify the location to fetch the files rather than being limited to the default directory \tmp. The location should default to /tmp if the user does not specify a location.

Reference:
https://github.com/nexB/fetchcode/blob/master/src/fetchcode/vcs/git.py#L39
https://github.com/nexB/fetchcode/blob/master/src/fetchcode/vcs/__init__.py#L70

The new parameter could be named directory and would be used like so: fetch_via_git/vcs(url, directory="/path/to/directory")

I believe this feature would greatly improve the flexibility and usability of the program for many users. Please let me know if there is any additional information that you need from me. Thank you for your consideration.

Add package registry support for Debian

Have a look at fetchcode/package.py , and try to implement the same for Debian packages.

Add package registry support for Github

Consider switching to curl / wget for http(s) ftp

Why:

multipart support
auth/proxy support
inferring filenames ( what #56)

Issues:
@pombredanne mentioned it might be too heavy tool for the task
@JonoYang mentioned overhead of doing it via subprocess

Great page comparing wget vs curl
https://daniel.haxx.se/docs/curl-vs-wget.html

@pombredanne mentioned https://aria2.github.io/ as potential alternative which might be better than both curl wget

Update README with development + testing instructions

fetchcode is currently missing development, installation and testing information.

Currently, I do not know how to set up fetchcode for development or how to run the test suite.

package_managers.py uses ET.ElementTree against untrusted data

The python docs mention:

Warning

The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

This is used in line 265 here:
https://github.com/nexB/vulnerablecode/blob/369897fb947584e44581df075c6e76638737f2ca/vulnerabilities/package_managers.py#L250-L266

The docs further suggest to use defusedxml instead:

defusedxml is a pure Python package with modified subclasses of all stdlib XML parsers that prevent any potentially malicious operation. Use of this package is recommended for any server code that parses untrusted XML data. The package also ships with example exploits and extended documentation on more XML exploits such as XPath injection.

Add Package Registry support for Homebrew Packages

Add package registry support for npm packages

Replace packaging with packvers

See https://github.com/nexB/fetchcode/tree/42110e9cb6017c8f85b52fe6daab0038d35e4a84/src/fetchcode/vcs/pip/_vendor/packaging

To avoid the issues of pypa/packaging#530
packaging should be replaced by https://github.com/nexB/packvers/

See also SCTK and nexB/univers#95 and nexB/python-inspector#108

Add CI test suite

We need to add CI actions that run the test suite, like we do on other repositories.

Do not duplicate scancode packagedcode models

Do not duplicate scancode packagedcode models... this was OK originally to avoid having FetchCode depend on the whole of ScanCode but we need to find a way either by vendoring them automatically or by importing them as a dep: packagedcode should be its own library?

Add package registry support for Bitbucket

Drop support for Python 2

We still have some Python2-ism like the many __future__ imports

Download from ftp:// URLs

part of phase 1 of GSoC 2020

fetchcode needs to be able to download FTP urls in the form of ftp://

We will need to determine the behavior of what to do when the URL doesn't point to a specific file, among other things.

@pombredanne anything in particular to add here?

pip commited as a single commit

I'm creating this issue to try understanding the decisions & plans here:

was there any reason pip was committed without use of git features like subtree/submodules (or even google's repo)?
whole pip is committed to the repository, while only pip._internals.vcs package is used. is there plan to use more of pip pkgs? or is pip's vcs module waiting to be refactored for fetchcode and the rest of pip will removed?

Design initial API

The initial API should be a single fetch function that accepts a url argument then download the thing at URL and returns either the fetched data (as bytes? as text?) or the location of a file where the content has been saved and with some extra data such as:

inferred package information (e.g. a Package URL?, etc.)
size, checksums, etc.

Add package registry support for Maven

Have a look at fetchcode/package.py , and try to implement the same for Maven packages.

Package URL encoding should be dealt correctly

>>> from fetchcode import package
>>> list(package.info('pkg:npm/@babel/babel-runtime'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "w421/fetchcode/src/fetchcode/package.py", line 117, in get_npm_data_from_purl
    purl = PackageURL.from_string(purl)
  File "w421/fetchcode/tmp/lib/python3.6/site-packages/packageurl/__init__.py", line 408, in from_string
    'name component: {}'.format(repr(purl)))
ValueError: purl is missing the required name component: 'pkg:npm/@babel/babel-runtime'

fails...
but IMHO it should work
this works:

list(package.info('pkg:npm/%40babel/babel-runtime'))

Add package registry support for rubygems packages

Do not pass silently on ClientResponseError

Several importers or fetcher fail silently on ClientResponseError such as here https://github.com/nexB/vulnerablecode/blob/24e33966bae6124e381556e520e18f1129c352fc/vulnerabilities/package_managers.py#L147

We should never fail silently unless it is really needed.

The Readme.rst formatting for code not showing correctly

The code block in the README file is not showing correctly, on check .rst format there were issues.
System Message: WARNING/2 (e:\bkp\linux\code\python\fetchcode\README.rst, line 15)

The code block shows
from fetchcode import fetch url = 'A Http or FTP URL' location = 'Location of file' # This returns a response object which has attributes # 'content_type' content type of the file # 'location' the absolute location of the files that was fetched # 'scheme' scheme of the URL # 'size' size of the retrieved content in bytes # 'url' fetched URL resp = fetch(url = url)

Split the functions that create URLs and fetch into two separate functions

We are going to have discrete functions each with a router that return a very specific URL or small piece of data like today. But here the functions that create URLs and fetch should be split in two so that we can have URL-only functions as explained in 2. that do not fetch anything.

Functions that only transform a PURL in a URL should be in one place (likely packageurl-python)

Then anything that fetches remote data should be of two kinds

One may return raw data from a JSON or XML API
One may return a ScanCode Package object converted from this raw data
Some basic function that only return versions may just return lists of PURLs alright

We need to account to for repository_url and download_url qualifiers

see #93 (comment)