Code Monkey home page Code Monkey logo

fetchcode's People

Contributors

aalexanderr avatar agustinhenze avatar arijitde92 avatar armijnhemel avatar arnav-mandal1234 avatar ayansinhamahapatra avatar bobquest33 avatar chinyeungli avatar dependabot[bot] avatar jimc404 avatar jonoyang avatar keshav-space avatar mjherzog avatar pombredanne avatar steven-esser avatar swastkk avatar tg1999 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fetchcode's Issues

Failure trace is hard to process

>>> from fetchcode import package
>>> list(package.info('pkg:rubygems/file'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "fetchcode/src/fetchcode/package.py", line 315, in get_rubygems_data_from_purl
    response = get_response(api_url)
  File "fetchcode/src/fetchcode/package.py", line 47, in get_response
    return resp.json()
  File "fetchcode/tmp/lib/python3.6/site-packages/requests/models.py", line 910, in json
    return complexjson.loads(self.text, **kwargs)
  File ".pyenv/versions/3.6.10/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File ".pyenv/versions/3.6.10/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File ".pyenv/versions/3.6.10/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

This failure is cryptic and it does not tell anything that can help me fix it...
In contrast if I update the code this way:

@router.route("pkg:rubygems/.*")
def get_rubygems_data_from_purl(purl):
    """
    Generate `Package` object from the `purl` string of rubygems type.
    """
    purl = PackageURL.from_string(purl)
    name = purl.name
    api_url = f"https://rubygems.org/api/v1/gems/{name}.json"
    try:
        response = get_response(api_url)
    except Exception as e:
        raise Exception(f'Failed to fetch: {api_url}') from e

    declared_license = response.get("licenses") or None
    homepage_url = response.get("homepage_uri")
    code_view_url = response.get("source_code_uri")
    bug_tracking_url = response.get("bug_tracker_uri")
    download_url = response.get("gem_uri")
    yield Package(
        homepage_url=homepage_url,
        api_url=api_url,
        bug_tracking_url=bug_tracking_url,
        code_view_url=code_view_url,
        declared_license=declared_license,
        download_url=download_url,
        **purl.to_dict(),
    )

I now have a clean and clear failure trace that is actionable:

>>> from fetchcode import package
>>> list(package.info('pkg:rubygems/file'))
Traceback (most recent call last):
  File "fetchcode/src/fetchcode/package.py", line 316, in get_rubygems_data_from_purl
    response = get_response(api_url)
  File "fetchcode/src/fetchcode/package.py", line 47, in get_response
    return resp.json()
  File "fetchcode/tmp/lib/python3.6/site-packages/requests/models.py", line 910, in json
    return complexjson.loads(self.text, **kwargs)
  File ".pyenv/versions/3.6.10/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File ".pyenv/versions/3.6.10/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File ".pyenv/versions/3.6.10/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "fetchcode/src/fetchcode/package.py", line 318, in get_rubygems_data_from_purl
    raise Exception(f'Failed to fetch: {api_url}') from e
Exception: Failed to fetch: https://rubygems.org/api/v1/gems/file.json

We should test and ensure we have informative traces in all our functions

Reuse/copy pip code for VCS and download

pip is a good starting point as https://github.com/pypa/pip/blob/master/src/pip/_internal/download.py is a solid and reliable download utility tested with billions of downloads.

There are a few ways to handle this:

  1. use https://github.com/sarugaku/pip-shims and reuse pip code
  2. copy and fork pip code
  3. vendor pip code

Note pip also handles VCS URLs
See https://github.com/pypa/pip/tree/master/src/pip/_internal/vcs
The download location specified in SPDX is mostly derived from the pip URLs https://github.com/spdx/spdx-spec/blob/db06dc81e525e08035af34117127742337e1f1b6/chapters/3-package-information.md#37-package-download-location-

pip does not handle ftp AFAIK

Add support for Rust PURLs

Currently there is not any method to convert Rust PURLs into a downloadable normalized URL. So add support regarding it, that can convert PURL into URL and then download that package.

Download from http(s):// URLs

Apart of phase 1 of GSoC 2020

fetchcode needs to be able to download HTTP and HTTPS urls in the form of http:// or https://
We will need to determine the behavior of what to do when the URL doesnt point to a specific file, among other things.

@pombredanne anything in particular to add here?

Review pip vendoring

In #58 there are several fixes made to pip... yet pip was meant to be vendored as-is and unmodified.
We also lost history in some move... we should clean this up and vendor pip automatically (with vendy that we use already for typecode)

fetchcode take long time to download

sometimes we just need a small part of the repository like https://github.com/github/advisory-database
we just need to github-reviewed part 57.2 MB. but using fetch_via_vcs we download 2.00 GiB
the problem gets worth when we start rerunning the importer and debugging it,
at least we can have something like a cache for saving the repo so we can use it again.
https://github.com/github/advisory-database/tree/main/advisories/github-reviewed

nexB/vulnerablecode/issues/902

Align models with ScanCode latest

The packagedcode models are vendored as a convenience, but they have changed a lot since.
I suggest that we should either externalize the SCTK packagedcode for reuse.
To discuss.

RFC: Roadmap for FetchCode next steps

Here are some of the TODOs and directions we could go in no specific order, based on a chat between @TG1999 and @pombredanne

  • Release on PyPI
  • Integration in ScanCode-toolkit such that there could be a --from-url option that would fetch a URL with fetchcode (but then it may also need to be extracted before being scanned....)
  • ... therefore an integration in https://github.com/nexB/scancode.io/ as a "pipe" to be used in pipelines may be better?
  • Add support for more API data fetching including package indexes.
  • Add support for Docker images download by moving the code from https://github.com/nexB/container-inspector/tree/develop/src/container_inspector/fetch ... or create some simple plugin architecture similar to that of scancode-toolkit such that there could be plugins for various "fetchers"
  • Later on an integration with an upcoming package data mining tool would be great
  • There are also some overlap between packageurl, fetchcode and scancode-toolkit packagedcode and this would benefit to be re-architected and reorged such that we have modules with clear responsibilities
  • https://github.com/nexB/dependentcode needs to be created and would be a major user of fetchcode (See nexB/dependentcode#1 )

take some inspiration from NixOS for different fetchers, mirrors, and so on.

NixOS has quite a few fetchers for different VCS systems as well as mirrored sites:

https://github.com/NixOS/nixpkgs/tree/master/pkgs/build-support

It might be worthwhile to spend some time understanding what they are doing there and get some inspiration. Some systems, such as GitHub, apparently do not return hashes in a consistent way per release, but it differs, so hashes that are used by NixOS are computed in a different way.

The NixOS fetchers have support for downloading from mirrors (example: sourceforge) and have supporting syntax in the Nix files for it, for example:

url = "mirror://sourceforge/scummvm/lure-${version}.zip";

which would automatically grab a mirror from the list of defined mirrors.

There are also some fetchers that are a bit more obscure bit which might be fun or useful to add.

Request for new parameter to set user-defined directory

Description:

I would like to request the addition of a new parameter for the program, specifically for setting the directory to a user-defined directory. This would allow users to specify the location to fetch the files rather than being limited to the default directory \tmp. The location should default to /tmp if the user does not specify a location.

Reference:
https://github.com/nexB/fetchcode/blob/master/src/fetchcode/vcs/git.py#L39
https://github.com/nexB/fetchcode/blob/master/src/fetchcode/vcs/__init__.py#L70

The new parameter could be named directory and would be used like so: fetch_via_git/vcs(url, directory="/path/to/directory")

I believe this feature would greatly improve the flexibility and usability of the program for many users. Please let me know if there is any additional information that you need from me. Thank you for your consideration.

package_managers.py uses ET.ElementTree against untrusted data

The python docs mention:

Warning

The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

This is used in line 265 here:
https://github.com/nexB/vulnerablecode/blob/369897fb947584e44581df075c6e76638737f2ca/vulnerabilities/package_managers.py#L250-L266

The docs further suggest to use defusedxml instead:

defusedxml is a pure Python package with modified subclasses of all stdlib XML parsers that prevent any potentially malicious operation. Use of this package is recommended for any server code that parses untrusted XML data. The package also ships with example exploits and extended documentation on more XML exploits such as XPath injection.

Add CI test suite

We need to add CI actions that run the test suite, like we do on other repositories.

Do not duplicate scancode packagedcode models

Do not duplicate scancode packagedcode models... this was OK originally to avoid having FetchCode depend on the whole of ScanCode but we need to find a way either by vendoring them automatically or by importing them as a dep: packagedcode should be its own library?

Download from ftp:// URLs

part of phase 1 of GSoC 2020

fetchcode needs to be able to download FTP urls in the form of ftp://

We will need to determine the behavior of what to do when the URL doesn't point to a specific file, among other things.

@pombredanne anything in particular to add here?

pip commited as a single commit

I'm creating this issue to try understanding the decisions & plans here:

  • was there any reason pip was committed without use of git features like subtree/submodules (or even google's repo)?
  • whole pip is committed to the repository, while only pip._internals.vcs package is used. is there plan to use more of pip pkgs? or is pip's vcs module waiting to be refactored for fetchcode and the rest of pip will removed?

Design initial API

The initial API should be a single fetch function that accepts a url argument then download the thing at URL and returns either the fetched data (as bytes? as text?) or the location of a file where the content has been saved and with some extra data such as:

  • inferred package information (e.g. a Package URL?, etc.)
  • size, checksums, etc.

Package URL encoding should be dealt correctly

>>> from fetchcode import package
>>> list(package.info('pkg:npm/@babel/babel-runtime'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "w421/fetchcode/src/fetchcode/package.py", line 117, in get_npm_data_from_purl
    purl = PackageURL.from_string(purl)
  File "w421/fetchcode/tmp/lib/python3.6/site-packages/packageurl/__init__.py", line 408, in from_string
    'name component: {}'.format(repr(purl)))
ValueError: purl is missing the required name component: 'pkg:npm/@babel/babel-runtime'

fails...
but IMHO it should work
this works:

list(package.info('pkg:npm/%40babel/babel-runtime'))

The Readme.rst formatting for code not showing correctly

The code block in the README file is not showing correctly, on check .rst format there were issues.
System Message: WARNING/2 (e:\bkp\linux\code\python\fetchcode\README.rst, line 15)

The code block shows
from fetchcode import fetch url = 'A Http or FTP URL' location = 'Location of file' # This returns a response object which has attributes # 'content_type' content type of the file # 'location' the absolute location of the files that was fetched # 'scheme' scheme of the URL # 'size' size of the retrieved content in bytes # 'url' fetched URL resp = fetch(url = url)

Split the functions that create URLs and fetch into two separate functions

We are going to have discrete functions each with a router that return a very specific URL or small piece of data like today. But here the functions that create URLs and fetch should be split in two so that we can have URL-only functions as explained in 2. that do not fetch anything.

Functions that only transform a PURL in a URL should be in one place (likely packageurl-python)

Then anything that fetches remote data should be of two kinds

One may return raw data from a JSON or XML API
One may return a ScanCode Package object converted from this raw data
Some basic function that only return versions may just return lists of PURLs alright

We need to account to for repository_url and download_url qualifiers

see #93 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.