Code Monkey home page Code Monkey logo

clearcode-toolkit's People

Contributors

dennisclark avatar dependabot[bot] avatar jonoyang avatar maxhbr avatar mjherzog avatar pombredanne avatar steven-esser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clearcode-toolkit's Issues

Clearcode toolkit connection breaks

I am running clearcode with this https://github.com/nexB/clearcode-toolkit#quick-start-using-a-database-storage, It stops in between due to breaking of the connection
Python - 3.7
OS - Debian GNU/Linux
Postgres - PostgreSQL 12.5

Saved 0 defs and harvests, in: 4 sec.
TOTAL cycles: 5603 with: 161972 defs and combined harvests, in: 6611 sec.
Cycle completed at: 2021-02-20T13:28:29.950845 Sleeping for 60 seconds...
Fetched definitions from : npm/npmjs/@popperjs/core/2.8.0 to: npm/npmjs/@fluentui/react/7.161.0
TOTAL cycles: 5604 with: 161972 defs and combined harvests, in: 6611 sec.
Traceback (most recent call last):
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.ProtocolViolation: server conn crashed?
SSL connection has been closed unexpectedly
 
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "/home/tg1999/clearcode-toolkit/bin/clearsync", line 11, in <module>
    load_entry_point('clearcode-toolkit', 'console_scripts', 'clearsync')()
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 514, in cli
    for coordinate, file_path in definitions:
  File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 137, in fetch_and_save_latest_definitions
    saver=saver)
  File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 258, in save_def
    return blob_path, saver(content=content, output_dir=output_dir, blob_path=blob_path)
  File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 234, in db_saver
    cditem, created = models.CDitem.objects.get_or_create(path=blob_path)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 573, in get_or_create
    return self.get(**kwargs), False
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 425, in get
    num = len(clone)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 269, in __len__
    self._fetch_all()
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 1308, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 53, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1156, in execute_sql
    cursor.execute(sql, params)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.DatabaseError: server conn crashed?
SSL connection has been closed unexpectedly

Some definitions are not fetched correctly from ClearlyDefined and are empty

There a few definitions that are not fetched correctly from ClearlyDefined and are stored empty.

We need to track all these oddities as they seem to be mostly API errors that silently do not return a correct value but instead an empty payload with no http error. An example of this is this:
https://api.clearlydefined.io/definitions/npm/npmjs/@zxing/ngx-scanner/1.0.0-dev.json

The symptom in the ClearCode DB is a mostly empty payload that is still there as a gzip-compressed but fails to deserialize from JSON since this is empty.

There are two things to do there:

  1. one time fixup of the data: catch the ones that we have in the DB and refetch them from the CD API.
  2. permanent fix is to check when we receive a payload that we can load it as JSON and that it is not empty (or only white spaces). If that happens we need to wait 30 to 60 seconds and retry the fetch. If this still fails we could have a boolean flag that this fetch needs to be restarted later and/or add an error for this.

Note that most likely cause may be the Cloudflare front that CD uses which is at best capricious.
A possibility to consider to work around these is to have multiple IPs and hosts to fetch things from.

@MaJuRG how many such empty CDitems do we have roughly?

Make the cdutils.py usable as a standalone

I'd like to use the functionality of cdutils.py externally

I think the architecture of this module could be improve to make it portable and usable outside of the whole clearcode-toolkit context.
First, the module should not depend on any external libraries (except package-url for the related features). attr, click, requests should not be requirements for this low-level functionalities.

Index data from ClearCode for matching

Changes need to be made to ClearCode to facilitate indexing.

A few changes to start would be:

  • Adding last_map_date and is_mappable fields to the CDitem model
    • This is to help our matching tools know which CDitems that can be mapped or if it has already processed and looked at

Add script to populate the initial DB

We need a script (for possibly a one time usage) that will:

  1. walk the directory tree that contains all the (million++) files already fetched
  2. save them in the DB

When running 'clearsync', --max-defs argument doesn't work

When running clearsync, the --max-defs command line argument doesn't work. This is because of a logic error here:

if max_def and max_def >= cycle_defs_count:

                    if max_def and max_def >= cycle_defs_count:
                        break

                if max_def and (max_def >= cycle_defs_count or max_def >= total_defs_count):
                    break

should be

                    if max_def and max_def <= cycle_defs_count:
                        break

                if max_def and (max_def <= cycle_defs_count or max_def <= total_defs_count):
                    break

Not able to get any data to map CD definition with SWH

{'described': {'releaseDate': '2014-07-16', 'tools': ['scancode/3.2.2'], 'toolScore': {'total': 30, 'date': 30, 'source': 0}, 'score': {'total': 30, 'date': 30, 'source': 0}}, 'coordinates': {'type': 'sourcearchive', 'provider': 'mavencentral', 'namespace': 'za.co.monadic', 'name': 'scopus_2.10', 'revision': '0.1.5'}, 'licensed': {'toolScore': {'total': 0, 'declared': 0, 'discovered': 0, 'consistency': 0, 'spdx': 0, 'texts': 0}, 'facets': {'core': {'attribution': {'unknown': 6, 'parties': ['Copyright David Weber 2014']}, 'discovered': {'unknown': 11}, 'files': 11}}, 'score': {'total': 0, 'declared': 0, 'discovered': 0, 'consistency': 0, 'spdx': 0, 'texts': 0}}, '_meta': {'schemaVersion': '1.6.1', 'updated': '2019-05-11T20:31:18.538Z'}, 'scores': {'effective': 15, 'tool': 15}}

I got this CD definition from clearcode toolkit DB in clearcode_cditem, I was expecting some data like hashes or sourcelocation to map this data with swh, I was able to get some kind of data to map these definitions with swh, but was not able to find any, so wanted to ask why its different than others?

Also synchronize or process file "attachments"

We do not sync attachments for now. These are typical top level key files in a package that ClearlyDefined mirrors and fetch a copy of directly in their database. These are problematic because they represent million plain text files that are serialized to JSON and that do exist in the original package otherwise.
These attachments may represent a significant size and have little direct value.
If we really want them there are two options:

  1. fetch and sync them as the other CDItems (and migrate them to the DB as we already have fetched the biggest majority of these)
  2. use pointer to an external source such as SWH

Add script to export and import CDitems updated after a certain date

We need to add a script to export all CDitems that were add/updated after a given date. Ideally, we would want to export this data in JSON format, streamed via JSON lines.

However, JSON streaming will not work as we store binary data in the content field. We will need to find an alternative method that will:

  1. serialize binary data
  2. preferably stream that data in a segmented fashion

Two possible solutions are pickle, which is apart of the standard library OR the protobuf library. pickle does not support streaming like a JSON-lines solution would, and I have yet to look deep into protobuf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.