nexb / clearcode-toolkit Goto Github PK
View Code? Open in Web Editor NEWClearCode is a simple tool to fetch and sync all ClearlyDefined data locally.
ClearCode is a simple tool to fetch and sync all ClearlyDefined data locally.
I am running clearcode with this https://github.com/nexB/clearcode-toolkit#quick-start-using-a-database-storage, It stops in between due to breaking of the connection
Python - 3.7
OS - Debian GNU/Linux
Postgres - PostgreSQL 12.5
Saved 0 defs and harvests, in: 4 sec.
TOTAL cycles: 5603 with: 161972 defs and combined harvests, in: 6611 sec.
Cycle completed at: 2021-02-20T13:28:29.950845 Sleeping for 60 seconds...
Fetched definitions from : npm/npmjs/@popperjs/core/2.8.0 to: npm/npmjs/@fluentui/react/7.161.0
TOTAL cycles: 5604 with: 161972 defs and combined harvests, in: 6611 sec.
Traceback (most recent call last):
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
psycopg2.errors.ProtocolViolation: server conn crashed?
SSL connection has been closed unexpectedly
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/tg1999/clearcode-toolkit/bin/clearsync", line 11, in <module>
load_entry_point('clearcode-toolkit', 'console_scripts', 'clearsync')()
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 514, in cli
for coordinate, file_path in definitions:
File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 137, in fetch_and_save_latest_definitions
saver=saver)
File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 258, in save_def
return blob_path, saver(content=content, output_dir=output_dir, blob_path=blob_path)
File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 234, in db_saver
cditem, created = models.CDitem.objects.get_or_create(path=blob_path)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 573, in get_or_create
return self.get(**kwargs), False
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 425, in get
num = len(clone)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 269, in __len__
self._fetch_all()
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 1308, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 53, in __iter__
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1156, in execute_sql
cursor.execute(sql, params)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.DatabaseError: server conn crashed?
SSL connection has been closed unexpectedly
We should set up the included Django test runner to run tests.
There a few definitions that are not fetched correctly from ClearlyDefined and are stored empty.
We need to track all these oddities as they seem to be mostly API errors that silently do not return a correct value but instead an empty payload with no http error. An example of this is this:
https://api.clearlydefined.io/definitions/npm/npmjs/@zxing/ngx-scanner/1.0.0-dev.json
The symptom in the ClearCode DB is a mostly empty payload that is still there as a gzip-compressed but fails to deserialize from JSON since this is empty.
There are two things to do there:
Note that most likely cause may be the Cloudflare front that CD uses which is at best capricious.
A possibility to consider to work around these is to have multiple IPs and hosts to fetch things from.
@MaJuRG how many such empty CDitems do we have roughly?
I'd like to use the functionality of cdutils.py externally
I think the architecture of this module could be improve to make it portable and usable outside of the whole clearcode-toolkit context.
First, the module should not depend on any external libraries (except package-url for the related features). attr
, click
, requests
should not be requirements for this low-level functionalities.
Changes need to be made to ClearCode to facilitate indexing.
A few changes to start would be:
last_map_date
and is_mappable
fields to the CDitem
model
We need a script (for possibly a one time usage) that will:
When running clearsync, the --max-defs
command line argument doesn't work. This is because of a logic error here:
clearcode-toolkit/src/clearcode/sync.py
Line 539 in 77e00c4
if max_def and max_def >= cycle_defs_count:
break
if max_def and (max_def >= cycle_defs_count or max_def >= total_defs_count):
break
should be
if max_def and max_def <= cycle_defs_count:
break
if max_def and (max_def <= cycle_defs_count or max_def <= total_defs_count):
break
When running two instances of clearsync, django.db.utils.IntegrityError
is raised when attempting to save a CDitem with the same path
value.
Some packages have been curated by someone at ClearyDefined (https://github.com/clearlydefined/curated-data/pulls). It would be useful to be able to know which packages we collect in clearsync that have curated data.
For example, https://api.clearlydefined.io/curations/pypi/pypi/-/numpy/1.17.5 has been curated by someone. We would treat the license the curator set for this Package as more correct than the detected license.
{'described': {'releaseDate': '2014-07-16', 'tools': ['scancode/3.2.2'], 'toolScore': {'total': 30, 'date': 30, 'source': 0}, 'score': {'total': 30, 'date': 30, 'source': 0}}, 'coordinates': {'type': 'sourcearchive', 'provider': 'mavencentral', 'namespace': 'za.co.monadic', 'name': 'scopus_2.10', 'revision': '0.1.5'}, 'licensed': {'toolScore': {'total': 0, 'declared': 0, 'discovered': 0, 'consistency': 0, 'spdx': 0, 'texts': 0}, 'facets': {'core': {'attribution': {'unknown': 6, 'parties': ['Copyright David Weber 2014']}, 'discovered': {'unknown': 11}, 'files': 11}}, 'score': {'total': 0, 'declared': 0, 'discovered': 0, 'consistency': 0, 'spdx': 0, 'texts': 0}}, '_meta': {'schemaVersion': '1.6.1', 'updated': '2019-05-11T20:31:18.538Z'}, 'scores': {'effective': 15, 'tool': 15}}
I got this CD definition from clearcode toolkit DB in clearcode_cditem, I was expecting some data like hashes or sourcelocation to map this data with swh, I was able to get some kind of data to map these definitions with swh, but was not able to find any, so wanted to ask why its different than others?
The deps are outdated and once this is done we should publish to PyPI.
We do not sync attachments for now. These are typical top level key files in a package that ClearlyDefined mirrors and fetch a copy of directly in their database. These are problematic because they represent million plain text files that are serialized to JSON and that do exist in the original package otherwise.
These attachments may represent a significant size and have little direct value.
If we really want them there are two options:
We need to add a script to export all CDitems that were add/updated after a given date. Ideally, we would want to export this data in JSON format, streamed via JSON lines.
However, JSON streaming will not work as we store binary data in the content
field. We will need to find an alternative method that will:
Two possible solutions are pickle
, which is apart of the standard library OR the protobuf library. pickle
does not support streaming like a JSON-lines solution would, and I have yet to look deep into protobuf
It would be convenient to have a function that converts from Clearlydefined coordinates to Package URLs and from Package URLs to Clearlydefined coordinates.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.