samduy / provenance-analysis Goto Github PK
View Code? Open in Web Editor NEWProgram Provenance Analysis
Program Provenance Analysis
Package A
includes some other packages B
and C
from other developers as its sub-directories.
/path/to/packageA/files
/path/to/packageA/packageB
/path/to/packageA/packageB/files
/path/to/packageA/packageC
/path/to/packageA/packageC/files
...
E.g.
veil-avasion
gems
Currently, only package A has been detected and consider the whole path/to/packageA
is a package.
It's correct. But, it is better if package B
and package C
have also been detected and checked for their latest versions.
Reason: There is a possibility that the developer of Package A
is slow in update his package when B
or C
is updated, so that there is a chance for hacker to attack package A
when security bugs of B
or C
have been published
Add summary information in the result file.
Such as: how many packages are updated, how many are still having active development, how many outdated, and so on.
Scan for .git
in specified directories. git_list.sh
Scan for configure.ac
and Makefile.am
in specified directories. autotool_list.sh
Version: v0.5
There are some duplicate items in the report that they should not be.
It would be better if we show the graph (like pie charts, for instance) to give the user a visual about current status of machine. (e.g. how many percent of programs are outdated/updated).
Some steps take so long time to finish, and sometime it is interrupted in the middle.
When it starts again, it should be able to resume from the last point it was forced to quit.
Given a set of files (e.g., an archive or a directory) find the ones with sufficiently unique-looking paths and search them on GitHub.
Current approach is: choosing 3 longest paths (in the package directory) seems not to be really efficient since it chooses some very common names.
Even the directory has 2 files (README*, BSD_LICENSE), it was missed in output detected list.
Solution: adding prefix regex to LICENSE so that it can detect *LICENSE as well.
Each of those tools can output their own result. It's better to integrate them in our final result also.
$ apt list --installed
List up only outdated packages:
$ apt list --installed | sed -nr 's_(.*)/(.*) (.*) (.*) (.*)upgradable to: (.*)]_\1,\3,\4,\6_p'
Implemented in: apt_check.sh
$ pip list -o
When?
$ make apt.list
Error:
dpkg-query: error: --listfiles needs a valid package name but 'gcc-4.9-base' is not: ambiguous package name 'gcc-4.9-base' with more than one installed instance
...
Traceback (most recent call last):
File "./github_latest.py", line 103, in <module>
print result
UnicodeEncodeError: 'ascii' codec can't encode characters in position 61-62: ordinal not in range(128)
When?
$ make internet_info.dat
First analysis:
The conditions should be combined (AND, not OR). The provided results should be very accurate, even there are few.
Traceback (most recent call last):
File "./github_latest.py", line 86, in <module>
result += ",latest_release:"+latest_release+",released_date:"+rel_published_date
TypeError: cannot concatenate 'str' and 'NoneType' objects
When?
$ make internet_info.dat
[Version 0.5]
Current algorithm-A is that: it matches the content (search for 3 files) and the name of local directory with the online GitHub repository. It returns the result if they are both identical (files exist and repo name matches).
However, there are many cases, such as:
Local directory:
.../dnsruby-09c3890ccfae
is different from the online repo name: dnsruby
.
The algorithm should be improved so that, it can also detect the corresponding GitHub repo for the above case:
https://github.com/alexdalitz/dnsruby
Extend the coverage in linux to other classes of installed files. What about python libraries (e.g., installed with pip
) or php
or other web application libraries?
Even there are some files in the directory installed manually, if the directory itself is managed by APT (or any other package manager), it is also out of scope. We don't care that one.
[xx]: a number.
So many lines of this error printed out while running the directories detection process.
The same as other long tasks, even this task is not so long (1 or 2 minutes to finish).
Some PIP packages does not have the directory with the same name, caused some errors when searching for its installed files.
find: ‘/usr/lib/python2.7/dist-packages/backports-abc*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/backports.shutil-get-terminal-size*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/backports.ssl-match-hostname*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/CouchDB-1.0-py2.7.egg/CouchDB*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/file-magic*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/fuse-python*’: No such file or directory
find: ‘2.1,/GeoIP*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/guess-language-spirit*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/ipcalc-1.1.3-py2.7.egg/ipcalc*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/ipython-genutils*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/msgpack-python*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/ndg-httpsclient*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/NoSQLMap-0.5-py2.7.egg/NoSQLMap*’: Not a directory
find: ‘/usr/local/lib/python2.7/dist-packages/oauthlib-1.1.2-py2.7.egg/oauthlib*’: Not a directory
find: ‘/usr/local/lib/python2.7/dist-packages/pbkdf2-1.3-py2.7.egg/pbkdf2*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/prompt-toolkit*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/pyasn1-modules*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/pymongo-2.7.2-py2.7-linux-x86_64.egg/pymongo*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/pysnmp-apps*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/pysnmp-mibs*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-apt*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-dateutil*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-debian*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-debianbts*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-ntlm*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/requests_oauthlib-0.6.2-py2.7.egg/requests-oauthlib*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/requests-toolbelt*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/service-identity*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/tweepy-3.6.0-py2.7.egg/tweepy*’: Not a directory
find: ‘/usr/local/lib/python2.7/dist-packages/WordHound-0.1-py2.7.egg/WordHound*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/wxPython-common*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/yara-python*’: No such file or directory
Full log in here
In some machines, soft links have been used, so there are two paths point to the same file.
e.g.
In the list of files installed by PIP (pip_sorted.list
)
/usr/lib/python2.7/dist-packages/olefile/olefile.py
/usr/lib/python2.7/dist-packages/olefile/__init__.pyc
/usr/lib/python2.7/dist-packages/olefile/olefile.pyc
In the all_files.list
:
/lib/live/mount/persistence/mmcblk0p3/rw/usr/lib/python2.7/dist-packages/olefile/__init__.py
/lib/live/mount/persistence/mmcblk0p3/rw/usr/lib/python2.7/dist-packages/olefile/olefile.py
/lib/live/mount/persistence/mmcblk0p3/rw/usr/lib/python2.7/dist-packages/olefile/olefile.pyc
So that, they can not eliminate each other (which they should).
Beef has README, VERSION and INSTALL.txt
Pyamf has README.txt, CHANGES.txt, INSTALL.txt, LICENSE.txt, MAINTAINERS.txt, THANKS.txt
They should not be missed with current algorithm.
Expected output result:
Path | Name | Source | Updated | Active | Local version | Latest version |
---|---|---|---|---|---|---|
/path/to/package1 | Package1 | Github | Y | Y | 1.0.2 | 1.0.2 |
/path/to/package2 | Package2 | Pip | N | Y | 0.0.2 | 1.0.5 |
Path | Name | Updated | Local version | Latest version |
---|---|---|---|---|
/path/to/package3 | Package3 | Y | 1.0.2 | 1.0.2 |
/path/to/package4 | Package4 | N | 0.0.2 | 1.0.5 |
Path | Name | Updated | Local version | Latest version |
---|---|---|---|---|
/path/to/package5 | Package5 | Y | 1.0.2 | 1.0.2 |
/path/to/package6 | Package6 | N | 0.0.2 | 1.0.5 |
Impacted version: 0.5
Phenomenon: time to finish the whole process is too long (a few hours). One of the reasons is there are many programs to check with Internet search (GitHub).
Current algorithm for package directory detection is not really good. It misses the packages that installed in the system:
Because, current algorithm is based on the Modification date
only.
e.g.
/path/to/directory-A/package-B
/path/to/directory-A/package-C
If both package-B
and package-C
were installed in the same day, it will mis-recognizes directory-A
as a package (which is not actually) instead of B or C.
For the modules installed by PIP, there is a better (straight-forward) way is to get the information directly from PIP's website:
https://pypi.org/project/<module_name>
Related Issue: #2
[BUG: when running on Eurecom machine]
In some cases, when the returned committed date is empty, it caused an error:
Traceback (most recent call last):
File "./report.py", line 64, in <module>
latest_datetime = datetime.strptime(item['committed_date'], DATETIME_FORMAT_IN)
KeyError: 'committed_date'
There are some .pyc files actually were built from .py files and these .py files were actually installed by APT.
So, all of those files should be considered under management of APT, and thus not in our interest.
It should output as soon as it has some processed data to file. Don't wait until the whole process is finished.
Those directories could be mis-recognized as package directories at previous phase, and therefore we don't need to care.
And they generate not so beautiful HTML file. Maybe it's better to find a way exporting to XML without depending on the external library.
Current type of report (.txt) has some limitation, such as: only display minimum information, do not support interaction with user,...
It is better to export result data to XML (or JSON) to be able to store more information. And if necessary, user can view on Web browser (as HTML, maybe need to transform with CSS or XSL...) and click for detail information of each item.
Currently, there's no effective way to auto-identify which level of sub-directory of scanning DIR
is the package folder.
Currently we assume that the directory for each package is 2 level below DIR
.
For example, when we scan directory '/', the directories packages we extracted can be:
/usr/share/program1
/usr/share/program2
But, it will mis-recognized other sub-directories of other packages as packages also, such as:
/opt/program3/sub-dir1
/opt/program3/sub-dir2
The following should be the proper extraction:
/usr/share/program1
/usr/share/program2
/opt/program3
Use other ways to locate files online beside github
(e.g., maybe we can simply search google for the filename)
Due to a mistake in programming, it made the process (unnecessarily) extremely slow.
When?
$ make programs_info.dat
Current issue:
What can be improved:
Creation date
of the directory (can be the Installation date
of the package) (there's no way to extract it directly from the Linux system. it hasn't support yet) may be the same as Modify date
of the vast majority of files and sub-folder inside it.Maybe combine file-search
with repo-name-search
?
Because in some cases, file-search
doesn't give the correct result but repo-name-search
(by package-name) could give the exact result.
e.g. in case of CMSMap:
filename:multipartpost.py path:thirdparty/multipart
...
doesn't give the result we wanted.
(Also, because of the file paths even are chosen as the longest one in the directory but it's still very common as many repos are using the same files).
https://github.com/Dionach/CMSmap
Some user interaction is needed during the operation which is not good in the sense of autonomous.
Need to find the solution to bypass this username password mechanism.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.