samduy / provenance-analysis Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 1.0 1.89 MB

Program Provenance Analysis

Shell 6.81% Makefile 2.42% Python 13.52% HTML 67.48% XSLT 6.87% JavaScript 2.89%

provenance provenance-analysis github-oauth analysis release version latest-version scanning audit software

provenance-analysis's People

Contributors

Stargazers

Watchers

Forkers

guoyu07

provenance-analysis's Issues

Detect sub-packages inside a package

Package A includes some other packages B and C from other developers as its sub-directories.

/path/to/packageA/files
/path/to/packageA/packageB
/path/to/packageA/packageB/files
/path/to/packageA/packageC
/path/to/packageA/packageC/files
...

E.g.

veil-avasion
gems

Currently, only package A has been detected and consider the whole path/to/packageA is a package.
It's correct. But, it is better if package B and package C have also been detected and checked for their latest versions.

Reason: There is a possibility that the developer of Package A is slow in update his package when B or C is updated, so that there is a chance for hacker to attack package A when security bugs of B or C have been published

[report.xml] Add summary info

Add summary information in the result file.
Such as: how many packages are updated, how many are still having active development, how many outdated, and so on.

Integrate GIT, Autotool check result

Scan for .git in specified directories. git_list.sh

Autotool, Makefile

Scan for configure.ac and Makefile.am in specified directories. autotool_list.sh

[report.xml] There are some duplicate items in the report

Version: v0.5
There are some duplicate items in the report that they should not be.

[report.xml] Visualization: add graph to report

It would be better if we show the graph (like pie charts, for instance) to give the user a visual about current status of machine. (e.g. how many percent of programs are outdated/updated).

Resume from the last point it was interrupted

Some steps take so long time to finish, and sometime it is interrupted in the middle.
When it starts again, it should be able to resume from the last point it was forced to quit.

Identify sufficiently unique-looking paths to search for

Given a set of files (e.g., an archive or a directory) find the ones with sufficiently unique-looking paths and search them on GitHub.

Current approach is: choosing 3 longest paths (in the package directory) seems not to be really efficient since it chooses some very common names.

[extract_packages.sh] Missing directory with the file BSD_LICENSE

Even the directory has 2 files (README*, BSD_LICENSE), it was missed in output detected list.
Solution: adding prefix regex to LICENSE so that it can detect *LICENSE as well.

Display progress bar for the long tasks

Integrate APT, PIP results

Each of those tools can output their own result. It's better to integrate them in our final result also.

$ apt list --installed

List up only outdated packages:

$ apt list --installed | sed -nr 's_(.*)/(.*) (.*) (.*) (.*)upgradable to: (.*)]_\1,\3,\4,\6_p'

Implemented in: apt_check.sh

$ pip list -o

[apt.list] dpkg-query: error: --listfiles needs a valid package name

When?

$ make apt.list

Error:

dpkg-query: error: --listfiles needs a valid package name but 'gcc-4.9-base' is not: ambiguous package name 'gcc-4.9-base' with more than one installed instance
...

github_latest.py: UnicodeEncodeError: 'ascii' codec can't encode characters

Traceback (most recent call last):
  File "./github_latest.py", line 103, in <module>
    print result
UnicodeEncodeError: 'ascii' codec can't encode characters in position 61-62: ordinal not in range(128)

When?

$ make internet_info.dat

First analysis:

It happened when the return results have some Chinese characters.

[Algorithm A][GitHub search] Should provide less-but-correct rather than many-but-wrong results.

The conditions should be combined (AND, not OR). The provided results should be very accurate, even there are few.

github_latest.py: cannot concatenate 'str' and 'NoneType' objects

Traceback (most recent call last):
  File "./github_latest.py", line 86, in <module>
    result += ",latest_release:"+latest_release+",released_date:"+rel_published_date
TypeError: cannot concatenate 'str' and 'NoneType' objects

When?

$ make internet_info.dat

[Algorithm-A] Cannot detect the repository if the directory name is not identical with repo name

[Version 0.5]
Current algorithm-A is that: it matches the content (search for 3 files) and the name of local directory with the online GitHub repository. It returns the result if they are both identical (files exist and repo name matches).

However, there are many cases, such as:
Local directory:

.../dnsruby-09c3890ccfae

is different from the online repo name: dnsruby.

The algorithm should be improved so that, it can also detect the corresponding GitHub repo for the above case:

https://github.com/alexdalitz/dnsruby

Extend the coverage in linux to other classes of installed files.

Extend the coverage in linux to other classes of installed files. What about python libraries (e.g., installed with pip) or php or other web application libraries?

[Algorithm D][interesting_dirs.list] Check if the directory is managed by APT

Even there are some files in the directory installed manually, if the directory itself is managed by APT (or any other package manager), it is also out of scope. We don't care that one.

[extract_packages.sh] line 91: [: xx: unary operator expected

[xx]: a number.
So many lines of this error printed out while running the directories detection process.

[git_info.dat] Also need to show the running progress

The same as other long tasks, even this task is not so long (1 or 2 minutes to finish).

Mismatch between pip package name and directory name

Some PIP packages does not have the directory with the same name, caused some errors when searching for its installed files.

find: ‘/usr/lib/python2.7/dist-packages/backports-abc*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/backports.shutil-get-terminal-size*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/backports.ssl-match-hostname*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/CouchDB-1.0-py2.7.egg/CouchDB*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/file-magic*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/fuse-python*’: No such file or directory
find: ‘2.1,/GeoIP*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/guess-language-spirit*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/ipcalc-1.1.3-py2.7.egg/ipcalc*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/ipython-genutils*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/msgpack-python*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/ndg-httpsclient*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/NoSQLMap-0.5-py2.7.egg/NoSQLMap*’: Not a directory
find: ‘/usr/local/lib/python2.7/dist-packages/oauthlib-1.1.2-py2.7.egg/oauthlib*’: Not a directory
find: ‘/usr/local/lib/python2.7/dist-packages/pbkdf2-1.3-py2.7.egg/pbkdf2*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/prompt-toolkit*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/pyasn1-modules*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/pymongo-2.7.2-py2.7-linux-x86_64.egg/pymongo*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/pysnmp-apps*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/pysnmp-mibs*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-apt*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-dateutil*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-debian*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-debianbts*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-ntlm*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/requests_oauthlib-0.6.2-py2.7.egg/requests-oauthlib*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/requests-toolbelt*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/service-identity*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/tweepy-3.6.0-py2.7.egg/tweepy*’: Not a directory
find: ‘/usr/local/lib/python2.7/dist-packages/WordHound-0.1-py2.7.egg/WordHound*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/wxPython-common*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/yara-python*’: No such file or directory

Full log in here

Confirm running with non-root user

[interesting.list] Filtering does not work well if files use softlinks

In some machines, soft links have been used, so there are two paths point to the same file.
e.g.
In the list of files installed by PIP (pip_sorted.list)

/usr/lib/python2.7/dist-packages/olefile/olefile.py
/usr/lib/python2.7/dist-packages/olefile/__init__.pyc
/usr/lib/python2.7/dist-packages/olefile/olefile.pyc

In the all_files.list:

/lib/live/mount/persistence/mmcblk0p3/rw/usr/lib/python2.7/dist-packages/olefile/__init__.py
/lib/live/mount/persistence/mmcblk0p3/rw/usr/lib/python2.7/dist-packages/olefile/olefile.py
/lib/live/mount/persistence/mmcblk0p3/rw/usr/lib/python2.7/dist-packages/olefile/olefile.pyc

So that, they can not eliminate each other (which they should).

[extract_packages.sh] Missing some good directories

Beef has README, VERSION and INSTALL.txt
Pyamf has README.txt, CHANGES.txt, INSTALL.txt, LICENSE.txt, MAINTAINERS.txt, THANKS.txt
They should not be missed with current algorithm.

Make a consolidated result

Expected output result:

Manual

Path	Name	Source	Updated	Active	Local version	Latest version
/path/to/package1	Package1	Github	Y	Y	1.0.2	1.0.2
/path/to/package2	Package2	Pip	N	Y	0.0.2	1.0.5

APT

Path	Name	Updated	Local version	Latest version
/path/to/package3	Package3	Y	1.0.2	1.0.2
/path/to/package4	Package4	N	0.0.2	1.0.5

PIP

Path	Name	Updated	Local version	Latest version
/path/to/package5	Package5	Y	1.0.2	1.0.2
/path/to/package6	Package6	N	0.0.2	1.0.5

[Algorithm-D]: improve quality of programs detection

Impacted version: 0.5
Phenomenon: time to finish the whole process is too long (a few hours). One of the reasons is there are many programs to check with Internet search (GitHub).

[Algorithm-D]: Improve accuracy of package detection

Current algorithm for package directory detection is not really good. It misses the packages that installed in the system:

In them same day.
In the same (parent) directory.

Because, current algorithm is based on the Modification date only.

e.g.

/path/to/directory-A/package-B
/path/to/directory-A/package-C

If both package-B and package-C were installed in the same day, it will mis-recognizes directory-A as a package (which is not actually) instead of B or C.

Dealing with python modules installed by `pip`

For the modules installed by PIP, there is a better (straight-forward) way is to get the information directly from PIP's website:

https://pypi.org/project/<module_name>

Related Issue: #2

[Test-bug] Error when committed_date is empty

[BUG: when running on Eurecom machine]

In some cases, when the returned committed date is empty, it caused an error:

Traceback (most recent call last):
  File "./report.py", line 64, in <module>
    latest_datetime = datetime.strptime(item['committed_date'], DATETIME_FORMAT_IN)
KeyError: 'committed_date'

[interesting.list] Remove .pyc files related to .py file that managed by APT

There are some .pyc files actually were built from .py files and these .py files were actually installed by APT.
So, all of those files should be considered under management of APT, and thus not in our interest.

[apt_files.list] has been updated while running

It should output as soon as it has some processed data to file. Don't wait until the whole process is finished.

[report] Do not print directory which has no clear information

Those directories could be mis-recognized as package directories at previous phase, and therefore we don't need to care.

[report.xml] Some machines do not support dicttoxml

And they generate not so beautiful HTML file. Maybe it's better to find a way exporting to XML without depending on the external library.

[Report] Export report to HTML or XML for better view

Current type of report (.txt) has some limitation, such as: only display minimum information, do not support interaction with user,...
It is better to export result data to XML (or JSON) to be able to store more information. And if necessary, user can view on Web browser (as HTML, maybe need to transform with CSS or XSL...) and click for detail information of each item.

[Makefile] At each step, the number of items in the result should be printed out.

Better extraction of program directory name

Currently, there's no effective way to auto-identify which level of sub-directory of scanning DIR is the package folder.
Currently we assume that the directory for each package is 2 level below DIR.
For example, when we scan directory '/', the directories packages we extracted can be:

/usr/share/program1
/usr/share/program2

But, it will mis-recognized other sub-directories of other packages as packages also, such as:

/opt/program3/sub-dir1
/opt/program3/sub-dir2

The following should be the proper extraction:

/usr/share/program1
/usr/share/program2
/opt/program3

Use other ways to locate files online beside github

Use other ways to locate files online beside github
(e.g., maybe we can simply search google for the filename)

[Algorithm-D]: Current extraction of package directories is too slow

Due to a mistake in programming, it made the process (unnecessarily) extremely slow.

Better information extraction of a local package directory

When?

$ make programs_info.dat

Current issue:

Only one file that is presented for the directory has been checked.
(That file was chosen by the criteria: it has longest path).

What can be improved:

Some of the extracted information of the directory should be based on the information of vast majority of files and sub-folders inside it.
But, some of the other information should be based on one particular file.
One idea (may need to be proved) is: the Creation date of the directory (can be the Installation date of the package) (there's no way to extract it directly from the Linux system. it hasn't support yet) may be the same as Modify date of the vast majority of files and sub-folder inside it.

Improve search on Github to not miss anything

Maybe combine file-search with repo-name-search?
Because in some cases, file-search doesn't give the correct result but repo-name-search (by package-name) could give the exact result.

e.g. in case of CMSMap:

File search:

filename:multipartpost.py path:thirdparty/multipart
...

doesn't give the result we wanted.

(Also, because of the file paths even are chosen as the longest one in the directory but it's still very common as many repos are using the same files).

But, repo search does return (in the first place):

https://github.com/Dionach/CMSmap

[git_info.dat] Some repositories require username and password to be input

Some user interaction is needed during the operation which is not good in the sense of autonomous.
Need to find the solution to bypass this username password mechanism.

samduy / provenance-analysis Goto Github PK

provenance-analysis's People

Contributors

Stargazers

Watchers

Forkers

provenance-analysis's Issues

Manual

APT

PIP

Recommend Projects

Recommend Topics

Recommend Org