sourcecode-ai / aura Goto Github PK
View Code? Open in Web Editor NEWPython source code auditing and static analysis on a large scale
License: GNU General Public License v3.0
Python source code auditing and static analysis on a large scale
License: GNU General Public License v3.0
Hi @RootLUG ,
I need few clarifications on below mentioned questions:
In your document, you're entered like Aura can analyse both the binary and python files. If I'm giving a source file(.py), whether aura perform its analysis by converting(compiling) the source code to byte code(.py to .pyc) or it can perform over source code alone?
How Aura can be able to construct AST for both the Python version (py 2k & 3k) in same installation of Aura?
While giving source code as input it correctly finds all the detection. Meanwhile, I'm giving the respective byte code file, it shows zero(0) detection. Why it is so?
Sample Case:
if test.py is an input file, aura finds 3 detections.
similarly, if test.pyc is an input file, aura finds 0 detections.
Thank you.
There is already an experimental ngram.py
in the repository root that is able to extract n-gram features from the source code in the JSON format. This extractor needs to be finished & refactored to port the changes from the new Aura v2.
This extractor should be disabled by default as it would produce huge amounts of data that is not needed during a standard scan but can be enabled when collecting the dataset for the ML.
As part of the ML roadmap, add a new feature extractor to the AST visitor to extract the data suitable for code2vec and related ML tasks.
Hey @RootLUG, hope you're well. Long time, no talk.
Question for you: Is it possible to scan a specific version of a package from PyPI without downloading it locally?
For instance, something like:
docker run -ti --rm sourcecodeai/aura:dev scan pypi://requests:1.2.3 -f html > output.html
Binwalk integration is currently causing a lot of failures (both installation wise as well as running aura scan) and also has a big performance drawback on the overall scan time. It was mostly integrated to gather additional research data and file information but is not critical to aura's functionality.
It was hence decided that binwalk integration is going to be removed from the default aura installation. The plan is to create a new repository where such plugins/integrations will be moved to provide a way for users to install them as optional plugins or as an archive of no longer maintained integrations/plugins for users.
I haven't got any source and Tainted path in the output file. Is there any possibility to get the tainted path (flow) from the tainted source to sink? So that It may be easier to find the vulnerability and fix that issue.
from reproducible builds:
{
"operation": "M",
"diff": null,
"similarity": 0.0,
"a_ref": "click-8.1.3-py3-none-any.whl$click/py.typed",
"a_md5": "d41d8cd98f00b204e9800998ecf8427e",
"a_mime": "inode/x-empty",
"a_size": 0,
"b_ref": "click-8.1.3-py3-none-any.whl$click/py.typed",
"b_md5": "d41d8cd98f00b204e9800998ecf8427e",
"b_mime": "inode/x-empty",
"b_size": 0
},
empty file is reported as diffing, probably due to being empty which yields an empty disjoin set in the algorithm when comparing
This repository: https://github.com/Yara-Rules/rules looks like a very good candidate for including built-in yara rules, especially the packer and obfuscation detection rules.
As this is a third-party repo, an update mechanism should be in place to provide the latest signatures without manually checking for updates in the yara rules. This could be accomplished (ideally) by extending the aura update
with update hooks that would allow installed plugins/analyzers to call their own update operations.
@RootLUG ,
If source is on file test1.py (class A) and sink is in another file test2.py (class B), the user input passes from one file to another file, will it be considered as a vulnerability? In short, will it perform taint analysis across file and across class?
Describe the bug
The HTML report for PyPI package faiss needs a bit more explanation. When there are no detections, it is probably worth providing the user a bit more information, something like "There were no detections."
To Reproduce
docker run -ti --rm sourcecodeai/aura:dev scan pypi://faiss -f html > output.html
Expected behavior
Expected a bit more information to provide context.
Additional context
Additionally, faiss has a pre-build binary in it. You might consider adding a detection in Aura that alerts for pre-built binaries. A user might want to know about that.
Thanks for your help, @RootLUG.
I need to embed AURA with my project. I want to run aura through my project without installing it. For that I need to know which is the main file to run aura. May I know which is the main file to run aura without installing it?
Thank you.
The default SuspiciousFile analyzer currently runs for all input, even when scanning a directory triggering false positives. Most of the SuspiciousFile detections are due to a hidden file being detected (starting with a dot) which is suspicious when inside the python package (sdist, wheel etc.) but completely normal when scanning for example a GitHub repo.
Suspicious file scan should be triggered only when the input data is an archive or a package scan - mirror:// or pypi:// URIs.
Aura has currently an experimental cache system that cache all the input data - package/url downloads or copies from offline pypi mirror. However the cache is never cleaned up right now and must be done to do so manually.
This feature adds an automatic cleanup system that would purge old entries from the cache under specific conditions such as:
Currently, when running aura diff
on input files/directories, everything is always scanned ( mainly talking about the AST analysis) even if there wasn't any change in the files. This can be optimized further optimized to exclude files from being processed by Aura if the diff hasn't detected any change.
The bug
When I run the Docker container and attempt to get HTML output, I get an error. The resulting HTML document says, "�[31mNo such output format html
�[0m"
To Reproduce
Run this command:
docker run -ti --rm sourcecodeai/aura:dev scan pypi://requests -f html > output.html
Expected behavior
Expected a nicely formatted HTML document.
Desktop (please complete the following information):
Additional context
I'm excited to use this feature. Nice work! Sorry if I'm doing something silly.
In the document, it is stated that, if we haven't given the URI in the input, by default the AURA consider the current local directory as the input folder. I have tried this using the following command.
sudo docker run -ti --rm sourcecodeai/aura:dev scan file:///home/local/ZOHOCORP/karthick-pt5811/Downloads/hgtools-default
But, it shows error as
Invalid location provided from URI: 'file:///home/local/ZOHOCORP/karthick-pt5811/Downloads/hgtools-default
Kindly fix it. @RootLUG
(aura) blue@BluedeMacBook-Pro ~/Downloads/aura-dev aura scan /Users/blue/Documents/Malicious/dataset/pypi_unzip/raw-tool/2.0.1/raw_tool-2.0.1/setup.py
Traceback (most recent call last):
File "/Users/blue/opt/anaconda3/envs/aura/bin/aura", line 5, in <module>
from aura.cli import main
File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/cli.py", line 16, in <module>
from . import commands
File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/commands.py", line 16, in <module>
from .package_analyzer import Analyzer
File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/package_analyzer.py", line 13, in <module>
from . import utils
File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/utils.py", line 21, in <module>
from . import config
File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/config.py", line 310, in <module>
load_config()
File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/config.py", line 240, in load_config
resource.setrlimit(resource.RLIMIT_RSS, (rss, rss))
ValueError: current limit exceeds maximum limit
Why? How to deal with the error?
Aura diff functionality currently uses git as an underlying mechanism. Creating a repo in a temporary directory and then making two commits with "left-hand side content" and "right-hand side content". Diffing is then done by leveraging native git functionality of diffing those two commits to detect changes between those two input sources.
Although this is simple it is not very performant and resource-effective since it creates several copies of the input data. The diff functionality should be migrated from using diff to using tlsh which is already used in Aura at other places. Using tlsh, we can compute similarity pairs between input files and thus detect in the same manner which files are the same, changed, renamed, removed, or added. A similar approach is also used in the diffoscope project to diff input data.
I'm sure you've thought of this and it would probably be a pain. But I find myself clicking on the indicators in the HTML view hoping that I get taken to a view of the code GitHub-style, so that I can do deeper code investigation. Thought I would mention it.
JS
How to get the tainted sink for a vulnerability in output from the signatures.yaml file?
For Example, if subprocess.call() is a tainted sink as configured in the signatures file, from where I can fetch the sink subprocess.call(...) to view in output(either json or sarif)?
With the recent news on attacks leveraging the non-ascii characters, implement a new analyzer that would flag such characters as suspicious, namely:
This should be preferably configurable in a config file as it can produce a lot of false-positives or uninteresting results in some codebases, for example to turn it off/on completely as well as setting a trigger for min and max occurence of non-ascii characters
Hi @RootLUG ,
I am invoking Aura through Java ProcessBuilder as 30 processes with same zips as input. While doing this it is taking more time for analysis. If the same zip is invoked with a single process, it is completed within 3 mins. But doing the same for 30 zips as 30 processes, it is taking more than an hour.
Moreover, The zip contains more recursive zips. So that I have used the ThreadPoolExecutors with max_workers as 10 for extraction alone. I have also changed the max-depth in aura_config.yaml file to 50.
Here, I have given the modified ThreadPoolExecutor in package_analyzer.py file. Kindly check this and let me know why it is taking too much time for analysis while invoking through Java with 30 processes.
Thanks in advance!
`
@staticmethod
def scan_directory(item: base.ScanLocation):
print(f"Collecting files in a directory '{item.str_location}")
dir_executor = futures.ThreadPoolExecutor(max_workers=10)
dir_executor.submit(Analyzer.scan_dir_by_ThreadPool, item)
collected = Analyzer.scan_dir_by_ThreadPool(item=item)
dir_executor.shutdown()
return collected
@staticmethod
def scan_dir_by_ThreadPool(item: base.ScanLocation):
"""Scanning input directory"""
topo = TopologySort()
collected = []
for f in utils.walk(item.location):
if str(f).endswith((".py",".zip",".jar",".war", ".whl", ".egg",".gz",".tgz")):
new_item = item.create_child(f,
parent=item.parent,
strip_path=item.strip_path
)
collected.append(new_item)
topo.add_node(Path(new_item.location).absolute())
logger.debug("Computing import graph")
for x in collected:
if not x.metadata.get('py_imports'):
continue
node = Path(x.location).absolute()
topo.add_edge(node, x.metadata['py_imports']['dependencies'])
topology = topo.sort()
collected.sort(
key=lambda x: topology.index(x.location) if x.location in topology else 0
)
logger.debug("Topology sorting finished")
return collected
`
Hi @RootLUG,
One thing that could be helpful is to place a numeric count of each indicator severity level by the filter buttons in the HTML output. It would be helpful for me as user to know which categories have a lot and which have a little. Of course I can do the filtering and then scroll but I would prefer to be provided that number at a glance.
I'll keep experimenting. Feel free to ignore these ideas! Thought any feedback would be better than none.
JS
Add raw file analyzer to data pipeline that integrates with the ClamAV for scanning input files, this would be particularly helpful during global PyPI scans.
Preliminary research however shows that most of the python ClamAV bindings are very outdated and have not been updated in some time. PyClamd (https://xael.org/pages/pyclamd-en.html) appears to be somewhat most used out there but the bitbucket repo is a dead-end (404), best bet might be to fork or create our own ClamAV binding perhaps?
Aura has support for defining output plugins to output the data in various formats. There are already several output formats built-in such as JSON, SQLite, text etc... however the documentation on how to write or extend Aura using the output plugins is missing as of now. This should be documented with an example/tutorial.
I'm wondering if the current implementation of the project supports value-level taint analysis. It seems that the propagation of tainted values is only done through the TaintLog object. Based on my analysis, it appears that the current implementation only propagates a binary "tainted or not" flag, but I'm curious if it's possible to perform more fine-grained analysis, such as tracking the actual values of tainted data.
If the goal is to analyze more complex models, I'm concerned that the current logging mechanism may not be sufficient. Can you provide more information on how the project handles taint analysis, and whether value-level analysis is supported? If not, are there any plans to add this functionality in the future?
When conducting a scan of the source code, Aura has a feature to "extract" blobs from the code and scan them separately.
This is very useful to for example scan the content of a string passed to the eval
or exec
function as it is parsed and analyzed in the same way as the input python source file. This blob extraction is triggered for string longer than the threshold specified in the config file and in some cases could cause performance degradation if the source code file contains a lot of long string definitions.
This process can be further optimized by not extracting blobs at the locations that the diff functionality has not detected any changes.
@RootLUG ,
I need the following details about the Aura AST Parsing...
1. Can I get the information about how the AST tree is constructed in Aura?
2. What are classes used for constructing the AST? If possible, provide code flow.
3. Can I print the AST node or AST tree?
Thanks in Advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.