sourcecode-ai / aura Goto Github PK

View Code? Open in Web Editor NEW

486.0 9.0 31.0 10.69 MB

Python source code auditing and static analysis on a large scale

License: GNU General Public License v3.0

Dockerfile 0.34% Python 93.78% YARA 1.33% Shell 0.40% HTML 0.53% JavaScript 3.59% CSS 0.03%

static-analysis python security-audit taint-analysis

aura's People

Contributors

Stargazers

Watchers

aura's Issues

Analyses Byte code(.pyc) or source code(.py)?

Hi @RootLUG ,

I need few clarifications on below mentioned questions:

In your document, you're entered like Aura can analyse both the binary and python files. If I'm giving a source file(.py), whether aura perform its analysis by converting(compiling) the source code to byte code(.py to .pyc) or it can perform over source code alone?
How Aura can be able to construct AST for both the Python version (py 2k & 3k) in same installation of Aura?
While giving source code as input it correctly finds all the detection. Meanwhile, I'm giving the respective byte code file, it shows zero(0) detection. Why it is so?
Sample Case:
if test.py is an input file, aura finds 3 detections.
similarly, if test.pyc is an input file, aura finds 0 detections.

Thank you.

Add an AST ngram extractor to the Aura framework

There is already an experimental ngram.py in the repository root that is able to extract n-gram features from the source code in the JSON format. This extractor needs to be finished & refactored to port the changes from the new Aura v2.

This extractor should be disabled by default as it would produce huge amounts of data that is not needed during a standard scan but can be enabled when collecting the dataset for the ML.

Add code2vec extractor for source code

As part of the ML roadmap, add a new feature extractor to the AST visitor to extract the data suitable for code2vec and related ML tasks.

Question related to scanning specific package version

Hey @RootLUG, hope you're well. Long time, no talk.

Question for you: Is it possible to scan a specific version of a package from PyPI without downloading it locally?

For instance, something like:

 docker run -ti --rm sourcecodeai/aura:dev scan pypi://requests:1.2.3 -f html > output.html

remove binwalk integration

Binwalk integration is currently causing a lot of failures (both installation wise as well as running aura scan) and also has a big performance drawback on the overall scan time. It was mostly integrated to gather additional research data and file information but is not critical to aura's functionality.

It was hence decided that binwalk integration is going to be removed from the default aura installation. The plan is to create a new repository where such plugins/integrations will be moved to provide a way for users to install them as optional plugins or as an archive of no longer maintained integrations/plugins for users.

Regarding Taint Analysis Report in output file

@RootLUG

I haven't got any source and Tainted path in the output file. Is there any possibility to get the tainted path (flow) from the tainted source to sink? So that It may be easier to find the vulnerability and fix that issue.

aura diff incorectly reports empty file

from reproducible builds:

  {
    "operation": "M",
    "diff": null,
    "similarity": 0.0,
    "a_ref": "click-8.1.3-py3-none-any.whl$click/py.typed",
    "a_md5": "d41d8cd98f00b204e9800998ecf8427e",
    "a_mime": "inode/x-empty",
    "a_size": 0,
    "b_ref": "click-8.1.3-py3-none-any.whl$click/py.typed",
    "b_md5": "d41d8cd98f00b204e9800998ecf8427e",
    "b_mime": "inode/x-empty",
    "b_size": 0
  },

empty file is reported as diffing, probably due to being empty which yields an empty disjoin set in the algorithm when comparing

include update hooks for plugins/analyzers + extend yara signatures

This repository: https://github.com/Yara-Rules/rules looks like a very good candidate for including built-in yara rules, especially the packer and obfuscation detection rules.

As this is a third-party repo, an update mechanism should be in place to provide the latest signatures without manually checking for updates in the yara rules. This could be accomplished (ideally) by extending the aura update with update hooks that would allow installed plugins/analyzers to call their own update operations.

Taint Analysis - Across file and class

@RootLUG ,

If source is on file test1.py (class A) and sink is in another file test2.py (class B), the user input passes from one file to another file, will it be considered as a vulnerability? In short, will it perform taint analysis across file and across class?

Problem Generating Report for PyPI Package faiss

Describe the bug
The HTML report for PyPI package faiss needs a bit more explanation. When there are no detections, it is probably worth providing the user a bit more information, something like "There were no detections."

To Reproduce
docker run -ti --rm sourcecodeai/aura:dev scan pypi://faiss -f html > output.html

Expected behavior
Expected a bit more information to provide context.

Additional context
Additionally, faiss has a pre-build binary in it. You might consider adding a detection in Aura that alerts for pre-built binaries. A user might want to know about that.

Thanks for your help, @RootLUG.

Main file to run AURA

@RootLUG @mirzazulfan

I need to embed AURA with my project. I want to run aura through my project without installing it. For that I need to know which is the main file to run aura. May I know which is the main file to run aura without installing it?

Thank you.

Trigger suspicious files scan only for python package scans

The default SuspiciousFile analyzer currently runs for all input, even when scanning a directory triggering false positives. Most of the SuspiciousFile detections are due to a hidden file being detected (starting with a dot) which is suspicious when inside the python package (sdist, wheel etc.) but completely normal when scanning for example a GitHub repo.

Suspicious file scan should be triggered only when the input data is an archive or a package scan - mirror:// or pypi:// URIs.

Add purging mechanism to the cache system

Aura has currently an experimental cache system that cache all the input data - package/url downloads or copies from offline pypi mirror. However the cache is never cleaned up right now and must be done to do so manually.

This feature adds an automatic cleanup system that would purge old entries from the cache under specific conditions such as:

removing entries older than X days
removing entries if the cache capacity is filled above threshold as configured in the main aura config

Skip files scanning in `aura diff` if they haven' changed

Currently, when running aura diff on input files/directories, everything is always scanned ( mainly talking about the AST analysis) even if there wasn't any change in the files. This can be optimized further optimized to exclude files from being processed by Aura if the diff hasn't detected any change.

An Issue with HTML Output

The bug
When I run the Docker container and attempt to get HTML output, I get an error. The resulting HTML document says, "�[31mNo such output format html�[0m"

To Reproduce
Run this command:
docker run -ti --rm sourcecodeai/aura:dev scan pypi://requests -f html > output.html

Expected behavior
Expected a nicely formatted HTML document.

Desktop (please complete the following information):

OS: Catalina 10.15.7
Docker: Docker version 19.03.5, build 633a0ea

Additional context
I'm excited to use this feature. Nice work! Sorry if I'm doing something silly.

Unable to run the application from a local directory in terminal

In the document, it is stated that, if we haven't given the URI in the input, by default the AURA consider the current local directory as the input folder. I have tried this using the following command.
sudo docker run -ti --rm sourcecodeai/aura:dev scan file:///home/local/ZOHOCORP/karthick-pt5811/Downloads/hgtools-default
But, it shows error as
Invalid location provided from URI: 'file:///home/local/ZOHOCORP/karthick-pt5811/Downloads/hgtools-default

Kindly fix it. @RootLUG

ValueError: current limit exceeds maximum limit

(aura)  blue@BluedeMacBook-Pro  ~/Downloads/aura-dev  aura scan /Users/blue/Documents/Malicious/dataset/pypi_unzip/raw-tool/2.0.1/raw_tool-2.0.1/setup.py
Traceback (most recent call last):
  File "/Users/blue/opt/anaconda3/envs/aura/bin/aura", line 5, in <module>
    from aura.cli import main
  File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/cli.py", line 16, in <module>
    from . import commands
  File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/commands.py", line 16, in <module>
    from .package_analyzer import Analyzer
  File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/package_analyzer.py", line 13, in <module>
    from . import utils
  File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/utils.py", line 21, in <module>
    from . import config
  File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/config.py", line 310, in <module>
    load_config()
  File "/Users/blue/opt/anaconda3/envs/aura/lib/python3.10/site-packages/aura/config.py", line 240, in load_config
    resource.setrlimit(resource.RLIMIT_RSS, (rss, rss))
ValueError: current limit exceeds maximum limit

Why? How to deal with the error?

Migrate diff functionality from git to tlsh

Aura diff functionality currently uses git as an underlying mechanism. Creating a repo in a temporary directory and then making two commits with "left-hand side content" and "right-hand side content". Diffing is then done by leveraging native git functionality of diffing those two commits to detect changes between those two input sources.

Although this is simple it is not very performant and resource-effective since it creates several copies of the input data. The diff functionality should be migrated from using diff to using tlsh which is already used in Aura at other places. Using tlsh, we can compute similarity pairs between input files and thus detect in the same manner which files are the same, changed, renamed, removed, or added. A similar approach is also used in the diffoscope project to diff input data.

Feature Request: Code Viewing Capability

@RootLUG,

I'm sure you've thought of this and it would probably be a pain. But I find myself clicking on the indicators in the HTML view hoping that I get taken to a view of the code GitHub-style, so that I can do deeper code investigation. Thought I would mention it.

Tainted Sink in output

@RootLUG

How to get the tainted sink for a vulnerability in output from the signatures.yaml file?
For Example, if subprocess.call() is a tainted sink as configured in the signatures file, from where I can fetch the sink subprocess.call(...) to view in output(either json or sarif)?

Add analyzer to report non-ascii character

With the recent news on attacks leveraging the non-ascii characters, implement a new analyzer that would flag such characters as suspicious, namely:

strings containing non-ascii characters
variable names and attribute names containing non-ascii characters

This should be preferably configurable in a config file as it can produce a lot of false-positives or uninteresting results in some codebases, for example to turn it off/on completely as well as setting a trigger for min and max occurence of non-ascii characters

Taking more time to analyse with many processes

Hi @RootLUG ,

I am invoking Aura through Java ProcessBuilder as 30 processes with same zips as input. While doing this it is taking more time for analysis. If the same zip is invoked with a single process, it is completed within 3 mins. But doing the same for 30 zips as 30 processes, it is taking more than an hour.

Moreover, The zip contains more recursive zips. So that I have used the ThreadPoolExecutors with max_workers as 10 for extraction alone. I have also changed the max-depth in aura_config.yaml file to 50.

Here, I have given the modified ThreadPoolExecutor in package_analyzer.py file. Kindly check this and let me know why it is taking too much time for analysis while invoking through Java with 30 processes.

Thanks in advance!

`
@staticmethod
def scan_directory(item: base.ScanLocation):
print(f"Collecting files in a directory '{item.str_location}")
dir_executor = futures.ThreadPoolExecutor(max_workers=10)
dir_executor.submit(Analyzer.scan_dir_by_ThreadPool, item)
collected = Analyzer.scan_dir_by_ThreadPool(item=item)
dir_executor.shutdown()
return collected

@staticmethod
def scan_dir_by_ThreadPool(item: base.ScanLocation):
    """Scanning input directory"""
    topo = TopologySort()
    collected = []
    for f in utils.walk(item.location):
        if str(f).endswith((".py",".zip",".jar",".war", ".whl", ".egg",".gz",".tgz")):
            new_item = item.create_child(f,
                parent=item.parent,
                strip_path=item.strip_path
                )
            collected.append(new_item)
            topo.add_node(Path(new_item.location).absolute())
            logger.debug("Computing import graph")
            for x in collected:
                if not x.metadata.get('py_imports'):
                    continue
                node = Path(x.location).absolute()
                topo.add_edge(node, x.metadata['py_imports']['dependencies'])
            topology = topo.sort()
            collected.sort(
                key=lambda x: topology.index(x.location) if x.location in topology else 0
            )
            logger.debug("Topology sorting finished")
    return collected

Feature Request: Show Numeric Count of Detections by Type in Color-Coded Filter Section

Hi @RootLUG,

One thing that could be helpful is to place a numeric count of each indicator severity level by the filter buttons in the HTML output. It would be helpful for me as user to know which categories have a lot and which have a little. Of course I can do the filtering and then scroll but I would prefer to be provided that number at a glance.

I'll keep experimenting. Feel free to ignore these ideas! Thought any feedback would be better than none.

Add ClamAV integration

Add raw file analyzer to data pipeline that integrates with the ClamAV for scanning input files, this would be particularly helpful during global PyPI scans.

Preliminary research however shows that most of the python ClamAV bindings are very outdated and have not been updated in some time. PyClamd (https://xael.org/pages/pyclamd-en.html) appears to be somewhat most used out there but the bitbucket repo is a dead-end (404), best bet might be to fork or create our own ClamAV binding perhaps?

Add documentation for writing output plugins

Aura has support for defining output plugins to output the data in various formats. There are already several output formats built-in such as JSON, SQLite, text etc... however the documentation on how to write or extend Aura using the output plugins is missing as of now. This should be documented with an example/tutorial.

Does the project support value-level taint analysis?

I'm wondering if the current implementation of the project supports value-level taint analysis. It seems that the propagation of tainted values is only done through the TaintLog object. Based on my analysis, it appears that the current implementation only propagates a binary "tainted or not" flag, but I'm curious if it's possible to perform more fine-grained analysis, such as tracking the actual values of tainted data.

If the goal is to analyze more complex models, I'm concerned that the current logging mechanism may not be sufficient. Can you provide more information on how the project handles taint analysis, and whether value-level analysis is supported? If not, are there any plans to add this functionality in the future?

Don't export blobs in unchanged locations

When conducting a scan of the source code, Aura has a feature to "extract" blobs from the code and scan them separately.
This is very useful to for example scan the content of a string passed to the eval or exec function as it is parsed and analyzed in the same way as the input python source file. This blob extraction is triggered for string longer than the threshold specified in the config file and in some cases could cause performance degradation if the source code file contains a lot of long string definitions.

This process can be further optimized by not extracting blobs at the locations that the diff functionality has not detected any changes.

About AST Construction

@RootLUG ,

I need the following details about the Aura AST Parsing...

1. Can I get the information about how the AST tree is constructed in Aura?
2. What are classes used for constructing the AST? If possible, provide code flow.
3. Can I print the AST node or AST tree?

Thanks in Advance!

sourcecode-ai / aura Goto Github PK

aura's People

Contributors

Stargazers

Watchers

Forkers

aura's Issues

Recommend Projects

Recommend Topics

Recommend Org