Code Monkey home page Code Monkey logo

noseyparker's Introduction

Nosey Parker: Find secrets in textual data

Overview

Nosey Parker is a command-line tool that finds secrets and sensitive information in textual data. It is useful both for offensive and defensive security testing.

Key features:

  • It can natively scan files, directories, and Git repository history
  • It uses regular expression matching with a set of 141 patterns chosen for high signal-to-noise based on experience and feedback from offensive security engagements
  • It deduplicates its findings, grouping matches together that share the same secret, which in practice can reduce review burden by 100x or more
  • It is fast: it can scan at hundreds of megabytes per second on a single core, and is able to scan 100GB of Linux kernel source history in less than 2 minutes on an older MacBook Pro
  • It scales: it has scanned inputs as large as 20TiB during security engagements

An internal version of Nosey Parker has found secrets in hundreds of offensive security engagements at Praetorian. The internal version has additional capabilities for false positive suppression and a rule-free machine learning-based detection engine. Read more in blog posts here and here.

Installation

Homebrew formula

Nosey Parker is available in Homebrew:

$ brew install noseyparker

Prebuilt binaries

Prebuilt binaries are available for x86_64 Linux and x86_64/aarch64 macOS on the latest release page. This is a simple way to get started and will give good performance.

Docker images

A prebuilt multiplatform Docker image is available for the latest release for x86_64 and aarch64:

$ docker pull ghcr.io/praetorian-inc/noseyparker:latest

Additionally, A prebuilt Docker image is also available for the most recent commit for x86_64:

$ docker pull ghcr.io/praetorian-inc/noseyparker:edge

Finally, an additional prebuilt Alpine-based Docker image is also available for the most recent commit for x86_64:

$ docker pull ghcr.io/praetorian-inc/noseyparker-alpine:edge

Note: The Docker images run noticeably slower than a native binary, particularly on macOS.

Arch Linux package

Nosey Parker is available in the Arch User Repository.

Building from source

1. Install prerequisites

This has been tested with several versions of Ubuntu Linux on x86_64 and with macOS on both x86_64 and aarch64.

Required dependencies:

  • cargo: recommended approach: install from https://rustup.rs
  • cmake: needed for building the vectorscan-sys crate and some other dependencies
  • boost: needed for building the vectorscan-sys crate (supported version >=1.57)
  • git: needed for embedding version information into the noseyparker CLI
  • patch: needed for building the vectorscan-sys crate
  • pkg-config: needed for building the vectorscan-sys crate
  • sha256sum: needed for computing digests (often provided by the coreutils package)
  • zsh: needed for build scripts

2. Build using the create-release.zsh script

$ rm -rf release && ./scripts/create-release.zsh

If successful, this will produce a directory structure at release populated with release artifacts. The command-line program will be at release/bin/noseyparker.

Usage

Overview

Nosey Parker is essentially a special-purpose grep-like tool for detection of secrets. The typical workflow is three phases:

  1. Scan inputs of interest using the scan command
  2. Report details of scan results using the report command
  3. Review and triage findings

The scanning and reporting steps are implemented as separate commands because you may wish to generate several reports from one expensive scan run.

Getting help

Running the noseyparker binary without arguments prints top-level help and exits. You can get abbreviated help for a particular command by running noseyparker COMMAND -h. More detailed help is available with the help command or long-form --help option.

The prebuilt releases also include manpages that collect the command-line help in one place. These manpages converted into Markdown format are also included in the repository here.

If you have a question that's not answered by this documentation, feel free to start a discussion.

Terminology and data model

The datastore

Most Nosey Parker commands use a datastore, which is a special directory that Nosey Parker uses to record its findings and maintain its internal state. A datastore will be implicitly created by the scan command if needed.

Blobs

Each input that Nosey Parker scans is called a blob, and has a unique blob ID, which is a SHA-1 digest computed the same way git does.

Provenance

Each blob has one or more provenance entries associated with it. A provenance entry is metadata that describes how the input was discovered, such as a file on the filesystem or an entry in Git repository history.

Rules

Nosey Parker is a rule-based system that uses regular expressions. Each rule has a single pattern with at least one capture group that isolates the match content from the surrounding context. You can list available rules with noseyparker rules list.

Rulesets

A collection of rules is organized into a ruleset. Nosey Parker's default ruleset includes rules that detect things that appear to be hardcoded secrets. Other rulesets are available; you can list them with noseyparker rules list.

Matches

When a rule's pattern matches an input, it produces a match. A match is defined by a rule, blob ID, start byte offset, and end byte offset; these fields are used to determine a unique match identifier.

Findings

Matches that were produced by the same rule and share the same capture groups are grouped into a finding. In other words, a finding is a group of matches. This is Nosey Parker's top-level unit of reporting.

Usage examples

NOTE: When using Docker...

If you are using the Docker image, replace noseyparker in the following commands with a Docker invocation that uses a mounted volume:

docker run -v "$PWD":/scan ghcr.io/praetorian-inc/noseyparker:latest <ARGS>

The Docker container runs with /scan as its working directory, so mounting $PWD at /scan in the container will make tab completion and relative paths in your command-line invocation work.

Scan inputs for secrets

Filesystem content, including local Git repos

Screenshot showing Nosey Parker's workflow for scanning the filesystem for secrets

Nosey Parker has built-in support for scanning files, recursively scanning directories, and scanning the entire history of Git repositories.

For example, if you have a Git clone of CPython locally at cpython.git, you can scan its entire history with the scan command. Nosey Parker will create a new datastore at np.cpython and saves its findings there. (The name np.cpython is nonessential; it can be whatever you want.)

$ noseyparker scan --datastore np.cpython cpython.git
Found 28.30 GiB from 18 plain files and 427,712 blobs from 1 Git repos [00:00:04]
Scanning content  ████████████████████ 100%  28.30 GiB/28.30 GiB  [00:00:53]
Scanned 28.30 GiB from 427,730 blobs in 54 seconds (538.46 MiB/s); 4,904/4,904 new matches

 Rule                      Distinct Groups   Total Matches
───────────────────────────────────────────────────────────
 PEM-Encoded Private Key             1,076           1,192
 Generic Secret                        331             478
 netrc Credentials                      42           3,201
 Generic API Key                         2              31
 md5crypt Hash                           1               2

Run the `report` command next to show finding details.

Git repos given URL, GitHub username, or GitHub organization name

Nosey Parker can also scan Git repos that have not already been cloned to the local filesystem. The --git-url URL, --github-user NAME, and --github-org NAME options to scan allow you to specify repositories of interest.

For example, to scan the Nosey Parker repo itself:

$ noseyparker scan --datastore np.noseyparker --git-url https://github.com/praetorian-inc/noseyparker

For example, to scan accessible repositories belonging to octocat:

$ noseyparker scan --datastore np.noseyparker --github-user octocat

These input specifiers will use an optional GitHub token if available in the NP_GITHUB_TOKEN environment variable. Providing an access token gives a higher API rate limit and may make additional repositories accessible to you.

See noseyparker help scan for more details.

Report findings

To see details of Nosey Parker's findings, use the report command. This prints out a text-based report designed for human consumption:

$ noseyparker report --datastore np.cpython
Finding 1/1452: Generic API Key
Match: QTP4LAknlFml0NuPAbCdtvH4KQaokiQE
Showing 3/29 occurrences:

    Occurrence 1:
    Git repo: clones/cpython.git
    Blob: 04144ceb957f550327637878dd99bb4734282d07
    Lines: 70:61-70:100

        e buildbottest

        notifications:
          email: false
          webhooks:
            urls:
              - https://python.zulipchat.com/api/v1/external/travis?api_key=QTP4LAknlFml0NuPAbCdtvH4KQaokiQE&stream=core%2Ftest+runs
            on_success: change
            on_failure: always
          irc:
            channels:
              # This is set to a secure vari

    Occurrence 2:
    Git repo: clones/cpython.git
    Blob: 0e24bae141ae2b48b23ef479a5398089847200b3
    Lines: 174:61-174:100

        j4 -uall,-cpu"

        notifications:
          email: false
          webhooks:
            urls:
              - https://python.zulipchat.com/api/v1/external/travis?api_key=QTP4LAknlFml0NuPAbCdtvH4KQaokiQE&stream=core%2Ftest+runs
            on_success: change
            on_failure: always
          irc:
            channels:
              # This is set to a secure vari
...

(Note: the findings above are synthetic, invalid secrets.) Additional output formats are supported, including JSON, JSON lines, and SARIF (experimental), via the --format=FORMAT option.

Human-readable text format

Screenshot showing Nosey Parker's workflow for rendering its findings in human-readable format

JSON format

Screenshot showing Nosey Parker's workflow for rendering its findings in JSON format

Summarize findings

Nosey Parker prints out a summary of its findings when it finishes scanning. You can also run this step separately:

$ noseyparker summarize --datastore np.cpython

 Rule                      Distinct Groups   Total Matches
───────────────────────────────────────────────────────────
 PEM-Encoded Private Key             1,076           1,192
 Generic Secret                        331             478
 netrc Credentials                      42           3,201
 Generic API Key                         2              31
 md5crypt Hash                           1               2

Additional output formats are supported, including JSON and JSON lines, via the --format=FORMAT option.

Enumerate repositories from GitHub

To list URLs for repositories belonging to GitHub users or organizations, use the github repos list command. This command uses the GitHub REST API to enumerate repositories belonging to one or more users or organizations. For example:

$ noseyparker github repos list --user octocat
https://github.com/octocat/Hello-World.git
https://github.com/octocat/Spoon-Knife.git
https://github.com/octocat/boysenberry-repo-1.git
https://github.com/octocat/git-consortium.git
https://github.com/octocat/hello-worId.git
https://github.com/octocat/linguist.git
https://github.com/octocat/octocat.github.io.git
https://github.com/octocat/test-repo1.git

An optional GitHub Personal Access Token can be provided via the NP_GITHUB_TOKEN environment variable. Providing an access token gives a higher API rate limit and may make additional repositories accessible to you.

Additional output formats are supported, including JSON and JSON lines, via the --format=FORMAT option.

See noseyparker help github for more details.

Integrations

Nosey Parker has a few third-party integrations:

If you have an integration you'd like to share that's not listed here, please create a PR.

Contributing

Feel free to ask questions or share ideas in the Discussions page.

Contributions are welcome, especially new regex rules. Developing new regex rules is detailed in a separate document.

If you are considering making significant code changes, please open an issue or start a discussion first.

This project has a number of pre-commit hooks enabled that you are encouraged to use. To install them in your local repo, make sure you have pre-commit installed and run:

$ pre-commit install

These checks will help to quickly detect simple errors.

License

Nosey Parker is licensed under the Apache License, Version 2.0.

Any contribution intentionally submitted for inclusion in Nosey Parker by you, as defined in the Apache 2.0 license, shall be licensed as above, without any additional terms or conditions.

Nosey Parker also includes vendored copies of several other packages released under the Apache License and other permissive licenses; see LICENSE for details.

noseyparker's People

Contributors

adnanekhan avatar bradlarsen avatar coruscant11 avatar ds-koolaid avatar gemesa avatar grawin avatar kpcyrd avatar marcool04 avatar ransomsec avatar seqre avatar tobiasgyoerfi avatar tpat13 avatar yilas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

noseyparker's Issues

Improve caching in the Docker build

The Docker image currently takes quite a long time to build (~20 minutes in GitHub Actions). A large part of that time is spent running cargo, fetching the crates.io index and downloading packages:

noseyparker/Dockerfile

Lines 29 to 30 in 9438ab2

# XXX it would be nice if this could store crates.io index and dependency builds in the Docker cache
RUN cargo build --release

Is it possible and feasible to expose the contents of the GitHub Actions cache to docker build to speed it up?

Support git cloning in parallel

The scan command currently is able to automatically clone Git repositories when invoked with the --git-url, --github-user, or --github-org arguments. This runs sequentially, and when you cast a large net (e.g., end up indirectly specifying 1000 repositories), cloning the input repositories takes the majority of the total time.

It doesn't appear that cloning a single git repo at a time is either network, CPU, or memory-bound on any system I've used. It seems that the remote server that we are cloning the repo from is the bottleneck.

It would be better if Nosey Parker could clone Git repositories in parallel — maybe a limit of 4 at a time by default.

Update vendored `vectorscan` dependency

Nosey Parker currently bundles Vectorscan 5.4.8, released in September 2022.

Vectorscan have put out three new releases in the meantime: 5.4.9, 5.4.10 and 5.4.10.1. These include additional features ported from Hyperscan, performance improvements, and bug fixes.

Nosey Parker should be updated to use the latest version. While addressing this issue, we should also take the opportunity to address #71.

Add SARIF output format for `report`

If Nosey Parker could output findings in SARIF format, it could be more easily integrated into other tools, like GitHub Code Scanning.

This would probably be exposed as a --format=sarif option for the report (and possibly summarize) commands.

Some references:

Add a command-line option to `report` to control the number of displayed matches

noseyparker report currently limits the number of displayed matches per finding to 3 in the default human format. This is to prevent reporting thousands of matches in degenerate cases (which do exist!). However, it would be better if this setting could be controlled at runtime with a new CLI option (perhaps --max-matches=N).

Add an improved rule selection mechanism

The current mechanism for selecting rules to be used when scanning is very simplistic: you can use all the default rules, and you can specify additional YAML-format rules to load with the --rules FILE_OR_DIR option.

What are the problems with this?

  • There is no way to disable a particular rule at scan time
  • There is no way to enable only a particular rule at scan time
  • Some of the default rules are much noisier than others (The Generic * rules in particular), and result in the largest proportion of reported findings

Let's improve the rule selection mechanism. I'm thinking that this would involve a new "ruleset" mechanism, which is an explicitly-specified set of available rules. Perhaps a YAML list of rule names to enable. Or perhaps gitignore format.

We will also want a new rules list CLI command, which will print out the set of selected rules according to some ruleset.

Improve SARIF output

SARIF support was recently added (#33, #4), adding a new output format to Nosey Parker's report command. This support is preliminary, but good enough that viewers like the VSCode SARIF plugin can do something useful with the output in some cases.

However, I want Nosey Parker to do something useful in all cases. The end goal is that Nosey Parker's SARIF output is complete enough that common viewers can usefully render all findings.

Viewers of particular interest:

  • GitHub Code Analysis (so that SARIF output can be automatically shown in pull requests)
  • VSCode SARIF Viewer
  • The sarif-fmt command-line program

Rough edges and opportunities for improvement:

  1. Findings in blobs from Git repositories don't have useful location information associated with them.
  2. Nosey Parker rules don't have a stable and machine-friendly ID associated with them, just a name.
  3. Nosey Parker rules don't have a long description, severity, or precision associated with them.
  4. Currently, the VSCode SARIF Viewer's functionality to annotate findings as false positives crashes with Nosey Parker-generated output, probably due to some missing field.
  5. The location info in SARIF results is for the entire regex match rather than just the match group.

Vendor additional parts of vectorscan

Vectorscan is now vendored in the Nosey Parker source tree (#41), making it simpler to build and distribute the noseyparker binary. However, there are still a couple complications beyond simply doing cargo build -r.

  1. The build environment needs Python to be available: this is for the Vectorscan code itself (see vectorscan-sys/vectorscan/CMakeLists.txt), I think for formatting a timestamp for reproducible builds, and looks like it could easily be stubbed out.

  2. The build environment needs libclang to be available, for the use of bindgen in vectorscan-sys/build.rs. This is necessary to run bindgen, but that could be done once and the result committed, with the invocation of bindgen put behind a feature flag.

If these two items were addressed, then someone trying to build Nosey Parker from source would only need the usual Rust toolchain stuff and cmake.

Add commits range for `scan` in Git repositories

Hi 👋

A great option in secret scanner is to be able to scan a range of commits, for example by adding an option to scan.

In my case, we use scanners for very large repositories. Once reported, in futures runs there will be no need to scan previously scanned commits. Only new commits are relevant. It saves a lot of time in large repositories.

Gitleaks has this feature , and Trufflehog too.

For example a since_commits option, scanning between a specific commit and HEAD. And why not a until_commits option.

Do you see any blocking issues for this enhancement?

😄

Add Bitbucket enumeration support

Hello,
Since my company is actually working with Bitbucket, I would be glad to work in order to add the same features for Github to Bitbucket. For example listing Bitbucket project repositories, listing user repositories, and ensure that each feature of noseyparker is working on it.

I could take inspiration from the existing code for Github and make a Bitbucket version.
Why not GitLab too in the future?
😄

Investigate switching global allocator to mimalloc

Using musl instead of glibc when building Nosey Parker results in a significant drop in scan performance, presumably due to the allocator implementation in musl not supporting threaded workloads very well (see here).

It may be possible to sidestep this by using a different global allocator in Nosey Parker. In particular, it appears that jemalloc does not build with musl. But mimalloc does build there, and there is a Rust crate for it already.

Is it easy to switch Nosey Parker to use mimalloc as its global allocator?

How does switching impact performance of native-code builds? How does it affect performance of Docker-based builds, particularly the Alpine-based build in #77?

Support parallel enumeration of Git repositories

Currently, the scan command runs in two main phases: input enumeration and content scanning. Each of these phases runs in parallel (but not concurrently; the input enumeration phase completes entirely before the content scanning phase completes).

However, within the input enumeration phase, when a Git repository is discovered on the filesystem, that repository is enumerated sequentially, by a single thread. This becomes noticeable when you are scanning just a single huge repository, such as the Linux kernel, which has over a million commits, several million objects, and can take over a hundred GB of space when uncompressed.

It would be better if Nosey Parker did not have this sequential bottleneck, and was instead able to enumerate a single Git repository in parallel, using all available cores.

The implementation of this will be a bit tricky, requiring rework of the parallelism mechanism in the input enumerator code. That currently uses the ignore crate to do parallel filesystem walking, but that does not seem to expose its thread pool. We would want the proposed parallel Git enumerator to not oversubscribe the system running scan; the total number of enumeration threads should be controllable.

Additionally complicated will be figuring out how to build up the Git metadata graph that is being added in #66 (to address #16): the core graph data structure there is not designed for out-of-the-box mutation from many threads.

Cache scan results across runs

Nosey Parker should keep track of which inputs have already been scanned, and avoid rescanning them if possible on future scanning runs.

Currently, noseyparker scan -d DATASTORE INPUT will completely enumerate and scan INPUT from scratch. Nosey Parker is fast, but for large repositories (like the LInux kernel, with 100+GB of blobs), it still takes a couple minutes. However, simply enumerating contents goes pretty quickly, especially in Git repositories (e.g., the Linux kernel repo can be enumerated in 13-25 seconds, depending on filesystem cache). If Nosey Parker kept track of which blobs it had scanned and with which set of rules, it could avoid re-scanning things.

Caching is tricky to get right. The information about which inputs have already been scanned should probably be persisted in the datastore sqlite database. An entry would be a (blob id, ruleset id, nosey parker version id), indicated that the particular blob had been scanned with a particular set of rules and Nosey Parker version.

If the context size for reported findings in Nosey Parker becomes runtime-configurable, that parameter would also need to be taken into account for caching.

The cache could be a fixed-size LRU cache, 128MB of entries for example, loaded into a fast in-memory structure at enumeration time, and then updated in bulk at the end of scanning. (Some implementation like this may be necessary to avoid tanking Nosey Parker speed.)

One complication: a Nosey Parker datastore sqlite database is currently a very simple, totally denormalized schema with a single table. There is also no such thing currently as a ruleset id; that notion would need to be added (perhaps sha512 of all the loaded rules).

Support older versions of Linux with prebuilt release binaries

          @bradlarsen Can we please provide a `musl` binary as well?

I get the following GLIB error when trying to run noseyparker on Ubuntu 20:

❯ noseyparker
noseyparker: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by noseyparker)
noseyparker: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by noseyparker)
noseyparker: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by noseyparker)
noseyparker: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by noseyparker)

I seem to have GLIB 2.31:

❯ ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31

Originally posted by @dufferzafar in #28 (comment)

Rework vendoring of `vectorscan` to use patchfiles

vectorscan is currently vendored, built from source as part of Nosey Parker. For the sake of expediency I implemented that by adding a copy of all vectorscan 5.4.8 in a subdirectory, and then modifying those in-place.

It would be better for maintainability if the vendored build process instead extracted a pristine shasum-verified release tarball and applied patchfiles to that prior to building. This will make it clearer in the future what modifications were made to the Vectorscan sources.

Additionally, being able to get rid of the enormous vectorscan and boost source trees from this repository would make continued development of Nosey Parker a bit nicer, as things like fzf-based fuzzy search and whole-source-tree textual search wouldn't get confounded by all the stuff in there (which we mostly don't care about).

Add an extensible filtering mechanism for reporting

Use case

You are reviewing a bunch of findings from Nosey Parker and want to focus on particular categories of findings, eliminate findings from certain files or repos, etc.

Existing limitations

At present, the only way to do this is to write some postprocessing script. This has a couple downsides:

  • If you postprocess the human-readable text-based report, you have to do grungy parsing
  • If you postprocess the machine-readable JSON report, you afterwards have to write your own logic to assemble that into a human-readable report from the filtered result
  • Any postprocessing you do is strictly "post"; the filtered results cannot be fed back into Nosey Parker. (Example use case: you filter results and want the noseyparker summarize output only on the filtered results.)

Proposal

The report and summarize commands should have an additional --filter CMD option that behaves as follows:

  • CMD is the name of a program that takes a JSON array of findings on stdin and emits a JSON array of findings on stdout
  • If CMD returns a non-zero exit code, prints unintelligible output, or emits findings that did not appear in the original input, noseyparker exits with an error
  • When --filter CMD is given to either the report or summarize commands, the filter program is run on Nosey Parker's findings as soon as they are loaded, but prior to downstream processing within noseyparker

Use a default datastore path when not provided

Currently you have to explicitly specify the datastore to most Nosey Parker commands. For example, to scan something, you have to say noseyparker scan -d DATASTORE SOME_INPUT, and to report findings, you have to say noseyparker report -d DATASTORE.

It can be a hassle to keep specifying the datastore over and over. Additionally, the -d/--datastore argument must be given after the subcommand name, which complicates command-line editing. For example, to scan and then report, you first run scan, then go to your previous command in shell history to change the subcommand to report, but you have to edit within the line, rather than being able to move the cursor to scan, deleting until end of line, and typing report.

You can kind of work around this hassle at present by setting the NP_DATASTORE=DATASTORE environment variable, which lets you get away with saying just noseyparker scan SOME_INPUT and noseyparker report.

It would be nicer if Nosey Parker would choose a default datastore name if it is not explicitly specified. The proposal is this:

  • Use a default value of datastore.np for the datastore
  • To avoid a UX hazard of accidentally polluting an existing datastore, modify the scan command so that if no datastore is explicitly given via -d/--datastore or the NP_DATASTORE environment variable — i.e., the default datastore path is used — and the datastore already exists, exit with an error

Azure Devops Support?

I would be HAPPY to test azure devops support (via ssh would be REALLY nice!). Thanks for a better than the steaming pile o crap that is here for password scanners. Its not perfect, but its a WHOLE lot better than everything else!

Add ability to enumerate repositories of GitHub users and organizations

This feature was demoed at Black Hat EU 2022 from the internal version of Nosey Parker, but is not yet implemented here.

The listed repos should include user gists and repo wikis.

This should exist as both a new github list [--github-user USER | --github-org ORG]... command, and as new --github-user USER, --github-org ORGarguments forscan`.

How to ignore some strings ?

Hi,

Thank you for your tool which works very well. However, I have a question about the best way to ignore certain strings.

For example, I would like to be able to ignore some AWS keys :

AKIA111111111EXAMPLE
AKIA111111112EXAMPLE

So I've created a yaml file with this pattern (~/tmp/nosey/aws.yml) :

rules:
- name: AWS API Key Key
  pattern: '\b(?:A3T[A-Z0-9]|AKIA|AGPA|AIDA|AROA|AIPA|ANPA|ANVA|ASIA)[A-Z0-9]{16})\b'
  references:
  - https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html
  - https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html
  - https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html
  - https://docs.aws.amazon.com/accounts/latest/reference/credentials-access-keys-best-practices.html

  examples:
  - 'A3T0ABCDEFGHIJKLMNOP'
  - AKIADEADBEEFDEADBEEF'

  negative_examples:
  - 'AKIA111111111EXAMPLE'
  - 'AKIA222222222EXAMPLE'
  - 'AKIAI44QH8DHBEXAMPLE'
  - '======================'
  - '//////////////////////'
  - '++++++++++++++++++++++'

Then :

noseyparker scan -r ~/tmp/nosey/ --datastore ~/gitlab-dump.db ~/gitlab-dump/

But these keys are still detected by the tool.
I guess I'm doing something wrong but I don't see how to solve my issue.

Thanks in advance for your help !

Add a scanning option to skip the initial input enumeration

noseyparker scan currently always does an initial enumeration of the filesystem inputs. The only user-facing reason for doing this currently is to show a progress bar when scanning. This is detrimental in a couple cases:

  1. When running without a terminal (like with output directed to a file), the progress bars are not shown, and the initial filesystem enumeration is not useful.
  2. When scanning large inputs from slow filesystems (like old magnetic disks or Docker bind mounts on macOS), simply enumerating the inputs can be as slow as actually scanning everything!

It would be useful to automatically avoid enumeration when progress bars are not displayed. It would also be beneficial to have an explicit control to avoid enumeration, to help in cases of slow I/O.

Minor issue (unhandled exception when datastore doesn't exist)

This shouldn't work anyway, but I suspect you may want to handle the exception more gracefully

Specifying a datastore that doesn't exist when using the report command causes a full stack trace, appears to just be an uncaught exception

14:51:44 › noseyparker report -d /tmp/no-such-file
Error: Failed to open database at "/tmp/no-such-file/datastore.db"

Caused by:
    0: unable to open database file: /tmp/no-such-file/datastore.db
    1: Error code 14: Unable to open the database file

Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: noseyparker::datastore::Datastore::open
   2: noseyparker::cmd_report::run
   3: noseyparker::main
   4: std::sys_common::backtrace::__rust_begin_short_backtrace
   5: std::rt::lang_start::{{closure}}
   6: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/e75aab045fc476f176a58c408f6b06f0e275c6e1/library/core/src/ops/function.rs:609:13
   7: std::panicking::try::do_call
             at /rustc/e75aab045fc476f176a58c408f6b06f0e275c6e1/library/std/src/panicking.rs:483:40
   8: std::panicking::try
             at /rustc/e75aab045fc476f176a58c408f6b06f0e275c6e1/library/std/src/panicking.rs:447:19
   9: std::panic::catch_unwind
             at /rustc/e75aab045fc476f176a58c408f6b06f0e275c6e1/library/std/src/panic.rs:137:14
  10: std::rt::lang_start_internal::{{closure}}
             at /rustc/e75aab045fc476f176a58c408f6b06f0e275c6e1/library/std/src/rt.rs:148:48
  11: std::panicking::try::do_call
             at /rustc/e75aab045fc476f176a58c408f6b06f0e275c6e1/library/std/src/panicking.rs:483:40
  12: std::panicking::try
             at /rustc/e75aab045fc476f176a58c408f6b06f0e275c6e1/library/std/src/panicking.rs:447:19
  13: std::panic::catch_unwind
             at /rustc/e75aab045fc476f176a58c408f6b06f0e275c6e1/library/std/src/panic.rs:137:14
  14: std::rt::lang_start_internal
             at /rustc/e75aab045fc476f176a58c408f6b06f0e275c6e1/library/std/src/rt.rs:148:20
  15: main
  16: __libc_start_main
             at ./csu/../csu/libc-start.c:308:16
  17: _start

Obviously I don't feel too strongly about this being "fixed" but FYI. A pretty error is probably preferred?

Great tool by the way, thanks for publishing

Make release builds from CI usable

The new Release Build job in GitHub Actions is intended to produce a release-mode binary for noseyparker for x86_64 Linux.

It doesn't quite work at present as desired, because Hyperscan does not in fact get statically linked:

$ ldd /scan/noseyparker
        libhs.so.5 => not found
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000004001f27000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000000400200e000)
        /lib64/ld-linux-x86-64.so.2 (0x0000004000000000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000004002236000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000004002256000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000000400225b000)

See the reference to libhs.so.5 there. What this means is that someone who downloads a prebuilt noseyparker binary will also have to install Hyperscan.

It would also be really nice for ease-of-distribution if the binary could be linked statically, not requiring any dynamic libraries whatsoever, but that's probably a bigger job.

Validation of secrets

I'd love to use Nosey Parker for defensive purposes, and being able to filter results to only show secrets which are confirmed valid is essential.

It should be an optional feature, likely occurring during the report command to keep scans performant.

Each rule could have optional section listing URL + parameters / headers to specify what network request to make in order to verify the credential. The slack rule https://github.com/praetorian-inc/noseyparker/blob/74098e8881bd68a6368032b78b56ddad5ecbbad4/crates/noseyparker/data/default/rules/slack.yml could have a section which specifies curl https://slack.com/api/auth.test -H "Authorization: Bearer $(1)" as the request to make. With $(1) as a placeholder for the first capture group from the regex.

Many project re-invent their own request / response matching format, very curious if you might be able to integrate an existing system like https://hurl.dev with a flexible way to test a response to confirm a valid secret.

Generate manpages and include them in prebuilt releases

The prebuilt releases could include man pages generated from the CLI documentation.

See clap_mangen. This would possibly need to be called from the build.rs script. Then, in the scripts/create_release.zsh script, we would want to copy those generated man pages into somewhere like release/share/man/1/.

Add ability to scan a Git repo given its url

Something like scan -d DATASTORE --git-url https://github.com/python/cpython should work. This would automatically clone the requested repo and then scan it.

A new clones subdirectory within the datastore can be used for storing automatically cloned repos.

An automatically cloned repo should be cloned using the --mirror option, which pulls down additional objects compared to a regular git clone invocation.

If a Git URL is requested that was already automatically cloned, the existing clone state should be reused and updated if possible.

Provide prebuilt release binaries for Nosey Parker

Currently, Nosey Parker comes in source form, and has an automatically built Docker image available.

I'd like to provide a new distribution mechanism: a prebuilt binary distributed as a single file for popular platforms. To start with:

  • GNU x86_64 Linux
  • macOS x86_64
  • macOS aarch64

The biggest obstacle to this currently is the use of Hyperscan in Rust:

  • Hyperscan proper only builds on x86, but the Vectorscan fork of that project builds on other platforms (see #5)
  • When installed from system package managers, static linking against Hyperscan/Vectorscan doesn't work as expected (there is still a dynamic dependency on libhs; see #22)
  • Building Hyperscan or Vectorscan from source is nontrivial, requiring several additional dependencies (ragel, cmake, boost, ...)

Improve GitHub repository enumeration with filtering mechanisms

A few commands enumerate GitHub repositories:

  • scan --github-user=USER
  • scan --github-org=ORG
  • github repos list --user=USER
  • github repos list --org=ORG

These currently do not offer any filtering mechanism on the set of resulting repo URLs. It would be useful to have an option to ignore forked repos.

Provide prebuilt Docker images for ARM

The Docker images that are currently build in GitHub Actions are built only for x86_64. When someone runs one of these on a different architecture, such as ARM, some kind of binary translation or runtime emulation is performed to get the image running.

It would be better if instead of relying on emulation, Nosey Parker provided multi-architecture Docker images.

Error: Failed to enumerate GitHub repositories (SSL/TLS Cert Issues)

Hello I'm probably doing something wrong here, but when I run NoseyParker I'm getting errors related to SSL certificates. When I search for solutions, most mention corporate firewalls or importing CA certificates - both of which seem too involved given I'm running this on a home network and it's trying to connect to api.github.com - any help/suggestions would be appreciated.

image

Add mechanism for per-rule filtering of matches

There are certain rule types in particular where it would be helpful to have an expressive postprocessing/filtering mechanism. For example:

  • Matches from the JWT rule could be postprocessed to decode the JWT and only keep those that have no expiration time
  • Matches from PEM-encoded keys could be postprocessed to filter out matches where the payload doesn't PEM-decode

Support non-x86_64 by switching from Hyperscan to Vectorscan

Hyperscan only supports x86. It has been forked, however, to support ARM and other architectures as well: https://github.com/vectorcamp/vectorscan

I experimented in a local copy on an M1 MacBook Pro, and was able to get Nosey Parker building and running there using vectorscan instead of hyperscan. The build process for that experiment was rather manual:

  • Install boost, ragel, cmake, etc in order to build vectorscan from source
  • Build vectorscan
  • Ensure vectorscan's test suite passed
  • export HYPERSCAN_ROOT=$PATH_TO_VECTORSCAN_BUILD (causes the hyperscan-sys crate to use vectorscan instead)
  • cargo build --release for Nosey Parker

After this process, I was able to run Nosey Parker on the M1 MacBook Pro, and it seemed to behave as expected. The resulting binary was also statically linked against vectorscan, and didn't have a dynamic dependency on libhs.

I'd like to figure out how to streamline this build process so that nothing more than a cargo build in Nosey Parker is required.

Enhance the `version/-V` command

It would be useful for troubleshooting if Nosey Parker's -V command, used to show version info, would include more detail than just the declared version number, such as Git commit hash.

A library that would be helpful for implementing this: https://docs.rs/vergen/latest/vergen/

It would also be useful to add a top-level version command that would emit more information than the simple -V option.

Refresh README examples

The example snippets and usage instructions in the README are out-of-date now. They could also stand to be generally improved.

Some ideal properties of the example usage snippets:

  • they can be automatically regenerated from a script
  • they include short animation screen captures, in addition to static terminal command/output text

Useful references:

Add support for JSON output format for report

Looks like it might have gotten forgotten. The readme and help mentions it, but attempting to use it produces:

Error: The `report` command currently only supports the `human` output format.
Support for other formats is coming soon.

Report detailed Git provenance information for matches

Like #15, this was also demoed at Black Hat EU 2022.

When a match is found within a blob in a Git repository, detailed provenance information should be reported, including:

  • The commit(s) that first introduced the match, along with author, date, and commit message
  • The path(s) that the introducing blob appeared as
  • The repository origin URL, if available

With all this information, it is possible to generate permalinks to GitHub for matches.

Add support for precompiled rules

Compiling the rules into a Vectorscan database currently is very expensive in debug builds, where I see it taking some 6 seconds. In release builds, the compilation is much faster (perhaps 1/3 of a second). The slow compilation in debug builds is especially irksome, because that's the development configuration where you'd like to be able to quickly iterate.

It would be ideal if, instead of compiling the rules database from scratch each run, Nosey Parker could load a precompiled rules database.

Last I looked at Vectorscan, there were APIs for serialization of databases, but the serialized format did not sound to be machine-independent. So it would be difficult to precompile that database and include it as an asset in the release binary.

An alternative would be to have Nosey Parker optionally (and by default) look in a special location at runtime for a machine-specific precompiled rule file that it could load, and generate that if needed. Though this approach is fraught with its own perils, such as: what if the notable location ends up in a network-mounted home directory shared across systems?

Add "category" metadata to each rule

Currently, Nosey Parker rules are just a bag of rules, undifferentiated from each other in terms of severity or the kind of thing they detect.

As noted by @CameronLonsdale in #52 (comment):

Another useful category would be whether the finding is for a "secret" like an API key, or something more informational like an s3 bucket name (https://github.com/praetorian-inc/noseyparker/blob/main/crates/noseyparker/data/default/rules/aws.yml#L120-L121)

This is related to #51: if we had useful metadata attached to each rule, we could use that for filtering.

I propose adding a category field to each rule. This field will have a list of values of notable types. I'm not quite sure what the taxonomy should be, but some ideas:

  • secret: indicating that the rule detects things that are in fact secrets (e.g., GitHub Personal Access Token). These are the most severe, as they can give an attacker nearly immediate access to additional resources.
  • identifier: indicating that the rule detects things that are not secrets but could be used by an attacker to enumerate additional resources (e.g., AWS API Key, S3 buckets).
  • hashed, encrypted: indicating that the rule detects things that are hashed or encrypted payloads (e.g., bcrypt hashes). Things like password hashes probably shouldn't be leaked, as an attacker could use them to brute force credentials.
  • test: indicating that the rule detects things that are explicitly used for test deployments (e.g., stripe test keys)
  • api: indicating that the rule detects things that are API tokens

Make `scan --ignore FILENAME` apply to blobs in Git repositories

The scan command currently has a --ignore FILENAME option, which allows one to specify a gitignore-style rules files for paths to ignore when scanning. Those ignore rules are only applied to plain files that are scanned, and not blobs found within Git repositories. Those rules should also apply to Git blobs.

This is probably dependent on #16 being completed first.

Create a schema for the `report` JSON output

report --format json and report --format jsonl emit output in a currently unspecified JSON/JSON Lines format. It would be helpful for downstream users if these formats were documented with a schema.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.