Code Monkey home page Code Monkey logo

hashr's Introduction

HashR: Generate your own set of hashes



Table of Contents

About

HashR allows you to build your own hash sets based on your data sources. It's a tool that extracts files and hashes out of input sources (e.g. raw disk image, GCE disk image, ISO file, Windows update package, .tar.gz file, etc.).

HashR consists of the following components:

  1. Importers, which are responsible for copying the source to local storage and doing any required preprocessing.
  2. Core, which takes care of extracting the content from the source using image_export.py (Plaso), caching and repository level deduplication and preparing the extracted files for the exporters.
  3. Exporters, which are responsible for exporting files, metadata and hashes to given data sinks.

Currently implemented importers:

  1. GCP, which extracts file from base GCP disk images
  2. Windows, which extracts files from Windows installation media in ISO-13346 format.
  3. WSUS, which extracts files from Windows Update packages.
  4. GCR, which extracts file from container images stored in Google Container Registry.
  5. TarGz, which extracts files from .tar.gz archives.
  6. Deb, which extracts files Debian software packages.
  7. RPM, which extracts files from RPM software packages.
  8. Zip, which extracts files from .zip (and zip-like) archives.

Once files are extracted and hashed results will be passed to the exporters, currently implemented exporters:

  1. PostgreSQL, which upload the data to PostgreSQL instance.
  2. Cloud Spanner, which uploads the data to GCP Spanner instance.

You can choose which importers you want to run, each one have different requirements. More about this can be found in sections below.

Requirements

HashR requires Linux OS to run, this can be a physical, virtual or cloud machine. Below are optimal hardware requirements:

  1. 8-16 cores
  2. 128GB memory
  3. 2TB fast local storage (SSDs preferred)

HashR can likely run how machines with lower specifications, however this was not thoroughly tested.

Building HashR binary and running tests

In order to build a hashr binary run the following command:

env GOOS=linux GOARCH=amd64 go build hashr.go

In order to run tests for the core hashR package you need to run Spanner emulator:

gcloud emulators spanner start

Then to execute all tests run the following command:

go test -timeout 2m ./...

Setting up HashR

HashR using docker

To run HashR in a docker container visit the docker specific guide

OS configuration & required 3rd party tooling

HashR takes care of the heavy lifting (parsing disk images, volumes, file systems) by using Plaso. You need to pull the Plaso docker container using the following command:

docker pull log2timeline/plaso

We also need 7z, which is used by WSUS importer for recursive extraction of Windows Update packages, to be installed on the machine running HashR:

sudo apt install p7zip-full

You need to allow the user, under which HashR will run, to run certain commands via sudo. Assuming that your user is hashr create a file /etc/sudoers.d/hashr and put in:

hashr ALL = (root) NOPASSWD: /bin/mount,/bin/umount,/sbin/losetup,/bin/rm

The user under which HashR will run will also need to be able to run docker. Assuming that your user is hashr, add them to the docker group like this:

sudo usermod -aG docker hashr

Setting up storage for processing tasks

HashR needs to store information about processed sources. It also stores additional telemetry about processing tasks: processing times, number of extracted files, etc. You can choose between using:

  1. PostgreSQL
  2. Cloud (GCP) Spanner

Setting up PostgreSQL storage

There are many ways you can run and maintain your PostgreSQL instance, one of the simplest ways would be to run it in a Docker container. Follow the steps below to set up a PostgreSQL Docker container.

Step 1: Pull the PostgreSQL docker image.

docker pull postgres

Step 2: Initialize and run the PostgreSQL container in the background. Make sure to adjust the password.

docker run -itd -e POSTGRES_DB=hashr -e POSTGRES_USER=hashr -e POSTGRES_PASSWORD=hashr -p 5432:5432 -v /data:/var/lib/postgresql/data --name hashr_postgresql postgres

Step 3: Create a table that will be used to store processing jobs.

cat scripts/CreateJobsTable.sql | docker exec -i hashr_postgresql psql -U hashr -d hashr

In order to use PostgreSQL to store information about processing tasks you need to specify the following flags: -storage postgres -postgres_host <host> -postgres_port <port> -postgres_user <user> -postgres_password <pass> -postgres_db <db_name>

Setting up Cloud Spanner

You can choose the store the data about processing jobs in Cloud Spanner. You'll need a Google Cloud project for that. The main advantage of this setup is that you can easily create dashboard(s) using Google Data Studio and directly connect to the Cloud Spanner instance that allows monitoring and debugging without running queries against your PostgreSQL instance.

Assuming that your gcloud tool is configured with your target hashr GCP project, you'll need to follow the steps below to enable Cloud Spanner.

Create HashR service account:

gcloud iam service-accounts create hashr --description="HashR SA key." --display-name="hashr"

Create service account key and store in your home directory. Set <project_name> to your project name.

gcloud iam service-accounts keys create ~/hashr-sa-private-key.json --iam-account=hashr-sa@<project_name>.iam.gserviceaccount.com

Point GOOGLE_APPLICATION_CREDENTIALS env variable to your service account key:

export GOOGLE_APPLICATION_CREDENTIALS=/home/hashr/hashr-sa-private-key.json

Create Spanner instance, adjust the config and processing-units value if needed:

gcloud spanner instances create hashr --config=regional-us-central1 --description="hashr" --processing-units=100

Create Spanner database:

gcloud spanner databases create hashr --instance=hashr

Allow the service account to use Spanner database, set <project_name> to your project name:

gcloud spanner databases add-iam-policy-binding hashr --instance hashr --member="serviceAccount:hashr-sa@<project_name>.iam.gserviceaccount.com" --role="roles/spanner.databaseUser"

Update Spanner database schema:

gcloud spanner databases ddl update hashr --instance=hashr --ddl-file=scripts/CreateJobsTable.ddl

In order to use Cloud Spanner to store information about processing tasks you need to specify the following flags: -jobStorage cloudspanner -spannerDBPath <spanner_db_path>

Setting up importers

In order to specify which importer you want to run you should use the -importers flag. Possible values: GCP,targz,windows,wsus,deb,rpm,zip,gcr,iso9660

GCP (Google Cloud Platform)

This importer can extract files from GCP disk images. This is done in few steps:

  1. Check for new images in the target project (e.g. ubuntu-os-cloud)
  2. Copy new/unprocessed image to the hashR GCP project
  3. Run Cloud Build, which creates a temporary VM, runs dd on the copied image and saves the output to a .tar.gz file.
  4. Export raw_disk.tar.gz to the GCS bucket in hashR GCP project
  5. Copy raw_disk.tar.gz from GCS to local hashR storage
  6. Extract raw_disk.tar.gz and pass the disk image to Plaso

List of GCP projects containing public GCP images can be found here. In order to use this importer you need to have a GCP project and follow these steps:

Step 1: Create HashR service account, if this was done while setting up Cloud Spanner please go to step 4.

gcloud iam service-accounts create hashr-sa --description="HashR SA key." --display-name="hashr"

Step 2: Create service account key and store in your home directory. Make sure to set <project_name> to your project name:

gcloud iam service-accounts keys create ~/hashr-sa-private-key.json --iam-account=hashr-sa@<project_name>.iam.gserviceaccount.com

Step 3: Point GOOGLE_APPLICATION_CREDENTIALS env variable to your service account key:

export GOOGLE_APPLICATION_CREDENTIALS=~/hashr-sa-private-key.json

Step 4: Create GCS bucket that will be used to store disk images in .tar.gz format, set <project_name> to your project name and <gcs_bucket_name> to your project new GCS bucket name:

gsutil mb -p project_name> gs://<gcs_bucket_name>

Step 5: Make the service account admin of this bucket:

gsutil iam ch serviceAccount:hashr-sa@<project_name>.iam.gserviceaccount.com:objectAdmin gs://<gcs_bucket_name>

Step 6: Enable Compute API:

gcloud services enable compute.googleapis.com cloudbuild.googleapis.com

Step 7: Create IAM role and assign it required permissions:

gcloud iam roles create hashr --project=<project_name> --title=hashr --description="Permissions required to run hashR" --permissions compute.images.create compute.images.delete compute.globalOperations.ge

Step 8: Bind IAM role to the service account:

gcloud projects add-iam-policy-binding <project_name> --member="serviceAccount:hashr-sa@<project_name>.iam.gserviceaccount.com" --role="projects/<project_name>/roles/hashr"

Step Grant service accounts access required to run Cloud Build, make sure the change the <project_name> and <project_id> values:

gcloud projects add-iam-policy-binding <project_name> --member='serviceAccount:hashr-sa@<project_name>.iam.gserviceaccount.com' --role='roles/storage.admin'

gcloud projects add-iam-policy-binding <project_name> \
  --member='serviceAccount:hashr-sa@<project_name>.iam.gserviceaccount.com' \
  --role='roles/viewer'

gcloud projects add-iam-policy-binding <project_name> \
  --member='serviceAccount:hashr-sa@<project_name>.iam.gserviceaccount.com' \
  --role='roles/resourcemanager.projectIamAdmin'

gcloud projects add-iam-policy-binding <project_name> \
  --member='serviceAccount:hashr-sa@<project_name>.iam.gserviceaccount.com' \
  --role='roles/cloudbuild.builds.editor'


gcloud projects add-iam-policy-binding <project_name> \
   --member='serviceAccount:<project_id>@cloudbuild.gserviceaccount.com' \
   --role='roles/compute.admin'

gcloud projects add-iam-policy-binding <project_name> \
   --member='serviceAccount:<project_id>@cloudbuild.gserviceaccount.com' \
   --role='roles/iam.serviceAccountUser'

gcloud projects add-iam-policy-binding <project_name> \
   --member='serviceAccount:<project_id>@cloudbuild.gserviceaccount.com' \
   --role='roles/iam.serviceAccountTokenCreator'

gcloud projects add-iam-policy-binding <project_name> \
   --member='serviceAccount:<project_id>@cloudbuild.gserviceaccount.com' \
   --role='roles/compute.networkUser'

gcloud projects add-iam-policy-binding <project_name> \
  --member='serviceAccount:<project_id>[email protected]' \
  --role='roles/compute.storageAdmin'

gcloud projects add-iam-policy-binding <project_name> \
  --member='serviceAccount:<project_id>[email protected]' \
  --role='roles/storage.objectViewer'

gcloud projects add-iam-policy-binding <project_name> \
  --member='serviceAccount:<project_id>[email protected]' \
  --role='roles/storage.objectAdmin'

To use this importer you need to specify the following flag(s):

  1. -gcpProjects which is a comma separated list of cloud projects containing disk images. If you'd like to import public images take a look here
  2. -hashrGCPProject GCP project that will be used to store copy of disk images for processing and also run Cloud Build
  3. -hashrGCSBucket GCS bucket that will be used to store output of Cloud Build (disk images in .tar.gz format)

AWS

This importer processes Amazon owned AMIs and generates hashes. The importer requires at least one HashR worker (an EC2 instance).

AWS HashR Workers

AWS HashR worker is an EC2 instance where AMI’s volume is attached, disk archive is created, and then uploaded to S3 bucket. It is recommended to have at least two AWS HashR workers. If your setup uses a single AWS worker use -processing_worker_count 1.

An AWS HashR worker needs to have to meet the following requirements:

  • EC2 instances must have the tag InUse: false. If the value is true, the worker is not used for processing.
aws ec2 describe-instances --instance-id INSTANCE_ID | jq -r ‘.Reservations[].Instances[0].Tags’
  • The system running hashr must be able to SSH to the EC2 instance using:

    • SSH key as described in Keyname.
    aws ec2 describe-instances --instance-id INSTANCE_ID | jq -r ‘.Reservations[].Instances[0].Keyname’
    • To FQDN as described in PublicDnsName.
    aws ec2 describe-instances --instance-id INSTANCE_ID | jq -r ‘.Reservations[].Instances[0].PublicDnsName’
  • scripts/hashr-archive must be copied to AWS HashR worker to /usr/local/sbin/hashr-archive

  • An AWS account with permission to upload files to HashR bucket. AWS configuration and credential should be stored in $HOME/.aws/ directory.

aws configure
HashR Application

On the system that runs the hashr the following is required.

  • An AWS account with permissions to call followings APIs:
    • EC2
      • AttachVolume
      • CopyImage
      • CreateTags
      • CreateVolume
      • DeleteVolume
      • DescribeAvailabilityZones
      • DescribeImages
      • DescribeInstances
      • DescribeSnapshots
      • DescribeVolumes
      • DetachVolume
    • S3
      • DeleteObject
  • AWS account configuration and credential file must be located at $HOME/.aws/ directory.
  • The SSH private key used for AWS HashR must be located in the $HOME/.ssh/ directory. It must match the value of Keyname as described in `aws ec2 describe-instances --instance-id INSTANCE_ID | jq -r ‘.Reservations[].Instances[0].Keyname’
Setting up AWS EC2 Instance

This section describes how to create EC2 instances to use with HashR. Ideally we want two AWS accounts hashr.uploader and hashr.worker.

hashr.uploader is used on EC2 instances and needs permissions to upload archived disk images to S3 bucket. scripts/aws/AwsHashrUploaderPolicy.json contains sample policy for the S3 bucket hashr-bucket.

hashr.worker is used on the computer running HashR commands. The account needs EC2 and S3 permissions. scripts/aws/AwsHashrWorkerPolicy.json contains sample policy for the hashr.worker account.

The hashr_setup.sh is a script that helps create EC2 instances. Edit hashr_setup.sh and review and update the following fields as required:

  • AWS_PROFILE
  • AWS_REGION
  • SECURITY_SOURCE_CIDR
  • WORKER_AWS_CONFIG_FILE

Note: The file specified WORKER_AWS_CONFIG_FILE must exist in the directory with hashr_setup.sh.

Note: The hashr_setup.sh must be executed from the same directory as hashr_setup.sh.

Run the following commands to create and set up the EC2 instances.

$ git clone https://github.com/google/hashr
$ cd hashr/scripts/aws
$ aws configure
$ cp -r ~/.aws ./
$ tar -zcf hashr.uploader.tar.gz .aws
$ hash_setup.sh setup
HashR AWS Importer Workflow

AWS importer takes the following high level steps:

  1. Copies a new/unprocessed Amazon owned AMI to HashR project
  2. Creates a volume based on the copied AMI
  3. Attaches the volume to an available AWS HashR worker
  4. On an AWS HashR worker a. Creates disk archive (tar.gz) on the AWS HashR worker b. Uploads the disk archive to HashR S3 bucket
  5. Downloads the disk archive from HashR S3 bucket
  6. Unarchives the disk image
  7. Processes the raw disk using Plaso
HashR AWS Importer Command

The command below processes debian-12 images and stores them in a PostgreSQL database.

hashr -storage postgres -exporters postgres -importers aws -aws_bucket aws-hashr-bucket -aws_os_filter debian-12

Note: Amazon Linux (al2023-*) was used as a worker while developing the importer. Thus, the default value for -aws_ssh_user is set to ec2-user. A different distro may have a different default SSH user, use -aws_ssh_user to set the appropriate SSH user.

GCR (Google Container Registry)

This importer extracts files from container images stored in GCR repositories. In order to set ip up follow these steps:

Step 1: Create HashR service account, skip to step 4 if this was done while setting up other GCP dependent components.

gcloud iam service-accounts create hashr-sa --description="HashR SA key." --display-name="hashr"

Step 2: Create service account key and store in your home directory. Make sure to set <project_name> to your project name:

gcloud iam service-accounts keys create ~/hashr-sa-private-key.json --iam-account=hashr-sa@<project_name>.iam.gserviceaccount.com

Step 3: Point GOOGLE_APPLICATION_CREDENTIALS env variable to your service account key:

export GOOGLE_APPLICATION_CREDENTIALS=~/hashr-sa-private-key.json

Step 4: Grant hashR service account key required permissions to access given GCR repository.

gsutil iam ch serviceAccount:hashr-sa@<project_name>.iam.gserviceaccount.com:objectViewer gs://artifacts.<project_name_hosting_gcr_repo>.appspot.com

To use this importer you need to specify the following flag(s):

  1. -gcr_repos which should contain comma separated list of GCR repositories from which you want to import the container images.

Windows

This importer extracts files from official Windows installation media in ISO-13346 format, e.g. the ones you can download from official Microsoft website. One ISO file can contain multiple WIM images:

  1. Windows10ProEducation
  2. Windows10Education
  3. Windows10EducationN
  4. Windows10ProN
  5. etc.

This importer will extract files from all images it can find in the install.wim file.

WSUS

This importer utilizes 7z to recursively extract contents of Windows Update packages. It will look for Windows Update files in the provided GCS bucket, easiest way to automatically update the GCS bucket with new updates would be to do the following:

  1. Set up GCE VM running Windows Server in hashr GCP project.
  2. Configure it with WSUS role, select Windows Update packages that you'd like to process
  3. Configure WSUS to automatically approve and download updates to local storage
  4. Set up a Windows task to automatically sync content of the local storage to the GCS bucket: gsutil -m rsync -r D:/WSUS/WsusContent gs://hashr-wsus/ (remember to adjust the paths)
  5. If you'd like to have the filename of the update package (which usually contains KB number) as the ID (by default it's sha1, that's how MS stores WSUS updates) and its description this is something that can be dumped from the internal WID WSUS database. You can use the following Power Shell script and run it as a task:
#SQL Query
$delimiter = ";"
$SqlQuery = 'select DISTINCT CONVERT([varchar](512), tbfile.FileDigest, 2) as sha1, tbfile.[FileName], vu.[KnowledgebaseArticle], vu.[DefaultTitle]  from [SUSDB].[dbo].[tbFile] tbfile
  left join [SUSDB].[dbo].[tbFileForRevision] ffrev
  on tbfile.FileDigest = ffrev.FileDigest
  left join [SUSDB].[dbo].[tbRevision] rev
  on ffrev.RevisionID = rev.RevisionID
  left join [SUSDB].[dbo].[tbUpdate] u
  on rev.LocalUpdateID = u.LocalUpdateID
  left join [SUSDB].[PUBLIC_VIEWS].[vUpdate] vu
  on u.UpdateID = vu.UpdateId'
$SqlConnection = New-Object System.Data.SqlClient.SqlConnection
$SqlConnection.ConnectionString = 'server=\\.\pipe\MICROSOFT##WID\tsql\query;database=SUSDB;trusted_connection=true;'
$SqlCmd = New-Object System.Data.SqlClient.SqlCommand
$SqlCmd.CommandText = $SqlQuery
$SqlCmd.Connection = $SqlConnection
$SqlCmd.CommandTimeout = 0
$SqlAdapter = New-Object System.Data.SqlClient.SqlDataAdapter
$SqlAdapter.SelectCommand = $SqlCmd
#Creating Dataset
$DataSet = New-Object System.Data.DataSet
$SqlAdapter.Fill($DataSet)
$DataSet.Tables[0] | export-csv -Delimiter $delimiter -Path "D:\WSUS\WsusContent\export.csv" -NoTypeInformation

gsutil -m rsync -r D:/WSUS/WsusContent gs://hashr-wsus/

This will dump the relevant information from WSUS DB, store it in the export.csv file and sync the contents of the WSUS folder with GCS bucket. WSUS importer will check if export.csv file is present in the root of the WSUS repo, if so it will use it.

TarGz

This is a simple importer that traverses repositories and looks for .tar.gz files. Once found it will hash the first and the last 10MB of the file to check if it was already processed. This is done to prevent hashing the whole file every time the repository is scanned for new sources. To use this importer you need to specify the following flag(s):

  1. -targz_repo_path which should point to the path on the local file system that contains .tar.gz files

Deb

This is very similar to the TarGz importer except that it looks for .deb packages. Once found it will hash the first and the last 10MB of the file to check if it was already processed. This is done to prevent hashing the whole file every time the repository is scanned for new sources. To use this importer you need to specify the following flag(s):

  1. -deb_repo_path which should point to the path on the local file system that contains .deb files

RPM

This is very similar to the TarGz importer except that it looks for .rpm packages. Once found it will hash the first and the last 10MB of the file to check if it was already processed. This is done to prevent hashing the whole file every time the repository is scanned for new sources. To use this importer you need to specify the following flag(s):

  1. -rpm_repo_path which should point to the path on the local file system that contains .rpm files

Zip (and other zip-like formats)

This is very similar to the TarGz importer except that it looks for .zip archives. Once found it will hash the first and the last 10MB of the file to check if it was already processed. This is done to prevent hashing the whole file every time the repository is scanned for new sources. To use this importer you need to specify the following flag(s):

  1. -zip_repo_path which should point to the path on the local file system that contains .zip files

Optionally, you can also set the following flag(s):

  1. -zip_file_exts comma-separated list of file extensions to treat as zip files, eg. "zip,whl,jar". Default: "zip"

ISO 9660

This is very similar to the TarGz importer except that it looks for .iso file. Once found it will hash the first and the last 10MB of the file to check if it was already processed. This is done to prevent hashing the whole file every time the repository is scanned for new sources. To use this importer you need to specify the following flag(s):

  1. -iso_repo_path which should point to the path on the local file system that contains .iso files

Setting up exporters

Setting up Postgres exporter

Postgres exporter allows sending of hashes, file metadata and the actual content of the file to a PostgreSQL instance. For best performance it's advised to set it up on a separate and dedicated machine. If you did set up PostgreSQL while choosing the processing jobs storage you're almost good to go, just run the following command to create the required tables:

cat scripts/CreatePostgresExporterTables.sql | docker exec -i hashr_postgresql psql -U hashr -d hashr

If you didn't choose Postgres for processing job storage follow steps 1 & 2 from the Setting up PostgreSQL storage section.

This is currently the default exporter, you don't need to explicitly enable it. By default the content of the actual files won't be uploaded to PostgreSQL DB, if you wish to change that use -upload_payloads true flag.

In order for the Postgres exporter to work you need to set the following flags: -exporters postgres -postgresHost <host> -postgresPort <port> -postgresUser <user> -postgresPassword <pass> -postgresDBName <db_name>

Setting up GCP exporter

GCP exporter allows sending of hashes, file metadata to GCP Spanner instance. Optionally you can upload the extracted files to GCS bucket. If you haven't set up Cloud Spanner for storing processing jobs, follow the steps in Setting up Cloud Spanner and instead of the last step run the following command to create necessary tables:

gcloud spanner databases ddl update hashr --instance=hashr --ddl-file=scripts/CreateCloudSpannerExporterTables.ddl

If you have already set up Cloud Spanner for storing jobs data you just need to the run the command above and you're ready to go.

If you'd like to upload the extracted files to GCS you need to create the GCS bucket:

Step 1: Make the service account admin of this bucket:

gsutil mb -p project_name> gs://<gcs_bucket_name>

Step 2: Make the service account admin of this bucket:

gsutil iam ch serviceAccount:hashr-sa@<project_name>.iam.gserviceaccount.com:objectAdmin gs://<gcs_bucket_name>

To use this exporter you need to provide the following flags: -exporters GCP -gcp_exporter_gcs_bucket <gcs_bucket_name>

Additional flags

  1. -processing_worker_count: This flag controls number of parallel processing workers. Processing is CPU and I/O heavy, during my testing I found that having 2 workers is the most optimal solution.
  2. -cache_dir: Location of local cache used for deduplication, it's advised to change that from /tmp to e.g. home directory of the user that will be running hashr.
  3. -export: When set to false hashr will save the results to disk bypassing the exporter.
  4. -export_path: If export is set to false, this is the folder where samples will be saved.
  5. -reprocess: Allows to reprocess a given source (in case it e.g. errored out) based on the sha256 value stored in the jobs table.
  6. -upload_payloads: Controls if the actual content of the file will be uploaded by defined exporters.
  7. -gcp_exporter_worker_count: Number of workers/goroutines that the GCP exporter will use to upload the data.

This is not an officially supported Google product.

hashr's People

Contributors

dependabot[bot] avatar favour-olumese avatar hu6li avatar jkppr avatar meeehow avatar roshanmaskey avatar zetatwo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hashr's Issues

Crash with concurrent map read/write

I ran hashr using the new deb importer to hash a large number of files, specficially every single Ubuntu 22 package and after about 8 hours of runtime it crashed with the following message:

Nov 13 09:19:43 deb-hashing bash[29850]: fatal error: concurrent map read and map write

I have attached the log file which shows what happened bafore and the full stack trace.
hashr-crash.log

Add --rm when running plaso container

There is a --rm missing when running the plaso container. This leaves a bunch of stopped containers after using that importer.

args := []string{"run", "-v", "/tmp/:/tmp", "log2timeline/plaso", "image_export", "--logfile", logFile, "--partitions", "all", "--volumes", "all", "-w", exportDir, sourcePath}

Properly clean up temporary files on importer error

Apr 03 13:30:00 hashr2 bash[1362]: I0403 13:30:00.043981    1362 hashr.go:200] Preprocessing linux-tools-4.15.0-106-generic_4.15.0-106.107_amd64.deb
Apr 03 13:30:00 hashr2 bash[1362]: I0403 13:30:00.044380    1362 common.go:142] Copying linux-tools-4.15.0-106-generic_4.15.0-106.107_amd64.deb to /tmp/hashr-linux-tools-4.15.0-106-generic_4.15.0-106.107_amd64.deb-309297636/linux-tools-4.15.0-106-generic_4.15.0-106.107_amd64.deb
Apr 03 13:30:00 hashr2 bash[1362]: E0403 13:30:00.045723    1362 hashr.go:308] deb: skipping source linux-tools-4.15.0-101-generic_4.15.0-101.102_amd64.deb: error while preprocessing: error while opening tar archive in deb package: xz: data is truncated or corrupt
Apr 03 13:30:00 hashr2 bash[1362]: I0403 13:30:00.046763    1362 hashr.go:233] Deleting

The "Deleting" log line is missing a path

glog.Infof("Deleting %s", path)
which is probably also why the removal never happens. This could have been dangerous as well considering the command below is "sudo rm -rf ".

Consider storing hashes as binary data in SQL database

Currently, all hashes are stored as hex-encoded VARCHAR(100) in the database. This means that every hash takes up roughly twice as much space as it should need if stored optimally.

Unfortunately Postgres does not have a fixed length binary data type so the closest would be the BYTEA data type. I think we should investigate how changing to this for hashes affects storage requirements and lookup times.

Large memory consumptions with large number of input files

I suspect that the cache map starts consuming a very large amount of memory after a while. I ran it on a machine with 32gb RAM and nothing other than postgres and hashr running and it killed hashr for OOM.

Should we consider adding support for offloading the cache to the database or something like redis?

hashr-crash3.log

Add SELinux policy

If one tries to run hashr using Linux with enforced SELinux, there will be an access violation when it comes to the preprocessing.
This is due to incompatibility in SELinux contexts of /tmp and docker

AVC Events:
scontext=system_u:system_r:container_t:
tcontext=unconfined_u:object_r:user_tmp_t

General solutions (maybe there are more):

  1. Custom SELinux policy
  2. Change default preprocessing directory or provide flag to do so
  3. Disable SELinux: Don't think this should be a general solution

mount failed: Operation not permitted

trying to hash several windows iso's, getting mount failed every time, even when running as SUDO
docker run -it --network hashr_net -v /opt/hashr/windows_iso:/data/windows us-docker.pkg.dev/osdfir-registry/hashr/release/hashr -storage postgres -postgres_host hashr_postgresql -postgres_port 5432 -postgres_user XXX -postgres_password XXX -postgres_db hashr -importers windows -windows_iso_repo_path /data/windows -exporters postgres

error:
Stderr: mount: /tmp/hashr-server2022.iso-3119761407/mnt: mount failed: Operation not permitted.

thanks

Explore alternatives to `mount` for Docker compatibility in importers

Currently, some hashr importers rely on directly mounting ISO files within Docker containers (e.g. windows or iso importer). This approach is not generally supported by Docker due to security restrictions. Explore and implement alternatives to eliminate the need for the --privileged flag. (see #61)

Considerations:

  • Tools: Investigate libraries or tools like 7z, xorriso, or others that can extract or access contents of ISO files without requiring a direct mount.
  • Performance Impact: Evaluate any trade-offs in terms of performance or resource usage between the mounting approach and potential alternatives.
  • Importer Scope: Identify the specific importers within hashr that rely on mount and will need to be refactored.

Run containers as hashr user to remove need for "sudo rm"

Currently when cleaning up the temp files we use sudo:

cmd := exec.Command("sudo", "rm", "-rf", path)
I presume this is because some of the importers run a docker image (as root) which means the produced files are owned by root. If we run that container as the same user hashr is running as they will be owned by that user and we don't need to do "sudo rm ..." which could be dangerous if there is a bug.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.