certcc / labyrinth Goto Github PK
View Code? Open in Web Editor NEWCome inside, and have a nice cup of tea.
License: Other
Come inside, and have a nice cup of tea.
License: Other
See for example
https://github.com/CERTCC/labyrinth/actions/runs/5527573575/job/14968312185
log snippet follows
2023-07-12T05:42:25.0531884Z ##[group]Run repo_deep_dive --verbose --mod 3 --divisor 10 --results_dir results/2023/07/11 --max_age 7200
2023-07-12T05:42:25.0532364Z �[36;1mrepo_deep_dive --verbose --mod 3 --divisor 10 --results_dir results/2023/07/11 --max_age 7200�[0m
2023-07-12T05:42:25.0634838Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2023-07-12T05:42:25.0635111Z env:
2023-07-12T05:42:25.0635399Z pythonLocation: /opt/hostedtoolcache/Python/3.9.17/x64
2023-07-12T05:42:25.0635778Z PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.9.17/x64/lib/pkgconfig
2023-07-12T05:42:25.0636381Z Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
2023-07-12T05:42:25.0636726Z Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
2023-07-12T05:42:25.0637048Z Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
2023-07-12T05:42:25.0637388Z LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.17/x64/lib
2023-07-12T05:42:25.0637974Z GH_TOKEN: ***
2023-07-12T05:42:25.0638187Z ##[endgroup]
2023-07-12T05:42:25.7345778Z INFO root - log level: INFO
2023-07-12T05:42:25.7353976Z INFO labyrinth.repo_processor - Reading 1 search result summaries
2023-07-12T05:42:25.7835639Z INFO labyrinth.repo_processor - Found 16 search results to process
2023-07-12T05:42:25.7888462Z INFO labyrinth.repo_processor - Cloning https://github.com/bha-vin/HTB-Beep.git 1 of 16
2023-07-12T05:42:25.9948417Z INFO labyrinth.repo_processor - Cloning https://github.com/codingcore12/SILENT-DOC-EXPLOIT-CLEAN-v5.git 2 of 16
2023-07-12T05:42:26.1897741Z INFO labyrinth.repo_processor - Cloning https://github.com/gcarrilao/hook.git 3 of 16
2023-07-12T05:42:26.3880162Z Traceback (most recent call last):
2023-07-12T05:42:26.3887717Z File "/opt/hostedtoolcache/Python/3.9.17/x64/bin/repo_deep_dive", line 57, in <module>
2023-07-12T05:42:26.3888708Z process_modulo(args.results_dir, args.mod, args.divisor)
2023-07-12T05:42:26.3889799Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 276, in process_modulo
2023-07-12T05:42:26.3890321Z df = scan_repos(top_dir, mod, divisor)
2023-07-12T05:42:26.3891004Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 223, in scan_repos
2023-07-12T05:42:26.3891563Z results = df.apply(process_row, axis=1).to_list()
2023-07-12T05:42:26.3892230Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/frame.py", line 9423, in apply
2023-07-12T05:42:26.3899802Z return op.apply().__finalize__(self, method="apply")
2023-07-12T05:42:26.3900537Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 678, in apply
2023-07-12T05:42:26.3901242Z return self.apply_standard()
2023-07-12T05:42:26.3901838Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 798, in apply_standard
2023-07-12T05:42:26.3903187Z results, res_index = self.apply_series_generator()
2023-07-12T05:42:26.3904231Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 814, in apply_series_generator
2023-07-12T05:42:26.3905168Z results[i] = self.f(v)
2023-07-12T05:42:26.3905773Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 148, in process_row
2023-07-12T05:42:26.3906211Z _df = process_git_url(clone_url, workdir)
2023-07-12T05:42:26.3906927Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 49, in process_git_url
2023-07-12T05:42:26.3907353Z df = process_dir(workdir, workdir)
2023-07-12T05:42:26.3908020Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/file_processor.py", line 94, in process_dir
2023-07-12T05:42:26.3908443Z _df = process_file(fpath, workdir)
2023-07-12T05:42:26.3909108Z File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/file_processor.py", line 28, in process_file
2023-07-12T05:42:26.3909691Z with open(fpath, "r", encoding="ISO-8859-1") as fp:
2023-07-12T05:42:26.3910183Z PermissionError: [Errno 13] Permission denied: '/tmp/git-clone-svldfy0g/link'
2023-07-12T05:42:26.4827113Z ##[error]Process completed with exit code 1.
See for example https://github.com/CERTCC/labyrinth/actions/runs/3363048531
Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. Please update the following actions to use Node.js 16: actions/checkout, actions/setup-python, actions/setup-python, actions/checkout
See for example https://github.com/CERTCC/labyrinth/actions/runs/3363048531
The
set-output
command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
The code seems to work, but I just noticed in this segment of code that quals
is both the thing being iterated over AND something that is assigned to inside the loop. That seems like a bad idea even if it isn't broken. Need to change the name inside the loop to something else.
Lines 53 to 65 in 53986fe
We need some unit tests.
Tests go into <top_dir>/test
Rough naming convention: test_<module_name>.py
It's taking over 20 minutes for each action to check out the repository before it can do anything. This results in every run of SearchRepos using basically 12 hours of compute time, and the vast bulk of that is checking out the repository. Probably time to optimize the process and see if we can make it more efficient.
The only licensing info I could find is in the setup "license="all rights reserved",... can you clarify what is the license for this repo?
There's a project similar to this one that is doing per-CVE searches on Github. Our choice here is to either
Of the two of these, item 2 seems the easier one to incorporate, although certainly 1 is more robust to future change.
See for example https://github.com/CERTCC/labyrinth/actions/runs/6536547242
Prepare all required actions
Run ./.github/actions/deep_dive
Run repo_deep_dive --verbose --mod 5 --divisor 10 --results_dir results/[2](https://github.com/CERTCC/labyrinth/actions/runs/6536547242/job/17749913960#step:7:2)023/10/15 --max_age 7200
INFO root - log level: INFO
INFO labyrinth.repo_processor - Reading 1 search result summaries
INFO labyrinth.repo_processor - Found 13 search results to process
INFO labyrinth.repo_processor - Cloning https://github.com/ExploitRc3/ExploitRc3.git 1 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/codingcore12/Extremely-Silent-JPG-Exploit-NEW-nk.git 2 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/PrasoonPratham/Simple-XSS-exploit-example.git 3 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/Pyr0sec/CVE-2023-38646.git 4 of 13
INFO labyrinth.file_processor - Found 1 matches in 1 out of 3 files
INFO labyrinth.repo_processor - Cloning https://github.com/iotwar/AntiQbot.git 5 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/Latrodect/EATER-offensive-security-frameowork.git 6 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/Anthony-T-N/CTF-Binary-Exploitation.git 7 of 13
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.18/x64/bin/repo_deep_dive", line 66, in <module>
process_modulo(args.results_dir, args.mod, args.divisor)
File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line [28](https://github.com/CERTCC/labyrinth/actions/runs/6536547242/job/17749913960#step:7:30)5, in process_modulo
df = scan_repos(top_dir, mod, divisor)
File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 2[32](https://github.com/CERTCC/labyrinth/actions/runs/6536547242/job/17749913960#step:7:34), in scan_repos
results = df.apply(process_row, axis=1).to_list()
File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py", line 10037, in apply
return op.apply().__finalize__(self, method="apply")
File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 837, in apply
return self.apply_standard()
File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 963, in apply_standard
results, res_index = self.apply_series_generator()
File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 979, in apply_series_generator
results[i] = self.func(v, *self.args, **self.kwargs)
File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 1[36](https://github.com/CERTCC/labyrinth/actions/runs/6536547242/job/17749913960#step:7:38), in process_row
gh_has_newer = _check_repo_newer(ts, repo_name)
File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 77, in _check_repo_newer
if m_ts < repo.pushed_at:
TypeError: can't compare offset-naive and offset-aware datetimes
Error: Process completed with exit code 1.
om g
For example, there shouldn't be a file called data/repo_id/(2/84/03/(284034948,)/(284034948,).csv
but there is.
May I ask, what is this repository for? My name is mentioned a few times here.
Having both the code and the data it is collecting in a single repository was neat when we started but the data has grown so much (and is updated so frequently) that it's impossible to follow the git commit history for the code anymore.
My proposal is to:
/results
and /data
to a separate repository or repositories (there's no reason they can't be in the same repo, because the process that generates data into /results
is distinct from the process that generates data into /data
Note that implementing thing would probably be a good time to consider implementing #1 as well.
Describe the bug
2024-05-13T17:04:27.0321574Z Starting 1 queries
2024-05-13T17:04:27.8388184Z Search: vulnerability poc pushed:2024-05-12..2024-05-13
2024-05-13T17:04:28.1586607Z Traceback (most recent call last):
2024-05-13T17:04:28.1596262Z File "/opt/hostedtoolcache/Python/3.9.19/x64/bin/search_github", line 211, in <module>
2024-05-13T17:04:28.1597379Z main(search_str, args.start_date, args.end_date, args.overwrite)
2024-05-13T17:04:28.1598433Z File "/opt/hostedtoolcache/Python/3.9.19/x64/bin/search_github", line 56, in main
2024-05-13T17:04:28.1599312Z data = do_search(query, start_date, end_date)
2024-05-13T17:04:28.1600814Z File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/labyrinth/search.py", line 75, in do_search
2024-05-13T17:04:28.1601900Z for r in result:
2024-05-13T17:04:28.1603074Z File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/github/PaginatedList.py", line 84, in __iter__
2024-05-13T17:04:28.1604200Z newElements = self._grow()
2024-05-13T17:04:28.1605455Z File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/github/PaginatedList.py", line 95, in _grow
2024-05-13T17:04:28.1606921Z newElements = self._fetchNextPage()
2024-05-13T17:04:28.1608379Z File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/github/PaginatedList.py", line 244, in _fetchNextPage
2024-05-13T17:04:28.1609730Z headers, data = self.__requester.requestJsonAndCheck(
2024-05-13T17:04:28.1611192Z File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/github/Requester.py", line 548, in requestJsonAndCheck
2024-05-13T17:04:28.1612793Z return self.__check(*self.requestJson(verb, url, parameters, headers, input, self.__customConnection(url)))
2024-05-13T17:04:28.1614422Z File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/github/Requester.py", line 609, in __check
2024-05-13T17:04:28.1615639Z raise self.createException(status, responseHeaders, data)
2024-05-13T17:04:28.1619712Z github.GithubException.RateLimitExceededException: 403 {"documentation_url": "https://docs.github.com/free-pro-team@latest/rest/overview/rate-limits-for-the-rest-api#about-secondary-rate-limits", "message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again. If you reach out to GitHub Support for help, please include the request ID B470:3F0D3F:14DBB1E9:21B8E305:6642481B."}
2024-05-13T17:04:28.2168365Z ##[error]Process completed with exit code 1.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
It should be able to catch the secondary rate limit failure, sleep for a few minutes and try again with an exponential backoff or something.
Describe the solution you'd like
Instead of having the search and deep dive workflows just create CSVs, JSON, and md files, we should add an mkdocs/material site that can publish the results as a browsable website on certcc.github.io
Describe alternatives you've considered
Status quo is just to put csv, json, and markdown back into the repository, so it's quasi-browsable but not as useful as if it were constructing a proper static site.
See failed job in https://github.com/CERTCC/labyrinth/actions/runs/5527573575/job/14968312185
Log snippet follows
search_github --gh_token *** --start_date 2023-07-09 --end_date 2023-07-10 exploit
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
pythonLocation: /opt/hostedtoolcache/Python/3.9.[17](https://github.com/CERTCC/labyrinth/actions/runs/5510599382/job/14918209014#step:6:18)/x64
PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.9.17/x64/lib/pkgconfig
Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.17/x64/lib
Starting 1 queries
Search: exploit pushed:[20](https://github.com/CERTCC/labyrinth/actions/runs/5510599382/job/14918209014#step:6:21)[23](https://github.com/CERTCC/labyrinth/actions/runs/5510599382/job/14918209014#step:6:24)-07-09..2023-07-10
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.17/x64/bin/search_github", line 202, in <module>
main(search_str, args.start_date, args.end_date, args.overwrite)
File "/opt/hostedtoolcache/Python/3.9.17/x64/bin/search_github", line 47, in main
data = do_search(query, start_date, end_date)
File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/search.py", line 79, in do_search
data = r.raw_data
File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/GithubObject.py", line 160, in raw_data
self._completeIfNeeded()
File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/GithubObject.py", line 390, in _completeIfNeeded
self.__complete()
File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/GithubObject.py", line 395, in __complete
headers, data = self._requester.requestJsonAndCheck("GET", self._url.value)
File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/Requester.py", line 442, in requestJsonAndCheck
return self.__check(
File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/Requester.py", line 487, in __check
raise self.__createException(status, responseHeaders, data)
github.GithubException.UnknownObjectException: [40](https://github.com/CERTCC/labyrinth/actions/runs/5510599382/job/14918209014#step:6:42)4 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/reference/repos#get-a-repository"}
Error: Process completed with exit code 1.
All the workflows are currently operating on the main branch. This results in a lot of small commits every time a workflow runs.
There are a few related tasks here
adapt the search and update_summaries jobs in SearchRepos workflow to do their work in a branch or branches, then squash-merge the results back to main. Remove the working branch when done.
subtask of the above, or could be treated separately: Once update_summaries has done its job, the intermediate per-search result json files can be deleted. So they can exist on the working branch, but would never need to make it to main. Only summaries would get into main. Note, however, that this will require changes to generate_summaries so that we can continue to do monthly and yearly summaries too. (It's not as simple as adding a remove-all-non-summaries method.)
adapt the deep_dive and repo2vulid jobs in SearchRepos workflow to do their work in a branch or branches, then squash-merge the results back to main. Unlike the search/summaries items above, we want to retain both the repo and vul-id centric views, so in this case there is no post-action cleanup to be done.
Prepare all required actions
Run ./.github/actions/single_search
Run search_github --gh_token *** --start_date [2](https://github.com/CERTCC/labyrinth/actions/runs/5179586538/jobs/9332662297#step:6:2)023-06-04 --end_date 2023-06-05 attack poc
Starting 1 queries
Search: attack poc pushed:2023-06-04..2023-06-05
Found 3 results for attack poc pushed:2023-06-04..2023-06-05
df has 3 rows
df has 3 rows after dropna
df has 3 rows after drop out of range dates
Search found 3 results for 2023-06-05
Read 1 records from results/2023/06/05/2023-06-05_attack_poc.json.
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.16/x64/bin/search_github", line 202, in <module>
main(search_str, args.start_date, args.end_date, args.overwrite)
File "/opt/hostedtoolcache/Python/3.9.16/x64/bin/search_github", line 139, in main
out_df = json_df.append(new_df, ignore_index=True)
File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/generic.py", line 5989, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'
Error: Process completed with exit code 1.
labyrinth/labyrinth/repo_processor.py
Line 201 in 207dbce
This line uses the repo id and a modulus to decide how to split repos across parallel runs of the script. The problem is that sometimes individual runs can fail repeatedly, meaning that the same block of repos never gets worked on.
We can't just randomize it, because then we will have more than one process handling a repo.
So I'm thinking we need to add in some other factor that is constant for an individual run, but changes between runs.
Could be hour of the day, or maybe there's some run ID that can be converted to an int? The former can come from within the Python code directly, whereas the latter might require modification to the workflow scripts, unless there is some environment variable already there for the python code to use.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.