urlstechie / urlchecker-action Goto Github PK
View Code? Open in Web Editor NEW:octocat: :link: GitHub action to extract and check urls in code and documentations.
Home Page: https://urlchecker-python.readthedocs.io
License: MIT License
:octocat: :link: GitHub action to extract and check urls in code and documentations.
Home Page: https://urlchecker-python.readthedocs.io
License: MIT License
from termcolor import colored
is present in two scripts. In one it is unused and in the other it can be replaced by a regular print.
I can't understand how or why this might be happening, so I'll just describe what I'm seeing. I'm using urlstechie/urlchecker-action@master
.
urlchecker-action is truncating some URLs, trying to reach those truncated URLs, and reporting them as broken. Examples:
[![linux](https://raw.githubusercontent.com/devicons/devicon/master/icons/linux/linux-original.svg)](https://www.linux.org/)
truncated to https://www.linux
url = "https://codepen.io/rootwork/"
truncated to https://codepen.io/root
[lazy-render things](https://www.drupal.org/node/1982024)
truncated to https://www.drupal.org/node/19
urlchecker is looking at .md
, .toml
, .yml
and .scss
files. The ones that are being truncated are in the Markdown and TOML files but that might just be because there are many more of them.
These URLs are not split between lines; other than the first example above they are all actually pretty short URLs. The Markdown/TOML files validate. Many URLs, including other ones in those same files, are not truncated and pass urlchecker just fine.
Adjusting timeout and retry values on the linkchecker doesn't affect anything, because they're trying to check the wrong URLs.
All URLs were passing until July 12, at which point I introduced some broken URLs. After fixing them on July 13, I started getting the truncated URLs. (And to be clear the truncated ones aren't the ones that had been broken; they don't seem to have anything to do with each other.)
Any idea what might be happening?
Currently, we are checking out a git repo directly:
https://github.com/urlstechie/URLs-checker/blob/master/check.py#L15
However, this has no concept of branch - we would want to check out the branch for whatever PR is being run. We noticed this when running a PR for the devel branch, but seeing the current content of master.
I can open a PR to fix this up, maybe we can get it integrated with the next release that also includes #19 ?
this way it'd be easier to copy-paste the workflow file from one repo to another (a lazy question, yes).
Currently, we do one run that does basic checks for the testing repository. We would want to add additional runs that:
We absolutely don't want any PRs being merged that possibly break any previous functionality.
Currently, I can add _config.yml and README.md to my white_listed_patterns, but the files /github/workspace/_config.yml and /github/workspace/README.md are still checked. I suspect it's looking for the full path or starts with, and as a user I'd expect it to do more of a re.search (using my pattern). E.g., this should work without /github/workspace
# Cannot check private GitHub settings
white_listed_patterns: _config.yml,README.md,SocialNetworks.yml,.github/workflows,tests
there seem to be that codecov is not linked/ configured correctly.
The move from my personal folder to @urlstechie seems to be the cause.
CodeFactor found an issue: Starting a process with a shell, possible injection detected, security issue.
It's currently on:
check.py:21
Presume we build a website with a static website generator (jekyll, jbake, hugo, ...)
The output website is in directory output
.
How do we check if there are no new broken url's in a pull request?
After building it from a pull request, by checking the files in the output directory.
It always checks out a new branch, because parameter branch
defaults to master
, so it ignores the changes in the pull request.
Document how to tell it to use the current branch (don't check anything out).
hi @SuperKogito
I seem to have a link reported to have been broken see https://github.com/HPC-buildtest/buildtest-framework/runs/428310246?check_suite_focus=true but in reality it is working.
The actual link is https://www.hpcwire.com/2019/01/17/pfizer-hpc-engineer-aims-to-automate-software-stack-testing/ which is found but the url-checker is stating it is not found. I see a red X next to it.
It's confusing that the white_listed_
inputs for URLs, patterns, and files excludes things rather than including them; usually you'd use the term "whitelist" to describe things you are explicitly including that would otherwise not be included.
Additionally, in the output "whitelist" is misspelled for the URL option:
url whitetlist: []
My suggestion would be to change these three input options to exclude_urls
, exclude_patterns
and exclude_files
, to match the include_files
option that already exists. (And also update those values in the output itself.)
If you don't want to break existing configs, you could leave the white_listed_
options as valid along with the new ones, but only list the latter in the docs.
Hey, me again ๐
I ran urlchecker-python v0.22 on a directory with dotfiles (.editorconfig
, etc.) using this command:
urlchecker check --file-types '.*' .
That works fine.
Then I ran urlchecker-action v0.2.3, which as I understand includes the new version of the python script, with the following option:
file_types: '.*'
And the action simply skips all files and "passes" with "Done. No urls were collected."
Notably, the output of the action's build (on GitHub) says:
file types: ['.']
So I think it is removing the asterisk entirely. I tried using double quotes, using no quotes, escaping the asterisk with a backslash, and using a double-asterisk, but the result was the same each time.
Any thoughts?
This is just a random idea I thought would be cool - we could have an action that responds to issue comments,and looks for a particular string (like @urlchecker-action check) and then it could do some special check for a url.
@maelle you just gave me a really cool idea - we could have some kind of action / bot that responded to /check-urls (or something like @urlchecker-action check-urls. I've never made a bot before, but I'll note it because it would be cool to have!
Definitely not priority - but would be fun to figure out how to do!
@vsoch I updated, fixed some bugs and tested urlchecker-python. I also automated the quay.io docker builds and released a new version on pypi 0.1.14. I edited the action to use this version and tested it on https://github.com/urlstechie/urlchecker-test-repo/ but it is failing, I am not sure what did I miss and since you are more familiar with the new structure, can you please take a look ? Here are the logs: https://github.com/urlstechie/urlchecker-test-repo/runs/562209780?check_suite_focus=true
When we have more than one repository (this one here) and we refactor the action, we might possibly want to consider renaming this repository (while it's still early and we can track down users and make sure they update names) to something that strongly indicates "I'm a GitHub action." E.g.,
And then the namespace is a bit more clear.
Hey guys!
Not sure if I'm doing something wrong, but on my end a simple check on a Jekyll website I got:
WARNING:fake_useragent:Error occurred during loading data. Trying to use cache server https://fake-useragent.herokuapp.com/browsers/0.1.12
In the end, the checker won't find any URL to check, ends in an error and the CI passes. Here's the full log: https://github.com/kiegroup/kogito-website/runs/7903901907?check_suite_focus=true
My yaml:
name: Check URLs
on: [pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: ruby/setup-ruby@v1
with:
ruby-version: '3.1'
bundler-cache: true
- uses: actions/cache@v3
with:
path: vendor/bundle
key: ${{ runner.os }}-gems-${{ hashFiles('**/Gemfile') }}
restore-keys: |
${{ runner.os }}-gems-
- run: |
gem install bundler jekyll
bundle check || bundle install
bundle exec jekyll build
- name: urls-checker
uses: urlstechie/[email protected]
with:
# subfolder with files to test
subfolder: _site
# A comma-separated list of file types to cover in the URL checks
file_types: .html,.js,.css,.xml
# Choose whether to include file with no URLs in the prints.
print_all: false
# The timeout seconds to provide to requests, defaults to 5 seconds
timeout: 5
# How many times to retry a failed request (each is logged, defaults to 1)
retry_count: 3
# choose if the force pass or not
force_pass : false
The cache action can be ignored, I guess the ruby action is already doing this work, I'll review later.
In https://github.com/urlstechie/urlchecker-action#example-with-checkout there's a mention of v0.1.9 but the latest release is 0.1.8. :-)
e.g. https://github.com/pybamm-team/PyBaMM/runs/7115511512?check_suite_focus=true#step:4:631
Are there different settings we can use to make it work?
I'm not actually sure how to do this, so @SuperKogito it's all you! When you make changes and PR, I'm interested to see how to go about this!
Some more user feedback/questions ๐ With the current action&library, URLs are checked but not links: if you typed [a cool post](htttps://blabla.org)
by mistake, the wrong link (should be https://blabla.org
) won't be caught. In Markdown/html files, "links" are well-defined (in comments in code, I agree, less so).
Other actions focus on link checking, I wonder whether this could be added to this action somehow, or as a limitation to the docs maybe.
(I suppose this is partly the Commonmark debate again ;-))
Maybe a good workflow for a website/thing is
check links and URLs when content is created;
check URLs once in a while (the URL can get broken, a link that's not malformed won't change).
Advantages of your library/action: retry, artifact, UA. But do I use another library/action for checking links? ๐ค
I'm not sure if this is intentional, but the list of check_results is returned after just parsing one of the urls in the list:
Assuming that there is one list of urls, wouldn't we want to go through all of them, update check_results as we go, and return the final two lists? I'm working on a PR now, I can update this to fix the issue (if it isn't intentional!)
Thanks, very cool! My plan is to write a workflow uploading the csv as artifact. :-)
Originally posted by @maelle in urlstechie/urlchecker-python#24 (comment)
I'm trying to write a workflow uploading the artifact and have a question. I tried various things and might be missing something obvious regarding paths.
https://github.com/r-hub/docs/runs/582541285?check_suite_focus=true
Do you have any tip regarding where it might be best to tweak things?
No worries if you don't answer, I'm not sure this is the right place to ask. :-)
Could the list of examples at the end of the README be a table with
[workflow running blablabla every blablabla](permalink to workflow file)
I'm asking because that's what I go look for in example repos.
I don't think it's relevant here anymore - it's completely represented in https://github.com/urlstechie/urlchecker-python. I wasn't 100% sure so I didn't want to delete.
@SuperKogito what are your thoughts on adding black for code formatting? It will be easier to enforce a standard for code style, and make the code a bit easier to read. We are doing a fairly good job of keeping it need, but some of the sections with input arguments are a bit crunched and would benefit from black. There are a few levels we can add it:
Let me know your thoughts! I'm working on the two issues I opened this morning now, but if you like the idea of black I can do a PR with a workflow for that after.
Great tool, thanks!
But it seems like it throws an error if there are no files to check. Shouldn't it just quietly succeed in that case?
A little ugly, but here's a snip from the log file.
urlchecker check --branch master --no-print --file-types .md,.py,.rst,.html --exclude-urls https://en.w,https://github.com/bssw-tutorial/presentations/blob/master,https://doi.org/10.1126/science.aah6168,https://doi.org/10.1002/spe.2220 --retry-count 1 --timeout 5 --files .github/workflows/check-pr-urls.yml .
Traceback (most recent call last):
original path: .
final path: /github/workspace
subfolder: None
branch: master
File "/opt/conda/bin/urlchecker", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.9/site-packages/urlchecker/client/init.py", line 191, in main
main(args=args, extra=extra)
File "/opt/conda/lib/python3.9/site-packages/urlchecker/client/check.py", line 83, in main
check_results = checker.run(
File "/opt/conda/lib/python3.9/site-packages/urlchecker/core/check.py", line 193, in run
for file_name, result in results.items():
AttributeError: 'NoneType' object has no attribute 'items'
cleanup: False
file types: ['.md', '.py', '.rst', '.html']
files: ['.github/workflows/check-pr-urls.yml']
print all: False
verbose: False
urls excluded: ['https://en.w', 'https://github.com/bssw-tutorial/presentations/blob/master', 'https://doi.org/10.1126/science.aah6168', 'https://doi.org/10.1002/spe.2220']
url patterns excluded: []
file patterns excluded: []
force pass: False
retry count: 1
save: None
timeout: 5
This action currently cannot run due to additional security measures.
Run PR=$(jq --raw-output .pull_request.number "${GITHUB_EVENT_PATH}")
Error: Unable to process command '::set-env name=PR::291' successfully.
Error: The `set-env` command is disabled. Please upgrade to using Environment Files or opt into unsecure command execution by setting the `ACTIONS_ALLOW_UNSECURE_COMMANDS` environment variable to `true`. For more information see: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/
Adding the following to the jobs:
is a workaround.
env:
ACTIONS_ALLOW_UNSECURE_COMMANDS: true
However, ideally the examples are updated to use the new $GITHUB_ENV
file method or similar. For example,
files=$(curl --request GET \
--url https://api.github.com/repos/${{ github.repository }}/pulls/$PR/files \
--header 'authorization: Bearer ${{ secrets.GITHUB_TOKEN }}' \
--header 'Accept: application/vnd.github.antiope-preview+json' \
--header 'content-type: application/json' | jq --raw-output .[].filename | sed 's/^\|$/"/g'|paste -sd, - | tr -d \" | tr -d \')
echo "files=$files" >> $GITHUB_ENV
env:
PR: ${{ github.event.issue.number }}
CodeFactor found an issue: Starting a process with a partial executable path
It's currently on:
tests\test_check.py:53
Commit 0d77054
A possible improvement is to export the checks results into a .txt
report that will be committed to the folder.
I'm going to open a PR to fix this ASAP
buildtest-framework/README.rst
------------------------------
Traceback (most recent call last):
File "/check.py", line 99, in <module>
check_results = check_repo(file_paths, print_all, white_listed_urls,
File "/check.py", line 71, in check_repo
check_results = urlproc.check_urls(file, urls, retry_count, timeout)
File "/core/urlproc.py", line 99, in check_urls
do_retry = check_response_status_code(response, print_format)
TypeError: check_response_status_code() missing 1 required positional argument: 'print_format'
I wish GitHub actions had some way to check this outside of actions.
Feel free to take a look at: https://github.com/tinova/docs/runs/335326174
Hi,
Using this action I have problems verifying nersc.gov
and ornl.gov
certificates in a standard Ubuntu-20.04 GH action instance:
HTTPSConnectionPool(host='docs-dev.nersc.gov', port=443): Max retries exceeded with url: /cgpu/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://docs-dev.nersc.gov/cgpu/
HTTPSConnectionPool(host='www.olcf.ornl.gov', port=443): Max retries exceeded with url: /summit/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://www.olcf.ornl.gov/summit/
Googling the interwebs, I think that we can fix this by also pre-installing certifi
into the actions:
I just noticed it failed to build docker container today as part of the github workflow. I have been using the same version (v0.2.3) for some time and it was working fine.
See https://github.com/buildtesters/buildtest/pull/699/checks?check_run_id=2170371755
My urlcheck workflow can be found at https://github.com/buildtesters/buildtest/blob/devel/.github/workflows/urlchecker.yml
CodeFactor found an issue: Do not use apt-get upgrade or dist-upgrade
It's currently on:
Dockerfile:6
For instance "https://x.x.x.x/mob/?moid=yyyyy" is an example URL and shouldn't give a warning (it would be nice to be able to curate a whitelist of URLs in the action conf file)
CodeFactor found an issue: Comparison to True should be just 'expr'
It's currently on:
tests\test_check.py:21
Commit 0d77054
Consider making tests including e-mail addresses or maybe you made them somewhere but I missed them. As you know, there are corner-stone cases when using regular expressions so it is good to find them :)
Maybe take a look at this repo and tests, although this one is not bulletproof either lipoja/URLExtract#13.
Sorry for asking without trying first!
I found this repo via https://vsoch.github.io/2020/urlchecker/, it looks really cool ๐
See https://developer.github.com/v3/checks/runs/#output-object
And e.g. https://github.com/ropensci/dev_guide/pull/258/checks?check_run_id=555253517
It'd be something nicer to read than the log.
Not sure if very useful. I thought of this when seeing other link checker actions mentioning "Markdown reports".
Today we are deprecating the following versions:
The reason is because of issue #88 - basically we changed the versioning to match urlchecker-python (meaning latest is now 0.0.27 and the previous 0.2.x and 0.1.x are older) and I did not realize dependabot would go around suggesting an update to an older version. I've manually searched for repos using urlchecker-action to update their versions, but I apologize if I missed yours! If I could go back in time, knowing that dependabot now does this I would _not_change the versioning schema, but since it's already done (and for a few releases) I'm doing my best to make it right! But I apologize if you've come here wondering why your previous version was removed - I'm aware this isn't good practice and typically wouldn't do it if dependabot wasn't actively trying to deprecate versions.
So please update to 0.0.27 for a 7x speed up and to ensure your workflows do not break! To be clear, 0.0.27 is actually the newest release. https://github.com/urlstechie/urlchecker-action/releases/tag/0.0.27
Hello,
We have a PR in progress in easybuild see easybuilders/easybuild#591 One of the issues we have is URL-checker can't accept variable substitution. We have configuration files (.eb
) that build the http url based on few parameters that include version
to fetch the tarball for compilation.
Is it possible to fix this issue or is only workaround to ignore these using white list. Can we white list by extension instead of by each filename.
Take a look at berlin-hack-and-tell/bhnt.c-base.org#320 and you'll see @dependabot tying to merge the update from 0.0.27 to 0.2.31.
It is confused by the version scheme change.
Since you cannot remove "old" version, I suggest changing the version scheme again. Since you want to match URLchecker, then you should use major.minor versions only. e.g. 0.0.27 becomes 0.27.
That will fix the issue since 0.27 > 0.2.31 > 0.0.27.
Just a thought, the URLchecker is a nice feature to be honest it is not practical to go into the output of the Action to see the error report. It would be nice to have a bot report all the broken links as part of the check in the PR. The code reviewers and authors would be keen on getting this information in their PR so they can fix their code.
In this case, the bot just reports only failure, if there are none have a custom message to your liking saying SUCCESS: all URL links are valid!!
Once this is implemented, i am sure it would be very useful for lots of folks.
CodeFactor found an issue: The if statement can be replaced with 'return bool(test)'
It's currently on:
core\fileproc.py:22
hey @SuperKogito I saw there was lots of good work today! Could you ping me on this issue when the latest two commits are released? I can then test the released version on the repos where they are needed. Then we can chat more about creating a module, if you are still interested.
hey @SuperKogito ! We are using your action for buildtest, and I'd like to also test it for usrse USRSE/usrse.github.io#171 where we've been using html-proofer. The issue with proofer is that it doesn't have any understanding / implementation for a retry - many of the links in our static site are old HPC documentation or other servers that don't always respond reliably, and might need a retry with exponential backoff up to some number of maximum attempts. Is this something we could look into adding here? I could definitely open a PR if you want to discuss how to go about it (and expose variables to the user).
We've been doing releases with the latest
tag, but since this is a moving target and features available are starting to change, I want to transition to having releases here associated with a particular tag/release for urlchecker. For the next round of tests to go in to address adding more tests, we can do a release associated with 0.0.19.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.