urlstechie / urlchecker-action Goto Github PK

:octocat: :link: GitHub action to extract and check urls in code and documentations.

Home Page: https://urlchecker-python.readthedocs.io

License: MIT License

Dockerfile 5.31% Shell 94.69%

urls link-checker actions urls-checker url-checker url-check github-action github-actions-ci github-actions-python continuous-testing

urlchecker-action's People

Contributors

Stargazers

Watchers

Forkers

say researchapps superkogito shahzebsiddiqui pombredanne ax3l shyndman yemo-memeda mrmundt neilhanlon ricardozanini

urlchecker-action's Issues

Remove termcolor dependency

from termcolor import colored is present in two scripts. In one it is unused and in the other it can be replaced by a regular print.

URLs are being truncated and then failing in Markdown, TOML files

I can't understand how or why this might be happening, so I'll just describe what I'm seeing. I'm using urlstechie/urlchecker-action@master.

urlchecker-action is truncating some URLs, trying to reach those truncated URLs, and reporting them as broken. Examples:

[![linux](https://raw.githubusercontent.com/devicons/devicon/master/icons/linux/linux-original.svg)](https://www.linux.org/)

truncated to https://www.linux

  url = "https://codepen.io/rootwork/"

truncated to https://codepen.io/root

[lazy-render things](https://www.drupal.org/node/1982024)

truncated to https://www.drupal.org/node/19

urlchecker is looking at .md, .toml, .yml and .scss files. The ones that are being truncated are in the Markdown and TOML files but that might just be because there are many more of them.

These URLs are not split between lines; other than the first example above they are all actually pretty short URLs. The Markdown/TOML files validate. Many URLs, including other ones in those same files, are not truncated and pass urlchecker just fine.

Adjusting timeout and retry values on the linkchecker doesn't affect anything, because they're trying to check the wrong URLs.

All URLs were passing until July 12, at which point I introduced some broken URLs. After fixing them on July 13, I started getting the truncated URLs. (And to be clear the truncated ones aren't the ones that had been broken; they don't seem to have anything to do with each other.)

Any idea what might be happening?

URLchecker has no concept of branch

Currently, we are checking out a git repo directly:

https://github.com/urlstechie/URLs-checker/blob/master/check.py#L15

However, this has no concept of branch - we would want to check out the branch for whatever PR is being run. We noticed this when running a PR for the devel branch, but seeing the current content of master.

I can open a PR to fix this up, maybe we can get it integrated with the next release that also includes #19 ?

Should git_path default to current repo?

this way it'd be easier to copy-paste the workflow file from one repo to another (a lazy question, yes).

Additional tests needed for .github workflow

Currently, we do one run that does basic checks for the testing repository. We would want to add additional runs that:

use save and verify that the file exists
use cleanup to verify that the repository is deleted
others that might be useful

We absolutely don't want any PRs being merged that possibly break any previous functionality.

Add white listed files and patterns variable

Currently, I can add _config.yml and README.md to my white_listed_patterns, but the files /github/workspace/_config.yml and /github/workspace/README.md are still checked. I suspect it's looking for the full path or starts with, and as a user I'd expect it to do more of a re.search (using my pattern). E.g., this should work without /github/workspace

        # Cannot check private GitHub settings
        white_listed_patterns: _config.yml,README.md,SocialNetworks.yml,.github/workflows,tests

fix code coverage reports generation

there seem to be that codecov is not linked/ configured correctly.
The move from my personal folder to @urlstechie seems to be the cause.

Fix 'Starting a process with a shell, possible injection detected, security issue.' issue in check.py

CodeFactor found an issue: Starting a process with a shell, possible injection detected, security issue.

It's currently on:
check.py:21

Document how to use url checker for pull request changes

Presume we build a website with a static website generator (jekyll, jbake, hugo, ...)
The output website is in directory output.

How do we check if there are no new broken url's in a pull request?
After building it from a pull request, by checking the files in the output directory.

It always checks out a new branch, because parameter branch defaults to master, so it ignores the changes in the pull request.
Document how to tell it to use the current branch (don't check anything out).

false positive link

hi @SuperKogito

I seem to have a link reported to have been broken see https://github.com/HPC-buildtest/buildtest-framework/runs/428310246?check_suite_focus=true but in reality it is working.

The actual link is https://www.hpcwire.com/2019/01/17/pfizer-hpc-engineer-aims-to-automate-software-stack-testing/ which is found but the url-checker is stating it is not found. I see a red X next to it.

I am using v0.1.2

Change name of "whitelist" options to "exclude" options

It's confusing that the white_listed_ inputs for URLs, patterns, and files excludes things rather than including them; usually you'd use the term "whitelist" to describe things you are explicitly including that would otherwise not be included.

Additionally, in the output "whitelist" is misspelled for the URL option:

url whitetlist: []

My suggestion would be to change these three input options to exclude_urls, exclude_patterns and exclude_files, to match the include_files option that already exists. (And also update those values in the output itself.)

If you don't want to break existing configs, you could leave the white_listed_ options as valid along with the new ones, but only list the latter in the docs.

Can't use action to check only dotfiles

Hey, me again 😄

I ran urlchecker-python v0.22 on a directory with dotfiles (.editorconfig, etc.) using this command:

urlchecker check --file-types '.*' .

That works fine.

Then I ran urlchecker-action v0.2.3, which as I understand includes the new version of the python script, with the following option:

file_types: '.*'

And the action simply skips all files and "passes" with "Done. No urls were collected."

Notably, the output of the action's build (on GitHub) says:

file types: ['.']

So I think it is removing the asterisk entirely. I tried using double quotes, using no quotes, escaping the asterisk with a backslash, and using a double-asterisk, but the result was the same each time.

Any thoughts?

Question: how to implement action to respond to comments?

This is just a random idea I thought would be cool - we could have an action that responds to issue comments,and looks for a particular string (like @urlchecker-action check) and then it could do some special check for a url.

@maelle you just gave me a really cool idea - we could have some kind of action / bot that responded to /check-urls (or something like @urlchecker-action check-urls. I've never made a bot before, but I'll note it because it would be cool to have!

Definitely not priority - but would be fun to figure out how to do!

fix 0.1.8 to use urlchercker 0.1.14

@vsoch I updated, fixed some bugs and tested urlchecker-python. I also automated the quay.io docker builds and released a new version on pypi 0.1.14. I edited the action to use this version and tested it on https://github.com/urlstechie/urlchecker-test-repo/ but it is failing, I am not sure what did I miss and since you are more familiar with the new structure, can you please take a look ? Here are the logs: https://github.com/urlstechie/urlchecker-test-repo/runs/562209780?check_suite_focus=true

When python library exists, consider renaming for action

When we have more than one repository (this one here) and we refactor the action, we might possibly want to consider renaming this repository (while it's still early and we can track down users and make sure they update names) to something that strongly indicates "I'm a GitHub action." E.g.,

urlschecker-python: --> deploys urlschecker python module
urlschecker-action -> GitHub action

And then the namespace is a bit more clear.

fake_useragent error

Hey guys!

Not sure if I'm doing something wrong, but on my end a simple check on a Jekyll website I got:

WARNING:fake_useragent:Error occurred during loading data. Trying to use cache server https://fake-useragent.herokuapp.com/browsers/0.1.12

In the end, the checker won't find any URL to check, ends in an error and the CI passes. Here's the full log: https://github.com/kiegroup/kogito-website/runs/7903901907?check_suite_focus=true

My yaml:

name: Check URLs

on: [pull_request]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: ruby/setup-ruby@v1
      with:
        ruby-version: '3.1'
        bundler-cache: true
    - uses: actions/cache@v3
      with:
        path: vendor/bundle
        key: ${{ runner.os }}-gems-${{ hashFiles('**/Gemfile') }}
        restore-keys: |
          ${{ runner.os }}-gems-
    - run: |
          gem install bundler jekyll
          bundle check || bundle install
          bundle exec jekyll build
    - name: urls-checker
      uses: urlstechie/[email protected]
      with:
        # subfolder with files to test
        subfolder: _site
        # A comma-separated list of file types to cover in the URL checks
        file_types: .html,.js,.css,.xml
        # Choose whether to include file with no URLs in the prints.
        print_all: false
        # The timeout seconds to provide to requests, defaults to 5 seconds
        timeout: 5
        # How many times to retry a failed request (each is logged, defaults to 1)
        retry_count: 3
        # choose if the force pass or not
        force_pass : false

The cache action can be ignored, I guess the ruby action is already doing this work, I'll review later.

Example in README

In https://github.com/urlstechie/urlchecker-action#example-with-checkout there's a mention of v0.1.9 but the latest release is 0.1.8. :-)

URL checker fails on URLs that work

e.g. https://github.com/pybamm-team/PyBaMM/runs/7115511512?check_suite_focus=true#step:4:631

Are there different settings we can use to make it work?

Update Codeacy to only check code files

I'm not actually sure how to do this, so @SuperKogito it's all you! When you make changes and PR, I'm interested to see how to go about this!

URL vs link checking

Some more user feedback/questions 😉 With the current action&library, URLs are checked but not links: if you typed [a cool post](htttps://blabla.org) by mistake, the wrong link (should be https://blabla.org) won't be caught. In Markdown/html files, "links" are well-defined (in comments in code, I agree, less so).

Other actions focus on link checking, I wonder whether this could be added to this action somehow, or as a limitation to the docs maybe.

(I suppose this is partly the Commonmark debate again ;-))

Maybe a good workflow for a website/thing is

check links and URLs when content is created;
check URLs once in a while (the URL can get broken, a link that's not malformed won't change).

Advantages of your library/action: retry, artifact, UA. But do I use another library/action for checking links? 🤔

Check results returned before loop completed

I'm not sure if this is intentional, but the list of check_results is returned after just parsing one of the urls in the list:

https://github.com/urlstechie/URLs-checker/blob/3200baf59d73c3e8c0fd92e9ce73c375603b4b2b/core/urlproc.py#L59

Assuming that there is one list of urls, wouldn't we want to go through all of them, update check_results as we go, and return the final two lists? I'm working on a PR now, I can update this to fix the issue (if it isn't intentional!)

Accessing results file as artifact

Thanks, very cool! My plan is to write a workflow uploading the csv as artifact. :-)

Originally posted by @maelle in urlstechie/urlchecker-python#24 (comment)

I'm trying to write a workflow uploading the artifact and have a question. I tried various things and might be missing something obvious regarding paths.

Workflow

https://github.com/r-hub/docs/runs/582541285?check_suite_focus=true

The URL-checks actions is "Saving results to /github/workspace/output/urls.csv"
But then the upload-artifact function tries to find it in "/home/runner/work/docs/docs/output"

Do you have any tip regarding where it might be best to tweak things?

No worries if you don't answer, I'm not sure this is the right place to ask. :-)

Make it easier to look through real-life examples?

Could the list of examples at the end of the README be a table with

community name, linking to repo as it is now the case
workflow file presented as [workflow running blablabla every blablabla](permalink to workflow file)
example log?

I'm asking because that's what I go look for in example repos.

Consider removing docs folder

I don't think it's relevant here anymore - it's completely represented in https://github.com/urlstechie/urlchecker-python. I wasn't 100% sure so I didn't want to delete.

black linting

@SuperKogito what are your thoughts on adding black for code formatting? It will be easier to enforce a standard for code style, and make the code a bit easier to read. We are doing a fairly good job of keeping it need, but some of the sections with input arguments are a bit crunched and would benefit from black. There are a few levels we can add it:

enforced as check with GitHub workflow action (and message to user if it fails with instructions to run it)
done locally, but not enforced

Let me know your thoughts! I'm working on the two issues I opened this morning now, but if you like the idea of black I can do a PR with a workflow for that after.

Error when no files to check

Great tool, thanks!

But it seems like it throws an error if there are no files to check. Shouldn't it just quietly succeed in that case?

A little ugly, but here's a snip from the log file.

urlchecker check --branch master --no-print --file-types .md,.py,.rst,.html --exclude-urls https://en.w,https://github.com/bssw-tutorial/presentations/blob/master,https://doi.org/10.1126/science.aah6168,https://doi.org/10.1002/spe.2220 --retry-count 1 --timeout 5 --files .github/workflows/check-pr-urls.yml .
Traceback (most recent call last):
original path: .
final path: /github/workspace
subfolder: None
branch: master
File "/opt/conda/bin/urlchecker", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.9/site-packages/urlchecker/client/init.py", line 191, in main
main(args=args, extra=extra)
File "/opt/conda/lib/python3.9/site-packages/urlchecker/client/check.py", line 83, in main
check_results = checker.run(
File "/opt/conda/lib/python3.9/site-packages/urlchecker/core/check.py", line 193, in run
for file_name, result in results.items():
AttributeError: 'NoneType' object has no attribute 'items'
cleanup: False
file types: ['.md', '.py', '.rst', '.html']
files: ['.github/workflows/check-pr-urls.yml']
print all: False
verbose: False
urls excluded: ['https://en.w', 'https://github.com/bssw-tutorial/presentations/blob/master', 'https://doi.org/10.1126/science.aah61 68', 'https://doi.org/10.1002/spe.2220']
url patterns excluded: []
file patterns excluded: []
force pass: False
retry count: 1
save: None
timeout: 5

The `set-env` command is disabled

This action currently cannot run due to additional security measures.

Run PR=$(jq --raw-output .pull_request.number "${GITHUB_EVENT_PATH}")
Error: Unable to process command '::set-env name=PR::291' successfully.
Error: The `set-env` command is disabled. Please upgrade to using Environment Files or opt into unsecure command execution by setting the `ACTIONS_ALLOW_UNSECURE_COMMANDS` environment variable to `true`. For more information see: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/

Adding the following to the jobs: is a workaround.

env:
      ACTIONS_ALLOW_UNSECURE_COMMANDS: true

However, ideally the examples are updated to use the new $GITHUB_ENV file method or similar. For example,

          files=$(curl --request GET \
          --url https://api.github.com/repos/${{ github.repository }}/pulls/$PR/files \
          --header 'authorization: Bearer ${{ secrets.GITHUB_TOKEN }}' \
          --header 'Accept: application/vnd.github.antiope-preview+json' \
          --header 'content-type: application/json' | jq --raw-output .[].filename | sed 's/^\|$/"/g'|paste -sd, - | tr -d \" | tr -d \')
          echo "files=$files" >> $GITHUB_ENV

env:
          PR: ${{ github.event.issue.number }}

Fix 'Starting a process with a partial executable path' issue in tests\test_check.py

CodeFactor found an issue: Starting a process with a partial executable path

It's currently on:
tests\test_check.py:53
Commit 0d77054

Export results to a file and commit it to git repo

A possible improvement is to export the checks results into a .txt report that will be committed to the folder.

Missing parameter for recent retry update

I'm going to open a PR to fix this ASAP

 buildtest-framework/README.rst 
 ------------------------------
Traceback (most recent call last):
  File "/check.py", line 99, in <module>
    check_results = check_repo(file_paths, print_all, white_listed_urls,
  File "/check.py", line 71, in check_repo
    check_results = urlproc.check_urls(file, urls, retry_count, timeout)
  File "/core/urlproc.py", line 99, in check_urls
    do_retry = check_response_status_code(response, print_format)
TypeError: check_response_status_code() missing 1 required positional argument: 'print_format'

I wish GitHub actions had some way to check this outside of actions.

Some URLs break verification

Feel free to take a look at: https://github.com/tinova/docs/runs/335326174

Certificates: nersc.gov, ornl.gov

Hi,

Using this action I have problems verifying nersc.gov and ornl.gov certificates in a standard Ubuntu-20.04 GH action instance:

HTTPSConnectionPool(host='docs-dev.nersc.gov', port=443): Max retries exceeded with url: /cgpu/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://docs-dev.nersc.gov/cgpu/

HTTPSConnectionPool(host='www.olcf.ornl.gov', port=443): Max retries exceeded with url: /summit/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://www.olcf.ornl.gov/summit/

Googling the interwebs, I think that we can fix this by also pre-installing certifi into the actions:

Newlines not being stripped

Testing release 0.1.6, even though there is a strip (that in manual testing will remove a newline) it appears that there is still a newline character -

failed to build docker container

I just noticed it failed to build docker container today as part of the github workflow. I have been using the same version (v0.2.3) for some time and it was working fine.

See https://github.com/buildtesters/buildtest/pull/699/checks?check_run_id=2170371755

My urlcheck workflow can be found at https://github.com/buildtesters/buildtest/blob/devel/.github/workflows/urlchecker.yml

Fix 'Do not use apt-get upgrade or dist-upgrade' issue in Dockerfile

CodeFactor found an issue: Do not use apt-get upgrade or dist-upgrade

It's currently on:
Dockerfile:6

Whitelist urls

For instance "https://x.x.x.x/mob/?moid=yyyyy" is an example URL and shouldn't give a warning (it would be nice to be able to curate a whitelist of URLs in the action conf file)

Failed urls print should be visually separate

Currently, when we print the list of URLS that don't pass, it gets sort of lost in the same block as the last check. Here is an example:

We should have a newline (and possibly a header that stands out) for this section.

Fix 'Comparison to True should be just 'expr'' issue in tests\test_check.py

CodeFactor found an issue: Comparison to True should be just 'expr'

It's currently on:
tests\test_check.py:21
Commit 0d77054

Improving tests

Consider making tests including e-mail addresses or maybe you made them somewhere but I missed them. As you know, there are corner-stone cases when using regular expressions so it is good to find them :)

Maybe take a look at this repo and tests, although this one is not bulletproof either lipoja/URLExtract#13.

Can the URL checks run on the files added/edited by a PR only?

Sorry for asking without trying first!

I found this repo via https://vsoch.github.io/2020/urlchecker/, it looks really cool 👏

Should the action also output a Markdown report in a check run?

See https://developer.github.com/v3/checks/runs/#output-object

And e.g. https://github.com/ropensci/dev_guide/pull/258/checks?check_run_id=555253517

It'd be something nicer to read than the log.

Not sure if very useful. I thought of this when seeing other link checker actions mentioning "Markdown reports".

NOTICE: deprecated 0.2.x and 0.1.x versions!

Today we are deprecating the following versions:

0.2.31
0.2.3
0.2.2
0.2.1
0.2.0
0.1.9
0.1.8
0.1.7
0.1.6
0.1.5
0.1.4
0.1.3
0.1.2
0.1.2

The reason is because of issue #88 - basically we changed the versioning to match urlchecker-python (meaning latest is now 0.0.27 and the previous 0.2.x and 0.1.x are older) and I did not realize dependabot would go around suggesting an update to an older version. I've manually searched for repos using urlchecker-action to update their versions, but I apologize if I missed yours! If I could go back in time, knowing that dependabot now does this I would _not_change the versioning schema, but since it's already done (and for a few releases) I'm doing my best to make it right! But I apologize if you've come here wondering why your previous version was removed - I'm aware this isn't good practice and typically wouldn't do it if dependabot wasn't actively trying to deprecate versions.

So please update to 0.0.27 for a 7x speed up and to ensure your workflows do not break! To be clear, 0.0.27 is actually the newest release. https://github.com/urlstechie/urlchecker-action/releases/tag/0.0.27

variable substitution in url field

Hello,

We have a PR in progress in easybuild see easybuilders/easybuild#591 One of the issues we have is URL-checker can't accept variable substitution. We have configuration files (.eb) that build the http url based on few parameters that include version to fetch the tarball for compilation.

Is it possible to fix this issue or is only workaround to ignore these using white list. Can we white list by extension instead of by each filename.

dependabot is trying to update from 0.0.27 to 0.2.31 after version scheme change

Take a look at berlin-hack-and-tell/bhnt.c-base.org#320 and you'll see @dependabot tying to merge the update from 0.0.27 to 0.2.31.

It is confused by the version scheme change.

Since you cannot remove "old" version, I suggest changing the version scheme again. Since you want to match URLchecker, then you should use major.minor versions only. e.g. 0.0.27 becomes 0.27.

That will fix the issue since 0.27 > 0.2.31 > 0.0.27.

Add logo for the action?

This is my suggestion for the action logo. We cannot set it in the marketplace but it is still a nice addition imo.

url checker report in Github

Just a thought, the URLchecker is a nice feature to be honest it is not practical to go into the output of the Action to see the error report. It would be nice to have a bot report all the broken links as part of the check in the PR. The code reviewers and authors would be keen on getting this information in their PR so they can fix their code.

In this case, the bot just reports only failure, if there are none have a custom message to your liking saying SUCCESS: all URL links are valid!!

Once this is implemented, i am sure it would be very useful for lots of folks.

Fix 'The if statement can be replaced with 'return bool(test)'' issue in core\fileproc.py

CodeFactor found an issue: The if statement can be replaced with 'return bool(test)'

It's currently on:
core\fileproc.py:22

Ping me when there is release 0.1.7

hey @SuperKogito I saw there was lots of good work today! Could you ping me on this issue when the latest two commits are released? I can then test the released version on the repos where they are needed. Then we can chat more about creating a module, if you are still interested.

Retry if failed parameters?

hey @SuperKogito ! We are using your action for buildtest, and I'd like to also test it for usrse USRSE/usrse.github.io#171 where we've been using html-proofer. The issue with proofer is that it doesn't have any understanding / implementation for a retry - many of the links in our static site are old HPC documentation or other servers that don't always respond reliably, and might need a retry with exponential backoff up to some number of maximum attempts. Is this something we could look into adding here? I could definitely open a PR if you want to discuss how to go about it (and expose variables to the user).

Next release should use tagged container version

We've been doing releases with the latest tag, but since this is a moving target and features available are starting to change, I want to transition to having releases here associated with a particular tag/release for urlchecker. For the next round of tests to go in to address adding more tests, we can do a release associated with 0.0.19.