Code Monkey home page Code Monkey logo

urlchecker-action's People

Contributors

lucasrangit avatar maelle avatar mrmundt avatar shahzebsiddiqui avatar superkogito avatar vsoch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

urlchecker-action's Issues

Remove termcolor dependency

from termcolor import colored is present in two scripts. In one it is unused and in the other it can be replaced by a regular print.

URLs are being truncated and then failing in Markdown, TOML files

I can't understand how or why this might be happening, so I'll just describe what I'm seeing. I'm using urlstechie/urlchecker-action@master.

urlchecker-action is truncating some URLs, trying to reach those truncated URLs, and reporting them as broken. Examples:

[![linux](https://raw.githubusercontent.com/devicons/devicon/master/icons/linux/linux-original.svg)](https://www.linux.org/)

truncated to https://www.linux

  url = "https://codepen.io/rootwork/"

truncated to https://codepen.io/root

[lazy-render things](https://www.drupal.org/node/1982024)

truncated to https://www.drupal.org/node/19

truncated-urls

urlchecker is looking at .md, .toml, .yml and .scss files. The ones that are being truncated are in the Markdown and TOML files but that might just be because there are many more of them.

These URLs are not split between lines; other than the first example above they are all actually pretty short URLs. The Markdown/TOML files validate. Many URLs, including other ones in those same files, are not truncated and pass urlchecker just fine.

Adjusting timeout and retry values on the linkchecker doesn't affect anything, because they're trying to check the wrong URLs.

All URLs were passing until July 12, at which point I introduced some broken URLs. After fixing them on July 13, I started getting the truncated URLs. (And to be clear the truncated ones aren't the ones that had been broken; they don't seem to have anything to do with each other.)

Any idea what might be happening?

Additional tests needed for .github workflow

Currently, we do one run that does basic checks for the testing repository. We would want to add additional runs that:

  • use save and verify that the file exists
  • use cleanup to verify that the repository is deleted
  • others that might be useful

We absolutely don't want any PRs being merged that possibly break any previous functionality.

Add white listed files and patterns variable

Currently, I can add _config.yml and README.md to my white_listed_patterns, but the files /github/workspace/_config.yml and /github/workspace/README.md are still checked. I suspect it's looking for the full path or starts with, and as a user I'd expect it to do more of a re.search (using my pattern). E.g., this should work without /github/workspace

        # Cannot check private GitHub settings
        white_listed_patterns: _config.yml,README.md,SocialNetworks.yml,.github/workflows,tests

Document how to use url checker for pull request changes

Presume we build a website with a static website generator (jekyll, jbake, hugo, ...)
The output website is in directory output.

How do we check if there are no new broken url's in a pull request?
After building it from a pull request, by checking the files in the output directory.

It always checks out a new branch, because parameter branch defaults to master, so it ignores the changes in the pull request.
Document how to tell it to use the current branch (don't check anything out).

Change name of "whitelist" options to "exclude" options

It's confusing that the white_listed_ inputs for URLs, patterns, and files excludes things rather than including them; usually you'd use the term "whitelist" to describe things you are explicitly including that would otherwise not be included.

Additionally, in the output "whitelist" is misspelled for the URL option:

url whitetlist: []

My suggestion would be to change these three input options to exclude_urls, exclude_patterns and exclude_files, to match the include_files option that already exists. (And also update those values in the output itself.)

If you don't want to break existing configs, you could leave the white_listed_ options as valid along with the new ones, but only list the latter in the docs.

Can't use action to check only dotfiles

Hey, me again ๐Ÿ˜„

I ran urlchecker-python v0.22 on a directory with dotfiles (.editorconfig, etc.) using this command:

urlchecker check --file-types '.*' .

That works fine.

Then I ran urlchecker-action v0.2.3, which as I understand includes the new version of the python script, with the following option:

file_types: '.*'

And the action simply skips all files and "passes" with "Done. No urls were collected."

Notably, the output of the action's build (on GitHub) says:

file types: ['.']

So I think it is removing the asterisk entirely. I tried using double quotes, using no quotes, escaping the asterisk with a backslash, and using a double-asterisk, but the result was the same each time.

Any thoughts?

Question: how to implement action to respond to comments?

This is just a random idea I thought would be cool - we could have an action that responds to issue comments,and looks for a particular string (like @urlchecker-action check) and then it could do some special check for a url.

@maelle you just gave me a really cool idea - we could have some kind of action / bot that responded to /check-urls (or something like @urlchecker-action check-urls. I've never made a bot before, but I'll note it because it would be cool to have!

Definitely not priority - but would be fun to figure out how to do!

fix 0.1.8 to use urlchercker 0.1.14

@vsoch I updated, fixed some bugs and tested urlchecker-python. I also automated the quay.io docker builds and released a new version on pypi 0.1.14. I edited the action to use this version and tested it on https://github.com/urlstechie/urlchecker-test-repo/ but it is failing, I am not sure what did I miss and since you are more familiar with the new structure, can you please take a look ? Here are the logs: https://github.com/urlstechie/urlchecker-test-repo/runs/562209780?check_suite_focus=true

When python library exists, consider renaming for action

When we have more than one repository (this one here) and we refactor the action, we might possibly want to consider renaming this repository (while it's still early and we can track down users and make sure they update names) to something that strongly indicates "I'm a GitHub action." E.g.,

  • urlschecker-python: --> deploys urlschecker python module
  • urlschecker-action -> GitHub action

And then the namespace is a bit more clear.

fake_useragent error

Hey guys!

Not sure if I'm doing something wrong, but on my end a simple check on a Jekyll website I got:

WARNING:fake_useragent:Error occurred during loading data. Trying to use cache server https://fake-useragent.herokuapp.com/browsers/0.1.12

In the end, the checker won't find any URL to check, ends in an error and the CI passes. Here's the full log: https://github.com/kiegroup/kogito-website/runs/7903901907?check_suite_focus=true

My yaml:

name: Check URLs

on: [pull_request]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: ruby/setup-ruby@v1
      with:
        ruby-version: '3.1'
        bundler-cache: true
    - uses: actions/cache@v3
      with:
        path: vendor/bundle
        key: ${{ runner.os }}-gems-${{ hashFiles('**/Gemfile') }}
        restore-keys: |
          ${{ runner.os }}-gems-
    - run: |
          gem install bundler jekyll
          bundle check || bundle install
          bundle exec jekyll build
    - name: urls-checker
      uses: urlstechie/[email protected]
      with:
        # subfolder with files to test
        subfolder: _site
        # A comma-separated list of file types to cover in the URL checks
        file_types: .html,.js,.css,.xml
        # Choose whether to include file with no URLs in the prints.
        print_all: false
        # The timeout seconds to provide to requests, defaults to 5 seconds
        timeout: 5
        # How many times to retry a failed request (each is logged, defaults to 1)
        retry_count: 3
        # choose if the force pass or not
        force_pass : false

The cache action can be ignored, I guess the ruby action is already doing this work, I'll review later.

URL vs link checking

Some more user feedback/questions ๐Ÿ˜‰ With the current action&library, URLs are checked but not links: if you typed [a cool post](htttps://blabla.org) by mistake, the wrong link (should be https://blabla.org) won't be caught. In Markdown/html files, "links" are well-defined (in comments in code, I agree, less so).

Other actions focus on link checking, I wonder whether this could be added to this action somehow, or as a limitation to the docs maybe.

(I suppose this is partly the Commonmark debate again ;-))

Maybe a good workflow for a website/thing is

  • check links and URLs when content is created;

  • check URLs once in a while (the URL can get broken, a link that's not malformed won't change).

Advantages of your library/action: retry, artifact, UA. But do I use another library/action for checking links? ๐Ÿค”

Check results returned before loop completed

I'm not sure if this is intentional, but the list of check_results is returned after just parsing one of the urls in the list:

https://github.com/urlstechie/URLs-checker/blob/3200baf59d73c3e8c0fd92e9ce73c375603b4b2b/core/urlproc.py#L59

Assuming that there is one list of urls, wouldn't we want to go through all of them, update check_results as we go, and return the final two lists? I'm working on a PR now, I can update this to fix the issue (if it isn't intentional!)

Accessing results file as artifact

Thanks, very cool! My plan is to write a workflow uploading the csv as artifact. :-)

Originally posted by @maelle in urlstechie/urlchecker-python#24 (comment)

I'm trying to write a workflow uploading the artifact and have a question. I tried various things and might be missing something obvious regarding paths.

Workflow

https://github.com/r-hub/docs/runs/582541285?check_suite_focus=true

  • The URL-checks actions is "Saving results to /github/workspace/output/urls.csv"
  • But then the upload-artifact function tries to find it in "/home/runner/work/docs/docs/output"

Do you have any tip regarding where it might be best to tweak things?

No worries if you don't answer, I'm not sure this is the right place to ask. :-)

Make it easier to look through real-life examples?

Could the list of examples at the end of the README be a table with

  • community name, linking to repo as it is now the case
  • workflow file presented as [workflow running blablabla every blablabla](permalink to workflow file)
  • example log?

I'm asking because that's what I go look for in example repos.

black linting

@SuperKogito what are your thoughts on adding black for code formatting? It will be easier to enforce a standard for code style, and make the code a bit easier to read. We are doing a fairly good job of keeping it need, but some of the sections with input arguments are a bit crunched and would benefit from black. There are a few levels we can add it:

  • enforced as check with GitHub workflow action (and message to user if it fails with instructions to run it)
  • done locally, but not enforced

Let me know your thoughts! I'm working on the two issues I opened this morning now, but if you like the idea of black I can do a PR with a workflow for that after.

Error when no files to check

Great tool, thanks!

But it seems like it throws an error if there are no files to check. Shouldn't it just quietly succeed in that case?

A little ugly, but here's a snip from the log file.

urlchecker check --branch master --no-print --file-types .md,.py,.rst,.html --exclude-urls https://en.w,https://github.com/bssw-tutorial/presentations/blob/master,https://doi.org/10.1126/science.aah6168,https://doi.org/10.1002/spe.2220 --retry-count 1 --timeout 5 --files .github/workflows/check-pr-urls.yml .
Traceback (most recent call last):
original path: .
final path: /github/workspace
subfolder: None
branch: master
File "/opt/conda/bin/urlchecker", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.9/site-packages/urlchecker/client/init.py", line 191, in main
main(args=args, extra=extra)
File "/opt/conda/lib/python3.9/site-packages/urlchecker/client/check.py", line 83, in main
check_results = checker.run(
File "/opt/conda/lib/python3.9/site-packages/urlchecker/core/check.py", line 193, in run
for file_name, result in results.items():
AttributeError: 'NoneType' object has no attribute 'items'
cleanup: False
file types: ['.md', '.py', '.rst', '.html']
files: ['.github/workflows/check-pr-urls.yml']
print all: False
verbose: False
urls excluded: ['https://en.w', 'https://github.com/bssw-tutorial/presentations/blob/master', 'https://doi.org/10.1126/science.aah6168', 'https://doi.org/10.1002/spe.2220']
url patterns excluded: []
file patterns excluded: []
force pass: False
retry count: 1
save: None
timeout: 5

The `set-env` command is disabled

This action currently cannot run due to additional security measures.

Run PR=$(jq --raw-output .pull_request.number "${GITHUB_EVENT_PATH}")
Error: Unable to process command '::set-env name=PR::291' successfully.
Error: The `set-env` command is disabled. Please upgrade to using Environment Files or opt into unsecure command execution by setting the `ACTIONS_ALLOW_UNSECURE_COMMANDS` environment variable to `true`. For more information see: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/

Adding the following to the jobs: is a workaround.

env:
      ACTIONS_ALLOW_UNSECURE_COMMANDS: true

However, ideally the examples are updated to use the new $GITHUB_ENV file method or similar. For example,

          files=$(curl --request GET \
          --url https://api.github.com/repos/${{ github.repository }}/pulls/$PR/files \
          --header 'authorization: Bearer ${{ secrets.GITHUB_TOKEN }}' \
          --header 'Accept: application/vnd.github.antiope-preview+json' \
          --header 'content-type: application/json' | jq --raw-output .[].filename | sed 's/^\|$/"/g'|paste -sd, - | tr -d \" | tr -d \')
          echo "files=$files" >> $GITHUB_ENV
env:
          PR: ${{ github.event.issue.number }}

Missing parameter for recent retry update

I'm going to open a PR to fix this ASAP

 buildtest-framework/README.rst 
 ------------------------------
Traceback (most recent call last):
  File "/check.py", line 99, in <module>
    check_results = check_repo(file_paths, print_all, white_listed_urls,
  File "/check.py", line 71, in check_repo
    check_results = urlproc.check_urls(file, urls, retry_count, timeout)
  File "/core/urlproc.py", line 99, in check_urls
    do_retry = check_response_status_code(response, print_format)
TypeError: check_response_status_code() missing 1 required positional argument: 'print_format'

I wish GitHub actions had some way to check this outside of actions.

Certificates: nersc.gov, ornl.gov

Hi,

Using this action I have problems verifying nersc.gov and ornl.gov certificates in a standard Ubuntu-20.04 GH action instance:

HTTPSConnectionPool(host='docs-dev.nersc.gov', port=443): Max retries exceeded with url: /cgpu/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://docs-dev.nersc.gov/cgpu/
HTTPSConnectionPool(host='www.olcf.ornl.gov', port=443): Max retries exceeded with url: /summit/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://www.olcf.ornl.gov/summit/

Googling the interwebs, I think that we can fix this by also pre-installing certifi into the actions:

Newlines not being stripped

Testing release 0.1.6, even though there is a strip (that in manual testing will remove a newline) it appears that there is still a newline character -

image

Failed urls print should be visually separate

Currently, when we print the list of URLS that don't pass, it gets sort of lost in the same block as the last check. Here is an example:

image

We should have a newline (and possibly a header that stands out) for this section.

Improving tests

Consider making tests including e-mail addresses or maybe you made them somewhere but I missed them. As you know, there are corner-stone cases when using regular expressions so it is good to find them :)

Maybe take a look at this repo and tests, although this one is not bulletproof either lipoja/URLExtract#13.

NOTICE: deprecated 0.2.x and 0.1.x versions!

Today we are deprecating the following versions:

  • 0.2.31
  • 0.2.3
  • 0.2.2
  • 0.2.1
  • 0.2.0
  • 0.1.9
  • 0.1.8
  • 0.1.7
  • 0.1.6
  • 0.1.5
  • 0.1.4
  • 0.1.3
  • 0.1.2
  • 0.1.2

The reason is because of issue #88 - basically we changed the versioning to match urlchecker-python (meaning latest is now 0.0.27 and the previous 0.2.x and 0.1.x are older) and I did not realize dependabot would go around suggesting an update to an older version. I've manually searched for repos using urlchecker-action to update their versions, but I apologize if I missed yours! If I could go back in time, knowing that dependabot now does this I would _not_change the versioning schema, but since it's already done (and for a few releases) I'm doing my best to make it right! But I apologize if you've come here wondering why your previous version was removed - I'm aware this isn't good practice and typically wouldn't do it if dependabot wasn't actively trying to deprecate versions.

So please update to 0.0.27 for a 7x speed up and to ensure your workflows do not break! To be clear, 0.0.27 is actually the newest release. https://github.com/urlstechie/urlchecker-action/releases/tag/0.0.27

variable substitution in url field

Hello,

We have a PR in progress in easybuild see easybuilders/easybuild#591 One of the issues we have is URL-checker can't accept variable substitution. We have configuration files (.eb) that build the http url based on few parameters that include version to fetch the tarball for compilation.

Is it possible to fix this issue or is only workaround to ignore these using white list. Can we white list by extension instead of by each filename.

dependabot is trying to update from 0.0.27 to 0.2.31 after version scheme change

Take a look at berlin-hack-and-tell/bhnt.c-base.org#320 and you'll see @dependabot tying to merge the update from 0.0.27 to 0.2.31.

It is confused by the version scheme change.

Since you cannot remove "old" version, I suggest changing the version scheme again. Since you want to match URLchecker, then you should use major.minor versions only. e.g. 0.0.27 becomes 0.27.

That will fix the issue since 0.27 > 0.2.31 > 0.0.27.

Add logo for the action?

This is my suggestion for the action logo. We cannot set it in the marketplace but it is still a nice addition imo.
action logo

url checker report in Github

Just a thought, the URLchecker is a nice feature to be honest it is not practical to go into the output of the Action to see the error report. It would be nice to have a bot report all the broken links as part of the check in the PR. The code reviewers and authors would be keen on getting this information in their PR so they can fix their code.

In this case, the bot just reports only failure, if there are none have a custom message to your liking saying SUCCESS: all URL links are valid!!

Once this is implemented, i am sure it would be very useful for lots of folks.

Ping me when there is release 0.1.7

hey @SuperKogito I saw there was lots of good work today! Could you ping me on this issue when the latest two commits are released? I can then test the released version on the repos where they are needed. Then we can chat more about creating a module, if you are still interested.

Retry if failed parameters?

hey @SuperKogito ! We are using your action for buildtest, and I'd like to also test it for usrse USRSE/usrse.github.io#171 where we've been using html-proofer. The issue with proofer is that it doesn't have any understanding / implementation for a retry - many of the links in our static site are old HPC documentation or other servers that don't always respond reliably, and might need a retry with exponential backoff up to some number of maximum attempts. Is this something we could look into adding here? I could definitely open a PR if you want to discuss how to go about it (and expose variables to the user).

Next release should use tagged container version

We've been doing releases with the latest tag, but since this is a moving target and features available are starting to change, I want to transition to having releases here associated with a particular tag/release for urlchecker. For the next round of tests to go in to address adding more tests, we can do a release associated with 0.0.19.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.