kevva / url-regex Goto Github PK

View Code? Open in Web Editor NEW

351.0 6.0 65.0 44 KB

Regular expression for matching URLs

License: MIT License

JavaScript 97.09% TypeScript 2.91%

regex url http nodejs

url-regex's Introduction

url-regex

Regular expression for matching URLs

Based on this gist by Diego Perini.

Install

$ npm install url-regex

Usage

const urlRegex = require('url-regex');

urlRegex().test('http://github.com foo bar');
//=> true

urlRegex().test('www.github.com foo bar');
//=> true

urlRegex({exact: true}).test('http://github.com foo bar');
//=> false

urlRegex({exact: true}).test('http://github.com');
//=> true

urlRegex({strict: false}).test('github.com foo bar');
//=> true

urlRegex({exact: true, strict: false}).test('github.com');
//=> true

'foo http://github.com bar //google.com'.match(urlRegex());
//=> ['http://github.com', '//google.com']

API

urlRegex([options])

Returns a RegExp for matching URLs.

options

exact

Type: boolean
Default: false

Only match an exact string. Useful with RegExp#test to check if a string is a URL.

strict

Type: boolean
Default: true

Force URLs to start with a valid protocol or www. If set to false it'll match the TLD against a list of valid TLDs.

get-urls - Get all URLs in text
linkify-urls - Linkify URLs in text

License

MIT © Kevin Mårtensson and Diego Perini

url-regex's People

Contributors

Stargazers

Watchers

Forkers

mpal9000 regexhq simonvizzini joukosaastamoinen joakimbeng hackergrrl vinnymac superhuman belinchung sugarshin xfields ajoslin sgmccli skyebook antonycourtney weiwang314 erika-dike averissimo juliankrispel samotlark adamperyman andybp85 nedomas dekryptic hagb4rd ntocampos feedbackfruits mcicoria fasterize scholastica krtx morejs deanwhillier sseppola writerduet teleaziz asdbaihu lcw0622 osdiab andreiashu doubleppereira guillaumewuip apricoton lucheng2 dnish shaunwarman dezfowler warifp kiitehq 418sec richienb ppang0405 strong-roots-capital tactivos pacmad nopash anubisant waylawww tjdev7

url-regex's Issues

Spread operator issue on some browsers

Since the update to release 5.0.0 - the spread operator at
https://github.com/kevva/url-regex/blob/master/index.js#L8
is causing issues in some browsers which don't support that. even though node handles it, javascript fails in certain older browsers.

Could the spread operator be reverted back to Object.assign since it's only used in 1 place and would work on all browsers then?

Prepack/precompile/regenerate to save bytes

url-regex is 24KB browserified and 7KB minified/gzipped. That's quite a bit for a regex.

This is the strict (default, for most users) compiled regex in 357 bytes:

/(?:(?:(?:[a-z]+:)?\/\/)|www\.)(?:\S+(?::\S*)?@)?(?:localhost|(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])){3}|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,}))\.?)(?::\d{2,5})?(?:[\/?#][^\s"]*)?/gi

The non-strict version actually comes out at 10KB, but perhaps devongovett/regexgen can compress this better: tlds.join('|')

The exact/non-exact versions can still be generated on the fly by repacking the flat regex (shown above) at runtime: new RegExp(regex.source, ...modifications)

Credit Diego Perini and respect the license of his work

This package seems to be a tiny, unneeded wrapper around @dperini’s URL regex (https://gist.github.com/dperini/729294) (which was made based on my semi-arbitrary requirements). However, you’re not respecting its MIT license, or even crediting the original author in any way.

To add insult to injury, the README says “MIT © Kevin Mårtensson”.

Matching URLs in HTML is broken

When the quotes were removed from the regex (#18) it also resulted in a really bad matching for URLs with paths in HTML:

'<a href="http://example.com/with-path">example.com with path</a>'.match(urlRegex());
// ['http://example.com/with-path">example.com']

The test I wrote were insufficient to cover this.

Even before the quotes were removed the regex also has some difficulties extracting URLs with paths from markdown:

'[example.com with path](http://example.com/with-path)'.match(urlRegex());
// ['http://example.com/with-path)']

As you can see by the existing tests it doesn't include the last parenthesis when there's no path in the URL.

I think the first issue could be solved by changing the path regex to:

var path = '(?:[/?#][^\\s"]*)?';

This should be ok to do because an unencoded " is not valid in a URL.

The issue with the closing parenthesis being matched in markdown is trickier though and I have no solution for it.

What do you think?

might be useful, scrapped from Chromium (Kiwi) browser internal code

not 100% sure what those do.
"((http|https|file|ftp|ssh)://)([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"

"(?:\\b|^)((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\\'\$\$\\,\\;\\?\\&\\=]|(?:\\%[&#x0061...x0041;-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\\'\$\$\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?(?:"

might it is essentially low level Java but might be useful.

Path parsed is inaccurate

e.g. https://datatracker.ietf.org/rg/cfrg/about/] gets parsed as https://datatracker.ietf.org/rg/cfrg/about/] instead of https://datatracker.ietf.org/rg/cfrg/about/

https://www.ietf.org/rfc/rfc3986.txt

Match URLs that start with 'www.'

First thing I noticed with this regex is that it doesn't match URLs that start with a 'www', and now I'm wondering if there was a reason why this was not included yet. Were there maybe some kind of issues? I've patched the regex to also match URLs that start with www (and also optional digits after the www, for URLs like www2.blah.com), and haven't noticed any issues yet.

So, was there a reason why the regex shouldn't match such URLs? If not then I'll gladly create a pull request for my patch. Let me know, thanks!

news: url is not recognised

One of examples in https://www.ietf.org/rfc/rfc2396.txt is not recognised:
news:comp.infosystems.www.servers.unix

originally raised at sindresorhus/is-url-superb#5

Avoid matching ) after a slash at end of url

Example:

[and another](https://another.example.com/) and

first asked for in: yakyak/yakyak#578

Can't build v4 using browserify

Hi,

v4.0.0 doesn't build using browserify in some configurations because browserify doesn't apply transforms to node_modules by default. This means that a fairly typical browserify/babel/uglify chain won't work.

This problem goes away if we add this to the url-regex package.json, so that browserify knows to transpile this module:

    "browserify": {
        "transform": [
            [
                "babelify",
                {
                    "presets": "es2015"
                }
            ]
        ]
    }

Let me know and I can make a pull request for the above.

Export the regex directly

Currently you’re doing:

module.exports = function() {
  return regex;
};

What’s the point? Why not just export the regex directly?

module.exports = regex;

URL match for invalid email address

Hi there!

I am running into an edge case in our application where if a user inputs an incorrect email address, for example, m.me@ff, the library will return m.me as a match.

Is this the expected outcome? Is there anyway around this?

Thanks in advance!

Don't match URL that contains \s

Example http://6 6.6.6.6

False positives and negatives

I've got a couple of surprising matches from my testing. Without www it doesn't match, but if the protocol is 3 characters it does?

> weburl_regex.test('http://google.com')
false
> weburl_regex.test('htp://google.com')
true

The weburl regex here works as expected, though.

Avoid matching period at the end of url

An URL that ends with a period ("http://google.com.") would match. This isn't a huge deal but could cause some issues.

It might be related to #34.

test function return true when use not url text ("dhttp://~~~")

const urlRegex = require('url-regex');

urlRegex({exact: true}).test('dhttps://github.com/kevva/url-regex');
//=> true

I imagine that return false, but return true. is it correct?
"dhttps" is not URL.

url-regex: 4.1.1
node: 6.13.0

match(urlRegex({strict:false})) returns urls with TLDs truncated to 2 characters

Here's a runkit example https://runkit.com/593fd473d55e110011acb04c/593fd473d55e110011acb04d

By not being strict, I can understand why this might be the case. Might there be another way to test in a non-strict way but have the full urls in the output array?

Nice i i i ....

Bit.ly link not parsing from html code properly

The following html code when parsed leaves the closing p tag: <p>http://bit.ly/2ePIrDy</p>

Results in: http://bit.ly/2ePIrDy</p>

tld check should be disabled by default

Or preferably just removed.

New tld's are coming out all the time and it's not feasible to keep it update to date. Even if you can, people using it won't and it will lead to annoying websites that doesn't accept [email protected]...

Support old browser

Hi,
This version is not available on old browsers. I think includes Transpile codes. How about that ?

package not transpiled

Please create a transpiled version of package, hat to add the following to my webpack.config.js

  rules: [{
    // Temp fix, because they have not transpiled it them selfs
    test: /index\.js?$/i,
    include: /url-regex/,
    use: ['babel-loader?cacheDirectory']
  },

In order for my UglifyJsPlugin to complete without errors...

can not run test.js file

After cloning this project and running the test.js file.

Please provide any solutions to run it.

Don't match URL that ends punctuation marks

Ex: http://facebook.com/text,

Email addresses are returned

Hey there!

We're using this library to pull urls out of a user input string. We're currently running into a problem where it returns email addresses.

'This is an [email protected]'.match(urlRegex({ exact: false, strict: false }));
// returns ['[email protected]']

I think it should return no matches in this case.

Shortened URL's are not matched

Shortened urls like goo.gl/l45ry5 are not matched.

URL wrapped into quotes

Nice package! But I got strange output with URL, that wrapped into '"':

> var urlRegex = require('url-regex');
undefined
> '"http://avatars.hosting.net/9f7793.jpg"'.match(urlRegex())
null

Is this expected to be so?

strings with escape characters are parsed differently in node vs browser

Following url is parsed differently in node vs browser. This causing the string to be parsed as intended in browser but giving a different result in our jest unit tests.

string: http://\bwww.mywebsite.com/\fwww.evil-website.com/?loadexploit&not_a_threat.exe

browser output: [http://bwww.mywebsite.com//fwww.evil-website.com/?loadexploit&not_a_threat.exe]

node output: [www.mywebsite.com/, www.evil-website.com/?loadexploit&not_a_threat.exe]

spaces in matched urls

I think the matched text should not include space at the beginning

> "some text  http://192.168.1.1:123 some other text http://asd.it no more".match(a())
[ ' http://192.168.1.1:123', ' http://asd.it' ]

[VULNERABILITY] Parsing a long String will result in 100% CPU usage and `String.test` will never finish

IMPORTANT UPDATE (8/15/20)

Per my comment below, I have released my own package, url-regex-safe, which resolves this issue, and all (solvable) existing issues and pull requests here in this GitHub repository. The new package has 100% test coverage and is available at https://github.com/niftylettuce/url-regex-safe. It has more sensible defaults as well.

Example:

> require('url-regex')({ strict: false }).test('018137.113.215.4074.138.129.172220.179.206.94180.213.144.175250.45.147.1364868726sgdm6nohQ')

The only way to exit out is to SIGINT.

`exact` option

To do an exact match. If you eg want to test if something is an URL.

Like I have here: https://github.com/sindresorhus/ip-regex/blob/ab230fdd415a1a4ff166d34aa4284bcf02c6cdd2/index.js#L8-L9

So I don't have modify the regex here: https://github.com/sindresorhus/is-url-superb/blob/8d9002e297603c3c9f82d8d6f914f4b9d9c08924/index.js#L3

urls with `@` followed by `.` and number are parsed incorrectly

console.log("https://test.com/@foo.bar1baz".match(require('url-regex')({strict:true})))`

[ 'https://test.com/@foo.bar' ]

Parenthesis are included as part of url

Hi!

I'm trying to extract links from a text like this:

const text = "Hello this is my url (https://www.miurl.org/en/test) googbye"
text.match(urlRegex({strict: false}) // ["https://www.uclg.org/en/node/27407)"]

It includes the last parenthesis as part of the link

Why not... without protocols and without www?

Hi,
Is there a reason for not adding one ? after

return /(?:^|\s)(["'])?(?:(?:(?:(?:https?|ftp|\w):)?\/\/)|(?:www.))

to allow urls without http(s) and without www?

I test it and it not fails another cases. You just missed this case or specially avoid this?

Regular Expression Denial of Service

Dependency of postcss-color-rebeccapurple fix vulnerability
More info: https://www.npmjs.com/advisories/1550

urls that are enclosed with ' instead of " are not matched correctly

This happens because of the path regex that should be changed to const path = '(?:[/?#][^\\s"\']*)?';

localhost fails

Hello,
This local URL fails:
http://localhost/
The tld is required by the plugin (for example: localhost.com).
Thank you :)
David

Match URLs without whitespace boundaries.

Eg:

'This is a [test](http://www.google.com).'

<a href="https://github.com">GitHub</a>

matching non existent tld

running urlRegex({strict: false}).test('something github.are foo bar');
returns true even though "are" isnt on the list of tld's

failing tests

Small thing. The following URLs fail:

http://مثال.إختبار
http://例子.测试
http://उदाहरण.परीक्षा

However, they pass with the latest version of the Regex by dperini:
https://gist.github.com/dperini/729294

duplicate test entry (should be unique test case?)

Line 52 in test.js shows a duplicate test case:

        'ws://foo.ws',
        'ws://foo.ws',

is this a typo, or was it meant be two different tests, for example:

        'ws://foo.ws',
        'ws://foo.ws/rainbows',

I just wanted to point this out in case the duplicate was meant to be an important test case that was left out.

Missing IPv6 support

Looks like ip-regex already provides the patterns:

https://github.com/sindresorhus/ip-regex/blob/605041b6a32ac7cca8b9c827bb9abc34e9336be0/index.js#L9-L20

failing example case (puts extra space at beginning of result)

input:

'Deep House and Garage can get tricky. I find it hard to create a bass sound that fits the genre, but also sets me apart from other producers. To combat that, I tend to use processing and effects. I used some creative reverbs and delays for this thick Deep House bass in Massive. The tutorial available from http://www.youtube.com/watch?v=luG3FIaEdcA'

expected output:

'http://www.youtube.com/watch?v=luG3FIaEdcA'

actual output:

' http://www.youtube.com/watch?v=luG3FIaEdcA'

4.0.0 also broke Safari

The v4 upgrade that switched to ES2015 broke Safari (it will throw errors indicating that it does not support const in strict mode, then give up). By default browserify+babelify does not transpile dependencies, so url-regex is being included as is. (That also breaks UglifyJS as it doesn't do ES2015 yet.)

Reverting to 3.2 fixes the problem. I don't know if you want to back out of this change to keep supporting Safari, or if you'd rather wait it out (or provide a client transpiled version). If you prefer to wait it out it might be helpful to point the issue out in the README.

http://[ipv4 ip address]:[port]/[path] are not correctly recognised

i.e.

> asd = require('url-regexp')
{ validate: [Function], match: [Function] }
> asd.match('http://192.168.1.1:1234/foo')
[]
> asd.match('http://192.168.1.12:1234/foo')
[]
> asd.match('http://123.123.123.123:1234/foo')
[ 'http://123.123.123.12' ]
> wtf?

Unexpected token: operator (>)

When building an angular 2 app for production using ng build --prod, build fails with this message:

ERROR in vendor.8a7e2466f2b7f474b403.bundle.js from UglifyJs
Unexpected token: operator (>) [/Users/user/Documents/git projects/angular-project/node_modules/url-regex/index.js:4,0][vendor.8a7e2466f2b7f474b403.bundle.js:888,23]

However, development server starts page successfully by using npm start.

Decode error

The presence of ).{% trailing a valid URL (e.g. http://cnn.com/).{%) causes the error:

[PATH]/node_modules/normalize-url/index.js:82
		urlObj.pathname = decodeURI(urlObj.pathname);
		                  ^

URIError: URI malformed
    at decodeURI (native)
    at module.exports ([PATH]/node_modules/normalize-url/index.js:82:21)
    at [PATH]/node_modules/get-urls/index.js:14:10
    at Array.map (native)
    at module.exports ([PATH]/node_modules/get-urls/index.js:13:24)
    at [PATH]/index.js:17:16
    at Array.forEach (native)
    at [PATH]/index.js:15:9
    at [PATH]/node_modules/recursive-readdir/index.js:64:22
    at [PATH]/node_modules/recursive-readdir/index.js:64:22

the tlds library bloats the bundle size of this library by a lot (6kb minified according to bundlephobia), most of which are irrelevant to almost all users of the library, and will increase over time as highlighted in #12
it isn't even used at all by default, unless strict is false
i'm happy as a client of the library to include a list of TLDs i consider valid, rather than deferring to the list

So I think there ought to be a way for one of two things to happen:

add a separate entry point that doesn't include it, requires you to pass in a set of TLDs if you set strict to false, so that bundle-conscious clients can just include that instead
somehow modify the existing library so that code-splitting-enabled clients can profit from the existing module export without breaking changes, while only loading tlds if necessary. not sure how that would work, I assume the previous suggestion is more realistic