Code Monkey home page Code Monkey logo

url-regex's Introduction

url-regex Build Status

Regular expression for matching URLs

Based on this gist by Diego Perini.

Install

$ npm install url-regex

Usage

const urlRegex = require('url-regex');

urlRegex().test('http://github.com foo bar');
//=> true

urlRegex().test('www.github.com foo bar');
//=> true

urlRegex({exact: true}).test('http://github.com foo bar');
//=> false

urlRegex({exact: true}).test('http://github.com');
//=> true

urlRegex({strict: false}).test('github.com foo bar');
//=> true

urlRegex({exact: true, strict: false}).test('github.com');
//=> true

'foo http://github.com bar //google.com'.match(urlRegex());
//=> ['http://github.com', '//google.com']

API

urlRegex([options])

Returns a RegExp for matching URLs.

options

exact

Type: boolean
Default: false

Only match an exact string. Useful with RegExp#test to check if a string is a URL.

strict

Type: boolean
Default: true

Force URLs to start with a valid protocol or www. If set to false it'll match the TLD against a list of valid TLDs.

Related

License

MIT © Kevin Mårtensson and Diego Perini

url-regex's People

Contributors

bendingbender avatar joakimbeng avatar joostverdoorn avatar kevva avatar radiovisual avatar simonvizzini avatar sindresorhus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

url-regex's Issues

Prepack/precompile/regenerate to save bytes

url-regex is 24KB browserified and 7KB minified/gzipped. That's quite a bit for a regex.

This is the strict (default, for most users) compiled regex in 357 bytes:

/(?:(?:(?:[a-z]+:)?\/\/)|www\.)(?:\S+(?::\S*)?@)?(?:localhost|(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])){3}|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,}))\.?)(?::\d{2,5})?(?:[\/?#][^\s"]*)?/gi

The non-strict version actually comes out at 10KB, but perhaps devongovett/regexgen can compress this better: tlds.join('|')

The exact/non-exact versions can still be generated on the fly by repacking the flat regex (shown above) at runtime: new RegExp(regex.source, ...modifications)

Matching URLs in HTML is broken

When the quotes were removed from the regex (#18) it also resulted in a really bad matching for URLs with paths in HTML:

'<a href="http://example.com/with-path">example.com with path</a>'.match(urlRegex());
// ['http://example.com/with-path">example.com']

The test I wrote were insufficient to cover this.

Even before the quotes were removed the regex also has some difficulties extracting URLs with paths from markdown:

'[example.com with path](http://example.com/with-path)'.match(urlRegex());
// ['http://example.com/with-path)']

As you can see by the existing tests it doesn't include the last parenthesis when there's no path in the URL.

I think the first issue could be solved by changing the path regex to:

var path = '(?:[/?#][^\\s"]*)?';

This should be ok to do because an unencoded " is not valid in a URL.

The issue with the closing parenthesis being matched in markdown is trickier though and I have no solution for it.

What do you think?

might be useful, scrapped from Chromium (Kiwi) browser internal code

not 100% sure what those do.
"((http|https|file|ftp|ssh)://)([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"

"(?:\\b|^)((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[&#x0061...x0041;-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?(?:"

might it is essentially low level Java but might be useful.

Match URLs that start with 'www.'

First thing I noticed with this regex is that it doesn't match URLs that start with a 'www', and now I'm wondering if there was a reason why this was not included yet. Were there maybe some kind of issues? I've patched the regex to also match URLs that start with www (and also optional digits after the www, for URLs like www2.blah.com), and haven't noticed any issues yet.

So, was there a reason why the regex shouldn't match such URLs? If not then I'll gladly create a pull request for my patch. Let me know, thanks!

Can't build v4 using browserify

Hi,

v4.0.0 doesn't build using browserify in some configurations because browserify doesn't apply transforms to node_modules by default. This means that a fairly typical browserify/babel/uglify chain won't work.

This problem goes away if we add this to the url-regex package.json, so that browserify knows to transpile this module:

    "browserify": {
        "transform": [
            [
                "babelify",
                {
                    "presets": "es2015"
                }
            ]
        ]
    }

Let me know and I can make a pull request for the above.

Export the regex directly

Currently you’re doing:

module.exports = function() {
  return regex;
};

What’s the point? Why not just export the regex directly?

module.exports = regex;

URL match for invalid email address

Hi there!

I am running into an edge case in our application where if a user inputs an incorrect email address, for example, m.me@ff, the library will return m.me as a match.

Is this the expected outcome? Is there anyway around this?

Thanks in advance!

False positives and negatives

I've got a couple of surprising matches from my testing. Without www it doesn't match, but if the protocol is 3 characters it does?

> weburl_regex.test('http://google.com')
false
> weburl_regex.test('htp://google.com')
true

The weburl regex here works as expected, though.

Support old browser

Hi,
This version is not available on old browsers. I think includes Transpile codes. How about that ?

package not transpiled

Please create a transpiled version of package, hat to add the following to my webpack.config.js

  rules: [{
    // Temp fix, because they have not transpiled it them selfs
    test: /index\.js?$/i,
    include: /url-regex/,
    use: ['babel-loader?cacheDirectory']
  }, 

In order for my UglifyJsPlugin to complete without errors...

can not run test.js file

After cloning this project and running the test.js file.
Screen Shot 2020-10-13 at 18 28 42

Please provide any solutions to run it.

Email addresses are returned

Hey there!

We're using this library to pull urls out of a user input string. We're currently running into a problem where it returns email addresses.

'This is an [email protected]'.match(urlRegex({ exact: false, strict: false }));
// returns ['[email protected]']

I think it should return no matches in this case.

URL wrapped into quotes

Nice package! But I got strange output with URL, that wrapped into '"':

> var urlRegex = require('url-regex');
undefined
> '"http://avatars.hosting.net/9f7793.jpg"'.match(urlRegex())
null

Is this expected to be so?

strings with escape characters are parsed differently in node vs browser

Following url is parsed differently in node vs browser. This causing the string to be parsed as intended in browser but giving a different result in our jest unit tests.

string: http://\bwww.mywebsite.com/\fwww.evil-website.com/?loadexploit&not_a_threat.exe

browser output: [http://bwww.mywebsite.com//fwww.evil-website.com/?loadexploit&not_a_threat.exe]

node output: [www.mywebsite.com/, www.evil-website.com/?loadexploit&not_a_threat.exe]

spaces in matched urls

I think the matched text should not include space at the beginning

> "some text  http://192.168.1.1:123 some other text http://asd.it no more".match(a())
[ ' http://192.168.1.1:123', ' http://asd.it' ]

[VULNERABILITY] Parsing a long String will result in 100% CPU usage and `String.test` will never finish

IMPORTANT UPDATE (8/15/20)

Per my comment below, I have released my own package, url-regex-safe, which resolves this issue, and all (solvable) existing issues and pull requests here in this GitHub repository. The new package has 100% test coverage and is available at https://github.com/niftylettuce/url-regex-safe. It has more sensible defaults as well.


Example:

> require('url-regex')({ strict: false }).test('018137.113.215.4074.138.129.172220.179.206.94180.213.144.175250.45.147.1364868726sgdm6nohQ')

The only way to exit out is to SIGINT.

Parenthesis are included as part of url

Hi!

I'm trying to extract links from a text like this:

const text = "Hello this is my url (https://www.miurl.org/en/test) googbye"
text.match(urlRegex({strict: false}) // ["https://www.uclg.org/en/node/27407)"]

It includes the last parenthesis as part of the link

Why not... without protocols and without www?

Hi,
Is there a reason for not adding one ? after

return /(?:^|\s)(["'])?(?:(?:(?:(?:https?|ftp|\w):)?\/\/)|(?:www.))

to allow urls without http(s) and without www?

I test it and it not fails another cases. You just missed this case or specially avoid this?

matching non existent tld

running urlRegex({strict: false}).test('something github.are foo bar');
returns true even though "are" isnt on the list of tld's
image

failing tests

Small thing. The following URLs fail:

http://مثال.إختبار
http://例子.测试
http://उदाहरण.परीक्षा

However, they pass with the latest version of the Regex by dperini:
https://gist.github.com/dperini/729294

duplicate test entry (should be unique test case?)

Line 52 in test.js shows a duplicate test case:

        'ws://foo.ws',
        'ws://foo.ws',

is this a typo, or was it meant be two different tests, for example:

        'ws://foo.ws',
        'ws://foo.ws/rainbows',

I just wanted to point this out in case the duplicate was meant to be an important test case that was left out.

failing example case (puts extra space at beginning of result)

input:

'Deep House and Garage can get tricky. I find it hard to create a bass sound that fits the genre, but also sets me apart from other producers. To combat that, I tend to use processing and effects. I used some creative reverbs and delays for this thick Deep House bass in Massive. The tutorial available from http://www.youtube.com/watch?v=luG3FIaEdcA'

expected output:

'http://www.youtube.com/watch?v=luG3FIaEdcA'

actual output:

' http://www.youtube.com/watch?v=luG3FIaEdcA'

4.0.0 also broke Safari

The v4 upgrade that switched to ES2015 broke Safari (it will throw errors indicating that it does not support const in strict mode, then give up). By default browserify+babelify does not transpile dependencies, so url-regex is being included as is. (That also breaks UglifyJS as it doesn't do ES2015 yet.)

Reverting to 3.2 fixes the problem. I don't know if you want to back out of this change to keep supporting Safari, or if you'd rather wait it out (or provide a client transpiled version). If you prefer to wait it out it might be helpful to point the issue out in the README.

Unexpected token: operator (>)

When building an angular 2 app for production using ng build --prod, build fails with this message:

ERROR in vendor.8a7e2466f2b7f474b403.bundle.js from UglifyJs
Unexpected token: operator (>) [/Users/user/Documents/git projects/angular-project/node_modules/url-regex/index.js:4,0][vendor.8a7e2466f2b7f474b403.bundle.js:888,23]

However, development server starts page successfully by using npm start.

Decode error

The presence of ).{% trailing a valid URL (e.g. http://cnn.com/).{%) causes the error:

[PATH]/node_modules/normalize-url/index.js:82
		urlObj.pathname = decodeURI(urlObj.pathname);
		                  ^

URIError: URI malformed
    at decodeURI (native)
    at module.exports ([PATH]/node_modules/normalize-url/index.js:82:21)
    at [PATH]/node_modules/get-urls/index.js:14:10
    at Array.map (native)
    at module.exports ([PATH]/node_modules/get-urls/index.js:13:24)
    at [PATH]/index.js:17:16
    at Array.forEach (native)
    at [PATH]/index.js:15:9
    at [PATH]/node_modules/recursive-readdir/index.js:64:22
    at [PATH]/node_modules/recursive-readdir/index.js:64:22

Support not requiring tlds, passing in list on your own

  • the tlds library bloats the bundle size of this library by a lot (6kb minified according to bundlephobia), most of which are irrelevant to almost all users of the library, and will increase over time as highlighted in #12
  • it isn't even used at all by default, unless strict is false
  • i'm happy as a client of the library to include a list of TLDs i consider valid, rather than deferring to the list

So I think there ought to be a way for one of two things to happen:

  1. add a separate entry point that doesn't include it, requires you to pass in a set of TLDs if you set strict to false, so that bundle-conscious clients can just include that instead
  2. somehow modify the existing library so that code-splitting-enabled clients can profit from the existing module export without breaking changes, while only loading tlds if necessary. not sure how that would work, I assume the previous suggestion is more realistic

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.