devongovett / regexgen Goto Github PK

View Code? Open in Web Editor NEW

3.3K 54.0 100.0 30 KB

Generate regular expressions that match a set of strings

Home Page: https://runkit.com/npm/regexgen

JavaScript 100.00%

regexgen's Introduction

regexgen

Generates regular expressions that match a set of strings.

Installation

regexgen can be installed using npm:

npm install regexgen

Example

The simplest use is to simply pass an array of strings to regexgen:

const regexgen = require('regexgen');

regexgen(['foobar', 'foobaz', 'foozap', 'fooza']); // => /foo(?:zap?|ba[rz])/

You can also use the Trie class directly:

const {Trie} = require('regexgen');

let t = new Trie;
t.add('foobar');
t.add('foobaz');

t.toRegExp(); // => /fooba[rz]/

CLI

regexgen also has a simple CLI to generate regexes using inputs from the command line.

$ regexgen
Usage: regexgen [-gimuy] string1 string2 string3...

The optional first parameter is the flags to add to the regex (e.g. -i for a case insensitive match).

ES2015 and Unicode

By default regexgen will output a standard JavaScript regular expression, with Unicode codepoints converted into UCS-2 surrogate pairs.

If desired, you can request an ES2015-compatible Unicode regular expression by supplying the -u flag, which results in those codepoints being retained.

$ regexgen 👩 👩‍💻 👩🏻‍💻 👩🏼‍💻 👩🏽‍💻 👩🏾‍💻 👩🏿‍💻
/\uD83D\uDC69(?:(?:\uD83C[\uDFFB-\uDFFF])?\u200D\uD83D\uDCBB)?/

$ regexgen -u 👩 👩‍💻 👩🏻‍💻 👩🏼‍💻 👩🏽‍💻 👩🏾‍💻 👩🏿‍💻
/\u{1F469}(?:[\u{1F3FB}-\u{1F3FF}]?\u200D\u{1F4BB})?/u

Such regular expressions are compatible with current versions of Node, as well as the latest browsers, and may be more transferrable to other languages.

How does it work?

Generate a Trie containing all of the input strings. This is a tree structure where each edge represents a single character. This removes redundancies at the start of the strings, but common branches further down are not merged.
A trie can be seen as a tree-shaped deterministic finite automaton (DFA), so DFA algorithms can be applied. In this case, we apply Hopcroft's DFA minimization algorithm to merge the nondistinguishable states.
Convert the resulting minimized DFA to a regular expression. This is done using Brzozowski's algebraic method, which is quite elegant. It expresses the DFA as a system of equations which can be solved for a resulting regex. Along the way, some additional optimizations are made, such as hoisting common substrings out of an alternation, and using character class ranges. This produces an an Abstract Syntax Tree (AST) for the regex, which is then converted to a string and compiled to a JavaScript RegExp object.

License

MIT

regexgen's People

Contributors

Stargazers

Watchers

Forkers

paulvanbrenk cuulee lynsei kikobeats backupmanager hhy5277 fatelei hapiman zhongshan1993 ndubey ygt-abooth longjohncoder 78juli bradparks mathiasbynens amoulkaf mengyou658 fewiki techscientist lymaple mitbots vijaynitrr giserh r0rs12ach traveler817 ghostlyman mikewlange haint jusss eminarcissus zachlungu rodrijuarez x-quad rlugojr pandahisham avatarxy wangyue7477 dengbupapapa slalx fengweijp kr-npm bukalov lyneliu chenli1989 rexfordkelly-on-nodejs cash2one kutim forked-repositories isao pedromonad templarundead lionsole huydeerpets kunaldawn es-collection prashanth726 bq1028 hixiaosan nuhmanp module-design nevinhappy pedoc nilportugues ktp-forked-repos asdbaihu jaishankarn 0xflotus hackerapple bonedaddy sringly marsnobel chiwenheng yangguangftlp fengzhibushang1101 rookiemonkey camilabodack cliffordfajardo tjdev7 dubzzz tiancheng-luo crocla darklingpeggys rakhithjk davgit admariner xzlinux martinsedm sts0mrg0 chapering nyhxiaoning david0718 unforswearing kingking888 abnerus prenigma wudeman mwxiaomao marianatorrestorres

regexgen's Issues

How to install it?

SyntaxError: Invalid regular expression: Range out of order in character class

I'm receiving this error when trying run match on a regex pattern built with this library. The regex pattern is generated using emoji text and emoji presentation unicode characters.

Here is the regex pattern: https://raw.githubusercontent.com/milesj/emojibase/v1/packages/emojibase-regex/unicode/index.js (using u flag)

And here is the file that generates the regex: https://github.com/milesj/emojibase/blob/v1/src/generators/generateRegex.js

regexgen bug in minimize() / output depends on order of Trie#addAll input

See mathiasbynens/emoji-test-regex-pattern#1.

There seems to be an issue where regexgen produces incorrect output. The exact output depends on the order of the Trie#addAll input, which seems like a bug in and of itself.

Test case:

const assert = require('assert');

const Trie = require('regexgen').Trie;
const trie = new Trie();

const strings = [
  '\u{1F9D1}\u{1F3FF}\u200D\u2695\uFE0F',
  '\u{1F468}\u{1F3FE}\u200D\u2708\uFE0F',
  '\u{1F468}\u{1F3FF}\u200D\u2708\uFE0F',
  '\u{1F469}\u200D\u2764\u200D\u{1F469}',
  '\u{1F9D1}\u{1F3FF}\u200D\u2695',
  '\u{1F468}\u{1F3FE}\u200D\u2708',
  '\u{1F468}\u{1F3FF}\u200D\u2708',
  '\u{1F468}\u200D\u2708',
  '\u{1F469}\u200D\u2708',
  '\u2708\uFE0F',
  '\u{1F469}',
  '\u2708',
];

const STRINGS_THAT_MUST_BE_INCLUDED = [
  '\u{1F9D1}\u200D\u2708\uFE0F', // 🧑‍✈️ E12.1 pilot
  '\u{1F468}\u200D\u2708\uFE0F', // 👨‍✈️ E4.0 man pilot
  '\u{1F469}\u200D\u2708\uFE0F', // 👩‍✈️ E4.0 woman pilot
];

strings.push(...STRINGS_THAT_MUST_BE_INCLUDED);

// Uncommenting this results in regexgen generating a different pattern
// that passes the tests below, but is wrong in a different way:
//strings.sort();

trie.addAll(strings);
const pattern = trie.toString();
console.log(pattern);

const re = new RegExp(pattern, 'g');
assert(strings.includes('\u{1F9D1}\u200D\u2708\uFE0F')); // 🧑‍✈️ E12.1 pilot
assert(strings.includes('\u{1F468}\u200D\u2708\uFE0F')); // 👨‍✈️ E4.0 man pilot
assert(strings.includes('\u{1F469}\u200D\u2708\uFE0F')); // 👩‍✈️ E4.0 woman pilot
const text = `
\u{1F9D1}\u200D\u2708\uFE0F  # E12.1 pilot
\u{1F468}\u200D\u2708\uFE0F  # E4.0 man pilot
\u{1F469}\u200D\u2708\uFE0F  # E4.0 woman pilot
`;
console.log('3 pilots expected', text.match(re));
// → expected: 3 pilots expected [ '🧑‍✈️', '👨‍✈️', '👩‍✈️' ]
// → actual: 3 pilots expected [ '🧑‍✈️', '👨‍✈️', '👩', '✈️' ]

Uncomment the strings.sort() line results in a different pattern (which appears to work correctly for this particular case, but it actually still doesn't match all expected strings). Here are the patterns regexgen generates for both cases:

# Without sort(), not working correctly:
\uD83D\uDC69(?:\u200D\u2764\u200D\uD83D\uDC69)?|\uD83E\uDDD1(?:\uD83C\uDFFF\u200D\u2695\uFE0F?|\u200D\u2708\uFE0F)|(?:\uD83D\uDC68(?:\uD83C[\uDFFE\uDFFF])?\u200D|\uD83D\uDC69\u200D)?\u2708\uFE0F?
# With sort(), seemingly working correctly:
\uD83D\uDC69\u200D\u2764\u200D\uD83D\uDC69|(?:\uD83E\uDDD1\uD83C\uDFFF\u200D\u2695|\uD83D\uDC68(?:\uD83C[\uDFFE\uDFFF])?\u200D\u2708|\uD83D\uDC69\u200D\u2708|\u2708)\uFE0F?|\uD83E\uDDD1\u200D\u2708\uFE0F|\uD83D\uDC69

I think there's a bug in regexgen:

it should produce the same pattern regardless of the sort order of the input strings
it should correctly match all of the input strings

[question] revert the process

Do you think that could be more or less easy to exactly the invert operation?

For example, return an array of all variations:

r[ae]d → ['red', 'rad']

ES6 Escaping Support

The underlying jsesc library accepts an es6 parameter, which causes it to output ES6-compatible escapes.

Exposing this option would be useful for, among other things, making the output of regexen more applicable to languages whose backing string format isn’t UTF-16!

Repeating the same word in a string may produce a wrong result

Can the input strings contain multiple words? I was presuming that they can, so I started trying it out.

This seems ok.

./cli.js "hi" "hi friend"
/hi(?: friend)*/

But this does not.

./cli.js "hey" "hi" "hi friend"
/h(?:i|ey)/

How to generate regex for invalid strings

Seeking answer to play around the code.

Example
this returns regex that validates these strings to be true
regexgen(['foobar', 'foobaz', 'foozap', 'fooza']);

How can I create regex that takes invalid strings and gives me a regex that will return true for foozap and false for fooza
regexgen([ { string: 'foobar', validity: true }, { string: 'foobaz', validity: true }, { string: 'foozap', validity: true }, { string: 'fooza', validity: false } ]);

Sub-optimal regex containing unnecessary repetitions

While building a regex for the various possible formats of Creative Commons' Public Domain Mark (to assist in spdx/license-list-XML#988), I noticed that regexgen produces a more complex regex than what the input requires.

Here's what I provided:

> regexgen([
  "This work is free of known copyright restrictions.",
  "This work (WWW) is free of known copyright restrictions.",
  "This work (by AAA) is free of known copyright restrictions.",
  "This work, identified by CCC, is free of known copyright restrictions.",
  "This work (WWW, by AAA) is free of known copyright restrictions.",
  "This work (WWW), identified by CCC, is free of known copyright restrictions.",
  "This work (WWW, by AAA), identified by CCC, is free of known copyright restrictions.",
  "This work (by AAA), identified by CCC, is free of known copyright restrictions."
]);

The result was:

/This work(?: (?:\((?:WWW(?:, by AAA)?\)(?:, identified by CCC,)?|by AAA\)(?:, identified by CCC,)?) )?|, identified by CCC, )is free of known copyright restrictions\./

Debuggex screenshot:

A regex produced by hand to match the same input shows that this could be simplified:

/This work(?: \((?:WWW(?:, by AAA)?|by AAA)\))?(?:, identified by CCC,)? is free of known copyright restrictions\./

Debuggex diagram:

Generated pattern doesn’t match one of the input words in some cases

Is regexgen(arrayOfWords) supposed to match any word in arrayOfWords as a single unit?

The reproduction below demonstrates a case where this expectation does not hold true.

const Trie = require('regexgen').Trie;

// To aid debugging.
const jsesc = require('jsesc');
const esc = (value) => jsesc(value, { wrap: true, es6: false });

const testString = '\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F';
//                  \u{1F3CA}   \u{1F3FD}   \u200D\u2640\uFE0F

const sequences = [
	// Removing any of these three avoids the issue:
	'\uD83C\uDDF7\uD83C\uDDFC',
	'\uD83C\uDDF8\uD83C\uDDE6',
	'\uD83C\uDFCA\uD83C\uDFFD',

	// This is the string we’re testing (`=== testString`).
	'\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F'
];

const trie = new Trie();
for (const sequence of sequences) {
	trie.add(sequence);
}
const sequenceRegExp = trie.toRegExp();

console.log(sequenceRegExp);
// /\uD83C\uDFCA\uD83C\uDFFD|\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F|\uD83C\uDDF8\uD83C\uDDE6|\uD83C\uDDF7\uD83C\uDDFC/

console.log(`Expected: ${ esc([testString]) }`);
console.log(`Actual:   ${ esc(testString.match(sequenceRegExp)) }`);

// Expected: ['\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F']
// Actual:   ['\uD83C\uDFCA\uD83C\uDFFD']

The relevant part of the generated regex is at the start:

\uD83C\uDFCA\uD83C\uDFFD|\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F

Note that this is of the form:

a|ab

Having output of the form

a(?:b)?

…would fix it.

(This blocks mathiasbynens/emoji-regex#13.)

Wrong regular expressions with regard to unicode codepoints

Hi, even though I doubt that you will provide bug fixes and updates after two years of inactivity, I want you to know about the following issues with escaped and unescaped unicode codepoints:

// 1. correct
Input : "I ♥ cake"
Output: /I \u2665 cake/
Proof : "I ♥ cake".match(/I \u2665 cake/)
Result: Array [ "I ♥ cake" ] // OK

// 2. correct
Input : "I \\u2665 cake"
Output: /I \\u2665 cake/
Proof : "I \\u2665 cake".match(/I \\u2665 cake/)
Result: Array [ "I \\u2665 cake" ] // OK

// 3. failure
Input : "I \u2665 cake"
Output: /I \\u2665 cake/
Proof : "I \u2665 cake".match(/I \\u2665 cake/)
Result: null // OOPS! 
Expected Output: /I \u2665 cake/

// 4. failure
Input : "I \u{2665} cake"
Output: /I \\u{2665} cake/
Proof : "I \u{2665} cake".match(/I \\u{2665} cake/)
Result: null // OOPS! 
Expected Output: /I \u2665 cake/

// 5. failure
Input : "I \\u{2665} cake"
Output: /I \\u{2665} cake/
Proof : "I \\u{2665} cake".match(/I \\u{2665} cake/)
Result: null // OOPS! 
Expected Output: /I \\u\{2665\} cake/

Is there any chance for you to fix these issues? Thanks in advance.

Please mininify

So we can load it via cdn and use it in webworker

Use original character instead of id with unicode prefix

const regexgen = require('regexgen');

regexgen(['重大持股人', '重大持股方', '持股集团', '人民英雄']);

Generates /\u91CD\u5927\u6301\u80A1[\u4EBA\u65B9]|\u4EBA\u6C11\u82F1\u96C4|\u6301\u80A1\u96C6\u56E2/ which is difficult to inspect and debug, hope it can use original CJK character instead of u[0-9A-F]{4}. Thank you.

Different output result with different word order

Consider returning a string representing the pattern instead of a RegExp object

I’m assuming this project’s main usage scenario is in build scripts (rather than using it dynamically at run-time), just like Regenerate. Is this correct?

If so, how about returning a string representing the pattern instead of a RegExp object? This seems more useful:

Such a string can easily be turned into a RegExp object by calling new RegExp(string). (But this is only useful at run-time, which I think is not the main use case of this project.)

Strings compose more easily. Consider a build script that does something like…

const output = `/(?:${ regexgen(words) })+(${ regexgen(someOtherWords) })/g`;

Instead, currently you’d have to call RegExp#toString (which is surprisingly buggy) on each of regexgen’s return values and remove the first and last / as well as the flags, if any, or use .source.

You could then also remove the logic for supporting flags, as the user would just deal with that themselves.

Electron app and a thank you

This is not an issue, I just wanted to thank you for making this package. 👍

I wanted to make it even simpler to use, so I've made it into an electron app.
It's available here.

Changelog

Hi!

I'm using this library in my project and it works perfectly! Thank you!

Just wanted to tell that it would be really great if there was a small Changelog with each release so when people update they know what's in it, perhaps something to be aware of etc.

Q: Negative look aheads

Is it possible to use negative lookaheads or lookaheads in general?

Empty input string throws TypeError

$ regexgen ""
[...]\regexgen\src\minimize.js:72
    for (let [c, old] of first.transitions) {
                              ^

TypeError: Cannot read property 'transitions' of undefined
    at minimize ([...]\regexgen\src\minimize.js:72:31)
    at Trie.minimize ([...]\regexgen\src\trie.js:45:12)
    at Trie.toString ([...]\regexgen\src\trie.js:54:25)
    at Trie.toRegExp ([...]\regexgen\src\trie.js:63:28)
    at regexgen ([...]\regexgen\index.js:12:15)
    at Object.<anonymous> ([...]\regexgen\bin\cli.js:16:13)
    at Module._compile (module.js:570:32)
    at Object.Module._extensions..js (module.js:579:10)
    at Module.load (module.js:487:32)
    at tryModuleLoad (module.js:446:12)

FYI: Rust port of regexgen with additional functionality

Hi @devongovett, hi everyone who is reading this. I hope it is okay to open this issue as GitHub does not provide a better way of communication. If not, then feel free to delete this issue again.

When I discovered regexgen a few months ago, I thought to myself: Wow, this tool has great potential. And a lot of ideas came to my mind that I believed would make this tool even more useful. That's why I came up with grex, a Rust port of regexgen which consists of both a command-line tool and a Rust library.

https://github.com/pemistahl/grex

Compared to regexgen, it additionally provides the detection of non-overlapping repeated substrings and their conversion to {min,max} quantifier notation. The conversion to shorthand character classes such as \d and \w is also possible. I already have other ideas in mind for future versions, such as specyfing test cases that must not match the generated expression.

If you are curious, then please check grex out and let me know what you think. Thanks a lot.

Strange behaviour with Cyrillic regexes

Hello! Used your library via Minta Electron app and noticed some strange lack of optimization when it comes to working with Cyrillic alphabet.

Regexes generated are cumbersome and bulky. For example (I converted unicode codepoints to Russian letters for convenience):

/ня(?:[кн]!|[кн])|Ня(?:[кн]?!|[кн])?/ (well, I know that detecting a substring nya is silly, but it is a perfect test case - short, memorable and permutable)

Which could be minimized to the following:
/[Нн]я(?:[кн][!\?]|[кн]|[!\?])?/

Seems like the generator cannot understand that я (cyrillic ya) in both Uppercase and lowercase strings is the same letter (and it is indeed, which is verified by looking at Unicode code points generated and outputted)
I will be glad to provide more data if you need it.

What does the parameter mean?

$ regexgen
Usage: regexgen [-gimuy] string1 string2 string3...

[-gimuy] mean?

Produce illegal regexes when passed \b in the inputs

Hello,

While running some property based tests against the library, see commit or runkit, I found a bug causing a crash of regexgen when we pass some very precise strings to it.

Indeed if you try to run this code locally (at least with 1.3.0), you'll have a crash at runtime:

regexgen(["\b "," "])
// SyntaxError: Invalid regular expression: /\b? /: Nothing to repeat

Indeed ? cannot be used right after \b but it should be ok with [\b]?