Code Monkey home page Code Monkey logo

regexgen's Introduction

regexgen

Generates regular expressions that match a set of strings.

Installation

regexgen can be installed using npm:

npm install regexgen

Example

The simplest use is to simply pass an array of strings to regexgen:

const regexgen = require('regexgen');

regexgen(['foobar', 'foobaz', 'foozap', 'fooza']); // => /foo(?:zap?|ba[rz])/

You can also use the Trie class directly:

const {Trie} = require('regexgen');

let t = new Trie;
t.add('foobar');
t.add('foobaz');

t.toRegExp(); // => /fooba[rz]/

CLI

regexgen also has a simple CLI to generate regexes using inputs from the command line.

$ regexgen
Usage: regexgen [-gimuy] string1 string2 string3...

The optional first parameter is the flags to add to the regex (e.g. -i for a case insensitive match).

ES2015 and Unicode

By default regexgen will output a standard JavaScript regular expression, with Unicode codepoints converted into UCS-2 surrogate pairs.

If desired, you can request an ES2015-compatible Unicode regular expression by supplying the -u flag, which results in those codepoints being retained.

$ regexgen πŸ‘© πŸ‘©β€πŸ’» πŸ‘©πŸ»β€πŸ’» πŸ‘©πŸΌβ€πŸ’» πŸ‘©πŸ½β€πŸ’» πŸ‘©πŸΎβ€πŸ’» πŸ‘©πŸΏβ€πŸ’»
/\uD83D\uDC69(?:(?:\uD83C[\uDFFB-\uDFFF])?\u200D\uD83D\uDCBB)?/

$ regexgen -u πŸ‘© πŸ‘©β€πŸ’» πŸ‘©πŸ»β€πŸ’» πŸ‘©πŸΌβ€πŸ’» πŸ‘©πŸ½β€πŸ’» πŸ‘©πŸΎβ€πŸ’» πŸ‘©πŸΏβ€πŸ’»
/\u{1F469}(?:[\u{1F3FB}-\u{1F3FF}]?\u200D\u{1F4BB})?/u

Such regular expressions are compatible with current versions of Node, as well as the latest browsers, and may be more transferrable to other languages.

How does it work?

  1. Generate a Trie containing all of the input strings. This is a tree structure where each edge represents a single character. This removes redundancies at the start of the strings, but common branches further down are not merged.

  2. A trie can be seen as a tree-shaped deterministic finite automaton (DFA), so DFA algorithms can be applied. In this case, we apply Hopcroft's DFA minimization algorithm to merge the nondistinguishable states.

  3. Convert the resulting minimized DFA to a regular expression. This is done using Brzozowski's algebraic method, which is quite elegant. It expresses the DFA as a system of equations which can be solved for a resulting regex. Along the way, some additional optimizations are made, such as hoisting common substrings out of an alternation, and using character class ranges. This produces an an Abstract Syntax Tree (AST) for the regex, which is then converted to a string and compiled to a JavaScript RegExp object.

License

MIT

regexgen's People

Contributors

devongovett avatar gilmoreorless avatar mathiasbynens avatar ticky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regexgen's Issues

SyntaxError: Invalid regular expression: Range out of order in character class

I'm receiving this error when trying run match on a regex pattern built with this library. The regex pattern is generated using emoji text and emoji presentation unicode characters.

Here is the regex pattern: https://raw.githubusercontent.com/milesj/emojibase/v1/packages/emojibase-regex/unicode/index.js (using u flag)

And here is the file that generates the regex: https://github.com/milesj/emojibase/blob/v1/src/generators/generateRegex.js

regexgen bug in minimize() / output depends on order of Trie#addAll input

See mathiasbynens/emoji-test-regex-pattern#1.

There seems to be an issue where regexgen produces incorrect output. The exact output depends on the order of the Trie#addAll input, which seems like a bug in and of itself.

Test case:

const assert = require('assert');

const Trie = require('regexgen').Trie;
const trie = new Trie();

const strings = [
  '\u{1F9D1}\u{1F3FF}\u200D\u2695\uFE0F',
  '\u{1F468}\u{1F3FE}\u200D\u2708\uFE0F',
  '\u{1F468}\u{1F3FF}\u200D\u2708\uFE0F',
  '\u{1F469}\u200D\u2764\u200D\u{1F469}',
  '\u{1F9D1}\u{1F3FF}\u200D\u2695',
  '\u{1F468}\u{1F3FE}\u200D\u2708',
  '\u{1F468}\u{1F3FF}\u200D\u2708',
  '\u{1F468}\u200D\u2708',
  '\u{1F469}\u200D\u2708',
  '\u2708\uFE0F',
  '\u{1F469}',
  '\u2708',
];

const STRINGS_THAT_MUST_BE_INCLUDED = [
  '\u{1F9D1}\u200D\u2708\uFE0F', // πŸ§‘β€βœˆοΈ E12.1 pilot
  '\u{1F468}\u200D\u2708\uFE0F', // πŸ‘¨β€βœˆοΈ E4.0 man pilot
  '\u{1F469}\u200D\u2708\uFE0F', // πŸ‘©β€βœˆοΈ E4.0 woman pilot
];

strings.push(...STRINGS_THAT_MUST_BE_INCLUDED);

// Uncommenting this results in regexgen generating a different pattern
// that passes the tests below, but is wrong in a different way:
//strings.sort();

trie.addAll(strings);
const pattern = trie.toString();
console.log(pattern);

const re = new RegExp(pattern, 'g');
assert(strings.includes('\u{1F9D1}\u200D\u2708\uFE0F')); // πŸ§‘β€βœˆοΈ E12.1 pilot
assert(strings.includes('\u{1F468}\u200D\u2708\uFE0F')); // πŸ‘¨β€βœˆοΈ E4.0 man pilot
assert(strings.includes('\u{1F469}\u200D\u2708\uFE0F')); // πŸ‘©β€βœˆοΈ E4.0 woman pilot
const text = `
\u{1F9D1}\u200D\u2708\uFE0F  # E12.1 pilot
\u{1F468}\u200D\u2708\uFE0F  # E4.0 man pilot
\u{1F469}\u200D\u2708\uFE0F  # E4.0 woman pilot
`;
console.log('3 pilots expected', text.match(re));
// β†’ expected: 3 pilots expected [ 'πŸ§‘β€βœˆοΈ', 'πŸ‘¨β€βœˆοΈ', 'πŸ‘©β€βœˆοΈ' ]
// β†’ actual: 3 pilots expected [ 'πŸ§‘β€βœˆοΈ', 'πŸ‘¨β€βœˆοΈ', 'πŸ‘©', '✈️' ]

Uncomment the strings.sort() line results in a different pattern (which appears to work correctly for this particular case, but it actually still doesn't match all expected strings). Here are the patterns regexgen generates for both cases:

# Without sort(), not working correctly:
\uD83D\uDC69(?:\u200D\u2764\u200D\uD83D\uDC69)?|\uD83E\uDDD1(?:\uD83C\uDFFF\u200D\u2695\uFE0F?|\u200D\u2708\uFE0F)|(?:\uD83D\uDC68(?:\uD83C[\uDFFE\uDFFF])?\u200D|\uD83D\uDC69\u200D)?\u2708\uFE0F?
# With sort(), seemingly working correctly:
\uD83D\uDC69\u200D\u2764\u200D\uD83D\uDC69|(?:\uD83E\uDDD1\uD83C\uDFFF\u200D\u2695|\uD83D\uDC68(?:\uD83C[\uDFFE\uDFFF])?\u200D\u2708|\uD83D\uDC69\u200D\u2708|\u2708)\uFE0F?|\uD83E\uDDD1\u200D\u2708\uFE0F|\uD83D\uDC69

I think there's a bug in regexgen:

  • it should produce the same pattern regardless of the sort order of the input strings
  • it should correctly match all of the input strings

[question] revert the process

Do you think that could be more or less easy to exactly the invert operation?

For example, return an array of all variations:

r[ae]d β†’ ['red', 'rad']

ES6 Escaping Support

The underlying jsesc library accepts an es6 parameter, which causes it to output ES6-compatible escapes.

Exposing this option would be useful for, among other things, making the output of regexen more applicable to languages whose backing string format isn’t UTF-16!

How to generate regex for invalid strings

Seeking answer to play around the code.

Example
this returns regex that validates these strings to be true
regexgen(['foobar', 'foobaz', 'foozap', 'fooza']);

How can I create regex that takes invalid strings and gives me a regex that will return true for foozap and false for fooza
regexgen([ { string: 'foobar', validity: true }, { string: 'foobaz', validity: true }, { string: 'foozap', validity: true }, { string: 'fooza', validity: false } ]);

Sub-optimal regex containing unnecessary repetitions

While building a regex for the various possible formats of Creative Commons' Public Domain Mark (to assist in spdx/license-list-XML#988), I noticed that regexgen produces a more complex regex than what the input requires.

Here's what I provided:

> regexgen([
  "This work is free of known copyright restrictions.",
  "This work (WWW) is free of known copyright restrictions.",
  "This work (by AAA) is free of known copyright restrictions.",
  "This work, identified by CCC, is free of known copyright restrictions.",
  "This work (WWW, by AAA) is free of known copyright restrictions.",
  "This work (WWW), identified by CCC, is free of known copyright restrictions.",
  "This work (WWW, by AAA), identified by CCC, is free of known copyright restrictions.",
  "This work (by AAA), identified by CCC, is free of known copyright restrictions."
]);

The result was:

/This work(?: (?:\((?:WWW(?:, by AAA)?\)(?:, identified by CCC,)?|by AAA\)(?:, identified by CCC,)?) )?|, identified by CCC, )is free of known copyright restrictions\./

Debuggex screenshot:

Screenshot 2020-03-10 at 12 14 13

A regex produced by hand to match the same input shows that this could be simplified:

/This work(?: \((?:WWW(?:, by AAA)?|by AAA)\))?(?:, identified by CCC,)? is free of known copyright restrictions\./

Debuggex diagram:

Screenshot 2020-03-10 at 12 22 50

Generated pattern doesn’t match one of the input words in some cases

Is regexgen(arrayOfWords) supposed to match any word in arrayOfWords as a single unit?

The reproduction below demonstrates a case where this expectation does not hold true.

const Trie = require('regexgen').Trie;

// To aid debugging.
const jsesc = require('jsesc');
const esc = (value) => jsesc(value, { wrap: true, es6: false });

const testString = '\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F';
//                  \u{1F3CA}   \u{1F3FD}   \u200D\u2640\uFE0F

const sequences = [
	// Removing any of these three avoids the issue:
	'\uD83C\uDDF7\uD83C\uDDFC',
	'\uD83C\uDDF8\uD83C\uDDE6',
	'\uD83C\uDFCA\uD83C\uDFFD',

	// This is the string we’re testing (`=== testString`).
	'\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F'
];

const trie = new Trie();
for (const sequence of sequences) {
	trie.add(sequence);
}
const sequenceRegExp = trie.toRegExp();

console.log(sequenceRegExp);
// /\uD83C\uDFCA\uD83C\uDFFD|\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F|\uD83C\uDDF8\uD83C\uDDE6|\uD83C\uDDF7\uD83C\uDDFC/

console.log(`Expected: ${ esc([testString]) }`);
console.log(`Actual:   ${ esc(testString.match(sequenceRegExp)) }`);

// Expected: ['\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F']
// Actual:   ['\uD83C\uDFCA\uD83C\uDFFD']

The relevant part of the generated regex is at the start:

\uD83C\uDFCA\uD83C\uDFFD|\uD83C\uDFCA\uD83C\uDFFD\u200D\u2640\uFE0F

Note that this is of the form:

a|ab

Having output of the form

a(?:b)?

…would fix it.

(This blocks mathiasbynens/emoji-regex#13.)

Wrong regular expressions with regard to unicode codepoints

Hi, even though I doubt that you will provide bug fixes and updates after two years of inactivity, I want you to know about the following issues with escaped and unescaped unicode codepoints:

// 1. correct
Input : "I β™₯ cake"
Output: /I \u2665 cake/
Proof : "I β™₯ cake".match(/I \u2665 cake/)
Result: Array [ "I β™₯ cake" ] // OK

// 2. correct
Input : "I \\u2665 cake"
Output: /I \\u2665 cake/
Proof : "I \\u2665 cake".match(/I \\u2665 cake/)
Result: Array [ "I \\u2665 cake" ] // OK

// 3. failure
Input : "I \u2665 cake"
Output: /I \\u2665 cake/
Proof : "I \u2665 cake".match(/I \\u2665 cake/)
Result: null // OOPS! 
Expected Output: /I \u2665 cake/

// 4. failure
Input : "I \u{2665} cake"
Output: /I \\u{2665} cake/
Proof : "I \u{2665} cake".match(/I \\u{2665} cake/)
Result: null // OOPS! 
Expected Output: /I \u2665 cake/

// 5. failure
Input : "I \\u{2665} cake"
Output: /I \\u{2665} cake/
Proof : "I \\u{2665} cake".match(/I \\u{2665} cake/)
Result: null // OOPS! 
Expected Output: /I \\u\{2665\} cake/

Is there any chance for you to fix these issues? Thanks in advance.

Use original character instead of id with unicode prefix

const regexgen = require('regexgen');

regexgen(['ι‡ε€§ζŒθ‚‘δΊΊ', 'ι‡ε€§ζŒθ‚‘ζ–Ή', 'ζŒθ‚‘ι›†ε›’', 'δΊΊζ°‘θ‹±ι›„']);

Generates /\u91CD\u5927\u6301\u80A1[\u4EBA\u65B9]|\u4EBA\u6C11\u82F1\u96C4|\u6301\u80A1\u96C6\u56E2/ which is difficult to inspect and debug, hope it can use original CJK character instead of u[0-9A-F]{4}. Thank you.

Consider returning a string representing the pattern instead of a RegExp object

I’m assuming this project’s main usage scenario is in build scripts (rather than using it dynamically at run-time), just like Regenerate. Is this correct?

If so, how about returning a string representing the pattern instead of a RegExp object? This seems more useful:

  • Such a string can easily be turned into a RegExp object by calling new RegExp(string). (But this is only useful at run-time, which I think is not the main use case of this project.)

  • Strings compose more easily. Consider a build script that does something like…

    const output = `/(?:${ regexgen(words) })+(${ regexgen(someOtherWords) })/g`;

Instead, currently you’d have to call RegExp#toString (which is surprisingly buggy) on each of regexgen’s return values and remove the first and last / as well as the flags, if any, or use .source.

You could then also remove the logic for supporting flags, as the user would just deal with that themselves.

Changelog

Hi!

I'm using this library in my project and it works perfectly! Thank you!

Just wanted to tell that it would be really great if there was a small Changelog with each release so when people update they know what's in it, perhaps something to be aware of etc.

Empty input string throws TypeError

$ regexgen ""
[...]\regexgen\src\minimize.js:72
    for (let [c, old] of first.transitions) {
                              ^

TypeError: Cannot read property 'transitions' of undefined
    at minimize ([...]\regexgen\src\minimize.js:72:31)
    at Trie.minimize ([...]\regexgen\src\trie.js:45:12)
    at Trie.toString ([...]\regexgen\src\trie.js:54:25)
    at Trie.toRegExp ([...]\regexgen\src\trie.js:63:28)
    at regexgen ([...]\regexgen\index.js:12:15)
    at Object.<anonymous> ([...]\regexgen\bin\cli.js:16:13)
    at Module._compile (module.js:570:32)
    at Object.Module._extensions..js (module.js:579:10)
    at Module.load (module.js:487:32)
    at tryModuleLoad (module.js:446:12)

FYI: Rust port of regexgen with additional functionality

Hi @devongovett, hi everyone who is reading this. I hope it is okay to open this issue as GitHub does not provide a better way of communication. If not, then feel free to delete this issue again.

When I discovered regexgen a few months ago, I thought to myself: Wow, this tool has great potential. And a lot of ideas came to my mind that I believed would make this tool even more useful. That's why I came up with grex, a Rust port of regexgen which consists of both a command-line tool and a Rust library.

https://github.com/pemistahl/grex

Compared to regexgen, it additionally provides the detection of non-overlapping repeated substrings and their conversion to {min,max} quantifier notation. The conversion to shorthand character classes such as \d and \w is also possible. I already have other ideas in mind for future versions, such as specyfing test cases that must not match the generated expression.

If you are curious, then please check grex out and let me know what you think. Thanks a lot.

Strange behaviour with Cyrillic regexes

Hello! Used your library via Minta Electron app and noticed some strange lack of optimization when it comes to working with Cyrillic alphabet.

Regexes generated are cumbersome and bulky. For example (I converted unicode codepoints to Russian letters for convenience):

/ня(?:[кн]!|[кн])|Ня(?:[кн]?!|[кн])?/ (well, I know that detecting a substring nya is silly, but it is a perfect test case - short, memorable and permutable)

Which could be minimized to the following:
/[Нн]я(?:[кн][!\?]|[кн]|[!\?])?/

Seems like the generator cannot understand that я (cyrillic ya) in both Uppercase and lowercase strings is the same letter (and it is indeed, which is verified by looking at Unicode code points generated and outputted)
I will be glad to provide more data if you need it.

Produce illegal regexes when passed \b in the inputs

Hello,

While running some property based tests against the library, see commit or runkit, I found a bug causing a crash of regexgen when we pass some very precise strings to it.

Indeed if you try to run this code locally (at least with 1.3.0), you'll have a crash at runtime:

regexgen(["\b "," "])
// SyntaxError: Invalid regular expression: /\b? /: Nothing to repeat

Indeed ? cannot be used right after \b but it should be ok with [\b]?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.