ARCHIVED: See https://github.com/sre-yield/sre-yield for continuing development.
google / sre_yield Goto Github PK
View Code? Open in Web Editor NEWPython module to generate regular all expression matches
License: Apache License 2.0
Python module to generate regular all expression matches
License: Apache License 2.0
ARCHIVED: See https://github.com/sre-yield/sre-yield for continuing development.
Output in pytest 5.2.4
sre_yield/tests/test_bigrange.py:34
sre_yield/tests/test_bigrange.py:34: PytestCollectionWarning: yield tests were removed in pytest 4.0 - test_all will be ignored
def test_all():
sre_yield/tests/test_fastdivmod.py:139
sre_yield/tests/test_fastdivmod.py:139: PytestCollectionWarning: yield tests were removed in pytest 4.0 - test_correctness_big_numbers will be ignored
def test_correctness_big_numbers():
sre_yield/tests/test_fastdivmod.py:168
sre_yield/tests/test_fastdivmod.py:168: PytestCollectionWarning: yield tests were removed in pytest 4.0 - test_powersum will be ignored
def test_powersum():
sre_yield/tests/test_slicing.py:52
sre_yield/tests/test_slicing.py:52: PytestCollectionWarning: yield tests were removed in pytest 4.0 - test_parity will be ignored
def test_parity():
There are lots of regex expanders which provide only one feature, and it is a feature missing from this library: Random values.
The result is that other similar codebases, typically not as well built (often broken or incomplete sre handling that is "good enough" for MVP), are getting more brain power invested in them.
No doubt this library can be adapted to this easily, since it provides rather efficient slicing, so it would be simple to do a random slice
into the sequence to get a random value.
IMO that is worth building into this library, heralding it, and over time improving the performance by providing additional slicers that obtain a less-random value that is known to be easier to obtain.
If the random slicer is able to be used repetitively, it can be used as a mechanism for thinning a large result space #2
fwiw, I am not suggesting that the larger use case of "fake data" is included in this library. I think that there should be many libraries which approach that type of problem. I see the objective as adding to this library the tools they would need to generate fake data values with high performance using an almost complete regex syntax.
sre_yield.AllStrings(r"\d{2}", max_count=1)
causes:
>>> list(sre_yield.AllStrings(r"\d{2}", max_count=1))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "sre_yield/__init__.py", line 581, in AllStrings
return RegexMembershipSequence(
File "sre_yield/__init__.py", line 556, in __init__
self.raw = self.sub_values(pattern)
File "sre_yield/__init__.py", line 417, in sub_values
elements = [self.sub_values(p) for p in parsed]
File "sre_yield/__init__.py", line 417, in <listcomp>
elements = [self.sub_values(p) for p in parsed]
File "sre_yield/__init__.py", line 427, in sub_values
return self.backends[matcher](*arguments)
File "sre_yield/__init__.py", line 378, in max_repeat_values
return RepetitiveSequence(self.sub_values(items), min_count, max_count)
File "sre_yield/__init__.py", line 275, in __init__
if self.offsets[-1][0] > OFFSET_BREAK_THRESHOLD:
File "sre_yield/cachingseq.py", line 33, in __getitem__
raise IndexError()
IndexError
README says:
The re module docs say "Regular expression pattern strings may not contain null bytes" yet this appears to work fine.
But this sentence was removed in Python 3.6 (python/cpython@69ed5b6), and sre_yield
doesn't support older versions.
This will make the \w\s\d
categories support unicode, as well as .
Properties support (from the regex
module rather than re
) probably won't be supported, and \b
still won't be (isn't currently either).
This is somewhat dependent on the thinning question in #2
For big regex I am seeing below error. Probably it is because too many combinations are generated. I would need just one of the string.
Probably sre_yield.oneString needs to be developed.
list(sre_yield.AllStrings(r'((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}((([0-9a-fA-F]{0,4}:)?(:|[0-9a-fA-F]{0,4}))|(((25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9])))(%[\p{N}\p{L}]+)?'))
Traceback (most recent call last):
File "", line 1, in
OverflowError: cannot fit 'int' into an index-sized integer
version 1.0 (published 2014/02/14) up on pypi doesn't work with python3.
It appears that the code here in 'master' has been updated to do so, although I haven't tried running/testing it in both py2 and 3.
If it does work for both, could someone upload a new version up to pypi?
sre_yield.AllMatches("z([ab]{2})")
has a capture group, but sre_yield.AllMatches("([ab]{2})")
does not.
I am using sre_yield to expand regexes where the result is sane, and want the ridiculous parts of the regex passed through unexpanded in the results.
I am currently achieving it by pre-processing known ridiculous bits to be '~~', and I intend to improve that by subclassing and catching&replace them dynamically. I think this could be a common need, as a way to allow usage when some cases are too complex for sre_yield to generate all possibilities, but most times it is ok.
sre_yield.AllStrings("x?").get_item(2)
doesnt raise IndexError and ends up lost.
It goes into divmod_iter(1, 1)
-> divmod_iter_basic(1, 1)
and never gets out.
This seems unnecessarily difficult, please leave a comment if it's useful to you (with an example).
Positive and negative lookaheads behave the same. This is as of [d997adf]
>>> x = sre_yield.AllStrings("(?!a)x?")
>>> len(x)
0
>>> x.raw.list_lengths
[([], 0), ({repeat base=1 low=0 high=1}, 2)]
I'm processing ~20,000 patterns, and I would rather not have them parsed/compiled a few times.
So I sre_parse
them, and then use sre_compile.compile(p)
to create the compiled pattern when needed. re.compile
does those two steps anyway - the only difference is whether the compiled regex has the pattern
attribute as a string containing the original regex.
The parsed (not compiled) version seems to be more suitable for keeping in memory for longer periods, as its size is more closely related to the string pattern length, while the compiled regex can be 8x the input string size.
k=list(sre_yield.AllStrings('[a-zA-Z]\d{7}'))
Is there way to limit the numbers as to generate 9 digit it would take lot of time.
It is a nice oddity of slices that they never raise an IndexError. I've found two uses of slices which cause them.
>>> [0, 1, 2, 3][slice(99,-99)]
[]
>>> AllStrings("[abcdef]")[slice(99,-99)]
Traceback (most recent call last):
File "sre_yield/__init__.py", line 137, in __getitem__
result = SlicedSequence(self, slicer=i)
File "sre_yield/__init__.py", line 167, in __init__
self.start, self.stop, self.step = slice_indices(slicer, raw.__len__())
File "sre_yield/__init__.py", line 97, in slice_indices
stop = _adjust_index(stop, size)
File "sre_yield/__init__.py", line 107, in _adjust_index
raise IndexError("Out of range")
IndexError: Out of range
>>> "abcdef"[99::-1]
'fedcba'
>>> AllStrings("[abcdef]")[99::-1]
Traceback (most recent call last):
File "sre_yield/__init__.py", line 140, in __getitem__
result = [item for item in result]
File "sre_yield/__init__.py", line 140, in <listcomp>
result = [item for item in result]
File "sre_yield/__init__.py", line 148, in __iter__
yield self.get_item(i)
File "sre_yield/__init__.py", line 176, in get_item
return self.raw[j]
File "sre_yield/__init__.py", line 144, in __getitem__
return self.get_item(i)
File "re_yield/__init__.py", line 405, in get_item
return super().get_item(i)
File "sre_yield/__init__.py", line 126, in get_item
return self.raw.get_item(i, d)
File "sre_yield/__init__.py", line 217, in get_item
raise IndexError("Index %d out of bounds" % (i,))
IndexError: Index 6 out of bounds
#32 test case testBenchInputSlow shows repr
can be very slow.
It looks like repeat sequence objects are the cause, but #30 fixes it for the PR 32 test cases (only), so my gut feeling is a wrapper concat/combin object which uses repr(self.list_lengths)
which hits list.__repr__
and that is causing a full expansion to occur. I am pretty confident that PR doesnt really solve it - it only solves the most basic cases.
When the user passes in a charset
currently, it's only used for dot. I'm expanding this to be use intersection between the charset
passed in, and categories like \w\s\d
as well, but don't intend to for literals.
Should it apply to character classes? I'm not sure.
For some, like [^\w]
it's pretty clear it should (once Unicode support lands), but others like [a-z_]
are already fairly limited.
len(sre_yield.AllMatches('\d+'))
raises OverflowError: cannot fit 'int' into an index-sized integer
len(sre_yield.AllMatches('\d+', max_count=19))
raises OverflowError: cannot fit 'int' into an index-sized integer
, but lower max_count
is ok.
But using slices it gets interesting.
len(sre_yield.AllMatches('\d+')[:16])
raises TypeError: 'float' object cannot be interpreted as an integer
- same result for using AllStrings
([:15]
is ok). In this case, using list
is a workaround - len(list(sre_yield.AllStrings('\d+')[:16]))
= 16.
len(sre_yield.AllMatches('\d+', max_count=1)[:16])
is ok, but len(sre_yield.AllMatches('\d+', max_count=2)[:16])
and higher is not.
Looking at the code, I see a note about using .__len__()
instead, and the solution for most cases becomes obvious.
To prevent memory overflow errors please provide an iterateable object so each string can be process separately.
Example:
Consider the following code:
import sre_yield
matches = list(sre_yield.AllStrings("[a-z]{1,20}"))
The above generates OverflowError
but if an iterateable object is provided then we can prevent this error. The expected outcome would be as follows:
import sre_yield
for combination in sre_yeild.getIteratableCombinations("[a-z]{1,20}"):
print(combination)
In this way we won't be needing memory for storing huge lists.
>>> slice(0,1,0)
slice(0, 1, 0)
>>> [0, 1, 2, 3][slice(0,1,0)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: slice step cannot be zero
>>> AllStrings("[abcdef]")[0:1:0]
Traceback (most recent call last):
File "sre_yield/__init__.py", line 137, in __getitem__
result = SlicedSequence(self, slicer=i)
File "sre_yield/__init__.py", line 170, in __init__
self.length = (
ZeroDivisionError: integer division or modulo by zero
IMO max_count
in AllStrings
/AllMatches
should be renamed max_repeat
, gradually of course, by introducing a new kwarg max_repeat
and deprecating use of max_count
.
Some regexes like ^(foo|bar)$
or ^^^
contain anchors that aren't strictly necessary (since it's fullmatch). It would be nice to accept these and not raise ParseError
.
>>> "12345678"[99:-99:-1]
'87654321'
>>> AllStrings("[abcdef]")[99:-99:-1]
[]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.