laurikari / tre Goto Github PK

View Code? Open in Web Editor NEW

790.0 790.0 132.0 607 KB

The approximate regex matching library and agrep command line tool.

License: Other

C 62.09% Python 30.72% Shell 1.47% Makefile 1.18% M4 4.55%

tre's People

Contributors

Stargazers

Watchers

Forkers

avm georgekola prantlf iblues76 mkjellman pps83 amitkr bjones1 zabrane ih2502mk olemis opoplawski wewela rdm adamfeuer mephi-ut luiseduardohdbackup teconomix memda thegeekinside errpro smaibom ynohtna jactry rainlake ajvengo zeehu amlweems davidedangelo yuhangwang xkey- connlan aosc-archive shannonyu 4sp1r3 edhall aa10000 ahomansikka mvkorpel ayush268 bpotard 19504643 yws cdornan alvisetrevisan rafaelquirino sdwfrost raikohoff sanjosh trnkatomas profcab onegrasshopper chenkovsky luchy0120 pebsconsulting dad98253 hgldj1966 rivy procudin ntposixdevs wanqinggit ng-alt csandlin1 vanolden amreet5 yubioinfo dfajar2 sciumotech zongmingshu luckysunfd nichevision ajadoaduragbemi hnyaoqingping blocky2019 alexanderrevo jannick0 crazyhein jeremybobbin thegeek82000 jsoref timgates42 aalbus-linux dl6er gerhobbelt bucanero limeng12 artoria2e5 michaelmior terrencesn sthagen dalejosne youyuanwu lichray vkochan skyformat99 bssrdf skyfish4tb mital14 sammcdsam alxbnct

tre's Issues

CVE-2016-8859 (attacker controlled integer overflow in tre_tnfa_run_parallel())

CVE-2016-8859 was assigned for an integer overflow in musl and, apparently, TRE that can potentially allow an attacker to achieve controlled heap corruption:

http://seclists.org/oss-sec/2016/q4/183

The reporter pointed out the fix applied in musl:

http://git.musl-libc.org/cgit/musl/commit/?id=c3edc06d1e1360f3570db9155d6b318ae0d0f0f7

Add install instructions to README

This is a great library, but I'd like to know how to build from source rather than install from the Ubuntu repository. Please add instructions for how to build to the README (or to an INSTALL file)

thanks,

simon

errors and hangs for large literal patterns (with workaround)

Hi, we're using the tre library and python bindings on a project. We noticed that tre has problems with large literal patterns in some cases:

    import re
    import unittest

    import tre


    class TreLargePatternTest(unittest.TestCase):
        def _tre_match(self, extract, truth):
            pattern = '^' + re.escape(truth) + '$'
            matcher = tre.compile(pattern, tre.EXTENDED)
            fuzzyness = tre.Fuzzyness(maxerr=5000)
            match = matcher.search(extract, fuzzyness)
            return match.cost

        def _verify_match_cost(self, multiplier):
            truth = 'X' * multiplier
            ocr = 'X' * multiplier
            cost = self._tre_match(ocr, truth)
            self.assertEqual(0, cost)

        def test_good(self):
            self._verify_match_cost(510)

        def test_malloc_error(self):
            self._verify_match_cost(1462)

        #def test_hang(self):
        #    self._verify_match_cost(511)

This python test can reproduce the problem. I'm running on Mac OS X, and have compiled using v0.8.0 (via Homebrew) and also from the source on GitHub.

The malloc error seems to be related to this line:

https://github.com/laurikari/tre/blob/master/lib/tre-compile.c#L1874

Setting the 10240 number to a much larger value (for instance 102400) stops the malloc error from happening for our input.

The hang problem seems to be related to these lines:

https://github.com/laurikari/tre/blob/master/lib/tre-match-approx.c#L472-L518

Setting the ringbuffer size (512) to a much larger value (for instance 262144) solves the problem for our input.

We thought about two possible fixes:

Best would be to dynamically allocate these buffers using malloc, increasing them if needed. This might be hard.
Quicker would be to make these two buffers configurable via configure so that the user could set them. If this sounds ok, we'd be willing to submit a pull request for this fix.

I'm wondering if you have thoughts on this?

additional kind of error

Great work! Would it be possible to add transposition (of two adjacent characters) as additional type of error? afaik it is a really common type of error and accounted for by Damerau–Levenshtein distance.

--min-cost or --not-exact option

A --min-cost and/or a --not-exact option would be very useful to find just the "bad data" while omitting the exact matches where cost is zero.

Java bindings for TRE.

I have made Java bidings for TRE library. See
https://github.com/ahomansikka/javatre

Found a simple expression that fails on OSX, in the latest TRE, but passes in ICU on OSX

This expression succeeds when it should fail:

ret = tre_regcomp(&trx, "^(?:(?:(?:(?:(?:[0-9]){1,4}):){0,4})(?:(?:[0-9]){1,4}))?::$", REG_EXTENDED | REG_NOSUB);
test = "1:2:3:4:5:6::";
ret = tre_regexec(&trx, test, 0, NULL, 0);

import fixes from musl-libc?

They make use of TRE as regex library in their libc and they made some fixes that seems to be worth going upstream(here).
http://git.musl-libc.org/cgit/musl/log/src/regex

Lazy matching using ? not enabled in bracketed subexpressions

This is something I experienced using R, which includes a copy of TRE which dates back to 2009. Sorry if it has been fixed since then.

The idea is that I want to remove the inner tag in the following expression, while keeping the outer one. I do not know beforehand what may appear in the tags after "class=" and "style=".

"ab"

Thus, the expected result is:

"ab"

The expression I tried is the following, and it works with PCRE. With TRE, the first .* is always greedy:

(gsub() matches the first pattern against the third string and replaces them with the second pattern.)

gsub("(?U)(._)", "\1", "ab")
[1] "b"

gsub("(._?)", "\1", "ab")
[1] "b"

// Use PCRE instead of TRE

gsub("(._?)", "\1", "ab", perl=TRUE)
[1] "ab"

Moreover, it looks like the parentheses around the second .* change the result:

gsub("(._?)", "", "ab")
[1] ""

gsub("._?", "", "ab")
[1] "b"

gsub("(?U)(._)", "", "ab")
[1] ""

gsub("(?U)._", "", "ab")
[1] "b"

Fail to build python extension in Windows with MSVC2012

First thanks very much for making such great software available for free!

I've checked the latest codes and built successfully using the solution files in win32 directory with MSVC2012. Then I tried to compile the python extensions, but the compiler couldn't locate tre.h:

tre-python.c(16) : fatal error C1083: Cannot open include file: 'tre/tre.h': No such file or directory

I added "../include" to include_dirs in python.py, and the compiler located the header this time, but came up with the following errors:

running install
running build
running build_ext
building 'tre' extension
D:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -DHAVE_CONFIG_H -I../lib -I../win32 -I../include -IC:\Anaconda\include -IC:\Anaconda\PC /Tctre-python.c /Fobuild\temp.win-amd64-2.7\Release\tre-python.obj
tre-python.c
tre-python.c(270) : warning C4028: formal parameter 2 different from declaration
tre-python.c(377) : warning C4133: 'function' : incompatible types - from 'PyObject *' to 'PyUnicodeObject *'
tre-python.c(508) : warning C4133: 'function' : incompatible types - from 'PyUnicodeObject *' to 'PyObject *'
D:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\BIN\amd64\link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:C:\Anaconda\libs /LIBPATH:C:\Anaconda\PCbuild\amd64 ../win32/Release\tre.lib /EXPORT:inittre build\temp.win-amd64-2.7\Release\tre-python.obj /OUT:build\lib.win-amd64-2.7\tre.pyd /IMPLIB:build\temp.win-amd64-2.7\Release\tre.lib /MANIFESTFILE:build\temp.win-amd64-2.7\Release\tre.pyd.manifest
tre-python.obj : warning LNK4197: export 'inittre' specified multiple times; using first specification
   Creating library build\temp.win-amd64-2.7\Release\tre.lib and object build\temp.win-amd64-2.7\Release\tre.exp
tre-python.obj : error LNK2019: unresolved external symbol tre_regerror referenced in function _set_tre_err
tre-python.obj : error LNK2019: unresolved external symbol tre_regfree referenced in function newTreMatchObject
tre-python.obj : error LNK2019: unresolved external symbol tre_regncomp referenced in function newTrePatternObject
tre-python.obj : error LNK2019: unresolved external symbol tre_regwncomp referenced in function newTrePatternObject
tre-python.obj : error LNK2019: unresolved external symbol tre_reganexec referenced in function newTreMatchObject
tre-python.obj : error LNK2019: unresolved external symbol tre_regawnexec referenced in function newTreMatchObject
tre-python.obj : error LNK2019: unresolved external symbol tre_regaparams_default referenced in function _set_tre_err
build\lib.win-amd64-2.7\tre.pyd : fatal error LNK1120: 7 unresolved externals
error: command '"D:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\BIN\amd64\link.exe"' failed with exit status 1120

I've built TRE both statically and dynamically. I'm using Python 2.7 and the OS is Windows 7 64bit.

Thanks again!

Question about approximate matching

I am trying to do just approximate matching of a small string in a given larger string and return if there is a match based on the specified maximum error/cost.

I found tre-agrep and the code I wrote based on the docs works fine. Thanks for making this open-source library.

I do not need any regular expression support -- I want to know if there is a way to optimize the call to int tre_regncomp(regex_t *preg, const char *regex, size_t len, int cflags) ? I would like to do it for two reasons 1. Performance (the code is on the critical path ) 2. I take arbitrary input and do matching and I am not escaping the search string (Escaping would further slow things down and I do not know how to correctly do it). Some pointers to which functions I should call/modify to accomplish what I want would be greatly appreciated.

Thanks,
George

Trouble with wide characters on Fedora 20

This is in relation to code I am trying to fix on C-ICAP Classify (LGPL, located at https://github.com/treveradams/C-ICAP-Classify/).

I do not speak Bulgarian and many other languages I use with this project, so this may not be unique to Bulgarian.

I need to save alt and title tags, I do this (alt version only changes the tag part): tre_regwcomp(&title1, L" title=\s_(("._?"|'.*?')|[^'\">\s]+)", REG_EXTENDED | REG_ICASE);

The document which is in wchar_t is then searched using: tre_regwnexec(&title1, ...)

The problem is on Bulgarian language documents, the appropriate matchset is truncated. This doesn't happen on English and it seems to work on many other languages (I only speak two others, so it is a guess).

Package on Fedora 20:
tre-0.8.0-8.fc20.x86_64

This appears to be caused by the engine seeing some character sequences as double quote or single quote instead of the correct character.

[tre-compile.c:1249]: (warning) Logical disjunction always evaluates to true: EXPR >= 1 || EXPR <= 256.

Source code is

      assert(lit->code_max >= 1
         || lit->code_max <= ASSERT_LAST);

Maybe

      assert(lit->code_max >= 1 &&  lit->code_max <= ASSERT_LAST);

would be better code.

Add option for '$' to zero match before MS Window's default line ending (CR LF = \r\n)

In case of option REG_NEWLINE:
"... The match-end-of-line operator $ matches the empty string immediately before a newline ('\n') as well as the empty string at the end of the string (but see the REG_NOTEOL regexec() flag). ..."

Default line ending chars on MS Window systems are both: CR LF (\r\n).
So it would be fine, to have an option (REG_WIN_NEWLINE) so that the match-end-of-line operator $ matches the empty string immediately before the window's newline (CR LF ; \r\n)
In this case, the dot (.) should neither match \r nor \n.

Maybe also an option for systems, where the line-end char is CR (\r) only ?

node package

Hi, it seems like there should be a node package for this library - I'm assuming there is not. I think with "fuzzy regex" it would get wide use for bots.

PyPI package for tre

This is not really an issue, but I made a PyPI package for tre this morning:

https://pypi.python.org/pypi/tre/0.8.0

This is so the python bindings can easily be installed via pip:

$ pip install tre

If you'd like to be on the PyPI maintainers list along with (or instead of) my dev team, let me know and I'll add you.

cheers
adam

--show-position disagrees with --colour

The following command/regex:

$ paste - - - - < in.fastq | agrep --colour --show-position --show-cost -e ((TGGAATTCTCGGGTGC){#5}|(TGGAATTCTCGGGTG){#5}|(TGGAATTCTCGGGT){#4}|(TGGAATTCTCGGG){#4}|(TGGAATTCTCGG){#3}|(TGGAATTCTCG){#3}|(TGGAATTCTC){#2})\t

Results in this output:

0:64-78:@HWI-ST212_0173:2:1101:1723:1950#GCCAAT/1       TTTTATTATGATCCATTTCGCG^[[01;31mTGGAATTCTCGGG    ^[[00m+HWI-ST212_0173:2:1101:1723:1950#GCCAAT/1 gggggggdggeggeegggggggeggddggdgdeeg
0:67-78:@HWI-ST212_0173:2:1101:2933:1959#GCCAAT/1       ATTGACAGACTGAGAGCTCTTTCTT^[[01;31mTGGAATTCTC    ^[[00m+HWI-ST212_0173:2:1101:2933:1959#GCCAAT/1 ggggggggggggggggggggggggggdggfggggg
0:66-78:@HWI-ST212_0173:2:1101:3312:1971#GCCAAT/1       TCTTCAGATCCGGTGGTTGCCGAC^[[01;31mTGGAATTCTCG    ^[[00m+HWI-ST212_0173:2:1101:3312:1971#GCCAAT/1 ceeeeeeeeeeeeYed`dabcbbbcdccd\c^Xcc

For the first line, the colouring is from character 67-80 inclusive, not 64-78 as indicated by the --show-position information at the start of the line. Am I missing something obvious or is this a bug?

Infinite loop when matching patterns with back-references

This leads to an infinite loop in agrep:

echo 'xx' | ./src/agrep '(\0|x)+'

I don't know what a good patch for this would be.

Found using LLVM's LibFuzzer.

MinGW and WChar_t

Hi, I've compiled the TRE with MinGw and enabled wchar_t. It worked with English words. I've tested it with Persian language but did not work. Is the problem related to MinGw ? How can I fix it ?

File libtre.a generated?

Hello,

Forgive me in advance if this is not libtre related, I am trying to install a program which is requiring the library libtre.a
I followed the manual to install libtre:

Install libtre library

mkdir /data/mbp15ja/libtre
cd /data/mbp15ja/libtre
wget https://github.com/laurikari/tre/archive/master.zip
unzip master.zip
cd tre-master/
./utils/autogen.sh
mkdir /data/mbp15ja/tre-0.8.0
./configure --prefix /data/mbp15ja/tre-0.8.0
make
make install

Although the file lib/libtre.la seems to be created, the file lib/libtre.a doesn't exist.
Is it possible that this file was created in earlier versions but not in this one? Is it a particular file created by other packages such as libtre-devel?

Thanks in advance!

./configure expects a ChangeLog

Seems like ChangeLog was renamed to ChangeLog.old and that commit expected darcs to create a new ChangeLog file. But the darcs dependency was then removed. So I had to
cp ChangeLog.old ChangeLog
to get ./configure to work

missing install-sh?

stephan@tiny:~/cvs/fossil/cwal/th1ish/tre$ ./configure --prefix=$HOME
configure: error: cannot find install-sh, install.sh, or shtool in utils "."/utils

Lazy Modifier `?` Works Incorrectly in Capture Groups

This is using the library included in R 3.0.2:

text <- "abcdEEEEfg"

sub("c.+?E", "###", text)
# [1] "ab###EEEfg"                          <<< OKAY
sub("c(.+?)E", "###", text)
# [1] "ab###EEfg"                           <<< WEIRD
sub("c(.+?)E", "###", text, perl=T)
# [1] "ab###EEEfg"                          <<< OKAY

Notice how in the second example something very odd is happening. The capture is neither greedy nor lazy. It should be lazy and look like the first and third examples, but if it were greedy it would capture one more E than it did.

This is almost certainly related to issue 11, but posting it separately as they issue description is not quite the same.

Also, original post on SO for reference with some more details.

Exponential running time when chaining bounds

Expressions like the following cause TRE's run-time to be very high:

echo 'x' | agrep 'x?{100}{100}'
echo 'x' | agrep 'x?{5}{5}{5}{5}{5}{5}'

Found using LLVM's LibFuzzer.

cannot find input file: `doc/Makefile.in'

i@scheherezade:/opt/tre$ ./configure
checking build system type... i686-pc-linux-gnu
checking host system type... i686-pc-linux-gnu
checking target system type... i686-pc-linux-gnu
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for style of include used by make... GNU
checking dependency style of gcc... gcc3
checking how to run the C preprocessor... gcc -E
checking whether gcc and cc understand -c and -o together... yes
checking for an ANSI C-conforming const... yes
checking for inline... inline
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for size_t... yes
checking for working alloca.h... yes
checking for alloca... yes
checking for isascii... yes
checking for isblank... yes
checking getopt.h usability... yes
checking getopt.h presence... yes
checking for getopt.h... yes
checking for getopt_long... yes
checking for libutf8... not needed
checking wchar.h usability... yes
checking wchar.h presence... yes
checking for wchar.h... yes
checking wctype.h usability... yes
checking wctype.h presence... yes
checking for wctype.h... yes
checking for wchar_t... yes
./configure: line 5433: AX_DECL_WCHAR_MAX: command not found
checking for wint_t... yes
checking for mbstate_t... yes
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... 64
checking whether NLS is requested... yes
checking for msgfmt... /usr/bin/msgfmt
checking for gmsgfmt... /usr/bin/msgfmt
checking for xgettext... /usr/bin/xgettext
checking for msgmerge... /usr/bin/msgmerge
checking for ld used by GCC... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for shared library run path origin... /bin/bash: utils/config.rpath: No such file or directory
done
checking for CFPreferencesCopyAppValue... no
checking for CFLocaleCopyCurrent... no
checking for GNU gettext in libc... yes
checking whether to use NLS... yes
checking where the gettext function comes from... libc
checking how to print strings... printf
checking for a sed that does not truncate output... /bin/sed
checking for fgrep... /bin/grep -F
checking for ld used by gcc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert i686-pc-linux-gnu file names to i686-pc-linux-gnu format... func_convert_file_noop
checking how to convert i686-pc-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... dlltool
checking how to associate runtime and link libraries... printf %s\n
checking for ar... ar
checking for archiver @FILE support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for mt... mt
checking if mt is a manifest tool... no
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/usr/bin/ld) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... no
configure: creating ./config.status
config.status: creating Makefile
config.status: error: cannot find input file: `doc/Makefile.in'

one more "?" operator bug?

[0-9] n times and [0-9]{n} behave differently when used with ? before:

a<-c("/Cajon_Criolla_20141024",
 "/Linon_20141115_20141130",
 "/Cat/LIQUID",
 "/c_puertas_20141206_20141107",
 "/C_Puertas_3_20141017_20141018",
 "/c_puertas_navidad_20141204_20141205"
 )
sub("(.?)([0-9]{8})(.)$","\2",a)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
sub("(.?)([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.)$","\2",a)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017"
[6] "20141204"

likely related to bugs #11 and #21 . See also original post on SO:

http://stackoverflow.com/questions/28725115/r-regex-capture-numbers-in-string-and-replacing-them-in-another-column-captu/28725655

Null-pointer dereference in agrep for expression "{+}{7}"

Running agrep '{+}{7}' leads to a null-pointer dereference:

==24104==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x00000051f502 bp 0x000000000000 sp 0x7ffc522f13c0 T0)
==24104==The signal is caused by a READ memory access.
==24104==Hint: address points to the zero page.
    #0 0x51f501 in tre_match_empty lib/tre-compile.c:1256:17
    #1 0x511a3e in tre_compute_nfl lib/tre-compile.c:1488:12
    #2 0x511a3e in tre_compile lib/tre-compile.c:1997
    #3 0x530464 in tre_regncomp lib/regcomp.c:93:9
    #4 0x50c7af in main src/agrep.c:743:13

This issue was found using LLVM's LibFuzzer.

Tre python bindings for python 3

Howdy Laurukari!

I converted python bindings to python 3 and emailed the result to you.

Fork/pull request seemed too complicated )-:. Sorry.

Conversion from wide chars to multibyte chars can fail

In tre_parse_bracket_items, TRE calls wcsrtombs without checking its return value. However, that function can return (size_t)(-1) on failure. This leads to a buffer underflow.

The issue happens for convoluted regular expressions like [[:ÿ: (that's \x5b\x5b\x3a\xff\x3a).

A patch like the following would fix the issue:

diff --git a/lib/tre-parse.c b/lib/tre-parse.c
index ff27dbb..e113896 100644
--- a/lib/tre-parse.c
+++ b/lib/tre-parse.c
@@ -326,6 +326,12 @@ tre_parse_bracket_items(tre_parse_ctx_t *ctx, int negate,
 #else /* !TRE_WCHAR */
                  strncpy(tmp_str, (const char*)re + 2, len);
 #endif /* !TRE_WCHAR */
+                  if (len < 0) {
+                    /* Conversion to multibyte character failed... */
+                   status = REG_ECTYPE;
+                    break;
+                  }
+
                  tmp_str[len] = '\0';
                  DPRINT(("  class name: %s\n", tmp_str));
                  class = tre_ctype(tmp_str);

This issue was found using LLVM's LibFuzzer.

Segmentation Fault

I'm loving tre-agrep for matching DNA sequence strings. However, I keep getting segmentation faults when my regular expression grows:

$ tre-agrep -V
tre-agrep (TRE agrep) 0.8.0

Copyright (c) 2001-2009 Ville Laurikari <[email protected]>.

$ head -n 8 in.fastq
@HWI-ST212_0173:2:1101:1700:1921#GCCAAT/1
NCTGAATGTCAAAGTGAAGAAATTCAACCAAGCGC
+HWI-ST212_0173:2:1101:1700:1921#GCCAAT/1
BMUUMWWTSX^^X^VZYXYZ[TT\[Y]YY]^\V\[
@HWI-ST212_0173:2:1101:1723:1950#GCCAAT/1
TTTTATTATGATCCATTTCGCGTGGAATTCTCGGG
+HWI-ST212_0173:2:1101:1723:1950#GCCAAT/1
gggggggdggeggeegggggggeggddggdgdeeg

$ paste - - - - < in.fastq | tre-agrep -e '((TGGAATTCTCGGGTGC){#5}|(TGGAATTCTCGGGTG){#5}|(TGGAATTCTCGGGT){#4}|(TGGAATTCTCGGG){#4}|(TGGAATTCTCGG){#3}|(TGGAATTCTCG){#3}|(TGGAATTCTC){#2})'
Segmentation fault (core dumped)

If I shorten the regex, it works fine:

$  paste - - - - < in.fastq | tre-agrep -e '((TGGAATTCTCGGGTGC){#5}|(TGGAATTCTCGGGTG){#5}|(TGGAATTCTCGGGT){#4})'
@HWI-ST212_0173:2:1101:1723:1950#GCCAAT/1       TTTTATTATGATCCATTTCGCGTGGAATTCTCGGG     +HWI-ST212_0173:2:1101:1723:1950#GCCAAT/1       gggggggdggeggeegggggggeggddggdgdeeg

$  paste - - - - < in.fastq | tre-agrep -e '((TGGAATTCTCGG){#3}|(TGGAATTCTCG){#3}|(TGGAATTCTC){#2})'
@HWI-ST212_0173:2:1101:1723:1950#GCCAAT/1       TTTTATTATGATCCATTTCGCGTGGAATTCTCGGG     +HWI-ST212_0173:2:1101:1723:1950#GCCAAT/1       gggggggdggeggeegggggggeggddggdgdeeg

TRE fail to parse this

Hello !
Just testing tre I found that it doesn't match correctly the following regular expression (note grep do manage it correctly)

the total value is 12.345,89 with a discount of 5,50%

common_args=" -e '-?\d[\d.]*,\d\d' test-bug.txt"

cmd="grep -P $common_args"
echo $cmd
$cmd

cmd="../src/agrep --color $common_args"
echo $cmd
$cmd

tre-agrep can drop records that have a match

tre-agrep has mysteriously failed to print some records that I know contain a match. I traced the program logic to a point where I knew that tre-grep knew there was a match, but no record printed. I changed this line:

printf("%.*s", record_len, record);

fwrite(record, record_len, 1, stdout);

and now it works. Go figure.

I have been programming in C for a long time, but I must confess that I do not think I have any experience with "%.*s" printf conversion. But it seems like it should work.

This is on Kubuntu Linux 15.04.
The original bug was encountered on the tre-agrep program that was installed using the packages supplied by Kubuntu. The "workaround" was applied to the source code as it came from the Debian package, and compiled with GCC and using glibc.

tab treated as printable character on Windows

The tab character (\t) is treated as printable ([:print:]) on Windows, even in the "C" locale. This is a bug (in violation with POSIX at least) and it happens when TRE_WCHAR is defined. It happens because iswprint(L'\t') returns true on all locales, including the "C" locale. On Unix, \tseems to be treated as non-printable in all locales (certainly all I checked). Note that isprint('\t') on Windows in "C" locale returns false, so using isprint() would have been fine. isprint('\t') returns true on Windows in some other locales, at least CP1252, which is surprising too, but maybe permissible. This has worked around in R which uses TRE.

Infinite loop for certain regular expressions

Calling agrep '\\)' results in an infinite loop. Agrep allocates memory inside the loop, and will quickly run out of memory.

Compiling agrep with debugging enabled gives the following output:

tre_compile: parsing '\\)'
tre_parse: parsing '\\)', len = 3
tre_parse:  bleep: '\\)'
tre_parse:     escaped: '\\)'
tre_mem_alloc: allocating new 1024 byte block
tre_parse:          empty: ')'
tre_parse:          empty: ')'
tre_parse:          empty: ')'
tre_parse:          empty: ')'
tre_parse:          empty: ')'
tre_parse:          empty: ')'
tre_parse:          empty: ')'
tre_mem_alloc: allocating new 1024 byte block
tre_parse:          empty: ')'
...

Upstream Project Zero security fixes from OS X

Comparing the 10.10.4 and 10.10.5 source to OS X's Libc (on opensource.apple.com) yields the following patch to TRE:

diff -ur Libc-1044.10.1/regex/TRE/lib/regexec.c Libc-1044.40.1/regex/TRE/lib/regexec.c
--- Libc-1044.10.1/regex/TRE/lib/regexec.c  2011-09-29 18:54:25.000000000 -0700
+++ Libc-1044.40.1/regex/TRE/lib/regexec.c  2015-07-09 15:15:19.000000000 -0700
@@ -10,6 +10,10 @@
 #include <config.h>
 #endif /* HAVE_CONFIG_H */

+/* Unset TRE_USE_ALLOCA to avoid using the stack to hold all the state
+   info while running */
+#undef TRE_USE_ALLOCA
+
 #ifdef TRE_USE_ALLOCA
 /* AIX requires this to be the first thing in the file.     */
 #ifndef __GNUC__
diff -ur Libc-1044.10.1/regex/TRE/lib/tre-match-backtrack.c Libc-1044.40.1/regex/TRE/lib/tre-match-backtrack.c
--- Libc-1044.10.1/regex/TRE/lib/tre-match-backtrack.c  2011-09-29 18:54:25.000000000 -0700
+++ Libc-1044.40.1/regex/TRE/lib/tre-match-backtrack.c  2015-07-09 15:15:19.000000000 -0700
@@ -274,7 +274,7 @@

   int num_tags = tnfa->num_tags;
   int touch = 1;
-  char *buf;
+  char *buf = NULL;
   int tbytes;

 #ifdef TRE_MBSTATE
diff -ur Libc-1044.10.1/regex/TRE/lib/tre-match-parallel.c Libc-1044.40.1/regex/TRE/lib/tre-match-parallel.c
--- Libc-1044.10.1/regex/TRE/lib/tre-match-parallel.c   2012-05-03 17:34:12.000000000 -0700
+++ Libc-1044.40.1/regex/TRE/lib/tre-match-parallel.c   2015-07-09 15:15:19.000000000 -0700
@@ -143,7 +143,7 @@
 #endif /* TRE_DEBUG */
   tre_tag_t *tmp_tags = NULL;
   tre_tag_t *tmp_iptr;
-  int tbytes;
+  size_t tbytes;
   int touch = 1;

 #ifdef TRE_MBSTATE
@@ -162,7 +162,7 @@
      everything in a single large block from the stack frame using alloca()
      or with malloc() if alloca is unavailable. */
   {
-    int rbytes, pbytes, total_bytes;
+    size_t rbytes, pbytes, total_bytes;
     char *tmp_buf;
     /* Compute the length of the block we need. */
     tbytes = sizeof(*tmp_tags) * num_tags;
@@ -177,11 +177,11 @@
 #ifdef TRE_USE_ALLOCA
     buf = alloca(total_bytes);
 #else /* !TRE_USE_ALLOCA */
-    buf = xmalloc((unsigned)total_bytes);
+    buf = xmalloc(total_bytes);
 #endif /* !TRE_USE_ALLOCA */
     if (buf == NULL)
       return REG_ESPACE;
-    memset(buf, 0, (size_t)total_bytes);
+    memset(buf, 0, total_bytes);

     /* Get the various pointers within tmp_buf (properly aligned). */
     tmp_tags = (void *)buf;
diff -ur Libc-1044.10.1/regex/TRE/lib/tre-parse.c Libc-1044.40.1/regex/TRE/lib/tre-parse.c
--- Libc-1044.10.1/regex/TRE/lib/tre-parse.c    2011-12-07 18:12:55.000000000 -0800
+++ Libc-1044.40.1/regex/TRE/lib/tre-parse.c    2015-07-09 15:15:19.000000000 -0700
@@ -717,7 +717,7 @@
 static reg_errcode_t
 tre_parse_bracket(tre_parse_ctx_t *ctx, tre_ast_node_t **result)
 {
-  tre_ast_node_t *node;
+  tre_ast_node_t *node = NULL;
   reg_errcode_t status = REG_OK;
   tre_bracket_match_list_t *items;
   int max_i = 32;
@@ -2016,6 +2016,8 @@
            ctx->re++;
            while (ctx->re_end - ctx->re >= 0)
              {
+               if (i == sizeof(tmp))
+               return REG_EBRACE;
                if (ctx->re[0] == CHAR_RBRACE)
                  break;
                if (tre_isxdigit_l(ctx->re[0], ctx->loc))

This patch resolves:

which may also affect upstream TRE.

Integer overflow and missing checks when parsing bounds

Expressions like x{9999999999999999,3} cause an integer overflow in tre_parse_int. This is undefined behavior in C. The following patch prevents this from happening. The overflow check is a bit conservative, but I don't think this matters.

diff --git a/lib/tre-parse.c b/lib/tre-parse.c
index e113896..4705795 100644
--- a/lib/tre-parse.c
+++ b/lib/tre-parse.c
@@ -588,16 +588,23 @@ static int
 tre_parse_int(const tre_char_t **regex, const tre_char_t *regex_end)
 {
   int num = -1;
+  int overflow = 0;
   const tre_char_t *r = *regex;
   while (r < regex_end && *r >= L'0' && *r <= L'9')
     {
       if (num < 0)
        num = 0;
-      num = num * 10 + *r - L'0';
+      if (num <= (INT_MAX - 9) / 10) {
+        num = num * 10 + *r - L'0';
+      } else {
+        /* This digit could cause an integer overflow. We do not return
+         * directly; instead, consume all remaining digits. */
+        overflow = 1;
+      }
       r++;
     }
   *regex = r;
-  return num;
+  return overflow ? -1 : num;
 }

When parsing bounds, the minimum repeat count is not checked if no maximum repeat count is given. For instance, the expression x{999999999,} is accepted by TRE. The following patch fixes this:

diff --git a/lib/tre-parse.c b/lib/tre-parse.c
index 4705795..ebc4856 100644
--- a/lib/tre-parse.c
+++ b/lib/tre-parse.c
@@ -641,7 +641,7 @@ tre_parse_bound(tre_parse_ctx_t *ctx, tre_ast_node_t **result)
     }
 
   /* Check that the repeat counts are sane. */
-  if ((max >= 0 && min > max) || max > RE_DUP_MAX)
+  if ((max >= 0 && min > max) || max > RE_DUP_MAX || min > RE_DUP_MAX)
     return REG_BADBR;

-d takes an argument, --delimiter does not

Program: tre-agrep

The man page leads me to believe that

-d PATTERN and --delimiter=PATTERN

are just the short form and long form
for the same underlying option, so,
allowing for this difference, they should
behave the same.

But, --delimiter=PATTERN causes tre-agrep
to complain that --delimiter does not take
an argument. But, -d PATTERN works.

I downloaded the source.

agrep.c, line 43

/* Short options. */
static char const short_options[] =
"cd:e:hiklnqsvwyBD:E:HI:MS:V0123456789-:";
  ^^

agrep.c, line 57

/* Long option equivalences. */
static struct option const long_options[] =
{
  {"best-match", no_argument, NULL, 'B'},
  {"color", no_argument, NULL, COLOR_OPTION},
  {"colour", no_argument, NULL, COLOR_OPTION},
  {"count", no_argument, NULL, 'c'},
  {"delete-cost", required_argument, NULL, 'D'},
  {"delimiter", no_argument, NULL, 'd'},
                ^^^^^^^^^^^
  {"delimiter-after", no_argument, NULL, 'M'},
  ...

Escape sequences are not recognized within bracket expressions

I want to match all non-tab characters at the start, or end, of a string. For example, I have the following tab-delimited string where the two columns also contain a space:
a aataaa\tbbbtb bb.

If I want to match the first column, I should be able to do:

echo -e "a aataaa\tbbbtb bb" | tre-agrep --color --show-position -e '^[^\t]+'
0-4:a aa</font>taaa    bbbtb bb

It should match anything with in not a tab. However, as you can see, it matches upto, but not including the "t" rather than the tab.

Similarly, trying to match the last column:

echo -e "a aataaa\tbbbtb bb" | tre-agrep --color --show-position -e '[^\t]+$'
13-17:a aataaa  bbbtb bb

isolating tre to use on aws lambda

We are looking into using tre and the python bindings on an AWS Lambda but we need a precompiled package to be able to deploy it. I've tried a lot of different things which are mostly based on this stackoverflow post: http://stackoverflow.com/questions/34749806/using-moviepy-scipy-and-numpy-in-amazon-lambda

but with no success. Every possible solution results in the same problem:
ImportError: libtre.so.5: cannot open shared object file: No such file or directory

so i'm out of options now. Anyone able to help/provide insights?

tre-agrep can print garbage instead of delimiter

When --delimiter (-d) is specified, and delimiter is to be printed before the matching record, tre-grep can print delim_len bytes of garbage, instead of the delimiter. This is because, just before printing the matching record, an adjustment is made to |record|, the start of the current record. delim_len is subtracted from record, it is not necessarily true that there are that many bytes in the buffer, buf, between buf and record. So, whatever bytes are just below buf get printed. My experience, so far, is that it just prints garbage, but the behavior is undefined, so it could be worse.

My test version has a test to ensure that record - buf >= delim_len.

Include R additions?

Apparently the R programs includes the tre library with some local modifications in it - particularly the addition of some functions:

tre_regnexecb
tre_regexecb
tre_regncompb
tre_regaexecb
tre_regcompb

that apparently operate on raw bytes ("takes bytes literally"). Has anyone ever approached you about adding this changes to tre directly?

One reason that leads me to ask is that we're seeing issues with the tre regular expression handling on the arm architecture. I was hoping to try using the standalone tre library with it to see if that works, but cannot due to the missing symbols.

can't build on Ubuntu

Here is the error I get when I try to run ./configure:

checking build system type... i686-pc-linux-gnu
checking host system type... i686-pc-linux-gnu
checking target system type... i686-pc-linux-gnu
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for style of include used by make... GNU
checking dependency style of gcc... gcc3
checking how to run the C preprocessor... gcc -E
checking whether gcc and cc understand -c and -o together... yes
./configure: line 4274: syntax error near unexpected token `fi'
./configure: line 4274: `fi'

Segmentation faults 2017-05-15

Hello,
I was using American Fuzzy Lop (afl-fuzz) to fuzz input to a modified version of the agrep program on Linux. Is fixing the crashes from these input files something you're interested in? The input files can be found here: https://github.com/rwhitworth/tre-fuzz.

The repo contains a README that has instructions on how to execute the files to cause the segmentation faults, a modified copy of the agrep.c source to read a regex from stdin, and the random input file that is searched with that regex.

I understand if the changes made to agrep makes this a bit convoluted, but it was the only way I could easily fuzz the program. I tried to keep the changes as minimal as possible.

Let me know if I can provide any more information to help narrow down this issue.

autopoint package required on Ubuntu 12.04 LTS

In order to build tre on Ubuntu 12.04, you also need to install autopoint, which is currently not documented in README.darcs. I ran the following to install the prerequisites before ./utils/autogen.sh would work:

sudo apt-get -y install darcs autoconf automake gettext libtool zip autopoint

Odd character problem with tre-agrep

When agrep’ing a file with some odd characters on one
line the search seems to stop at that line and nothing
is found therafter.

Furhter info:
export LC_ALL=en_US.UTF-8

export LC_ALL=en_US.ISO8859-1
<work as expected!>
unset LC_ALL
<work as expected!>

so it seems to be a problem related to utf-8

I have reduced this to a testfile that I paste here in b64,
it is only 4 lines:
eW91dHViZSBuaWxzb2xhYXhlbAp2aW1lbyBuaWxzb2xhQGFiYy5zZSAKbz8/4oCZcmVpbGx5
IGhlbUBuaWxzb2xhLnNlIApkYXRvcm1hZ2F6aW4gZG16bGFicyBtZWRpYXNob3cgbmlsc29s
YQo=

The simple testcase is simply like this:
$ /usr/bin/grep datormagazin /tmp/test
datormagazin dmzlabs mediashow nilsola
$ echo $?
0
$ /usr/local/bin/agrep datormagazin /tmp/test
$ echo $?
1

Further details:
I have tested this on both openbsd 5.5 and 5.6 (when
installed with pkg_add) and on gentoo (added as a
package dev-libs/tre)
OpenBSD Version tre-0.8.0p0
gentoo version tre-0.8.0

Further tests
match on two lines before works as expected:
$ /usr/bin/grep nilsolaa /tmp/test
youtube nilsolaaxel
$ /usr/local/bin/agrep nilsolaa /tmp/test
youtube nilsolaaxel

match on line before, strange behaviour:
$ /usr/bin/grep vimeo /tmp/test
vimeo [email protected]
$ /usr/local/bin/agrep vimeo /tmp/test
vimeo [email protected]
o��reilly [email protected]
datormagazin dmzlabs mediashow nilsola
s%
$ echo $?
0

Partial match

Hello.

I need to do a partial regex checks (from the beginning of a pattern). For example, If I have

a pattern "^a/b/c/.*/d" and a string "a/b", then it's a match
a pattern "^a/b/c/.*/d" and a string "b/c", then it's not a match.

Is this possible with TRE? If it's not then can you add such functionality? :)

Logical AND instead of bitwise AND, tre-match-backtrack.c line 603

lib/tre-match-backtrack.c

Line 603 reads:
if (stack->item.state->assertions && ASSERT_BACKREF)

It should use the bitwise operator &, as used elsewhere on line 415:
if (trans_i->state && trans_i->assertions & ASSERT_BACKREF)

Wrong match when minimum value is omitted in repeating qualifier

Using R:

grepl("ab{,2}c", "abbbc")
# [1] TRUE

I would expect either FALSE or an "invalid regex" error.

More details can be found in my original Stack Overflow question.

possibly undefined macro

configure.ac:6: error: possibly undefined macro: AM_INIT_AUTOMAKE
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
configure.ac:8: error: possibly undefined macro: AM_GNU_GETTEXT_VERSION
configure.ac:14: error: possibly undefined macro: AM_PROG_CC_C_O
configure.ac:40: error: possibly undefined macro: AM_CONDITIONAL
configure.ac:516: error: possibly undefined macro: AM_GNU_GETTEXT
configure.ac:517: error: possibly undefined macro: AC_LIBTOOL_TAGS
configure.ac:518: error: possibly undefined macro: AC_LIBTOOL_WIN32_DLL
configure.ac:519: error: possibly undefined macro: AM_DISABLE_STATIC
configure.ac:520: error: possibly undefined macro: AC_PROG_LIBTOOL

When I went looking for the Autoconf documentation, I found...

http://lists.gnu.org/archive/html/autoconf/2010-01/msg00050.html

That generally means that the package had an overquoted macro in its
configure.ac. You may be better off reporting this to the acl folks, to
see if it is a bug in their files.

Meanwhile, here's the documentation of m4_pattern_allow [1]. However,
adding m4_pattern_allow([AC_CONFIG_MACRO]) to configure.ac is probably not
the right thing to do; but without seeing the context that caused the
warning, I'm not sure what the best fix is.

[1] http://www.gnu.org/software/autoconf/manual/autoconf.html#Forbidden-Patterns

repeat operators allowed to start regular expression

Repeat operator (*, & and +) characters are allowed in the beginning of the regular expression. According to the documentation they should be followed by an atom. Also, all the other regex engines I tested report this as error in the regular expression.

Example: "*_abc" is considered by regcomp a valid regular expression, and matches "_abc".

I believe regcomp should report compile error, just like for other improper expressions, like "(_abc".