robert-bor / aho-corasick Goto Github PK
View Code? Open in Web Editor NEWJava implementation of the Aho-Corasick algorithm for efficient string matching
License: Apache License 2.0
Java implementation of the Aho-Corasick algorithm for efficient string matching
License: Apache License 2.0
For example:
In this case the match must not be made, or removed afterwards.
@Test
public void shefgTest() {
Trie trie = new Trie(false);
trie.addKeyword("shefg");
trie.addKeyword("hefg");
trie.addKeyword("efg");
trie.addKeyword("fg");
Collection<Emit> emits = trie.parseText("shefge");
System.out.println(emits);
}
when I run the test case above, it outputs nothing.
Robert
Thanks for the code. I noticed that for case insensitive matching, Trie.parseText() reduces the input text to lower case but the keywords are left in their original case (as added). So for the situation where
trie.addKeyword("Alpha");
and
Collection emits = trie.parseText("Alpha");
no matches are returned when case insensitive = true
Adding
if (trieConfig.isCaseInsensitive()) {
character = Character.toLowerCase(character);
}
to addKeyword would make addKeyword and parseText symmetric and match all combinations of case.
thanks
The Trie class should take an array and/or of keywords. Rather than:
.addKeyword("hers")
.addKeyword("his")
.addKeyword("she")
.addKeyword("he")
It should be possible to add them all at once:
.addKeywords(new String[]{"hers", "his", "she", "he"})
java.lang.StringIndexOutOfBoundsException: String index out of range: 311
at java.lang.String.charAt(String.java:658)
at org.ahocorasick.trie.Trie.removePartialMatches(Trie.java:119)
at org.ahocorasick.trie.Trie.parseText(Trie.java:103)
Loading "/usr/share/dict/words" as the dictionary and searching for
String text = "aardvark aardvarks aardwolf aardwolves Aaren Aargau aargh Aarhus Aarika Aaron aaron Aaronic aaronic Aaronical Aaronite Aaronitic Aaron's-beard Aaronsburg Aaronson babelet Babelic babelike Babelisation Babelise Babelised Babelish Babelising Babelism Babelization Babelize Babelized Babelizing babels Baber babery"
If I add an additional space at end of text the exception does not happen.
/usr/share/dict/words is from words-3.0-22.fc20.noarch
Forgot to mention that the trie is being constructed as new Trie().onlyWholeWords()
java.lang.StackOverflowError
at java.util.ArrayList.iterator(ArrayList.java:814)
at org.ahocorasick.interval.IntervalNode.determineMedian(IntervalNode.java:43)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:17)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:36)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:36)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:36)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:36)
(chopping of rest for sake of brevity)
Here is what I am doing.
In test setup I am loading a dictionary into a trie and a set of sentences. Then I am passing each one of the sentences in a loop to the trie and after processing two sentences I get the above error.
for (String sentence : sentences) {
Collection<Token> tokens = trie.tokenize(sentence);
for (Token token : tokens) {
System.out.println("token: " + token);
}
System.out.println("\n");
}
Does the trie need to be reset after processing of a sentence? If so, what is the call?
Thanks.
Hi, I saw the following exception reported from the parseText method, but haven't bee able to reproduce it in my development environment.
java.lang.NullPointerException
at org.ahocorasick.trie.Trie.getState(Trie.java:138) ~[ahocorasick-0.2.4.jar:?]
at org.ahocorasick.trie.Trie.parseText(Trie.java:99) ~[ahocorasick-0.2.4.jar:?]
at myCode...
I'm wondering if it's possibly being caused by using a single Trie from multiple threads simultaneously (though hammering it from a number of threads on my machine doesn't exhibit a problem).
Should I expect calling parseText on a single Trie from multiple threads to be safe?
For instance in the case of caching to some third party provider like memcached (and since building a Trie
is not a cheap computation), Trie
should implement Serializable
interface. Could you provide this feature?
Using version 0.3.0. I tried this code on a test file in Eclipse:
for(String line; (line = br.readLine()) != null; ) {
Emit firstMatch = trie.firstMatch(line);
System.out.println(firstMatch.toString());
}
and received a java.lang.NullPointerException
. I changed it to the following and got expected behavior (no more NullPointerException
):
for(String line; (line = br.readLine()) != null; ) {
Emit firstMatch = trie.firstMatch(line);
System.out.println(firstMatch);
}
I think you are missing a null
dereference check somewhere in your overrided toString()
implementations. Minor issue, but also easy to fix.
I can see that 0.6.0 is the latest version. But I can only refer 0.4.0 from Maven Repository. Did I miss something?
Can this algo be modified to accomplish URL pattern matching to return exact (best) match instead of first match?
modify State.java to have String instead of Character and compare strings based on "/" tokenizer in a way that doesn't impact performance?
It'd be great if each keyword could have an associated object that is used as a payload. That way the Trie can store associated data with each keyword, and Emits can retrieve that data in addition to just the keyword and its index position.
It is listed, but cannot be downloaded from here:
http://repo1.maven.org/maven2/org/ahocorasick/ahocorasick/0.3.1/
Hi,
Thanks for this excellent library! We've just started using in all our clients for our backup service https://degoo.com and it has helped us speed up our progress calculations a lot.
When profiling I noticed that (at least in our use case) it spends quite a lot of time calling this.success.get(character) in State. In our case >50% of all execution time is spent there. My hunch is that this could be improved quite a lot by using Trove's TCharObjectHashMap. It would add an external dependency but it might be worth it if the performance enhancement is big enough. If adding an external dependency is not an option perhaps implementing something similar to TCharObjectHashMap would be a better solution. Let me know if we can be of any assistance with this.
Regards
Carl Hasselskog
Hi Mr. Robert,
I'm trying to implement your code. I put the following dependency -
<dependency>
<groupId>org.ahocorasick</groupId>
<artifactId>ahocorasick</artifactId>
<version>0.2.4</version>
</dependency>
I'm getting the following error on mvn install-
priom@ASUS-S551LA:~/Public/6461/aho$ mvn -e install + Error stacktraces are turned on. [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [ERROR] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] The projects in the reactor contain a cyclic reference: Edge between 'Vertex{label='org.ahocorasick:ahocorasick'}' and 'Vertex{label='org.ahocorasick:ahocorasick'}' introduces to cycle in the graph org.ahocorasick:ahocorasick --> org.ahocorasick:ahocorasick [INFO] ------------------------------------------------------------------------ [INFO] Trace org.apache.maven.BuildFailureException: The projects in the reactor contain a cyclic reference: Edge between 'Vertex{label='org.ahocorasick:ahocorasick'}' and 'Vertex{label='org.ahocorasick:ahocorasick'}' introduces to cycle in the graph org.ahocorasick:ahocorasick --> org.ahocorasick:ahocorasick at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:295) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138) at org.apache.maven.cli.MavenCli.main(MavenCli.java:362) at org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:60) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315) at org.codehaus.classworlds.Launcher.launch(Launcher.java:255) at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430) at org.codehaus.classworlds.Launcher.main(Launcher.java:375) Caused by: hidden.org.codehaus.plexus.util.dag.CycleDetectedException: Edge between 'Vertex{label='org.ahocorasick:ahocorasick'}' and 'Vertex{label='org.ahocorasick:ahocorasick'}' introduces to cycle in the graph org.ahocorasick:ahocorasick --> org.ahocorasick:ahocorasick at hidden.org.codehaus.plexus.util.dag.DAG.addEdge(DAG.java:143) at hidden.org.codehaus.plexus.util.dag.DAG.addEdge(DAG.java:123) at org.apache.maven.project.ProjectSorter.(ProjectSorter.java:118) at org.apache.maven.execution.ReactorManager.(ReactorManager.java:99) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:288) ... 11 more [INFO] ------------------------------------------------------------------------ [INFO] Total time: Less than 1 second [INFO] Finished at: Thu Sep 25 01:00:02 EDT 2014 [INFO] Final Memory: 2M/61M [INFO] ------------------------------------------------------------------------
Can you help? Can it be a version issue? I am using-
Apache Maven 2.2.1 (rdebian-14)
Java version: 1.7.0_65
Thanks, really appreciate your work!
Hi,
I was taking a look at how storeEmit
works (https://github.com/robert-bor/aho-corasick/blob/master/src/main/java/org/ahocorasick/trie/Trie.java#L277) and thought it could be nice if the emitted
boolean was based on a value returned by emitHandler.emit
. This would allow us to do additional checks on values before acking an emit and thus taking advantage of isStopOnHit
.
In my case, I have a post condition based on each keyword. I would like to stop as soon as I find a match that also satisfies the post condition. Currently, even if I override the emit handler, i have no way of forcing a stop.
README tutorial uses deprecated .caseInsensitive() instead of .ignoreCase()
When the tokenize method is used, a single token must report itself as whiteSpace if it consists completely of whitespace characters.
Hello,
I was using the ahocorasick library to perform some text processing on English Wikipedia dump. As it contains several non-english words, it lead me to discovering an issue in how the library handles cases.
I wrote a test case to explain it. Here it is
@Test
public void caseInsensitiveTrieWithSomeUnicodeCharactersCreatesEmitsWithWrongStart(){
// when this is lower cased, it becomes a string of length 2!
String upperLengthOne = "İ";
String normalI = "I";
Trie trie = Trie.builder()
.caseInsensitive()
//.onlyWholeWords()
.addKeyword(upperLengthOne)
.build();
// because when lower cased it becomes 2 characters, the emit gets confused and creates a string that starts at -1
// this can cause further problems if we index into the original string with this emit start and end.
// This happens if we make the trie builder.onlyWholeWords() in which case we get exceptions
assertEquals(-1, trie.parseText(upperLengthOne).stream().findAny().get().getStart());
}
The whole problem happens when lower-casing the keyword has a different length than the original keyword. I can't think of a quick easy fix for this issue, but I guess the least that can be done about it is that the library should throw an error when the lowercased keyword's length is != the original keywords length warning the library's user to this issue.
Hi,
First of all thanks for providing an implementation of the algorithm.
Wanted to check -
Is there a way to maintain additional information along with each keyword/node within the trie data structure ? Basically whenever there is a match, I would like to retrieve corresponding match information.
For example,
Assume we have 10 documents each having distinct words. I can create the trie by adding the words as keywords to the trie builder. However, whenever there is a match, I would like to know from which document the word came from. So, need to maintain some extra information like document id in each of the node of the trie along with the words.
Let me know if you need more information.
Thanks,
Rajan Jana
I was looking at the tokenize
function, and need things like EmitHandler
.
Then I checked the code of tokenize
, and realized that it's actually a stage after parseText
.
So why not use a pattern like tokenize(parseText(...))
, so we can enjoy all the new features offered by parseText
!
Right now we already have a few parseText
variants, and there may be more in the future!
Any issues are if I do this repeatedly.
fun checkList(list):
Trie trie = Trie.builder().stopOnHit().ignoreCase().addKeywords(list).build();
return trie.containsMatch(text);
hello, I'm wondering about what's the time to publish the 0.3.1 version on maven repository?
The current implementation has grown bloated under the various change requests. Also, the check for whole words and non-overlapping matches is done by post-processing the found emits. Because of this, the interactive algorithm run does not support checking for whole words and non-overlaps, only the full run does.
Goal:
Some ideas:
This release will be turned into a v0.4.0.
If any of you has more or better ideas, please state so here.
Hello,
First of all, i would like to thanks this project which really help me in some of my projects.
Can you advice me if it can be possible to use regex in keywords ?
If yes, how can you do that ?
Thanks for the reply,
Hi,
As much as I gather without going too deeply into the implementation, it looks safe to build the Trie once and then search it in parallel (call trie.parseText() on the same Trie instance from multiple threads).
This is because all the "state" of the search appears to be on local variables on the stack.
This is of course ignoring the issue of memory barriers that can be dealt with by making the Trie instance volatile.
Is this assesment correct?
Thanks,
Uri
When a text is entered, it may contain combinations of upper and lower case characters. If the casing is not important to the matching, it is possible to convert the entire search text into lower case to allow for more matches to be found.
Hi,
We are using this library for some time now.
Recently there was an OOM due to the increased size of the dictionary.
A simple test shows that memory usage is linear. (there is no attach screenshot option, sorry!)
Currently, our dictionary size is 0.75 million entries (33 MB csv file each line is a term.)
Trie constructed using this dictionary takes about 1.9GB (heap entries below)
"Class","Objects","Shallow Size","Retained Size"
"java.util.TreeMap","16034426","769652448","2060340624"
"java.util.TreeMap$Entry","16036279","641451160","2060340624"
"org.ahocorasick.trie.Statesun.misc.Launcher$AppClassLoader","15292852","489371264","2060340624"
My question is
Thanks in advance,
Phaneendra
Based on the original search text, the Trie must be able to return tokens. A token can either be plaintext or a match.
When it is a match, it should hold the original emit besides the original text.
Hello,
i would like to ask whether this library is considered production ready? or whether it's being used somewhere?
Thanks!
I encountered a performance problem, documented in a Unit-Test https://github.com/almondtools/stringbench/blob/master/src/test/java/com/almondtools/stringbench/incubation/ACAhoCorasickIncubationTest.java.
I tracked down the reason: The removeOverlaps
-Method removes overlaps in an inefficient way. The experiments show, that following section is the critical one:
for (Intervalable removeInterval : removeIntervals) {
intervals.remove(removeInterval);
}
Doing this on an ArrayList consumens O(n^2) and an ArrayList containing millons of elements will have long to work.
It must be possible to call parseText with a callback handler. In this situation, emits will not be stored in the list, but instead for every emit, a call will be made to the callback handler.
To guarantee backwards compatibility, the system must use a default handler which stores the emits as it does now and returns the list in the end.
Hi
My profiler seems to show that a bunch of time is being spent in:
java.util.concurrent.LinkedBlockingDeque#add
inside of
inside of org.ahocorasick.trie.Trie.constructFailureStates();
LinkedList
is a faster implementation of Queue
than LinkedBlockingDeque
, I am measuring it to be about 3-4 times faster.
Could:
private void constructFailureStates() {
final Queue<State> queue = new LinkedBlockingDeque<>();
be changed to:
private void constructFailureStates() {
final Queue<State> queue = new LinkedList<>();
The basic Aho-Corasick algorithm returns all matches, whether they overlap or not. It should be possible to eliminate overlaps, so that the series of matches is interspersed in non-matching text, never in or between matches.
Example:
Conflict resolution:
Hi
org.ahocorasick.trie.Trie.removePartialMatches(CharSequence, List<Emit>)
is slow when the given list is large. The problem is because of the n^2 running time where a list of Emit
s to delete is made and then each one is removed (not by index) from the list.
If you are happy to move to java 8 the change you need to make to speed this up is to change the method to be removePartialMatches:
collectedEmits.removeIf(e -> this.isPartialMatch(searchText, e));
This resulted in a speed up of 1941053ms
to 142ms
for some test that somehow used that method I think the doc was 8MB had lots of emits
I have a simple microbenchmark project to measure the string matching performance with different algorithms.
It looks to me Aho-Corasick performance doesn't look very cool in comparing with others:
test scenario | String | KMP | KMP - precompiled | Aho-Corasick |
---|---|---|---|---|
short text | *78,763.767/ms |
19,118.220/ms |
29,489.667/ms |
7824.821/ms |
long text | *4,093.381/ms |
329.124/ms |
426.696/ms |
197.690/ms |
Is it because the design of my tests doesn't fit Aho-Corasick's typical usage scenario?
I added the following test case to TrieTest.java and it fails:
@Test
public void unicodeIssue() {
String target = "LİKE THIS";
Trie trie = new Trie().caseInsensitive().onlyWholeWords();
trie.addKeyword("this");
Collection<Emit> emits = trie.parseText(target);
System.out.println(emits);
assertEquals(1, emits.size());
Iterator<Emit> it = emits.iterator();
checkEmit(it.next(), 5, 8, "this");
}
Notice the capital I with a dot. The code thinks the offsets are at 6, 9 and that actually causes trie.tokenize(target)
to crash because 9 is past the end of the string. That character is two bytes wide, so it seems to be pushing the offset calculation off.
Hi,
I am getting frequent NullPointerException inside Trie.java
Caused by: java.lang.NullPointerException
at org.ahocorasick.trie.Trie.getState(Trie.java:138)
at org.ahocorasick.trie.Trie.parseText(Trie.java:99)
Looks like currentState can be null because failure() can return null?
private State getState(State currentState, Character character) {
State newCurrentState = currentState.nextState(character);
while (newCurrentState == null) {
currentState = currentState.failure();
newCurrentState = currentState.nextState(character);
}
return newCurrentState;
}
Hello,
I tried using the algorithm when scanning for 6M of values, but I got a Java out of heap space error.
java.lang.OutOfMemoryError: Java heap space
What would be the changes needed to support Aho-Corasick at scale ?
Could augmenting the heap space of the JVM be sufficient or is a bigger refactoring needed to support scale ?
Any alternatives (either AC or another algorithm) that could scale well, maybe being distributed ?
Thanks for any pointers to a possible solution.
Hi,
I noticed that if you call caseInsensitive()
configuration method after addKeyword(...)
method then trie.firstMatch()
and anything else that returns an Emit
type will always return null
for those keywords added before calling the case insensitive configuration method. This can be avoided by calling the caseInsensitive()
configuration method before calling addKeyword(...)
method as in your examples, but I think you should make it clear in the documentation somewhere that the order matters for caseInsensitive()
, or change the code to check for different possible orderings during the build()
process. The other configuration methods like onlyWholeWords()
work fine regardless of when they are invoked.
For example, the following returns null
for emit
:
Trie trie = Trie.builder()
.addKeyword(...)
.caseInsensitive()
.build();
Emit emit = trie.firstMatch(...);
But this works fine:
Trie trie = Trie.builder()
.caseInsensitive()
.addKeyword(...)
.build();
Emit emit = trie.firstMatch(...);
Apologies, maybe bad use of 'issues' for this, but is it possible to emit non-matches only?
Example: match list is : {"foo", "bar", "blah"}
Sentence is: "this is foo xx yyy bar aaa blah"
will emit
{this is xx yy aaa}
Hi robert,
Is it possible to add a feature or another method to stop searching the tire once find any keyword.
because i have requirement which only want to check exist or not.
it will be more efficient if have this function. think it is easy to add it to make the library more stronger:)
Thanks
It would be illuminating to compare various Trie implementations to see if there can be any significant performance gains to be made. This would require creating a common Trie interface that can be used, independent of the underlying implementation.
When the tokenize method is used, a single token must report whether it is a whole word or not.
The definition for whole word in Trie must be reused.
Hi, Rober :
Thanks for your help, i fix the dumplicate problems by using static constructor. The codes showing below :
public class AhoCorasickMethod {
private static Trie trie;
private static Boolean alreadyExecuted = false;
static {
trie = MssConst.METHODS.get(MssConst.METHOD_MATCHEST);
}
public static void init(String keyword) {
if (!alreadyExecuted) {
trie.addKeyword(keyword);
alreadyExecuted = true;
}
}
public AhoCorasickMethod() {
}
public Collection<Emit> ahoCorasickMethod() {
init("community");
return trie.parseText("anjuke, communityparts");
}
}
However, one user may create repetitive keywords, then ahocorasick's results will be repetitive.
So, i suggest that in State.java, change the type of emits from List to Set.
private List<String> emits = null;
become
private Set<String> emits = null;
It's my points, hope it helps.
Hi Robert,
Thanks for your help. I have constructed Trie with keywords upfront and calling the parseText() API on the constructed Trie in a multithreaded env getting NullPointerException. Could you please help us by providing the Thread-Safe Code.
I tried using the code but unfortunately version newer than 0.4.0 do not seem to be available through the Maven Central repository anymore.
Would it be possible to push the artifacts to Maven Central for 0.6.0 and future releases again?
I used your algorithm to compare the results with my suite of algorithms. Differences occurred.
I cannot say if they are related to wrong usage. You may reproduce the problem
by checking out https://github.com/almondtools/stringbench (just to have the right dependencies)
use following unit test
@Test
public void testACAhoCorasick1() {
MultiPatternMatcherBenchmark benchmark = new ACAhoCorasickBenchmark();
benchmark.setup(createMultiPatternSample(2, 8, 8));
benchmark.benchmarkFind();
}
private static MultiPatternSample createMultiPatternSample(int alphabet, int pattern, int patternNumber) {
try {
MultiPatternSample sample = new MultiPatternSample();
sample.setPatternNumber(patternNumber);
sample.setAlphabetSize(alphabet);
sample.setPatternSize(pattern);
sample.setup();
return sample;
} catch (IOException e) {
throw new AssertionError(e);
}
}
the problem of wrong results does also appear with other sample/pattern combinations (most documented here, search for IllegalStateException)
If the problem is part of the benchmark (or of the other algorithms) it would be kind to give feedback.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.