Code Monkey home page Code Monkey logo

trie4j's Introduction

Trie4J - various trie implementation for Java.

Trie4J is the sort of collection of various trie implementation.

You can get the binary using Maven:

<dependency>
    <groupId>com.github.takawitter</groupId>
    <artifactId>trie4j</artifactId>
    <version>0.9.10</version>
</dependency>

or from Central Repository

  • Coming release: nothing planned.
  • 2023/10/9: 0.9.10 released.
  • 2023/10/2: 0.9.9 released.
  • 2018/1/24: 0.9.8 released.
  • 2017/10/22: 0.9.7 released.
  • 2017/7/20: 0.9.6 released.
  • 2016/10/28: 0.9.5 released.
  • 2016/10/22: 0.9.4 released.
  • 2016/5/28: 0.9.3 released.

Performance comparison:

with 1.27 million words and 10.04 million chars contained in jawiki-20120220-all-titles-in-ns0.gz .
on MacOS X(10.7), Core i7 2.5GHz, Java 7 Update 21. 2013-5-14.

classnotesbuild(ms)contains(ms)used heap(MB)
java.util.HashSet404*1363126.2
java.util.TreeSet416*1235125.9
PatriciaTrie(src) Simple PATRICIA Trie.371*124790.8
TailPatriciaTrie(src) SuffixTrieTailBuilder(src) PATRICIA Trie with tail string. 937*124876.8
ConcatTailBuilder(src) 440*122869.2
DoubleArray(src) Simple Double Array Trie. 362*29852.6
TailDoubleArray(src) SuffixTrieTailBuilder(src) Double Array Trie with tail string. 3,111*218128.9
ConcatTailBuilder(src) 2,532*215433.6
TailLOUDSTrie(src) SuffixTrieTailBuilder(src) LOUDS Succinct Trie with tail string. 620*255415.4
ConcatTailBuilder(src) 111*253720.2
ConcatTailBuilder(src) with sbvTI*3 145*271215.7
TailLOUDSPPTrie(src) SuffixTrieTailBuilder(src) LOUDS++ Succinct Trie with tail string. 654*257115.4
ConcatTailBuilder(src) 119*255220.1
ConcatTailBuilder(src) with sbvTI*3 163*274115.6
*1 - build from string array.
*2 - build from other trie(org.trie4j.patricia.simple.PatriciaTrie).
*3 - shrinked index array using succinct bit vector.

Sample codes:

import org.trie4j.doublearray.DoubleArray;
import org.trie4j.louds.TailLOUDSTrie;
import org.trie4j.patricia.PatriciaTrie;

public class Sample {
	public static void main(String[] args) throws Exception{
		PatriciaTrie pat = new PatriciaTrie();
		pat.insert("Hello");
		pat.insert("World");
		pat.insert("Wonder");
		pat.insert("Wonderful!");
		pat.contains("Hello"); // -> true
		pat.predictiveSearch("Wo"); // -> {"Wonder", "Wonderful!", "World"} as Iterable<String>
		
		DoubleArray da = new DoubleArray(pat); // construct DoubleArray from existing Trie
		da.contains("World"); // -> true
		
		TailLOUDSTrie lt = new TailLOUDSTrie(pat); // construct LOUDS succinct Trie with ConcatTailBuilder(default)
		lt.contains("Wonderful!"); // -> true
		lt.commonPrefixSearch("Wonderful!"); // -> {"Wonder", "Wonderful!"} as Iterable<String>
	}
}

additional notes.

These classes are experimental and not contained in trie4j-SNAPSHOT.jar.

  • MultilayerPatriciaTrie: PatriciaTrie which has the pointer to another trie which store reversed tail string. (src)
  • optimizes size using Multilayer Trie but no significant improvement.
  • OptimizedTailDoubleArray: DoubleArray with Tail Array and some optimization (src)
  • feature completed but can't support large data (over several 10 thousants).

additional notes(ja).

2012年2月、1冊の本が発売されました。

"日本語入力を支える技術" 変わり続けるコンピュータと言葉の世界 (WEB+DB PRESS plus) 徳永 拓之 (著)

日本語入力を支える技術

多くのエンジニアがこの本に触発され、各種アルゴリズムの理解を深めたり、一から勉強を始めたり、 また中にはこれを機に様々なライブラリを実装し公開する人も出てきました。trie4jもそういったライブラリの一つで、各種トライ構造にターゲットを絞り、本書やその分野のブログなどを参考に実装されています。

ほとんどのクラスはシンプルな実装になっていますが、一部独自の最適化が入っています。また、各トライが提供するメソッドは、 極力中間オブジェクトを作らないようになっており、オブジェクト生成/破棄によるパフォーマンス低下を起こさないよう実装されています。

下記クラスは実験的実装で、trie4j-SNAPSHOT.jarには含まれません(src.kitchensinkにあります)。

  • 多層パトリシアトライ(MultilayerPatriciaTrie(src))

  • 多層トライの実験結果 - やた@はてな日記 を参考に、接尾辞を格納するトライを内包しサイズを最適化した実装です。また、子を持たないノード、子を一つだけ持つノード、それぞれの終端/非終端版と、様々な種類のノードを用意して 使い分けることで、極力無駄なメモリを使わないようにしています。但しパトリシアトライのままなので、あまり効率が上がっていません。

  • TAIL配列付き最適化ダブルアレイ(OptimizedTailDoubleArray(src))

    • 未使用領域の開放やcheck配列をshortにした。実装は完了していますが、大規模なデータ(数万レコード超)には対応できません。

trie4j's People

Contributors

dependabot[bot] avatar eliaslevy avatar miurahr avatar takawitter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

trie4j's Issues

MapPatriciaTrie.predictiveSearchEntries returns wrong value

MapPatriciaTrie.predictiveSearchEntries seem to returns wrong value.
Please check below example.

    public static void main(String[] args) {
        MapPatriciaTrie<String> trie = new MapPatriciaTrie<>();
        System.out.println(trie.insert("A", "valueA"));
        System.out.println(trie.insert("AB", "valueAB"));
        System.out.println(trie.insert("ABC", "valueABC"));
        trie.freeze();

        System.out.println("=====");
        for (Map.Entry<String, String> e: trie.predictiveSearchEntries("A")) {
            System.out.println(e.getKey() + "\t" + e.getValue());
        }

    }
null
null
null
=====
A   valueA
AB  valueA    # Wrong value
ABC valueAB   # Wrong value

org.trie4j.util.TrieMap#entrySet fails with class-cast-exception

TrieMap.entrySet() will fails as it tries to use a TreeSet with incomparable elements. i fail to successfully build the project since i'm missing test data, but this would fix it:

public class TrieMap<T> extends AbstractMap<String, T> {

...

    @Override
    public Set<Entry<String, T>> entrySet() {
        Set<StringEntry<T>> ret = new TreeSet<StringEntry<T>>();
        for (final String s : trie.predictiveSearch("")) {
            ret.add(new StringEntry<>(s, trie.get(s)));
        }
        return Collections.unmodifiableSet(ret);
    }

    private static final class StringEntry<T> implements Map.Entry<String, T>, Comparable<StringEntry<T>> {
        private final String k;
        private final T v;

        StringEntry(String k, T v) {
            this.k = k;
            this.v = v;
        }

        @Override
        public int compareTo(StringEntry<T> o) {
            return k.compareTo(o.getKey());
        }

        @Override
        public String getKey() {
            return k;
        }

        @Override
        public T getValue() {
            return v;
        }

        @Override
        public T setValue(T value) {
            throw new UnsupportedOperationException();
        }
    }

...

}

Unnecessary check of negative

Hi, I found unnecessary check code in o.t.doublearray.DoubleArray and o.t.doublearray.TailDoubleArray``#getChild method.

   public DoubleArrayNode getChild(char c) {
            int code = charToCode[c];
            if (code == -1) return null;
            int nid = base[nodeId] + code;

Here charToCode is defined char[] and char is ranging from 0x0000 to 0xffff in language specification and never be negative.

I think it can be like

        public DoubleArrayNode getChild(char c) {
            int nid = base[nodeId] + charToCode[c];

I found it when I check the project by SpotBugs, the above code produce an error.

question

trie must be loaded all in the memory? is not possible to load a part of the trie in memory and a way to flush the trie motification in the file without to write all the trie ?

Values are lost when using MapPatriciaTrie

After initializing a MapPatriciaTrie in the following way:

MapTrie<Integer> mapTrie = new MapPatriciaTrie<Integer>();

mapTrie.insert("Test", 1);
mapTrie.insert("Tes", 4);

the following code returns null:

mapTrie.get("Test")

I believe it's because you are not setting the value of newChild on lines 164-166 of MapPatriciaTrie.

Small size improvement for DoubleArray.

Currently, DoubleArray tries use base and check array as is created during build.
So arrays may contain unused area because that were extended by 1.5 each time that is extended.
Resizing arrays at the end of build process can reduce memory footprint.

Exception when looking up particular strings in MapDoubleArray

The following produces an ArrayIndexOutOfBoundsException:

MapPatriciaTrie<Object> mpt = new MapPatriciaTrie<>();
mpt.insert("FOO");
mpt.get("F");
mpt.get("f");

MapDoubleArray<Object> mda = new MapDoubleArray<>(mpt);
mda.get("F");
mda.get("f"); // exception here
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
    at org.trie4j.bv.BytesRank1OnlySuccinctBitVector.isOne(BytesRank1OnlySuccinctBitVector.java:90)
    at org.trie4j.bv.BytesRank1OnlySuccinctBitVector.get(BytesRank1OnlySuccinctBitVector.java:80)
    at org.trie4j.doublearray.DoubleArray.getTermId(DoubleArray.java:243)
    at org.trie4j.AbstractTermIdMapTrie.get(AbstractTermIdMapTrie.java:141)

I would expect this to simply return null rather than throw an exception. My use case is (natural-language) dictionary lookup with mixed-case keys.

Question: unclear procedure in method

I found a strange condition in bv.LongsRank0OnlySussinctBitVector.java#select0 Line 189

    @Override
    public int select0(int count) {
        for (int i = 0; i < longs.length; i++) {
            if (i * BITS_IN_BLOCK >= size) return -1;
            long v = longs[i];
            int c = BITS_IN_BLOCK - Long.bitCount(v);
            if (count <= c) {
                for (int j = 0; j < BITS_IN_BLOCK; j++) {
                    if (i * BITS_IN_BLOCK + j >= size) return -1;
                    if ((v & 0x8000000000000000L) != 1) {
                        count--;
                        if (count == 0) {
                            return i * BITS_IN_BLOCK + j;
                        }
                    }
                    v <<= 1;
                }
                return -1;
            }
            count -= c;
        }
        return -1;
    }

A if condition if ((v & 0x8000000000000000L) != 1) always true so a for loop will run until j * BIT_IN_BLOCK exceed size or exhausted count to 0. Is this intended?

I think it may be if ((v & 0x8000000000000000L) == 0).

Similar thing is also in org.trie4j.bv.LongsRank1OnlySuccinctBitVector.select0 line 189.

ArrayIndexOutOfBoundsException when traversing nodes of DoubleArray

When traversing nodes of DoubleArray which is built from PatriciaTrie that contains wikipedia japanese titles, the ArrayIndexOutOfBoundsException occurred at the line 73 of Rank1OnlySuccinctBitVector (inside "isOne" method).

        Algorithms.traverseByBreadth(trie.getRoot(), new NodeVisitor() {
            @Override
            public boolean visit(Node node, int nest) {
                return true;
            }
        });

feature request: Support of Java Platform Module System (JPMS)

Java 9 has new feature JPMS that was also known as Project jigsaw.
Because trie4j library has no dependency, it can be a leaf of software book of materials, it is valuable to support JPMS for me, at dependeents projects such as mdict4j, stardict4j.

see.
https://central.sonatype.com/artifact/com.github.takawitter/trie4j/0.9.8/dependents

You can support JPMS with Multi-Release jar, even when target Java is not 9.
see https://nipafx.dev/multi-release-jars-multiple-java-versions/

Refactor Test code

Test codes contain a lot of duplications and inefficient inheritance structure.

`o.t.io.TrieWriter#writeSuccinctBitVector` don't have enough `if` conditions

I found that there are not enough if conditions for o.t.io.TrieWriter#writeSuccinctBitVector method.

Argument SuccinctBitVector sbv is checked by instanceof

  • BytesSuccinctBitVector
  • BytesRank0OnlySuccinctBitVector
  • BytesRank0OnlySuccinctBitVector
  • BytesRank1OnlySuccinctBitVector
  • LongsSuccinctBitVector

but it also can be

  • LongsRank0OnlySuccinctBitVector
  • LongsRank1OnlySuccinctBitVector

that become IOException.

I observed it when running TrieWriterTest with a Japanese wikipedia title data at 20230901.

use bytebuffer abstraction instead of byte[]

Hey,
I would love it if trie4j could use ByteBuffer instead of byte[]. This could be useful for avoiding heap space for large static tries, by using DirectByteBuffer, MappedByteBuffer etc..

I only looked at the code briefly, and found that it could be quite simple to modify BytesSuccinctBitVector.java to use ByteBuffer. Wanted to know if there are more places I should be looking at?

WDYT?

Thanks!

ArrayIndexOutOfBoundsException on lookup in MapDoubleArray

Hi. I am using trie4j to store natural-language dictionary entries. With a particular dictionary index and a particular lookup word I am getting an ArrayIndexOutOfBoundsException on MapDoubleArray.get(). Sample code:

MapTrie<Object> mpt = new MapPatriciaTrie<>();
Files.lines(Paths.get("dicttest-Harrap_s_Business_fran_ais_ang.idx.txt"))
        .forEach(word -> mpt.insert(word, null));
MapDoubleArray<Object> da = new MapDoubleArray<>(mpt);
System.out.println(mpt.get("you")); // OK
System.out.println(da.get("you")); // Throws exception

The test file loaded above is available here: https://gist.github.com/amake/ba10166a7bf59ffac180defcecbec8d4

The lookup word "you" is not contained in the index, so I expect both get() calls to return null; however the second one results in this exception:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 128198
    at org.trie4j.doublearray.DoubleArray.getNodeId(DoubleArray.java:212)
    at org.trie4j.doublearray.DoubleArray.getTermId(DoubleArray.java:220)
    at org.trie4j.AbstractTermIdMapTrie.get(AbstractTermIdMapTrie.java:154)
    at test.main(Test.java:13)

This occurs with version 0.9.4.

Target Java version

Why not update target Java version to Java 8?
We know Java 21 is just released in last September, and Java 1.7 has been out of support.
Contributors can use Java 8 features such as NIO2 when update.

ArrayIndexOutOfBoundsException in DoubleArray

A DoubleArray trie with input strings of

php.a
php.e
php.o
e
php.elu
php.s
php.x

results in

java.lang.ArrayIndexOutOfBoundsException: -1
	at org.trie4j.doublearray.DoubleArray.commonPrefixSearchWithTermId(DoubleArray.java:313)

when you call commonPrefixSearchWithTermId with the value php.ele.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.