hideaki-t / sqlite-fts-python Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 11.0 190 KB

A Python binding of SQLite Full Text Search Tokenizer

License: MIT License

Python 100.00%

sqlite-fts-python's People

Contributors

Stargazers

Watchers

Forkers

saaj pombreda pombredanne oudalab enquos kyodocn lunalcni bobosui polyrand macabrus

sqlite-fts-python's Issues

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

This repository currently has no open or pending branches.

Detected dependencies

github-actions

.github/workflows/package.yml

actions/checkout v4

actions/setup-python v5

pep621

pyproject.toml

cffi >=1.15.1

pyenv

.python-version

python 3.12

Check this box to trigger a request for Renovate to run again on this repository

support Py_TRACE_REFS enabled Python

Hi, I noticed that you used the bm25 implementation from peewee. I've made some improvements which you might be interested in. Now the function accepts weights for each column on the model. So if you've indexed a row with a title, content and tags column, you can specify that matches in the title are worth more than matches in the other fields.

You can find the code here.

I've also implemented these in Cython as a SQLite C extension, which you can find here.

I've similarly updated the simpler rank() function to accept weight values for the columns.

try DragonFFI

the binary module is big (about 30MB) but no C compiler needed.

cannot enable 2-arg fts3_tokenizer on SQLite 3.20.0

the internal flag value was changed at 3.20.0 (and it seems the value was changed back to the previous value)

It is private value defined in sqliteInt.h, so it is better to avoid using it.
2-args fts3_tokenizer should be enabled by the public API(sqlite3_db_config) as this module did before.
The change may break a compatibility with APSW built with its Amalgamation method.

test_mecab.py::test_match failed

not sure what happened, but it seems it is failing. Python 3.7+MeCab worked before

test_mecab.py::test_createtable PASSED
test_mecab.py::test_insert PASSED
test_mecab.py::test_match FAILED
test_mecab.py::test_tokenizer_output From cffi callback <function make_tokenizer_module.<locals>.xclose at 0x7f0163659730>:
Traceback (most recent call last):
  File "/home/hideaki/sqlite-fts-python/tests/jajp_common.py", line 70, in test_tokenizer_output
    assert e == a
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/hideaki/sqlite-fts-python/.tox/py37-linux/lib/python3.7/site-packages/sqlitefts/fts3.py", line 132, in xclose
    tk = ffi.from_handle(pCursor.pTokenizer.t)
  File "/home/hideaki/sqlite-fts-python/.tox/py37-linux/lib/python3.7/site-packages/cffi/api.py", line 540, in from_handle
    return self._backend.from_handle(x)
SystemError: <built-in function from_handle> returned a result with an error set
ERROR: InvocationError for command /home/hideaki/sqlite-fts-python/.tox/py37-linux/bin/py.test -svrx (exited with code -11)

Pre-release improvement

There're some changes that I think should be addressed before package Cheese-shop release.

Result ranking

There's no full-text result set ranking function out-of-the-box in SQLite. I think it makes sense to extent the scope of the package to address ranking as it is absolutely a topic of both "sqlite" and "fts".

All code is already out there. There's the article, even though it's about MIT-licensed package, peewee, the code can be easily extracted. Here's a gist with module and test case for it.

Because BM25 is a general language-independent ranking function its presence in the package makes it more complete.

Minimum documentation

README should be written to overview and cover basics. I can assist with it.

Also recipes for integration with tokenizers for major domains (CJK, Cyrillic, etc) is a good idea.

Minor

Underscore is undesired in a Python module name. I suggest to rename sqlite_tokenizer.py. "sqlite" part is the obvious context. tokenizer.py is better but not good anyway as it's not informative as the module doesn't provide real tokenizer per se, rather than a binding to register it. binding.py may be a better name, though you can try to coin a better one.

Make user symbols available from __init__.py so import sqlitefts is sufficient.

setup.py. url points to other package. "Operating System :: POSIX :: Linux" seems redundant with "Operating System :: OS Independent".

Use cffi seemed will call some error in complex gc program.

Hi , i use this project with sqlite_fts4 to custom tokenizer and ranking function in a engine-map of sqlite.
And the mainly interaction between sqlite and python is the
register_functions defined in sqlite_fts4 and your register_tokenizer which plugin the code as your example say.
And i try my chinese tokenizer locally with one engine as follows:
`
import jieba
class JiebaTokenizer(fts.Tokenizer):
def tokenize(self, text):
for t, s, e in jieba.tokenize(text):
l = len(t.encode("utf-8"))
p = len(text[:s].encode("utf-8"))
yield t, p, p + l

contents = [("これは日本語で書かれています",), (" これは　日本語の文章を全文検索するテストです",),
("新兴铸管",)]

tkj = fts.make_tokenizer_module(JiebaTokenizer())

conn.execute("CREATE VIRTUAL TABLE fts USING FTS4(tokenize={})".format("jieba_tokenizer"))

c = conn
r = c.executemany("INSERT INTO fts VALUES(?)", contents)

r = c.execute("SELECT * FROM fts").fetchall()

r = c.execute("SELECT * FROM fts WHERE fts MATCH '新兴'").fetchall()
`
the last r produce the success conclusion.

My problem is that when i use it in a dictionary of engine, key as name, value as engine,
with some complex interaction (register)
It yield the following error in gdb:

Program received signal SIGSEGV, Segmentation fault.
0x0000555555690253 in delete_garbage.isra.26 (
old=0x5555558c7540 <_PyRuntime+416>, collectable=0x7fffffffda30)
at /tmp/build/80754af9/python_1599203911753/work/Modules/gcmodule.c:948
948 /tmp/build/80754af9/python_1599203911753/work/Modules/gcmodule.c: No such file or directory.

this seems a error caused by cffi,
relate questions are:
https://stackoverflow.com/questions/43079945/why-is-there-a-segmentation-fault-with-this-code
https://stackoverflow.com/questions/41577144/how-to-solve-a-sqlite-fts-segmentation-fault-in-python

some says cffi have some problem in nest objects, and say if replace cffi by pybind11,
this kind of problem can be solved, can you try to give me some suggestions ?
And if you require, i will upload the whole code to make the error reproduce.
Thank you.

segv if a python interpreter doesn't have separate _sqlite3

python interpreters fetched from python-build-standalone (by rye) crash 100%. it occurs for all versions I tested locally 3.9, 3.10, 3.11, 3.12 on an Arch Linux VM x86-64.

it also happens on Debian aarch64.

no such function fts3_tokenizer

The following is the output i got when i executed my code. I have compiled my code with the flag DSQLITE_ENABLE_FTS3_TOKENIZER. But i still got this error.

Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
Restarting with stat
connection to cursor
Traceback (most recent call last):
File "app.py", line 59, in
tokenize()
File "app.py", line 23, in tokenize
fts.register_tokenizer(c, 'oulatin', fts.make_tokenizer_module(OUWordTokenizer('latin')))
File "/usr/lib/python3.6/site-packages/sqlitefts/tokenizer.py", line 191, in register_tokenizer
r = c.execute('SELECT fts3_tokenizer(?, ?)', (name, address_blob))
File "src/cursor.c", line 1019, in APSWCursor_execute.sqlite3_prepare
File "src/statementcache.c", line 386, in sqlite3_prepare
apsw.SQLError: SQLError: no such function: fts3_tokenizer

FTS5_TOKEN_COLOCATED

The FTS tokenizer API has the concept of "colocated" tokens where multiple tokens can occupy the same position in a sentence. The main use of this functionality is to implement synonyms (See Sec 7.1.1).

Is there any way to mark a token as colocated through the Python API?

Is there a similar project that can make postgresql similar function?

PostgreSQL + plpython3u

support FTS5

see https://www.sqlite.org/fts5.html
peewee also supports FTS5 https://github.com/coleifer/peewee/blob/master/playhouse/sqlite_ext.py

FTS5 API has been changed @ SQLite 3.20.0, consider support both old and new APIs.

switch to GitHub Actions

Query with double quotes and OR operator

Hello!
A query with the OR operator does not work if double quotes are used.
test_base.py:

r = c.execute(
    "SELECT * FROM docs WHERE docs MATCH '\"binding\" OR \"あいうえお\"'"
).fetchall()
assert len(r) == 2
r = c.execute(
    "SELECT * FROM docs WHERE docs MATCH '\"provides binding\" OR あいうえお'"
).fetchall()
assert len(r) == 2

not work with PyPy

Maybe allocated objects are collected by GC.
Need more investigation.

Wrong xnext behaviour

First of all, the proof of concept is good as it deals with all the hard (for a Pythonista) ctypes stuff behind the scenes. Performance-wise it is yet to be tested but anyway it is helpful. Playing with it I've stumbled over a couple of bugs.

Bugs

Infinite recursion in MATCH

The quote from fts3_tokenizer.h about xNext:

The input text that generated the token is
identified by the byte offsets returned in *piStartOffset and
*piEndOffset. *piStartOffset should be set to the index of the first
byte of the token in the input buffer. *piEndOffset should be set
to the index of the first byte just past the end of the token in
the input buffer.

That says that the offsets deal with the token input, not normalized token. What consequence it has?

class Tokenizer(fts.Tokenizer):

  def tokenize(self, text):
    return (w[0:-1] for w in text.split(' '))

class TestCase(unittest.TestCase):

  def setUp(self):
    name = 'test'
    conn = sqlite3.connect(':memory:')

    fts.register_tokenizer(conn, name, fts.make_tokenizer_module(Tokenizer()))
    conn.execute('CREATE VIRTUAL TABLE fts USING FTS4(tokenize={})'.format(name))

    self.testee = conn

  def testInfiniteRecursion(self):
    contents = [('abc def',), ('abc xyz',)]
    result = self.testee.executemany('INSERT INTO fts VALUES(?)', contents)
    self.assertEqual(2, result.rowcount)

    result = self.testee.execute("SELECT * FROM fts WHERE fts MATCH 'abc'").fetchall()
    self.assertEqual(2, len(result))

The test case leads to infinite recursion when executing SELECT query. It doesn't, however, in INSERT.

Empty normalized token

If normalized token is an empty string it should not be returned from xNext rather then processing should be advanced to next token1. The following fails with Error: SQL logic error or missing database.

class TestCase(unittest.TestCase):

  def testZeroLengthToken(self):
    result = self.testee.executemany('INSERT INTO fts VALUES(?)', [('Make things I',)])
    self.assertEqual(1, result.rowcount)

Suggested changes

For first bug I suggest to also return begin and end indices of input (pre-normalized) token, i.e. (normalizedToken, inputBeginIndex, inputEndIndex). For second bug I suggest to implement empty tokens skip in xNext. Here's the gist with the patch and test case.

Unicode Error for single word queries on FTS3

All the advanced queries in my application work, except for single word searches like - SELECT title, book, author, link, snippet(text_idx) FROM text_idx WHERE text_idx MATCH 'possumus'; and OR searches like SELECT title, book, author, link, snippet(text_idx) FROM text_idx WHERE text_idx MATCH 'quam OR Galliae';

The application exits at line cursor.fetchall()

with the following error when the queries similar to the ones mentioned above are run.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 105: invalid continuation byte

sqlite --version
3.37.0 2021-12-09 01:34:53 9ff244ce0739f8ee52a3e9671adb4ee54c83c640b02e3f9d185fd2f9a179aapl

sqlite> PRAGMA compile_options;
┌──────────────────────────────────┐
│         compile_options          │
├──────────────────────────────────┤
│ ATOMIC_INTRINSICS=1              │
│ BUG_COMPATIBLE_20160819          │
│ COMPILER=clang-13.1.6            │
│ DEFAULT_AUTOVACUUM               │
│ DEFAULT_CACHE_SIZE=2000          │
│ DEFAULT_CKPTFULLFSYNC            │
│ DEFAULT_FILE_FORMAT=4            │
│ DEFAULT_JOURNAL_SIZE_LIMIT=32768 │
│ DEFAULT_LOOKASIDE=1200,102       │
│ DEFAULT_MEMSTATUS=0              │
│ DEFAULT_MMAP_SIZE=0              │
│ DEFAULT_PAGE_SIZE=4096           │
│ DEFAULT_PCACHE_INITSZ=20         │
│ DEFAULT_RECURSIVE_TRIGGERS       │
│ DEFAULT_SECTOR_SIZE=4096         │
│ DEFAULT_SYNCHRONOUS=2            │
│ DEFAULT_WAL_AUTOCHECKPOINT=1000  │
│ DEFAULT_WAL_SYNCHRONOUS=1        │
│ DEFAULT_WORKER_THREADS=0         │
│ ENABLE_API_ARMOR                 │
│ ENABLE_BYTECODE_VTAB             │
│ ENABLE_COLUMN_METADATA           │
│ ENABLE_DBPAGE_VTAB               │
│ ENABLE_DBSTAT_VTAB               │
│ ENABLE_EXPLAIN_COMMENTS          │
│ ENABLE_FTS3                      │
│ ENABLE_FTS3_PARENTHESIS          │
│ ENABLE_FTS3_TOKENIZER            │
│ ENABLE_FTS4                      │
│ ENABLE_FTS5                      │
│ ENABLE_JSON1                     │
│ ENABLE_LOCKING_STYLE=1           │
│ ENABLE_NORMALIZE                 │
│ ENABLE_PREUPDATE_HOOK            │
│ ENABLE_RTREE                     │
│ ENABLE_SESSION                   │
│ ENABLE_SNAPSHOT                  │
│ ENABLE_SQLLOG                    │
│ ENABLE_STMT_SCANSTATUS           │
│ ENABLE_UNKNOWN_SQL_FUNCTION      │
│ ENABLE_UPDATE_DELETE_LIMIT       │
│ HAVE_ISNAN                       │
│ MALLOC_SOFT_LIMIT=1024           │
│ MAX_ATTACHED=10                  │
│ MAX_COLUMN=2000                  │
│ MAX_COMPOUND_SELECT=500          │
│ MAX_DEFAULT_PAGE_SIZE=8192       │
│ MAX_EXPR_DEPTH=1000              │
│ MAX_FUNCTION_ARG=127             │
│ MAX_LENGTH=2147483645            │
│ MAX_LIKE_PATTERN_LENGTH=50000    │
│ MAX_MMAP_SIZE=1073741824         │
│ MAX_PAGE_COUNT=1073741823        │
│ MAX_PAGE_SIZE=65536              │
│ MAX_SQL_LENGTH=1000000000        │
│ MAX_TRIGGER_DEPTH=1000           │
│ MAX_VARIABLE_NUMBER=500000       │
│ MAX_VDBE_OP=250000000            │
│ MAX_WORKER_THREADS=8             │
│ MUTEX_UNFAIR                     │
│ OMIT_AUTORESET                   │
│ OMIT_LOAD_EXTENSION              │
│ STMTJRNL_SPILL=131072            │
│ SYSTEM_MALLOC                    │
│ TEMP_STORE=1                     │
│ THREADSAFE=2                     │
│ USE_URI                          │
└──────────────────────────────────┘

I am registering custom tokenizer as follows:

@asynccontextmanager
async def make_conn() -> Connection:
    async with aiosqlite.connect('test.db',
                                 detect_types=sqlite3.PARSE_DECLTYPES | sqlite3.PARSE_COLNAMES,
                                 check_same_thread=False) as db:
        try:
            await db.enable_load_extension(True)
            pragmas = await db.execute_fetchall('pragma compile_options;')
            if ('ENABLE_FTS5',) not in pragmas:
                await db.load_extension('fts5')
            await db.load_extension('mod_spatialite')
            await db.execute('pragma journal_mode=WAL')
            # hacky... but shouldn't be used really...
            # we need to register it in async manner...
            fts5.register_tokenizer(db._connection, 'croatian_generic', custom_tokenizer_module)
            db.row_factory = aiosqlite.Row
            yield db
            await db.commit()
        except Exception as e:
            await db.execute('rollback')
            raise e

But I get following error:

Traceback (most recent call last):
  File "/Users/.../Projects/.../.../sqlite-testing/src/db.py", line 32, in make_conn
    fts5.register_tokenizer(db._connection, 'croatian_generic', cro_tokenizer_mod)
  File "/Users/.../Projects/.../.../sqlite-testing/.venv/lib/python3.10/site-packages/sqlitefts/fts5.py", line 171, in register_tokenizer
    r = fts5api.xCreateTokenizer(fts5api,
AttributeError: cdata 'fts5_api * *' has no attribute 'xCreateTokenizer'

Reproduced the same issue in full working example without aiosqlite module:

import sqlite3

from sqlitefts import fts5

class CroatianTokenizer(fts5.FTS5Tokenizer):
    pattern = re.compile(r'\w+', re.UNICODE)

    def tokenize(self, text, flags=None):
        for match in self.pattern.finditer(text):
            start, end = match.span()
            token = text[start:end]
            length = len(token.encode('utf-8'))
            position = len(text[:start].encode('utf-8'))
            print(f'token: {token} {position} {position + length}')
            yield token, position, position + length

cro_tokenizer_mod = fts5.make_fts5_tokenizer(CroatianTokenizer())

with sqlite3.connect('simple.db') as db:
    fts5.register_tokenizer(db, 'croatian_generic', cro_tokenizer_mod)
    db.execute("create virtual table test using fts5(a,b,c,tokenize='croatian_generic')")
    data = [
        ('a b c d e', 'f g h i j', 'k l m n o'),
        ('a b c d e', 'f g h i j', 'k l m n o'),
        ('a b c d e', 'f g h i j', 'k l m n o'),
        ('a b c d e', 'f g h i j', 'k l m n o')
    ]
    db.executemany('insert into test values(?, ?, ?)', data)
    with db.cursor() as cur:
        for row in cur.execute('select * from test where test match ?', ('a',)):
            print(row)

EDIT: I suspect it might have to do with breaking changes introduced in 3.20. (see: point 3.)