simon987 / sist2 Goto Github PK

Lightning-fast file system indexer and search tool

License: GNU General Public License v3.0

CMake 0.50% C 43.04% Python 7.69% Shell 0.58% JavaScript 16.75% HTML 0.16% Dockerfile 0.38% Vue 30.70% Makefile 0.04% C++ 0.16%

elasticsearch c sqlite vuejs

sist2's Introduction

Demo: sist2.simon987.net

Community URL: Discord

sist2

sist2 (Simple incremental search tool)

Warning: sist2 is in early development

Features

Fast, low memory usage, multi-threaded
Manage & schedule scan jobs with simple web interface (Docker only)
Mobile-friendly Web interface
Extracts text and metadata from common file types *
Generates thumbnails *
Incremental scanning
Manual tagging from the UI and automatic tagging based on file attributes via user scripts
Recursive scan inside archive files **
OCR support with tesseract ***
Stats page & disk utilisation visualization
Named-entity recognition (client-side) ****

* See format support
** See Archive files
*** See OCR
**** See Named-Entity Recognition

Getting Started

Using Docker Compose (Windows/Linux/Mac)

version: "3"

services:
  elasticsearch:
    image: elasticsearch:7.17.9
    restart: unless-stopped
    volumes:
      # This directory must have 1000:1000 permissions (or update PUID & PGID below)
      - /data/sist2-es-data/:/usr/share/elasticsearch/data
    environment:
      - "discovery.type=single-node"
      - "ES_JAVA_OPTS=-Xms2g -Xmx2g"
      - "PUID=1000"
      - "PGID=1000"
  sist2-admin:
    image: simon987/sist2:3.4.2-x64-linux
    restart: unless-stopped
    volumes:
      - /data/sist2-admin-data/:/sist2-admin/
      - /:/host
    ports:
      - 4090:4090
      # NOTE: Don't expose this port publicly!
      - 8080:8080
    working_dir: /root/sist2-admin/
    entrypoint: python3
    command:
      - /root/sist2-admin/sist2_admin/app.py

Navigate to http://localhost:8080/ to configure sist2-admin.

Using the executable file (Linux/WSL only)

Choose search backend (See comparison):
- Elasticsearch: have an Elasticsearch (version >= 6.8.X, ideally >=7.14.0) instance running
  1. Download from official website
  2. (or) Run using docker:
```
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.17.9
```
- SQLite: No installation required
Download the latest sist2 release. Select the file corresponding to your CPU architecture and mark the binary as executable with chmod +x.
See usage guide for command line usage.

Example usage:

Scan a directory: sist2 scan ~/Documents --output ./documents.sist2
Prepare search index:
- Elasticsearch: sist2 index --es-url http://localhost:9200 ./documents.sist2
- SQLite: sist2 sqlite-index --search-index ./search.sist2 ./documents.sist2
Start web interface:
- Elasticsearch: sist2 web ./documents.sist2
- SQLite: sist2 web --search-index ./search.sist2 ./documents.sist2

Format support

File type	Library	Content	Thumbnail	Metadata
pdf,xps,fb2,epub	MuPDF	text+ocr	yes	author, title
cbz,cbr	libscan	-	yes	-
`audio/*`	ffmpeg	-	yes	ID3 tags
`video/*`	ffmpeg	-	yes	title, comment, artist
`image/*`	ffmpeg	ocr	yes	Common EXIF tags, GPS tags
raw, rw2, dng, cr2, crw, dcr, k25, kdc, mrw, pef, xf3, arw, sr2, srf, erf	LibRaw	no	yes	Common EXIF tags, GPS tags
ttf,ttc,cff,woff,fnt,otf	Freetype2	-	yes, `bmp`	Name & style
`text/plain`	libscan	yes	no	-
html, xml	libscan	yes	no	-
tar, zip, rar, 7z, ar ...	Libarchive	yes*	-	no
docx, xlsx, pptx	libscan	yes	if embedded	creator, modified_by, title
doc (MS Word 97-2003)	antiword	yes	no	author, title
mobi, azw, azw3	libmobi	yes	yes	author, title
wpd (WordPerfect)	libwpd	yes	no	planned
json, jsonl, ndjson	libscan	yes	-	-

* See Archive files

Archive files

sist2 will scan files stored into archive files (zip, tar, 7z...) as if they were directly in the file system. Recursive (archives inside archives) scan is also supported.

Limitations:

Support for parsing media files with formats that require seek (e.g. .gif, .mp4 w/ fragmented metadata etc.) is limitted (see --mem-buffer option)
Archive files are scanned sequentially, by a single thread. On systems where sist2 is not I/O bound, scans might be faster when larger archives are split into smaller parts.

OCR

You can enable OCR support for ebook (pdf,xps,fb2,epub) or image file types with the --ocr-lang <lang> option in combination with --ocr-images and/or --ocr-ebooks. Download the language data files with your package manager (apt install tesseract-ocr-eng) or directly from Github.

The simon987/sist2 image comes with common languages (hin, jpn, eng, fra, rus, spa, chi_sim, deu, pol) pre-installed.

You can use the + separator to specify multiple languages. The language name must be identical to the *.traineddata file installed on your system (use chi_sim rather than chi-sim).

Examples:

sist2 scan --ocr-ebooks --ocr-lang jpn ~/Books/Manga/
sist2 scan --ocr-images --ocr-lang eng ~/Images/Screenshots/
sist2 scan --ocr-ebooks --ocr-images --ocr-lang eng+chi_sim ~/Chinese-Bilingual/

Search backends

sist2 v3.0.7+ supports SQLite search backend. The SQLite search backend has fewer features and generally comparable query performance for medium-size indices, but it uses much less memory and is easier to set up.

	SQLite	Elasticsearch
Requires separate search engine installation		✓
Memory footprint	~20MB	>500MB
Query syntax	fts5	query_string
Fuzzy search		✓
Media Types tree real-time updating		✓
Manual tagging	✓	✓
User scripts	✓	✓
Media Type breakdown for search results		✓
Embeddings search	✓ O(n)	✓ O(logn)

NER

sist2 v3.0.4+ supports named-entity recognition (NER). Simply add a supported repository URL to Configuration > Machine learning options > Model repositories to enable it.

The text processing is done in your browser, no data is sent to any third-party services. See simon987/sist2-ner-models for more details.

List of available repositories:

URL	Maintainer	Purpose
simon987/sist2-ner-models	simon987	General

Screenshot

Build from source

You can compile sist2 by yourself if you don't want to use the pre-compiled binaries

Using docker

git clone --recursive https://github.com/simon987/sist2/
cd sist2
docker build . -t my-sist2-image
# Copy sist2 executable from docker image
docker run --rm --entrypoint cat my-sist2-image /root/sist2 > sist2-x64-linux

Using a linux computer

Install compile-time dependencies

apt install gcc g++ python3 yasm ragel automake autotools-dev wget libtool libssl-dev curl zip unzip tar xorg-dev libglu1-mesa-dev libxcursor-dev libxml2-dev libxinerama-dev gettext nasm git nodejs

Install vcpkg using my fork: https://github.com/simon987/vcpkg

Install vcpkg dependencies

vcpkg install openblas curl[core,openssl] sqlite3[core,fts5,json1] cpp-jwt pcre cjson brotli libarchive[core,bzip2,libxml2,lz4,lzma,lzo] pthread tesseract libxml2 libmupdf[ocr] gtest mongoose libmagic libraw gumbo ffmpeg[core,avcodec,avformat,swscale,swresample,webp,opus,mp3lame,vpx,zlib]

Build

git clone --recursive https://github.com/simon987/sist2/
(cd sist2-vue; npm install; npm run build)
(cd sist2-admin/frontend; npm install; npm run build)
cmake -DSIST_DEBUG=off -DCMAKE_TOOLCHAIN_FILE=<VCPKG_ROOT>/scripts/buildsystems/vcpkg.cmake .
make

sist2's People

Contributors

Stargazers

Watchers

Forkers

dankmemes krazybug n8wacht dpieski prepareworks spectra-wope germainm freddyzeng ra2003 human39 swipswaps moneytech xmindtech ltgrp pickkaa mippos andreteixeira1998 yehia2amer ultranijia eannewton startion2007 linuxperia chinkiko yatli westcope engida21 xmlgrg guybrusht ukaserge einfachtobi dmgolembiowski tooomaaasz sgraviassy nurech theboatymcboatface nirvana6 gmh5225 sgrives dosycorps gangulabs pipeline-crawler jeaneric oderyn anthonym21 jackypeng66 lucasnzbr gerhobbelt systemz theophilefreger kiskadee-dev justdn mchangrh nikankad

sist2's Issues

Path filter picker does not show directories in order

sist2 version: 2.3.2

(To investigate, see #54 )

Verbose option

Add -v, --verbose option

https Elasticsearch client w/ mongoose (placeholder)

(placeholder)

Error decoding frame: Invalid data found when processing input

Does this error mean that it is trying to process a file that it cannot read like a media file?

I am running this on a folder with a multitude of various file types, but I am getting this error fairly consistently for the past 7% or so and I am pretty certain that media files are less than 7% of my files.

Is there an error log so I can see what files it has the Errors on?

Also - is the percent based on storage volume or file quantity?

Web: Filter Path error

sist2 version: 2.3.2

Platform (Linux or Docker): Docker on Windows

When I enter values in the "Filter Path", I get an "Elasticsearch Connection Error"

docker logs shows:

[7FC2978D0540] [2020-06-01 14:12:13] [WARNING serve.c] ElasticSearch error during query (400)
[7FC2978D0540] [2020-06-01 14:12:13] [WARNING serve.c] {
        "error":        {
                "root_cause":   [{
                                "type": "x_content_parse_exception",
                                "reason":       "[1:388] [bool] failed to parse field [filter]"
                        }],
                "type": "x_content_parse_exception",
                "reason":       "[1:388] [bool] failed to parse field [filter]",
                "caused_by":    {
                        "type": "illegal_state_exception",
                        "reason":       "expected value but got [START_ARRAY]"
                }
        },
        "status":       400
}

Illegal instruction

I am using the commandline version on Debian testing x64. And this seesms to throw an error when indexing the whole system

/opt/sist2/sist2 scan / -o /media/drive/sist2/out

38%[===============> ] TN: 74M IDX: 1GIllegal instruction

Invalid utf8 character in index

Hi,
During the "index" step, I see error messages like the below.
Kind regards,
lakemike

[2020-04-09 10:42:25] [ERROR elastic.c] {
	"index":	{
		"_index":	"sist2",
		"_type":	"_doc",
		"_id":	"9e74c6ed-9bd8-45fd-8c7a-86c5d35c7496",
		"status":	400,
		"error":	{
			"type":	"parse_exception",
			"reason":	"Failed to parse content to map",
			"caused_by":	{
				"type":	"json_parse_exception",
				"reason":	"Invalid UTF-8 start byte 0x81\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@234889b9; line: 1, column: 136]"
			}
		}
	}
}

text_buffer_append_string0() does not always append the whole string

Feature Request: Web list view, show folder location of file

In the list view (perhaps to the right of the file size) show the folder path for the file.

Option to search in path

Would it be possible to bring results from partial folder path hits?

thanks

Skip thumbnails option

How do I run the web component on a non standard port?

I would like to try the Docker route but your sample web server does not havea port and I have other services running on the system alread. Can you please tell me how I can run the web component on a custom exposed service port ?

Maybe you can tell us how to run the web server without network host mode and use exposed ports insetad?

thanks

TESSDATA_PATH doesn't work on ubuntu/debian

Path for archlinux /usr/share/tessdata/ is different for debian /usr/share/tesseract-ocr/tessdata/...

FR: 3d object formats

It seems like Sist2 supports couple of those 3d formats (like xpov I think) but it will be even nicer if it supports more comman formats like obj, fbx, dae etc as long as they support additional information I guess.

MuPDF scan errors

Hi,
I have a large body of office documents (pdf, docx, pptx) and I noticed that the scanner throws a lot of error messages. Typically, it will also mean that full-text search does not work for the respective document. (Some error messages shown below .. there are a lot more). How could I best help debugging the scanner?
Kind regards,
lakemike

scan output:

..
FZ: cannot recognize xref format
FZ: trying to repair broken xref
FZ: repairing PDF document
..
FZ: invalid page object
..
Could not read archive: Unrecognized archive format
..
FZ: type3 glyph doesn't specify masked or colored
FZ: ... repeated 34 times...
..
[ERROR doc.c] Got fatal XML error while parsing document: Start tag expected, '<' not found
..

path suggestion lookup is case-sensitive

Self reminder: change mappings to allow case insensitive lookup

Required positional argument: PATH.

sist2 version: 2.3.2 (simon987/sist2:latest as of posting date)

Platform (please indicate if you're using Docker): Docker Engine 19.03.8

Command with arguments:

docker run -it --name s2scan-docs-0529 -v docs:/files -v $PWD/out/:/out simon987/sist2 scan --very-verbose -t 2 --content-size=65536 --rewrite-url='\\IP\docs\my path\' /files -o /out/docs_idx_20.05.29

I tried --rewrite-url '\\IP\docs\my path'
and tried --rewrite-url "\\IP\docs\my path"

Both gave me a "Required positional argument: PATH." error

Removing that option and keeping the remaining command exactly the same and it runs. The other options I used are --very-verbose, -t, --content-size, and -o.

Questions:

Is there a limit on the total length of the command?
Is there a limit on the length of the string for that option?
"\IP\docs\my path" is 67 characters long.

Indexer should handle HTTP429 from elasticsearch

Indexer should wait a few seconds when receiving 429 response

"Duplicate field content" error during index step

sist2 version: 2.1.0
Platform: Docker
Command with arguments:

docker run --net=bridge --rm --name sist2 \
   -v "$SRVDIR":"/docs" \
   -v $IDXBASE/idx/:"/idx" \
   -t simon987/sist2:$VERSION \
   index --very-verbose --force-reset --batch-size=100 --es-url=http://192.168.86.11:9200 ./idx/$IDXDIR/ 2>&1 | tee ./index.out

I see very few of these errors during indexing:

[7F0672D8A540]^[[01;33m [2020-05-03 09:45:08] [ERROR elastic.c] {
        "index":        {
                "_index":       "sist2",
                "_type":        "_doc",
                "_id":  "df30616f-de8d-4eb1-93b6-2d823bff9eef",
                "status":       400,
                "error":        {
                        "type": "parse_exception",
                        "reason":       "Failed to parse content to map",
                        "caused_by":    {
                                "type": "json_parse_exception",
                                "reason":       "Duplicate field 'content'\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@72fea466; line: 1, column: 629]"
                        }
                }
        }
}

Option to only index file names

Hi,

Here is the docker line I use

ocker run  --name sist2  -ti \
    -v /media/:/files \
    -v /media/TEMP/sist2/out/:/out \
    simon987/sist2 scan --verbose -t 6 --archive=list --content-size=-1 /files -o /out/media

As you see I put a negative value to content size, because I just want to index file names and nothing else. I have been scanning my whole /media (I have many devices there and about 10tb) for more than a day , the index file is at already at 17gb and still not done.

I see these lines in the verbose log. Does it mean it is trying to scan the contents as well?


-1094995529] Invalid data found when processing input
[E2F0F700] [2020-02-15 16:31:56] [ERROR /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/aptitude/aptitude-defaults.it] (media.c) avformat_o
pen_input() returned [-1094995529] Invalid data found when processing input
[E3710700] [2020-02-15 16:32:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/emacs/25.2/etc/tutorials/TUTORIAL.it] (media.c) avf
ormat_open_input() returned [-1094995529] Invalid data found when processing input
[E3F11700] [2020-02-15 16:33:01] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/pandoc/data/templates/default.epub] FZ: cannot re
cognize zip archive
[E3710700] [2020-02-15 16:33:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/readme-txt.dir/README.IT] (media.c) avforma
t_open_input() returned [-1094995529] Invalid data found when processing input
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: ca
nnot recognize version marker
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: tr
ying to repair broken xref
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: re
pairing PDF document
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: no
 objects found
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: Fa
iled to open doc from stream
[E3F11700] [2020-02-15 16:37:16] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/fonts/truetype/public/belleek/
rblmi.ttf] (font.c) FT_Load_Char() returned error code [6] invalid argument
[E3F11700] [2020-02-15 16:37:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/fonts/truetype/public/belleek/
rblmi.ttf] (font.c) FT_Load_Char() returned error code [6] invalid argument
[E3F11700] [2020-02-15 16:37:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/fonts/truetype/public/belleek/
rblmi.ttf] (font.c) FT_Load_Char() returned error code [6] invalid argument
[E3F11700] [2020-02-15 16:37:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/fonts/truetype/public/belleek/
rblmi.ttf] (font.c) FT_Load_Char() returned error code [6] invalid argument
[E270E700] [2020-02-15 16:51:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans400.woff] (f
ont.c) FT_New_Memory_Face() returned error code [7] unimplemented feature
[E3710700] [2020-02-15 16:51:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans600.woff] (f
ont.c) FT_New_Memory_Face() returned error code [7] unimplemented feature
[E2F0F700] [2020-02-15 16:51:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans400italic.wo
ff] (font.c) FT_New_Memory_Face() returned error code [7] unimplemented feature
[E270E700] [2020-02-15 16:51:15] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans600italic.wo
ff] (font.c) FT_New_Memory_Face() returned error code [7] unimplemented feature
[E3710700] [2020-02-15 16:51:15] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans700.woff] (f
ont.c) FT_New_Memory_Face() returned error code [7] unimplemented feature

Incremental scan issues / questions

I pulled the latest image (1.2.17)

I ran this: docker run -it --name sist-scan-03.02 -v general:/files -v sist2_out:/out simon987/sist2 scan --verbose --incremental=/out/general_idx2 -t 2 --content-size=65536 --rewrite-url='\\[server-IP]\general\' /files -o /out/general_idx_2020.03.02

The log starts with:

�[32m[FDB950C0]�[01;34m [2020-03-02 23:50:58] [INFO main.c] sist2 v1.2.17�[0m

�[32m[FDB950C0]�[01;34m [2020-03-02 23:59:37] [INFO tpool.c] Starting thread pool with 2 threads�[0m
Loaded 2050660 items in to mtime table.

It ended without comment.

I ran it again with --very-verbose. The log starts with:

[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg quality=5.000000
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg size=500
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg content_size=65536
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg threads=2
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg incremental=/out/general_idx2
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg output=/out/general_idx_2020.03.02/
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg rewrite_url=\\[server-IP]\general\
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg name=general_idx_2020.03.02
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg depth=2147483647
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg path=/files/
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg archive=(null)
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg tesseract_lang=(null)
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg tesseract_path=(null)
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg exclude=(null)
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg fast=0

and ends with the error: Error while opening store: No such file or directory (/out/general_idx2thumbs)

I saw that there was no slash between idx and the thumbs folder. So I tried running again and changing the incremental line to include a slash: incremental=/out/general_idx2/
Similar starting point, but ended up failing early with the error double free or corruption (out)

Additional question(s):

Do you have to run incremental with the same settings as the original scan? To wit, did it fail because the first time I did not make thumbnails but this time, with incremental. I asked it to make thumbnails for changed/new files?

memory leak in index module

(Self reminder to fix this)

Direct leak of 348482 byte(s) in 795 object(s) allocated from:
    #0 0x7fc56617ed28 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc1d28)
    #1 0x555740bb020d in index_json /home/hex/tmp/sist2/src/index/elastic.c:41
    #2 0x555740b86dae in read_index_bin /home/hex/tmp/sist2/src/io/serialize.c:310
    #3 0x555740b87540 in read_index /home/hex/tmp/sist2/src/io/serialize.c:394
    #4 0x555740b3963c in sist2_index /home/hex/tmp/sist2/src/main.c:172
    #5 0x555740b3a633 in main /home/hex/tmp/sist2/src/main.c:304
    #6 0x7fc5639512e0 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x202e0)

SUMMARY: AddressSanitizer: 348482 byte(s) leaked in 795 allocation(s).

FR: Ignore option for folders recursively

It would be great if the scan has options for ignoring folders/files. I started scanning /media it is fine but I also have Btrfs subvolumes inside the drives, so it scans those whole drives multiple times since they are also mounted as actual drives. I would like to be able to ignore folders with glob patters or full paths,

List display option

List display option in search interface

FR: Windows version

HI
I think that Windows release would be nice. I realize you recommended the Linux subsystem on Windows, but that requires the latest Win versions and to be honest I never got that working either.

thanks

Feature request web: preview source document

When doing a search and getting results, it'd be nice to preview the entire source document inline, or at least the entirety of the raw text.

From poking around, it looks like this would need to make a request to ES to get back the entire source document based on ID, and then we'll need the UI elements to display the result.

QA: Workflow advice

I am looking for some advice on the proper worfklow.

As of now I setup a cron job that removes existing index and then reindexes terabytes of data every 2days. This does not sound right to me because first the scanning and indexing takes many hours, then the database is 1.5 gb even though I am using the ``--fast` method. I can see some companies or people needing to make regular indexes and keep them as dated versions of search snapshots but that does not sound practical for daily use.

I use Everything on Windows and these two apps are almost the same performance wise. However with Everything I do not have to erase the database, it does it transparently I guess.

I am looking for a way to come up with an efficient workflow that is easy on my system resources CPU, IO and storage wise. Is there a way to keep the existing database and be more efficient with scan times and indexing?

thanks

Issue loading web component

So, trying docker run --rm --network host --name sist2 -v $PWD/out/my_idx1:/idx -v my_vol:/files simon987/sist2 web --bind 0.0.0.0 /idx
just sits there - there is no feedback on the screen and if I go to localhost:4090 it will not load.

If try docker run --rm --network host --name sist2 -v $PWD/out/my_idx1:/idx -v my_vol:/files simon987/sist2 web --bind 0.0.0.0:80 /idx (put the port on the end of the IP so that it breaks, I get this response:

[12A70A00] [2019-12-18 23:54:51] [ERROR listen_point.c:200] Error getting local address and port: Invalid argument.
[12A70A00] [2019-12-18 23:54:51] [ERROR onion.c:470] There are no available listen points
Loaded index: my_idx1
Starting web server @ http://0.0.0.0:80:4090

Then it ends the container and closes.

And lastly, if I try:
docker run --rm --name sist2 -a STDOUT -p 8585:8585 -v $PWD/out/my_idx1:/idx -v my_vol:/files simon987/sist2 web --bind 0.0.0.0 --port 8585 /idx

The screen will not provide any feedback, but going to localhost:8585 will load a webpage, but the page will not search.

Add .mobi file format

Would be nice to support mopbi,epub. Im trying to add fulltext to my ebook/mag collection

Issues with searching docx files

With Mime Types: vnd.openxmlformats-officedocument.wordprocessingml.document selected as the only type and nothing in the search, all the *.docx documents are shown.

For the first five documents, I opened the docx, copied a random word from the document, and pasted it in the search bar and received 0 results.

I have also tested with various other words that I know, due to the nature of the files, would appear numerous times, but I always return with 0 results.

Thumbnail meta field

thumbnail width & height as meta, don't rely on file type for frontend

Basic Auth option

Add option to enable basic auth

read/open: "No such file or directory" for files with single quotation mark or en dash in file path/name

I received a few "open: No such file or directory" and "read:no such file or directory" errors.

The commonality between all the files was a question mark in the error file name. Looking at the source files, the ? was substituted in place of either single quotation marks or en dashes.

I identified the character by copying it and pasting it in https://www.mclean.net.nz/ucf/
which returned U+2019 right single quotation mark, U+2018 left single quotation mark, and en dash U+2013.

MacOS Build

Fails while scanning

sist2: ../../../mce/helper.c:67: mceQNameLevelCleanup: Assertion '0==level || qname_level_set->max_level<level' failed.
This occurred after a couple hundred warning lines similar to:

Advanced search options

It would be very helpful to have advanced options for the search. Like being able to only show results with 2 or more search terms in them, or searching for specific strings.

Feature Request - Add the links to the thumbnails.

Just a small thing, maybe I'm spoiled or lazy, but my brain tells me to click on the thumbnails to open a link on browsers.

Just something to consider.

FR: Show the file path

I think it would be very helpful if the found path is also written below the found item, maybe with a smaller font. I like the way things are as a user at the moment however trying to figure out where things are coming from a bit of tediious.

Another idea could be like aanother type like mime types and instead it only shows the folder listings, thta way one can also just click on a folder and just list the found results in that folder.

thanks

Does "--content-size 0" skip scanning content of files?

I'm looking for ways to speed up the scans. Indexing of content of files is not super important to me. Can I skip scanning of file content by doing "--content-size 0"? Or is there something else I missed?

segfault on http413

(self reminder to fix this)

Intermittent Multi-threading bug with small folders

I have the two following directories created

Files I want indexed:
/mnt/user/projects/test/

Location of indexes:
/mnt/user/temp/sist2_indexes/

The sist2_indexes/ directory is initially empty.
Here's an ls of the test/ directory with the files to be indexed:

> ls -l /mnt/user/projects/test/

total 20
-rw-rw-rw- 1 me users    7 Nov 15 15:31 File.txt
-rw-rw-rw- 1 me users    9 Nov 15 15:31 File2.txt
-rw-rw-rw- 1 me users    9 Nov 15 15:31 File3.txt
-rw-rw-rw- 1 me users 3290 Nov 15 15:31 image1.png
-rw-rw-rw- 1 me users 3290 Nov 15 15:32 image2.png

I run the following Docker commands

Scan

> docker run -it -v /mnt/user/projects/test/:/files -v /mnt/user/temp/sist2_indexes/:/indexes simon987/sist2 scan -t 16 /files -o /indexes/test_index

sist2 V1.1.5
---------------------
threads         16
tn_qscale       5.0/31.0
tn_size         500px
output          /indexes/test_index/

Index

> docker run -it --network host -v /mnt/user/temp/sist2_indexes/:/indexes simon987/sist2 index /indexes/test_index

Delete index <0>
Create index <0>
Close index <0>
Update settings <0>
Update mappings <0>
Open index <0>
Indexed   0 documents (0kB) <0>

Already here it seems something has gone wrong, since it says "Indexed 0 documents".

Web

> docker run --rm --network host -d --name sist2 -v /mnt/user/temp/sist2_indexes/:/indexes -v /mnt/user/projects/test/:/files simon987/sist2 web --bind 0.0.0.0 --port 8888 /indexes/test_index

f275f598e9b39564cd8e4ac06bcb1915a066a6bf3b566ea9cd1ff64c321c13f1

The web interface comes up as expected on port 8888, and the index "test_index" shows up in the "Search in indices" list, but no files show up and searching doesn't do anything.

What am I doing wrong here?

index module does not respect --batch-size argument

sist2 version: 2.3.3

Sometimes creates a bigger batch:

[7F58DC11A540] [2020-06-02 01:41:48] [INFO elastic.c] Indexed 1000 documents (13282kB) <200>
[7F58DC11A540] [2020-06-02 01:41:56] [INFO elastic.c] Indexed 1000 documents (11059kB) <200>
[7F58DC11A540] [2020-06-02 01:41:59] [INFO elastic.c] Indexed 1000 documents (4486kB) <200>
[7F9D6B2EE540] [2020-06-02 01:42:00] [INFO elastic.c] Indexed 5000 documents (56914kB) <200> <---
[7F58DC11A540] [2020-06-02 01:42:03] [INFO elastic.c] Indexed 1000 documents (5347kB) <200>
[7F58DC11A540] [2020-06-02 01:42:43] [INFO elastic.c] Indexed 1000 documents (24819kB) <200>
[7F58DC11A540] [2020-06-02 01:43:00] [INFO elastic.c] Indexed 1000 documents (16371kB) <200>
[7F58DC11A540] [2020-06-02 01:43:06] [INFO elastic.c] Indexed 1000 documents (9546kB) <200>

FR: Disable highlighting in the search listing

It woul be nice if there is way to disable highlighting in the search listing, I personally think that it mis more of a distraction then help, at least ione should be able to enable on and off via shirtcut maybe?

See the image please for how chesboard look it gives

https://i.imgur.com/GeHja5p.png

404 when clicked on a file

I scanned and indexed my full system using the lines below. Everything seems to went good. I can run the web interface and see search results but when I click on an image for instance I get 404

I see that there is a /m missing from the file info I wonder if that is an issue

Here is the path line of a file when I click on info to see the details

edia/DRIVE/XXX/YYY/ZZZ/dinazor_1

I am on Debian 5.4.8-1 using Sist2 1.2.15

/opt/sist2/sist2 scan / --exclude "/media/DRIVE/.SUBVOLUMES/." --fast --verbose -o /DRIVE/sist2/out.system                                      
                                                                                                                                                                
/opt/sist2/sist2 index --force-reset /media/DRIVE/sist2/out.system

[BF4EA700] [2020-02-23 10:44:55] [INFO response.c:195] [192.168.2.11] "POST /es" 200 20776 (Keep-Alive)
[BF4EA700] [2020-02-23 10:44:58] [INFO response.c:195] [192.168.2.11] "GET /f/2db8637a-edcd-4306-a9d0-882e50d456bd" 404 24 (Keep-Alive)
[C0CEF7C0] [2020-02-23 10:44:58] [INFO response.c:195] [192.168.2.11] "GET /favicon.ico" 404 24 (Keep-Alive)
[C0CEF7C0] [2020-02-23 10:45:02] [INFO response.c:195] [192.168.2.11] "GET /f/6280ad6e-b0e9-4104-8f5d-456bf598902e" 404 24 (Keep-Alive)
[BF4EA700] [2020-02-23 10:45:11] [INFO response.c:195] [192.168.2.11] "GET /f/1ca9a1cf-11f2-4810-8b07-f768df95a84a" 404 24 (Keep-Alive)

chrome / reverse proxy issue

sist2 version: v2.1.0
Platform: Docker
Command: docker run -d --net=bridge --rm --name sist2 -p 4090:4090 -v "$SRVDIR":"/docs" -v $IDXBASE/idx/:"/idx" -t simon987/sist2:2.1.0 web --very-verbose --bind 0.0.0.0:4090 --es-url http://192.168.86.11:9200 ./idx/$IDXDIR/

Hi,
With v1.3.3 using chrome, I could access sist2 directly (URL with IP address) or using traefik as reverse proxy (e.g. https://sist2.subdomain.domain.org/). On v2.1.0 at least for PC or MacOs versions of chrome, only the direct access URL works. The proxied URL starts to load the interface, but then stops displaying a blue bar below the input mask. Seems to be related to chrome because it works with Safari.

Kind regards,
lakemike

FR: Web: Filter / Sort by date

On the web page, it would be helpful to filter and/or sort results by date.
The two relevant dates may be the created date and the modified date.

I think modified date would be more useful, at least for me, and the modified date is already saved for use with -incremental.

(I am not a UX person, so what follows are some random thoughts. I really like the simplicity of your search that just works. So, these aren't necessarily "suggestions" but more "random thoughts".)

Filters may be: before, after, equal, between. Or something like gmail has.

Possibly add new "Date" filter tab next to "Tags". Dropdown of filter options that gives the correct number of date boxes.
Example from gmail:

Sorts may be: Newest to oldest, oldest to newest.

Maybe a toggle under the "Date" filter tab, three options, New-Old, Old-New, Off. Default: Off to keep function the same as the current functioning.

Descriptive error message in web module on ES connection failure

There should be a message instead of having a blank page

See #12

Feature request - Scan depth option

A --scan-depth option could be useful if you want to scan all files in a folder, but not files in subdirectories,

Syscal param points to uninitialised bytes(s)

(Self reminder to take care of this)

Disk usage analyser?

Wow - this is pretty awesome =). Can't wait to try this on my NAS box.

What do you think about adding some kind of visualisation for disk usage to Sist2?

This would allow you to drill down to directories, and see where space is going. From the screenshots - it seems like Sist2 already stores the size, right? Is there enough data in the schema to create someting?

The Diskover project has some images that could show what I mean:

https://github.com/shirosaidev/diskover
https://github.com/shirosaidev/diskover-web

I was also thinking of Gnome Baobab:

https://wiki.gnome.org/Apps/DiskUsageAnalyzer
https://github.com/GNOME/baobab

or Windirstat:

https://windirstat.net/

simon987 / sist2 Goto Github PK

sist2's Introduction

sist2

Features

Getting Started

Using Docker Compose (Windows/Linux/Mac)

Using the executable file (Linux/WSL only)

Format support

Archive files

OCR

Search backends

NER

List of available repositories:

Build from source

Using docker

Using a linux computer

sist2's People

Contributors

Stargazers

Watchers

Forkers

sist2's Issues

scan output:

I have the two following directories created

I run the following Docker commands

Scan

Index

Web

Recommend Projects

Recommend Topics

Recommend Org