Code Monkey home page Code Monkey logo

sist2's Introduction

GitHub CodeFactor Development snapshots

Demo: sist2.simon987.net

Community URL: Discord

sist2

sist2 (Simple incremental search tool)

Warning: sist2 is in early development

search panel

Features

  • Fast, low memory usage, multi-threaded
  • Manage & schedule scan jobs with simple web interface (Docker only)
  • Mobile-friendly Web interface
  • Extracts text and metadata from common file types *
  • Generates thumbnails *
  • Incremental scanning
  • Manual tagging from the UI and automatic tagging based on file attributes via user scripts
  • Recursive scan inside archive files **
  • OCR support with tesseract ***
  • Stats page & disk utilisation visualization
  • Named-entity recognition (client-side) ****

* See format support
** See Archive files
*** See OCR
**** See Named-Entity Recognition

Getting Started

Using Docker Compose (Windows/Linux/Mac)

version: "3"

services:
  elasticsearch:
    image: elasticsearch:7.17.9
    restart: unless-stopped
    volumes:
      # This directory must have 1000:1000 permissions (or update PUID & PGID below)
      - /data/sist2-es-data/:/usr/share/elasticsearch/data
    environment:
      - "discovery.type=single-node"
      - "ES_JAVA_OPTS=-Xms2g -Xmx2g"
      - "PUID=1000"
      - "PGID=1000"
  sist2-admin:
    image: simon987/sist2:3.4.2-x64-linux
    restart: unless-stopped
    volumes:
      - /data/sist2-admin-data/:/sist2-admin/
      - /:/host
    ports:
      - 4090:4090
      # NOTE: Don't expose this port publicly!
      - 8080:8080
    working_dir: /root/sist2-admin/
    entrypoint: python3
    command:
      - /root/sist2-admin/sist2_admin/app.py

Navigate to http://localhost:8080/ to configure sist2-admin.

Using the executable file (Linux/WSL only)

  1. Choose search backend (See comparison):

    • Elasticsearch: have an Elasticsearch (version >= 6.8.X, ideally >=7.14.0) instance running
      1. Download from official website
      2. (or) Run using docker:
        docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.17.9
    • SQLite: No installation required
  2. Download the latest sist2 release. Select the file corresponding to your CPU architecture and mark the binary as executable with chmod +x.

  3. See usage guide for command line usage.

Example usage:

  1. Scan a directory: sist2 scan ~/Documents --output ./documents.sist2
  2. Prepare search index:
    • Elasticsearch: sist2 index --es-url http://localhost:9200 ./documents.sist2
    • SQLite: sist2 sqlite-index --search-index ./search.sist2 ./documents.sist2
  3. Start web interface:
    • Elasticsearch: sist2 web ./documents.sist2
    • SQLite: sist2 web --search-index ./search.sist2 ./documents.sist2

Format support

File type Library Content Thumbnail Metadata
pdf,xps,fb2,epub MuPDF text+ocr yes author, title
cbz,cbr libscan - yes -
audio/* ffmpeg - yes ID3 tags
video/* ffmpeg - yes title, comment, artist
image/* ffmpeg ocr yes Common EXIF tags, GPS tags
raw, rw2, dng, cr2, crw, dcr, k25, kdc, mrw, pef, xf3, arw, sr2, srf, erf LibRaw no yes Common EXIF tags, GPS tags
ttf,ttc,cff,woff,fnt,otf Freetype2 - yes, bmp Name & style
text/plain libscan yes no -
html, xml libscan yes no -
tar, zip, rar, 7z, ar ... Libarchive yes* - no
docx, xlsx, pptx libscan yes if embedded creator, modified_by, title
doc (MS Word 97-2003) antiword yes no author, title
mobi, azw, azw3 libmobi yes yes author, title
wpd (WordPerfect) libwpd yes no planned
json, jsonl, ndjson libscan yes - -

* See Archive files

Archive files

sist2 will scan files stored into archive files (zip, tar, 7z...) as if they were directly in the file system. Recursive (archives inside archives) scan is also supported.

Limitations:

  • Support for parsing media files with formats that require seek (e.g. .gif, .mp4 w/ fragmented metadata etc.) is limitted (see --mem-buffer option)
  • Archive files are scanned sequentially, by a single thread. On systems where sist2 is not I/O bound, scans might be faster when larger archives are split into smaller parts.

OCR

You can enable OCR support for ebook (pdf,xps,fb2,epub) or image file types with the --ocr-lang <lang> option in combination with --ocr-images and/or --ocr-ebooks. Download the language data files with your package manager (apt install tesseract-ocr-eng) or directly from Github.

The simon987/sist2 image comes with common languages (hin, jpn, eng, fra, rus, spa, chi_sim, deu, pol) pre-installed.

You can use the + separator to specify multiple languages. The language name must be identical to the *.traineddata file installed on your system (use chi_sim rather than chi-sim).

Examples:

sist2 scan --ocr-ebooks --ocr-lang jpn ~/Books/Manga/
sist2 scan --ocr-images --ocr-lang eng ~/Images/Screenshots/
sist2 scan --ocr-ebooks --ocr-images --ocr-lang eng+chi_sim ~/Chinese-Bilingual/

Search backends

sist2 v3.0.7+ supports SQLite search backend. The SQLite search backend has fewer features and generally comparable query performance for medium-size indices, but it uses much less memory and is easier to set up.

SQLite Elasticsearch
Requires separate search engine installation
Memory footprint ~20MB >500MB
Query syntax fts5 query_string
Fuzzy search
Media Types tree real-time updating
Manual tagging
User scripts
Media Type breakdown for search results
Embeddings search O(n) O(logn)

NER

sist2 v3.0.4+ supports named-entity recognition (NER). Simply add a supported repository URL to Configuration > Machine learning options > Model repositories to enable it.

The text processing is done in your browser, no data is sent to any third-party services. See simon987/sist2-ner-models for more details.

List of available repositories:

URL Maintainer Purpose
simon987/sist2-ner-models simon987 General
Screenshot

ner

Build from source

You can compile sist2 by yourself if you don't want to use the pre-compiled binaries

Using docker

git clone --recursive https://github.com/simon987/sist2/
cd sist2
docker build . -t my-sist2-image
# Copy sist2 executable from docker image
docker run --rm --entrypoint cat my-sist2-image /root/sist2 > sist2-x64-linux

Using a linux computer

  1. Install compile-time dependencies

    apt install gcc g++ python3 yasm ragel automake autotools-dev wget libtool libssl-dev curl zip unzip tar xorg-dev libglu1-mesa-dev libxcursor-dev libxml2-dev libxinerama-dev gettext nasm git nodejs
  2. Install vcpkg using my fork: https://github.com/simon987/vcpkg

  3. Install vcpkg dependencies

    vcpkg install openblas curl[core,openssl] sqlite3[core,fts5,json1] cpp-jwt pcre cjson brotli libarchive[core,bzip2,libxml2,lz4,lzma,lzo] pthread tesseract libxml2 libmupdf[ocr] gtest mongoose libmagic libraw gumbo ffmpeg[core,avcodec,avformat,swscale,swresample,webp,opus,mp3lame,vpx,zlib]
  4. Build

    git clone --recursive https://github.com/simon987/sist2/
    (cd sist2-vue; npm install; npm run build)
    (cd sist2-admin/frontend; npm install; npm run build)
    cmake -DSIST_DEBUG=off -DCMAKE_TOOLCHAIN_FILE=<VCPKG_ROOT>/scripts/buildsystems/vcpkg.cmake .
    make

sist2's People

Contributors

dependabot[bot] avatar dpieski avatar einfachtobi avatar jeaneric avatar kiskadee-dev avatar simon987 avatar systemz avatar v-yadli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sist2's Issues

Error decoding frame: Invalid data found when processing input

Does this error mean that it is trying to process a file that it cannot read like a media file?

I am running this on a folder with a multitude of various file types, but I am getting this error fairly consistently for the past 7% or so and I am pretty certain that media files are less than 7% of my files.

Is there an error log so I can see what files it has the Errors on?

Also - is the percent based on storage volume or file quantity?

Web: Filter Path error

sist2 version: 2.3.2

Platform (Linux or Docker): Docker on Windows

When I enter values in the "Filter Path", I get an "Elasticsearch Connection Error"

docker logs shows:

[7FC2978D0540] [2020-06-01 14:12:13] [WARNING serve.c] ElasticSearch error during query (400)
[7FC2978D0540] [2020-06-01 14:12:13] [WARNING serve.c] {
        "error":        {
                "root_cause":   [{
                                "type": "x_content_parse_exception",
                                "reason":       "[1:388] [bool] failed to parse field [filter]"
                        }],
                "type": "x_content_parse_exception",
                "reason":       "[1:388] [bool] failed to parse field [filter]",
                "caused_by":    {
                        "type": "illegal_state_exception",
                        "reason":       "expected value but got [START_ARRAY]"
                }
        },
        "status":       400
}

Illegal instruction

Hi

I am using the commandline version on Debian testing x64. And this seesms to throw an error when indexing the whole system

/opt/sist2/sist2 scan / -o /media/drive/sist2/out

38%[===============> ] TN: 74M IDX: 1GIllegal instruction

Invalid utf8 character in index

Hi,
During the "index" step, I see error messages like the below.
Kind regards,
lakemike

[2020-04-09 10:42:25] [ERROR elastic.c] {
	"index":	{
		"_index":	"sist2",
		"_type":	"_doc",
		"_id":	"9e74c6ed-9bd8-45fd-8c7a-86c5d35c7496",
		"status":	400,
		"error":	{
			"type":	"parse_exception",
			"reason":	"Failed to parse content to map",
			"caused_by":	{
				"type":	"json_parse_exception",
				"reason":	"Invalid UTF-8 start byte 0x81\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@234889b9; line: 1, column: 136]"
			}
		}
	}
}

How do I run the web component on a non standard port?

Hi

I would like to try the Docker route but your sample web server does not havea port and I have other services running on the system alread. Can you please tell me how I can run the web component on a custom exposed service port ?

Maybe you can tell us how to run the web server without network host mode and use exposed ports insetad?

thanks

FR: 3d object formats

Hi

It seems like Sist2 supports couple of those 3d formats (like xpov I think) but it will be even nicer if it supports more comman formats like obj, fbx, dae etc as long as they support additional information I guess.

MuPDF scan errors

Hi,
I have a large body of office documents (pdf, docx, pptx) and I noticed that the scanner throws a lot of error messages. Typically, it will also mean that full-text search does not work for the respective document. (Some error messages shown below .. there are a lot more). How could I best help debugging the scanner?
Kind regards,
lakemike

scan output:

..
FZ: cannot recognize xref format
FZ: trying to repair broken xref
FZ: repairing PDF document
..
FZ: invalid page object
..
Could not read archive: Unrecognized archive format
..
FZ: type3 glyph doesn't specify masked or colored
FZ: ... repeated 34 times...
..
[ERROR doc.c] Got fatal XML error while parsing document: Start tag expected, '<' not found
..

Required positional argument: PATH.

sist2 version: 2.3.2 (simon987/sist2:latest as of posting date)

Platform (please indicate if you're using Docker): Docker Engine 19.03.8

Command with arguments:

docker run -it --name s2scan-docs-0529 -v docs:/files -v $PWD/out/:/out simon987/sist2 scan --very-verbose -t 2 --content-size=65536 --rewrite-url='\\IP\docs\my path\' /files -o /out/docs_idx_20.05.29

I tried --rewrite-url '\\IP\docs\my path'
and tried --rewrite-url "\\IP\docs\my path"

Both gave me a "Required positional argument: PATH." error

Removing that option and keeping the remaining command exactly the same and it runs. The other options I used are --very-verbose, -t, --content-size, and -o.

Questions:

  • Is there a limit on the total length of the command?
  • Is there a limit on the length of the string for that option?
    "\IP\docs\my path" is 67 characters long.

"Duplicate field content" error during index step

sist2 version: 2.1.0
Platform: Docker
Command with arguments:

docker run --net=bridge --rm --name sist2 \
   -v "$SRVDIR":"/docs" \
   -v $IDXBASE/idx/:"/idx" \
   -t simon987/sist2:$VERSION \
   index --very-verbose --force-reset --batch-size=100 --es-url=http://192.168.86.11:9200 ./idx/$IDXDIR/ 2>&1 | tee ./index.out

I see very few of these errors during indexing:

[7F0672D8A540]^[[01;33m [2020-05-03 09:45:08] [ERROR elastic.c] {
        "index":        {
                "_index":       "sist2",
                "_type":        "_doc",
                "_id":  "df30616f-de8d-4eb1-93b6-2d823bff9eef",
                "status":       400,
                "error":        {
                        "type": "parse_exception",
                        "reason":       "Failed to parse content to map",
                        "caused_by":    {
                                "type": "json_parse_exception",
                                "reason":       "Duplicate field 'content'\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@72fea466; line: 1, column: 629]"
                        }
                }
        }
}

Option to only index file names

Hi,

Here is the docker line I use

ocker run  --name sist2  -ti \
    -v /media/:/files \
    -v /media/TEMP/sist2/out/:/out \
    simon987/sist2 scan --verbose -t 6 --archive=list --content-size=-1 /files -o /out/media

As you see I put a negative value to content size, because I just want to index file names and nothing else. I have been scanning my whole /media (I have many devices there and about 10tb) for more than a day , the index file is at already at 17gb and still not done.

I see these lines in the verbose log. Does it mean it is trying to scan the contents as well?


-1094995529] Invalid data found when processing input
[E2F0F700] [2020-02-15 16:31:56] [ERROR /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/aptitude/aptitude-defaults.it] (media.c) avformat_o
pen_input() returned [-1094995529] Invalid data found when processing input
[E3710700] [2020-02-15 16:32:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/emacs/25.2/etc/tutorials/TUTORIAL.it] (media.c) avf
ormat_open_input() returned [-1094995529] Invalid data found when processing input
[E3F11700] [2020-02-15 16:33:01] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/pandoc/data/templates/default.epub] FZ: cannot re
cognize zip archive
[E3710700] [2020-02-15 16:33:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/readme-txt.dir/README.IT] (media.c) avforma
t_open_input() returned [-1094995529] Invalid data found when processing input
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: ca
nnot recognize version marker
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: tr
ying to repair broken xref
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: re
pairing PDF document
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: no
 objects found
[E2F0F700] [2020-02-15 16:33:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/dvips/tetex/config.pdf] FZ: Fa
iled to open doc from stream
[E3F11700] [2020-02-15 16:37:16] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/fonts/truetype/public/belleek/
rblmi.ttf] (font.c) FT_Load_Char() returned error code [6] invalid argument
[E3F11700] [2020-02-15 16:37:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/fonts/truetype/public/belleek/
rblmi.ttf] (font.c) FT_Load_Char() returned error code [6] invalid argument
[E3F11700] [2020-02-15 16:37:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/fonts/truetype/public/belleek/
rblmi.ttf] (font.c) FT_Load_Char() returned error code [6] invalid argument
[E3F11700] [2020-02-15 16:37:18] [WARNING /files/DRIVE/var/lib/docker/overlay2/e52b5720313385b0db4e58c472734a00297cfd8c84e49612cd07d94c2389549c/merged/usr/share/texlive/texmf-dist/fonts/truetype/public/belleek/
rblmi.ttf] (font.c) FT_Load_Char() returned error code [6] invalid argument
[E270E700] [2020-02-15 16:51:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans400.woff] (f
ont.c) FT_New_Memory_Face() returned error code [7] unimplemented feature
[E3710700] [2020-02-15 16:51:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans600.woff] (f
ont.c) FT_New_Memory_Face() returned error code [7] unimplemented feature
[E2F0F700] [2020-02-15 16:51:14] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans400italic.wo
ff] (font.c) FT_New_Memory_Face() returned error code [7] unimplemented feature
[E270E700] [2020-02-15 16:51:15] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans600italic.wo
ff] (font.c) FT_New_Memory_Face() returned error code [7] unimplemented feature
[E3710700] [2020-02-15 16:51:15] [ERROR /files/DRIVE/var/lib/docker/overlay2/399ed6db8e08bb604728f2b95c0b16b61733433424065bec46beb1816d706b7f/diff/usr/local/tomcat/webapps/docs/images/fonts/OpenSans700.woff] (f
ont.c) FT_New_Memory_Face() returned error code [7] unimplemented feature


Incremental scan issues / questions

I pulled the latest image (1.2.17)

I ran this: docker run -it --name sist-scan-03.02 -v general:/files -v sist2_out:/out simon987/sist2 scan --verbose --incremental=/out/general_idx2 -t 2 --content-size=65536 --rewrite-url='\\[server-IP]\general\' /files -o /out/general_idx_2020.03.02

The log starts with:

�[32m[FDB950C0]�[01;34m [2020-03-02 23:50:58] [INFO main.c] sist2 v1.2.17�[0m

�[32m[FDB950C0]�[01;34m [2020-03-02 23:59:37] [INFO tpool.c] Starting thread pool with 2 threads�[0m
Loaded 2050660 items in to mtime table.

It ended without comment.

I ran it again with --very-verbose. The log starts with:

[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg quality=5.000000
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg size=500
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg content_size=65536
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg threads=2
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg incremental=/out/general_idx2
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg output=/out/general_idx_2020.03.02/
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg rewrite_url=\\[server-IP]\general\
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg name=general_idx_2020.03.02
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg depth=2147483647
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg path=/files/
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg archive=(null)
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg tesseract_lang=(null)
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg tesseract_path=(null)
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg exclude=(null)
[BC6950C0] [2020-03-03 00:52:25] [DEBUG cli.c] arg fast=0

and ends with the error: Error while opening store: No such file or directory (/out/general_idx2thumbs)

I saw that there was no slash between idx and the thumbs folder. So I tried running again and changing the incremental line to include a slash: incremental=/out/general_idx2/
Similar starting point, but ended up failing early with the error double free or corruption (out)

Additional question(s):

  • Do you have to run incremental with the same settings as the original scan? To wit, did it fail because the first time I did not make thumbnails but this time, with incremental. I asked it to make thumbnails for changed/new files?

memory leak in index module

(Self reminder to fix this)

Direct leak of 348482 byte(s) in 795 object(s) allocated from:
    #0 0x7fc56617ed28 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc1d28)
    #1 0x555740bb020d in index_json /home/hex/tmp/sist2/src/index/elastic.c:41
    #2 0x555740b86dae in read_index_bin /home/hex/tmp/sist2/src/io/serialize.c:310
    #3 0x555740b87540 in read_index /home/hex/tmp/sist2/src/io/serialize.c:394
    #4 0x555740b3963c in sist2_index /home/hex/tmp/sist2/src/main.c:172
    #5 0x555740b3a633 in main /home/hex/tmp/sist2/src/main.c:304
    #6 0x7fc5639512e0 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x202e0)

SUMMARY: AddressSanitizer: 348482 byte(s) leaked in 795 allocation(s).

FR: Ignore option for folders recursively

Hi

It would be great if the scan has options for ignoring folders/files. I started scanning /media it is fine but I also have Btrfs subvolumes inside the drives, so it scans those whole drives multiple times since they are also mounted as actual drives. I would like to be able to ignore folders with glob patters or full paths,

FR: Windows version

HI
I think that Windows release would be nice. I realize you recommended the Linux subsystem on Windows, but that requires the latest Win versions and to be honest I never got that working either.

thanks

Feature request web: preview source document

When doing a search and getting results, it'd be nice to preview the entire source document inline, or at least the entirety of the raw text.

From poking around, it looks like this would need to make a request to ES to get back the entire source document based on ID, and then we'll need the UI elements to display the result.

QA: Workflow advice

Hi

I am looking for some advice on the proper worfklow.

As of now I setup a cron job that removes existing index and then reindexes terabytes of data every 2days. This does not sound right to me because first the scanning and indexing takes many hours, then the database is 1.5 gb even though I am using the ``--fast` method. I can see some companies or people needing to make regular indexes and keep them as dated versions of search snapshots but that does not sound practical for daily use.

I use Everything on Windows and these two apps are almost the same performance wise. However with Everything I do not have to erase the database, it does it transparently I guess.

I am looking for a way to come up with an efficient workflow that is easy on my system resources CPU, IO and storage wise. Is there a way to keep the existing database and be more efficient with scan times and indexing?

thanks

Issue loading web component

So, trying docker run --rm --network host --name sist2 -v $PWD/out/my_idx1:/idx -v my_vol:/files simon987/sist2 web --bind 0.0.0.0 /idx
just sits there - there is no feedback on the screen and if I go to localhost:4090 it will not load.

If try docker run --rm --network host --name sist2 -v $PWD/out/my_idx1:/idx -v my_vol:/files simon987/sist2 web --bind 0.0.0.0:80 /idx (put the port on the end of the IP so that it breaks, I get this response:

[12A70A00] [2019-12-18 23:54:51] [ERROR listen_point.c:200] Error getting local address and port: Invalid argument.
[12A70A00] [2019-12-18 23:54:51] [ERROR onion.c:470] There are no available listen points
Loaded index: my_idx1
Starting web server @ http://0.0.0.0:80:4090

Then it ends the container and closes.

And lastly, if I try:
docker run --rm --name sist2 -a STDOUT -p 8585:8585 -v $PWD/out/my_idx1:/idx -v my_vol:/files simon987/sist2 web --bind 0.0.0.0 --port 8585 /idx

The screen will not provide any feedback, but going to localhost:8585 will load a webpage, but the page will not search.

Add .mobi file format

Would be nice to support mopbi,epub. Im trying to add fulltext to my ebook/mag collection

Issues with searching docx files

With Mime Types: vnd.openxmlformats-officedocument.wordprocessingml.document selected as the only type and nothing in the search, all the *.docx documents are shown.

For the first five documents, I opened the docx, copied a random word from the document, and pasted it in the search bar and received 0 results.

I have also tested with various other words that I know, due to the nature of the files, would appear numerous times, but I always return with 0 results.

read/open: "No such file or directory" for files with single quotation mark or en dash in file path/name

I received a few "open: No such file or directory" and "read:no such file or directory" errors.

The commonality between all the files was a question mark in the error file name. Looking at the source files, the ? was substituted in place of either single quotation marks or en dashes.

I identified the character by copying it and pasting it in https://www.mclean.net.nz/ucf/
which returned U+2019 right single quotation mark, U+2018 left single quotation mark, and en dash U+2013.

Fails while scanning

sist2: ../../../mce/helper.c:67: mceQNameLevelCleanup: Assertion '0==level || qname_level_set->max_level<level' failed.
This occurred after a couple hundred warning lines similar to:
image

Advanced search options

It would be very helpful to have advanced options for the search. Like being able to only show results with 2 or more search terms in them, or searching for specific strings.

FR: Show the file path

Hi

I think it would be very helpful if the found path is also written below the found item, maybe with a smaller font. I like the way things are as a user at the moment however trying to figure out where things are coming from a bit of tediious.

Another idea could be like aanother type like mime types and instead it only shows the folder listings, thta way one can also just click on a folder and just list the found results in that folder.

thanks

Intermittent Multi-threading bug with small folders

I have the two following directories created

Files I want indexed:
/mnt/user/projects/test/

Location of indexes:
/mnt/user/temp/sist2_indexes/

The sist2_indexes/ directory is initially empty.
Here's an ls of the test/ directory with the files to be indexed:

> ls -l /mnt/user/projects/test/

total 20
-rw-rw-rw- 1 me users    7 Nov 15 15:31 File.txt
-rw-rw-rw- 1 me users    9 Nov 15 15:31 File2.txt
-rw-rw-rw- 1 me users    9 Nov 15 15:31 File3.txt
-rw-rw-rw- 1 me users 3290 Nov 15 15:31 image1.png
-rw-rw-rw- 1 me users 3290 Nov 15 15:32 image2.png

I run the following Docker commands

Scan

> docker run -it -v /mnt/user/projects/test/:/files -v /mnt/user/temp/sist2_indexes/:/indexes simon987/sist2 scan -t 16 /files -o /indexes/test_index

sist2 V1.1.5
---------------------
threads         16
tn_qscale       5.0/31.0
tn_size         500px
output          /indexes/test_index/

Index

> docker run -it --network host -v /mnt/user/temp/sist2_indexes/:/indexes simon987/sist2 index /indexes/test_index

Delete index <0>
Create index <0>
Close index <0>
Update settings <0>
Update mappings <0>
Open index <0>
Indexed   0 documents (0kB) <0>

Already here it seems something has gone wrong, since it says "Indexed 0 documents".

Web

> docker run --rm --network host -d --name sist2 -v /mnt/user/temp/sist2_indexes/:/indexes -v /mnt/user/projects/test/:/files simon987/sist2 web --bind 0.0.0.0 --port 8888 /indexes/test_index

f275f598e9b39564cd8e4ac06bcb1915a066a6bf3b566ea9cd1ff64c321c13f1

The web interface comes up as expected on port 8888, and the index "test_index" shows up in the "Search in indices" list, but no files show up and searching doesn't do anything.

What am I doing wrong here?

index module does not respect --batch-size argument

sist2 version: 2.3.3

Sometimes creates a bigger batch:

[7F58DC11A540] [2020-06-02 01:41:48] [INFO elastic.c] Indexed 1000 documents (13282kB) <200>
[7F58DC11A540] [2020-06-02 01:41:56] [INFO elastic.c] Indexed 1000 documents (11059kB) <200>
[7F58DC11A540] [2020-06-02 01:41:59] [INFO elastic.c] Indexed 1000 documents (4486kB) <200>
[7F9D6B2EE540] [2020-06-02 01:42:00] [INFO elastic.c] Indexed 5000 documents (56914kB) <200> <---
[7F58DC11A540] [2020-06-02 01:42:03] [INFO elastic.c] Indexed 1000 documents (5347kB) <200>
[7F58DC11A540] [2020-06-02 01:42:43] [INFO elastic.c] Indexed 1000 documents (24819kB) <200>
[7F58DC11A540] [2020-06-02 01:43:00] [INFO elastic.c] Indexed 1000 documents (16371kB) <200>
[7F58DC11A540] [2020-06-02 01:43:06] [INFO elastic.c] Indexed 1000 documents (9546kB) <200>

404 when clicked on a file

Hi

I scanned and indexed my full system using the lines below. Everything seems to went good. I can run the web interface and see search results but when I click on an image for instance I get 404

I see that there is a /m missing from the file info I wonder if that is an issue

Here is the path line of a file when I click on info to see the details

edia/DRIVE/XXX/YYY/ZZZ/dinazor_1

I am on Debian 5.4.8-1 using Sist2 1.2.15

/opt/sist2/sist2 scan / --exclude "/media/DRIVE/.SUBVOLUMES/." --fast --verbose -o /DRIVE/sist2/out.system                                      
                                                                                                                                                                
/opt/sist2/sist2 index --force-reset /media/DRIVE/sist2/out.system            

[BF4EA700] [2020-02-23 10:44:55] [INFO response.c:195] [192.168.2.11] "POST /es" 200 20776 (Keep-Alive)
[BF4EA700] [2020-02-23 10:44:58] [INFO response.c:195] [192.168.2.11] "GET /f/2db8637a-edcd-4306-a9d0-882e50d456bd" 404 24 (Keep-Alive)
[C0CEF7C0] [2020-02-23 10:44:58] [INFO response.c:195] [192.168.2.11] "GET /favicon.ico" 404 24 (Keep-Alive)
[C0CEF7C0] [2020-02-23 10:45:02] [INFO response.c:195] [192.168.2.11] "GET /f/6280ad6e-b0e9-4104-8f5d-456bf598902e" 404 24 (Keep-Alive)
[BF4EA700] [2020-02-23 10:45:11] [INFO response.c:195] [192.168.2.11] "GET /f/1ca9a1cf-11f2-4810-8b07-f768df95a84a" 404 24 (Keep-Alive)

chrome / reverse proxy issue

sist2 version: v2.1.0
Platform: Docker
Command: docker run -d --net=bridge --rm --name sist2 -p 4090:4090 -v "$SRVDIR":"/docs" -v $IDXBASE/idx/:"/idx" -t simon987/sist2:2.1.0 web --very-verbose --bind 0.0.0.0:4090 --es-url http://192.168.86.11:9200 ./idx/$IDXDIR/

Hi,
With v1.3.3 using chrome, I could access sist2 directly (URL with IP address) or using traefik as reverse proxy (e.g. https://sist2.subdomain.domain.org/). On v2.1.0 at least for PC or MacOs versions of chrome, only the direct access URL works. The proxied URL starts to load the interface, but then stops displaying a blue bar below the input mask. Seems to be related to chrome because it works with Safari.

Kind regards,
lakemike

FR: Web: Filter / Sort by date

On the web page, it would be helpful to filter and/or sort results by date.
The two relevant dates may be the created date and the modified date.

I think modified date would be more useful, at least for me, and the modified date is already saved for use with -incremental.

(I am not a UX person, so what follows are some random thoughts. I really like the simplicity of your search that just works. So, these aren't necessarily "suggestions" but more "random thoughts".)

Filters may be: before, after, equal, between. Or something like gmail has.

  • Possibly add new "Date" filter tab next to "Tags". Dropdown of filter options that gives the correct number of date boxes.
  • Example from gmail:
    image

Sorts may be: Newest to oldest, oldest to newest.

  • Maybe a toggle under the "Date" filter tab, three options, New-Old, Old-New, Off. Default: Off to keep function the same as the current functioning.

Disk usage analyser?

Wow - this is pretty awesome =). Can't wait to try this on my NAS box.

What do you think about adding some kind of visualisation for disk usage to Sist2?

This would allow you to drill down to directories, and see where space is going. From the screenshots - it seems like Sist2 already stores the size, right? Is there enough data in the schema to create someting?

The Diskover project has some images that could show what I mean:

https://github.com/shirosaidev/diskover
https://github.com/shirosaidev/diskover-web

I was also thinking of Gnome Baobab:

https://wiki.gnome.org/Apps/DiskUsageAnalyzer
https://github.com/GNOME/baobab

or Windirstat:

https://windirstat.net/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.