steffenfritz / filetrove Goto Github PK

View Code? Open in Web Editor NEW

21.0 4.0 5.0 8.73 MB

FileTrove indexes files and creates metadata from them.

Home Page: https://filetrove.fritz.wtf

License: GNU Affero General Public License v3.0

Go 97.14% Makefile 2.86%

digipres digital-preservation forensics forensics-investigations

filetrove's Introduction

filetrove's People

Contributors

Stargazers

Watchers

Forkers

avary riffus

filetrove's Issues

[CHANGE] Timestamps should not depend on the User's Regional Settings

Currently the timestamps are dependent on the regional settings of the user executing ftrove.

The identical nsrl.db appears, for example, with filectime "2024-02-12 11:25:11.280858018 -0400 AST" (time zone Bermuda) or "2024-02-12 16:25:11.280858018 +0100 CET" (time zone Berlin) - at least in an export (did not look into the DB directly).

For a correct numerical evaluation, all timestamps in all sessions should have a uniform basis.

If necessary, a time zone option can be introduced to give the user a choice (perhaps including the value "auto" or "local" to restore the current behavior).

[CHANGE] NSRL percentage after run

via e-mail:

"Idea/Feature request: NSRL-Score
If 'easily' possible: one more INFO-message at the end of the FileTrove run that shows how many of the indexed files were found in the NSRL. That might help the archivists to evaluate the data/workload on a quick view. It could be a percentage value (30% of the files were found in the NSRL database) or a score (NSRL-Score = 3)."

[CHANGE] Add progress bar for NSRL download

Describe the solution you'd like
A progess bar should be shown when ftrove downloads the NSRL database.

[CHANGE] Make entropy calculation optional

via e-mail:

"I had a folder with some larger files (third run 'titzmann'). First time, I thought ftrove freezed because progress bar and time (right, bottom) didn't move anymore
Might it be an option to skip 'Calculating entropy' or to not even try to calculate the entropy (already knowing it is more than 'max allowed: 1073741824')"

[CHANGE] Align the Data Model with Wikidata Items like "file system object"

Wikidata is playing an increasingly important role in digital preservation.

FileTrove should align its data model with this de facto standard (see https://www.wikidata.org/wiki/Q37787110 and related pages):

file system object (Q37787110)
    part of
        file system (Q174989)
    has characteristic: 
       path ((Q817765))
           has part(s):
                path separator (Q64826685)
                filename (Q1144928)
                    has part(s):
                        filename extension (Q186157)

This could be implemented with the following changes:

Session table

New Column: filesystem
- WD description: "concrete format or program for storing files and directories on a data storage device" (https://www.wikidata.org/wiki/Q174989)
- Example value(s): "ntfs", "ext4", "fat32"
- Remarks: The Wikidata description may not be the best possible... Need a controlled vocabulary for FS values.
New Column: pathseparator
- WD description: "character used to delimit components of a file path" (https://www.wikidata.org/wiki/Q64826685)
- Example value(s): "/", "\"

Files and directories tables

New Column: filepath | dirpath
- WD description: "general form of the name of a file or directory; resources can be represented by either absolute or relative paths" (https://www.wikidata.org/wiki/Q817765)
- Example value(s): "logs/filetrove.log"
- Remarks: This should be relative to the sessions's mountpoint. This is identical to the existing columns filename | dirname
New Column: filename | dirname
- WD description: "text string used to uniquely identify a computer file" (https://www.wikidata.org/wiki/Q1144928)
- Example value(s): "filetrove.log"
- Remarks: We should add to the description: "Without the leading path". This is different from the existing columns filename | dirname.
New Column: filenameextension | dirnameextension
- WD description: "suffix to the name of a computer file" (https://www.wikidata.org/wiki/Q186157)
- Example value(s): "log"
- Remarks: Of course, the extension is only a weak indicator of the file format, but it does play a role. Less common for directories, but still may be useful.

This information could also be obtained by parsing the existing "filename" column afterwards, but ftrove has it all to hand on run time and can easily record it. The filename without the path may be particularly useful for tracking files with the same name across several sessions.

[CHANGE] Add a Column for File/Directory Hierarchy Level

A column for the (relative) hierarchy level of each file/directory recorded seems useful.

The level can also later be determined from the number of path separators. But an explicit, numeric level would be much easier to filter or to evaluate when you are just interested in a certain level of detail.

If, for example, an author always organizes their work like

MyDocuments\Presentations\Berlin\2018-02-On-File-Formats
MyDocuments\Presentations\Online\2021-11-Pixelvetica

it may be sufficient to only look at directories on level 4 to get a first impression of works, without the need to deal with any auxiliary files further down the hierarchy.

[CHANGE] Add more basic technical Metadata to Files and Directories

I would like to suggest identifying further basic technical metadata on files and directories, namely

Owner
Group
Access permissions
ACLs, if applicable.

Of course, you have to consider what occurs in all (or at least most) file systems.

Owners and groups probably only make sense in numerical form, and their mapping to readable names must be done externally.

It may make sense to create separate tables for certain file system types and fill them with their specific information.

[CHANGE] Create thumbnails from video stills

Describe the solution you'd like
FileTrove should create five random stills from video files. The images should be stored in the database.

[BUG] Progress bar is flickering on Windows

Describe the bug
Downloading the nsrl.db file with ftrove or while indexing the bar on Windows 10 (and probably other versions) is flickering in cmd.exe

[CHANGE] Add Parent UUID to Files and Directories

Apart from the name, there is currently no connection between the files and their directories.

Files and directories could carry the UUID of their (parent) directories in order to formally map the hierarchy.

[CHANGE] Add Version Infos to Session Metadata

The session metadata should record version information of all tools ("agents") and signature files, databases, ... involved, so that theoretically PREMIS "Events" may be constructed from FileTrove data.

[CHANGE] Add the possibility to add entries to Boltdb

admftrove should have an option to add entries easily from a text file, one hash per line, to an existing database.

[CHANGE] Create a graphical user interface

Describe the solution you'd like
FileTrove should have a graphical user interface that gives the same controls and status views like the command line interface.

There are three options at the moment:

i) provide a REST API and provide a web user interface
ii) create a fat client using the fyne library (https://fyne.io/)
iii) create a fat client using another library

[CHANGE] Add resume option

It should be possible to resume an indexing run by passing the session uuid.

As file lists are deterministic, FileTrove gets the last file entry from the session and then continuous with the next file in the input file list.

[CHANGE] Add libmagic for cross check with siegfried

Is your feature request related to a problem? Please describe.
To use siegfried as the only sourfe for file identification is not a good idea as there might be wrong identifications.

Describe the solution you'd like
libmagic can be easly added and its identification can be cross checked with sf's results.

[BUG] Streamline the "--install" option

The "--install" option is documented as "Install FileTrove into the given directory." However, this is somewhat misleading because, for example, "~/bin/ftrove --install /home/myusername/filetrove" does not lead to a functioning installation in the specified directory, but to the error message 'level=ERROR msg="Could not create db directory." error="mkdir /home/myusername/filetrove/db: no such file or directory"'.

"~/bin/ftrove --install ." in an existing directory on the other hand works.

I see two possibilities here:

Make it work
Change the help text to ""Install FileTrove helper files into the current directory.", remove the string parameter and default to the current directory.

[CHANGE] Add Timestamp Info to Directories, too

I think directories should have all the timestamp information the file items already have. (And also those of #27 of course.)

[CHANGE] Output/Export PREMIS V3 Metadata as well

It would be nice if FileTrove or a sibling tool could also output the database contents of a session as PREMIS 3.0 XML or RDF, similar to what Data Accessioner does in PREMIS V2 (see https://cprerc.files.wordpress.com/2016/12/ercm006_file-transfers-using-dataaccessioner.pdf, p. 12, https://www.loc.gov/standards/premis/v3/index.html).

This is of course a big feature request with currently low priority...

[BUG] filesfidentnote and filesfidentproof always seem to be identical

I never saw a case where the colums filesfidentnote and filesfidentproof held different information, so maybe either die Siegfried results are mapped incorrectly or one of these columns may be dropped.

v1.0.0-DEV-11 on Arch Linux.

[CHANGE] Create session specific and overall report

Describe the solution you'd like
FileTrove should be able to export a report listing

all sessions
all sums of files and directories over all sessions
the same for single sessions

To avoid this an upgrade path for every release should be defined and easily executable.

Therefore admftrove should have

i) all paths
ii) sql updates
iii) and file updates

as update function.

[CHANGE] Faster download for NSRL database

Is your feature request related to a problem? Please describe.
Download from archive.org is pretty slow. The NSRL SHA1 BoltDB should be hosted on a faster server.