steffenfritz / filetrove Goto Github PK
View Code? Open in Web Editor NEWFileTrove indexes files and creates metadata from them.
Home Page: https://filetrove.fritz.wtf
License: GNU Affero General Public License v3.0
FileTrove indexes files and creates metadata from them.
Home Page: https://filetrove.fritz.wtf
License: GNU Affero General Public License v3.0
Currently the timestamps are dependent on the regional settings of the user executing ftrove.
The identical nsrl.db
appears, for example, with filectime
"2024-02-12 11:25:11.280858018 -0400 AST" (time zone Bermuda) or "2024-02-12 16:25:11.280858018 +0100 CET" (time zone Berlin) - at least in an export (did not look into the DB directly).
For a correct numerical evaluation, all timestamps in all sessions should have a uniform basis.
If necessary, a time zone option can be introduced to give the user a choice (perhaps including the value "auto" or "local" to restore the current behavior).
via e-mail:
"Idea/Feature request: NSRL-Score
If 'easily' possible: one more INFO-message at the end of the FileTrove run that shows how many of the indexed files were found in the NSRL. That might help the archivists to evaluate the data/workload on a quick view. It could be a percentage value (30% of the files were found in the NSRL database) or a score (NSRL-Score = 3)."
Describe the solution you'd like
A progess bar should be shown when ftrove downloads the NSRL database.
via e-mail:
"I had a folder with some larger files (third run 'titzmann'). First time, I thought ftrove freezed because progress bar and time (right, bottom) didn't move anymore
Might it be an option to skip 'Calculating entropy' or to not even try to calculate the entropy (already knowing it is more than 'max allowed: 1073741824')"
Wikidata is playing an increasingly important role in digital preservation.
FileTrove should align its data model with this de facto standard (see https://www.wikidata.org/wiki/Q37787110 and related pages):
file system object (Q37787110)
part of
file system (Q174989)
has characteristic:
path ((Q817765))
has part(s):
path separator (Q64826685)
filename (Q1144928)
has part(s):
filename extension (Q186157)
This could be implemented with the following changes:
Session table
filesystem
pathseparator
Files and directories tables
filepath
| dirpath
mountpoint
. This is identical to the existing columns filename
| dirname
filename
| dirname
filename
| dirname
.filenameextension
| dirnameextension
This information could also be obtained by parsing the existing "filename" column afterwards, but ftrove has it all to hand on run time and can easily record it. The filename without the path may be particularly useful for tracking files with the same name across several sessions.
A column for the (relative) hierarchy level of each file/directory recorded seems useful.
The level can also later be determined from the number of path separators. But an explicit, numeric level would be much easier to filter or to evaluate when you are just interested in a certain level of detail.
If, for example, an author always organizes their work like
MyDocuments\Presentations\Berlin\2018-02-On-File-Formats
MyDocuments\Presentations\Online\2021-11-Pixelvetica
it may be sufficient to only look at directories on level 4 to get a first impression of works, without the need to deal with any auxiliary files further down the hierarchy.
I would like to suggest identifying further basic technical metadata on files and directories, namely
Of course, you have to consider what occurs in all (or at least most) file systems.
Owners and groups probably only make sense in numerical form, and their mapping to readable names must be done externally.
It may make sense to create separate tables for certain file system types and fill them with their specific information.
Describe the solution you'd like
FileTrove should create five random stills from video files. The images should be stored in the database.
Describe the bug
Downloading the nsrl.db file with ftrove or while indexing the bar on Windows 10 (and probably other versions) is flickering in cmd.exe
Apart from the name, there is currently no connection between the files and their directories.
Files and directories could carry the UUID of their (parent) directories in order to formally map the hierarchy.
The session metadata should record version information of all tools ("agents") and signature files, databases, ... involved, so that theoretically PREMIS "Events" may be constructed from FileTrove data.
admftrove should have an option to add entries easily from a text file, one hash per line, to an existing database.
Describe the solution you'd like
FileTrove should have a graphical user interface that gives the same controls and status views like the command line interface.
There are three options at the moment:
i) provide a REST API and provide a web user interface
ii) create a fat client using the fyne library (https://fyne.io/)
iii) create a fat client using another library
It should be possible to resume an indexing run by passing the session uuid.
As file lists are deterministic, FileTrove gets the last file entry from the session and then continuous with the next file in the input file list.
Is your feature request related to a problem? Please describe.
To use siegfried as the only sourfe for file identification is not a good idea as there might be wrong identifications.
Describe the solution you'd like
libmagic can be easly added and its identification can be cross checked with sf's results.
The "--install" option is documented as "Install FileTrove into the given directory." However, this is somewhat misleading because, for example, "~/bin/ftrove --install /home/myusername/filetrove" does not lead to a functioning installation in the specified directory, but to the error message 'level=ERROR msg="Could not create db directory." error="mkdir /home/myusername/filetrove/db: no such file or directory"'.
"~/bin/ftrove --install ." in an existing directory on the other hand works.
I see two possibilities here:
I think directories should have all the timestamp information the file items already have. (And also those of #27 of course.)
It would be nice if FileTrove or a sibling tool could also output the database contents of a session as PREMIS 3.0 XML or RDF, similar to what Data Accessioner does in PREMIS V2 (see https://cprerc.files.wordpress.com/2016/12/ercm006_file-transfers-using-dataaccessioner.pdf, p. 12, https://www.loc.gov/standards/premis/v3/index.html).
This is of course a big feature request with currently low priority...
I never saw a case where the colums filesfidentnote
and filesfidentproof
held different information, so maybe either die Siegfried results are mapped incorrectly or one of these columns may be dropped.
v1.0.0-DEV-11 on Arch Linux.
Describe the solution you'd like
FileTrove should be able to export a report listing
Describe the solution you'd like
For images FileTrove should create thumbnails that are stored in the database and can be viewed.
Describe the solution you'd like
FileTrove should index text documents, remove stop words, identify relevant tokens and create a (small) list of words that give an idea that describes the text document.
To use xapian might be an idea.
With every version update, even minor, older filetrove sqlite databases are not compatible with new ftrove versions. Even though not every update changes the database schema, this gives a lot of freedom in development.
On the other side with every new release a new database must be created.
To avoid this an upgrade path for every release should be defined and easily executable.
Therefore admftrove should have
i) all paths
ii) sql updates
iii) and file updates
as update function.
Is your feature request related to a problem? Please describe.
Download from archive.org is pretty slow. The NSRL SHA1 BoltDB should be hosted on a faster server.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.