Code Monkey home page Code Monkey logo

filetrove's Introduction

filetrove's People

Contributors

dependabot[bot] avatar steffenfritz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

avary riffus

filetrove's Issues

[CHANGE] Timestamps should not depend on the User's Regional Settings

Currently the timestamps are dependent on the regional settings of the user executing ftrove.

The identical nsrl.db appears, for example, with filectime "2024-02-12 11:25:11.280858018 -0400 AST" (time zone Bermuda) or "2024-02-12 16:25:11.280858018 +0100 CET" (time zone Berlin) - at least in an export (did not look into the DB directly).

For a correct numerical evaluation, all timestamps in all sessions should have a uniform basis.

If necessary, a time zone option can be introduced to give the user a choice (perhaps including the value "auto" or "local" to restore the current behavior).

[CHANGE] NSRL percentage after run

via e-mail:

"Idea/Feature request: NSRL-Score
If 'easily' possible: one more INFO-message at the end of the FileTrove run that shows how many of the indexed files were found in the NSRL. That might help the archivists to evaluate the data/workload on a quick view. It could be a percentage value (30% of the files were found in the NSRL database) or a score (NSRL-Score = 3)."

[CHANGE] Make entropy calculation optional

via e-mail:

"I had a folder with some larger files (third run 'titzmann'). First time, I thought ftrove freezed because progress bar and time (right, bottom) didn't move anymore
Might it be an option to skip 'Calculating entropy' or to not even try to calculate the entropy (already knowing it is more than 'max allowed: 1073741824')"

[CHANGE] Align the Data Model with Wikidata Items like "file system object"

Wikidata is playing an increasingly important role in digital preservation.

FileTrove should align its data model with this de facto standard (see https://www.wikidata.org/wiki/Q37787110 and related pages):

file system object (Q37787110)
    part of
        file system (Q174989)
    has characteristic: 
       path ((Q817765))
           has part(s):
                path separator (Q64826685)
                filename (Q1144928)
                    has part(s):
                        filename extension (Q186157)

This could be implemented with the following changes:

Session table

  • New Column: filesystem
    • WD description: "concrete format or program for storing files and directories on a data storage device" (https://www.wikidata.org/wiki/Q174989)
    • Example value(s): "ntfs", "ext4", "fat32"
    • Remarks: The Wikidata description may not be the best possible... Need a controlled vocabulary for FS values.
  • New Column: pathseparator

Files and directories tables

  • New Column: filepath | dirpath
    • WD description: "general form of the name of a file or directory; resources can be represented by either absolute or relative paths" (https://www.wikidata.org/wiki/Q817765)
    • Example value(s): "logs/filetrove.log"
    • Remarks: This should be relative to the sessions's mountpoint. This is identical to the existing columns filename | dirname
  • New Column: filename | dirname
    • WD description: "text string used to uniquely identify a computer file" (https://www.wikidata.org/wiki/Q1144928)
    • Example value(s): "filetrove.log"
    • Remarks: We should add to the description: "Without the leading path". This is different from the existing columns filename | dirname.
  • New Column: filenameextension | dirnameextension
    • WD description: "suffix to the name of a computer file" (https://www.wikidata.org/wiki/Q186157)
    • Example value(s): "log"
    • Remarks: Of course, the extension is only a weak indicator of the file format, but it does play a role. Less common for directories, but still may be useful.

This information could also be obtained by parsing the existing "filename" column afterwards, but ftrove has it all to hand on run time and can easily record it. The filename without the path may be particularly useful for tracking files with the same name across several sessions.

[CHANGE] Add a Column for File/Directory Hierarchy Level

A column for the (relative) hierarchy level of each file/directory recorded seems useful.

The level can also later be determined from the number of path separators. But an explicit, numeric level would be much easier to filter or to evaluate when you are just interested in a certain level of detail.

If, for example, an author always organizes their work like

MyDocuments\Presentations\Berlin\2018-02-On-File-Formats
MyDocuments\Presentations\Online\2021-11-Pixelvetica

it may be sufficient to only look at directories on level 4 to get a first impression of works, without the need to deal with any auxiliary files further down the hierarchy.

[CHANGE] Add more basic technical Metadata to Files and Directories

I would like to suggest identifying further basic technical metadata on files and directories, namely

  • Owner
  • Group
  • Access permissions
  • ACLs, if applicable.

Of course, you have to consider what occurs in all (or at least most) file systems.

Owners and groups probably only make sense in numerical form, and their mapping to readable names must be done externally.

It may make sense to create separate tables for certain file system types and fill them with their specific information.

[CHANGE] Add Parent UUID to Files and Directories

Apart from the name, there is currently no connection between the files and their directories.

Files and directories could carry the UUID of their (parent) directories in order to formally map the hierarchy.

[CHANGE] Add Version Infos to Session Metadata

The session metadata should record version information of all tools ("agents") and signature files, databases, ... involved, so that theoretically PREMIS "Events" may be constructed from FileTrove data.

[CHANGE] Create a graphical user interface

Describe the solution you'd like
FileTrove should have a graphical user interface that gives the same controls and status views like the command line interface.

There are three options at the moment:

i) provide a REST API and provide a web user interface
ii) create a fat client using the fyne library (https://fyne.io/)
iii) create a fat client using another library

[CHANGE] Add resume option

It should be possible to resume an indexing run by passing the session uuid.

As file lists are deterministic, FileTrove gets the last file entry from the session and then continuous with the next file in the input file list.

[CHANGE] Add libmagic for cross check with siegfried

Is your feature request related to a problem? Please describe.
To use siegfried as the only sourfe for file identification is not a good idea as there might be wrong identifications.

Describe the solution you'd like
libmagic can be easly added and its identification can be cross checked with sf's results.

[BUG] Streamline the "--install" option

The "--install" option is documented as "Install FileTrove into the given directory." However, this is somewhat misleading because, for example, "~/bin/ftrove --install /home/myusername/filetrove" does not lead to a functioning installation in the specified directory, but to the error message 'level=ERROR msg="Could not create db directory." error="mkdir /home/myusername/filetrove/db: no such file or directory"'.

"~/bin/ftrove --install ." in an existing directory on the other hand works.

I see two possibilities here:

  1. Make it work
  2. Change the help text to ""Install FileTrove helper files into the current directory.", remove the string parameter and default to the current directory.

[CHANGE] Text document indexing

Describe the solution you'd like
FileTrove should index text documents, remove stop words, identify relevant tokens and create a (small) list of words that give an idea that describes the text document.

To use xapian might be an idea.

[CHANGE] Add Update functionality to admftrove

With every version update, even minor, older filetrove sqlite databases are not compatible with new ftrove versions. Even though not every update changes the database schema, this gives a lot of freedom in development.

On the other side with every new release a new database must be created.

To avoid this an upgrade path for every release should be defined and easily executable.

Therefore admftrove should have

i) all paths
ii) sql updates
iii) and file updates

as update function.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.