rail5 / polonius Goto Github PK

View Code? Open in Web Editor NEW

5.0 1.0 1.0 216 KB

A text editor for very large files

License: GNU General Public License v3.0

Makefile 1.93% C++ 97.26% C 0.81%

editor large-files low-ram text-editor

polonius's Introduction

polonius

A uniquely memory-efficient and modular text editor

Therefore, since brevity is the soul of wit

And tediousness the limbs and outward flourishes,

I will be brief.

– Hamlet, Act II, Scene II

About

Polonius can be used to edit files of any size (up to just over 8 million terabytes) on systems with as little as only a few kilobytes of available memory.

In order to achieve this, it never loads any more data into RAM than is currently being used. All that we have to keep in memory is the part of the file that's currently being displayed, plus a list of the changes the user wants to make.

Most text editors function by:

(1) loading the entire contents of a file into RAM,
(2) making changes to that portion of RAM,
and then (3) writing that portion of RAM back to the disk.

There's nothing wrong with this method -- but it does limit you to how much you can load into RAM at any one time. Try editing a 100GB file in a normal text editor!

Polonius is made up of separate binary modules:

The "reader", which outputs a selected portion of the contents of a file
The "editor", which interprets editing instructions (replace, insert, and remove) and makes the requested changes to the file
And the interactive UI, which ties together the functionality of the other modules.

See the wiki for full documentation. Manual pages are also available in the Debian packages.

Development Progress

Development is done on Debian GNU/Linux. Builds are also tested on OpenBSD. Releases will be provided for amd64, i386, and arm64 architectures.

polonius-reader:

polonius-editor:

CLI (polonius):

GUI (polonius-gui):

polonius's People

Contributors

Stargazers

Watchers

Forkers

natnaeltaddese

polonius's Issues

polonius-editor: No support for Unicode or Multiple Byte Character Sets

At the moment, Polonius treats files as binary files and treats each character as being 1 byte. Some character sets use multiple bytes per character. The Chinese character "字" for example is 3 bytes.

If the user runs the following instruction:
REPLACE 0 字

On a file which initially contains:
hello world

The output will be:
字lo world

As the 3-byte "字" will replace "h", "e" and "l" (being 3 bytes), rather than just the "h" as the user might expect

polonius-editor: Improve sanity of instruction sanity checks

Currently instructions are initially verified to be sane by the following process:

"Instruction" object is instantiated
Basic sanity checks (Has a proper instruction type, values are properly set etc)
"Instruction" object is passed to the "File" object
The "File" object does one final check to make sure the end position is within range of the file

This is not at all a good procedure; the sanity checks should all happen before the Instruction object is passed to the file

Possible solutions:

Could measure the file size before instantiating the File object
Could add a File object pointer as a private member of the Instruction class, so that the Instruction object can check end positions against the file size (in this case we would have to instantiate the File object before parsing instructions)

Solution 2 seems preferable

polonius-editor: Potential race condition while editing files

Due to the fact that Polonius edits file sequentially rather than all-at-once as most text editors do, there is potential for race conditions while Polonius is in the process of editing a file (esp. when editing large files while using a small block-size). Another program, or another instance of Polonius, could try to make changes to the file before Polonius is finished.

polonius-editor: Should check for write permissions before beginning

Attempting to create a file in a directory where the user doesn't have write permissions, or attempting to edit a file which the user doesn't have write permissions for, results in:

terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: cannot get file size: No such file or directory
Aborted

These errors should be checked & handled before throwing

polonius-editor: Multi-character 'end' REPLACEs

An instruction like REPLACE end ab will always fail. Running a REPLACE instruction with the END keyword only handles (currently) 1-character input, as in:

File contains abc123
User runs REPLACE END z
File now contains abc12z

Multi-character input ought to be possible, resulting in:

File contains abc123
User runs REPLACE END xyz
File now contains abcxyz

polonius-editor: Inserting '\x00' does not insert a NUL byte into the file

For example, "INSERT 0 hello\x00world" results in a file with "helloworld" and no NUL byte in between.

Uncertain whether this should be considered a bug, or whether it should be possible for polonius to insert a NUL byte.

File_exists() fails on files >2GB on 32-bit systems

On 32-bit systems, file_exists() as it's currently written always returns false on files >2GB

This is apparently due to the syscall stat() returning the following error code:

EOVERFLOW
pathname or fd refers to a file whose size, inode number,
or number of blocks cannot be represented in,
respectively, the types off_t, ino_t, or blkcnt_t. This
error can occur when, for example, an application compiled
on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls
stat() on a file whose size exceeds (1<<31)-1 bytes.

See https://man7.org/linux/man-pages/man2/lstat.2.html

Options:

Could switch to using the access() syscall
Could change the following line of code:
return (stat (name.c_str(), &buffer) == 0);
to look for == 0 || == EOVERFLOW

Very much prefer switching to access() for code readability's sake

polonius-reader: 'Position' output should be pre-formatted to be accepted by polonius-editor

Currently, polonius-reader, when it outputs positions (-p / --output-pos), outputs them in the format start-position,end-position, for example 0,5, ie, comma-delimited

However, polonius-editor only accepts space-delimited input.

Either:

polonius-editor should be made to also accept comma-delimited input
polonius-reader should output position data in a space-delimited format

I'm in favor of option 2. Adding options for different delimiters seems like more headache than it's worth, for users and for me

polonius-reader: --output-pos "End Position" off by 1

The End Position reported by the -p option is always 1 character beyond the actual end position

Ie, in a file which contains:

abc123

The command polonius-reader ./file -f "a" -p should return: 0 0 (start position 0, end position 0)

In fact, it returns: 0 1

If the output of this function were to be piped into polonius-editor, as in the following example:

#!/bin/bash

POSITION_OF_THE_LETTER_A="$(polonius-reader ./file -f "a" -p)"

polonius-editor ./file -a "REMOVE $POSITION_OF_THE_LETTER_A"

This would evaluate to: polonius-editor ./file -a "REMOVE 0 1" (where it should say 0 0 instead)

Polonius-editor would, in this case, delete the characters 'a' and 'b' from our example file (Character 0 and character 1)

polonius-editor: Add "end" keyword

Polonius-editor instructions should be able to parse an "end" keyword which would point to the end of the file

Example: INSERT END hello world or REMOVE 10 END

polonius-editor: INSERT deletes last character from file IF the last edit to the file was made by polonius-editor

This is very mysterious

polonius-editor's "insert" function will remove the last character from the input file, IF and only if the most recent edit to that file was made by polonius-editor

Example:
Original file:
123456
Instruction:
$ polonius-editor -i ./testfile -a "INSERT 3 abc"
SHOULD produce:
123abc456
But, thanks to issue #1, it in fact produces:
123abc450

THEN, if we run the SAME COMMAND a second time, we would EXPECT to see:
123abcabc450
BUT in fact we see:
123abcabc40
It deleted the ending '0' (and thanks to issue #1 replaced the now-ending '5' with a new '0')

HOWEVER, if after we run the first edit (which gets us "123abc450"),
we now open the file in Gedit (or some other text editor), and re-save the file without making any other changes,
And then we run the same polonius-editor command a second time:
$ polonius-editor -i ./testfile -a "INSERT 3 abc"
This now produces:
123abcabc450
As we would expect.

What?

The exact same problem happens in the following sequence:

Original contents of file:
123456

Command:
$ polonius-editor -i ./testfile -a "INSERT 3 abc"
New file contents:
123abc450

Second command:
$ polonius-editor -i ./testfile -a "REPLACE 8 6"
New file contents:
123abc456

Third command:
$ polonius-editor -i ./testfile -a "INSERT 3 abc"
New file contents:
123abcabc40

So even if we interpose a "replace" instruction between the inserts, the same problem occurs.

But again, if we alter that sequence so that the edit immediately previous to the final "insert" was made by some other program:

Original contents of file:
123456

Command:
$ polonius-editor -i ./testfile -a "INSERT 3 abc"
New file contents:
123abc450

Second command:
$ polonius-editor -i ./testfile -a "REPLACE 8 6"
New file contents:
123abc456

Interjection: Open up ./testfile in some other text editor (e.g. Gedit), make no changes, and re-save the document

Third command:
$ polonius-editor -i ./testfile -a "INSERT 3 abc"
New file contents:
123abcabc450

And the last character is not deleted. (But, of course, issue #1 still at play)

This is giving me a headache. I need to re-write the insert function from scratch

polonius-editor: Strange bug with removes

Encountered randomly when editing a 500,000-line file. REMOVE (pos) end did not work correctly. Will examine more closely & define more precisely later when I have time.

polonius-editor: wish-list: Add units interpretation to --block-size / -b

-b Should be able to take arguments such as: -b 16M or -b 10K

polonius-reader: Fails to output contents if size of contents in bytes > maximum value of a signed 32-bit integer

terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_create
Aborted

parse_block_units fails with single-char input

polonius-editor: "Instruction Optimization"

REPLACE instructions are always preferred to INSERTs and REMOVEs, seeing as they're much faster

If a user types, for example:

-a "REMOVE 0 0"
-a "INSERT 0 a"

Polonius should be able to detect that this could be optimized into a single REPLACE instruction:

-a "REPLACE 0 a"

This would be major, especially for very large files

polonius-reader: Add regex search option

polonius-editor: INSERTs before EOF fail if file size in bytes > max value of signed 32-bit integer

In the event that:

We are editing a file whose size in bytes is larger than the maximum value of a signed 32-bit integer,
We are running an "INSERT" instruction which is inserting to any position prior to EOF (ie, not simply appending to the end of the file),

polonius-editor hangs for a minute like it's doing something, and then the insert is not made

Replaces & inserts to the end of the file still work fine.

Previously, file sizes/start positions/etc were stored as ints, now they're stored as long long ints which allows them to be larger than the maximum value of a signed 32-bit integer

polonius-editor: fopen() segfault on files >2GB on 32-bit systems

polonius::file::set_path calls fopen() to obtain a file descriptor for file locking. On 32bit systems this crashes if the file in question is larger than 2GB. Should use fopen64() instead so that the file descriptor is created with the O_LARGEFILE flag set

Ensure that block size input does not exceed integer maximum values

Prevent overflows

make update-version uses non-portable GNU grep -P option

Makefile target 'update-version' gets the version number from the latest debian/changelog with the following command:

grep -P -o -e "([0-9\.]*)" debian/changelog | head -n 1

The grep -P option is GNU-specific and doesn't run on non-GNU systems such as OpenBSD

This should be swapped out for something more portable

polonius-reader: Passing a path to a directory (rather than a file) leads to sadness

polonius-editor: INSERT replaces last character in file with a '0'

polonius-reader: Add special chars support to search function

See polonius-editor's -c option

polonius-editor: 'fileno_hack()' relies on non-portable GNUisms

The newly imported 'fileno_hack()' function (used to obtain file descriptors from std::fstreams, imported in response to bug #5 to implement file locks during editing) does not compile outside of a GNU/Glibc environment

Options:

Rewrite the editor::file class to use C-type FILE* instead of C++ std::fstream
Write a new 'fileno_hack()' function to be more portable
Give up on file locks altogether. There's a case for this (although in my opinion not a great one), since other processes can choose to ignore file locks at their leisure

Would really prefer to use C++ types and methods (such as fstream) where possible

polonius-editor: file_length not updating on multi-instruction sets

Acting on a file containing:
0123456789

With the instruction set:

INSERT 5 abc
INSERT 11 def

The second instruction fails with Invalid start position because '11' is out-of-range of the ORIGINAL file_length
However, after the first instruction, the file_length is increased, and so the second one should work.

polonius-reader: Regex searches with {#} after terms may fail when used with certain block sizes

Because Polonius reads files in chunks rather than all-at-once (permitting it to work with extremely large files), Polonius performs regex searches in the following way:

Example expression: [A-Z]{3}123

Create a list of "sub-expressions" from the initial regular expression (Removing terms from the end one by one. I.e., [A-Z]{3}12, [A-Z]{3}1, [A-Z]{3}, [A-Z])
Search for a FULL match of the regular expression in the currently loaded chunk. If found, skip to step 4
Scan the currently loaded chunk to see if it ENDS WITH a match for one of the sub-expressions. If so, load a new block from the start position of that match & rewind to step 2
Report the match

The above example uses a {#} after the [A-Z] term, which specifies that there should be precisely (in this case) 3 occurrences of the preceding term.

Let's suppose a file with the following contents:

abcd123efg

And let's suppose the user runs the following command:

polonius-reader ./the-file --find "[0-9]{3}" --regex --block-size 3

Polonius scans the first block (with block size '3') and finds abc. No matches or partial matches for the pattern [0-9]{3}.

It then scans the second block and finds d12 -- No match, but it ends with a match for the generated sub-expression [0-9] (without the {3}), this match is 2. It then loads a new block beginning from that position, and finds: 23e. It jumped right over the match (123)

The obvious solution would be to, when generating sub-expressions, create a few more intermediary sub-expressions by decrementing the numbers inside curly-braces before removing them entirely. In this case, Polonius would've found that the second block ended with a match for [0-9]{2} and carried on successfully from there.

Possibly a much better solution would be to create a single extra sub-expression which replaces the term inside the curly-braces along the lines of converting {3} to {1,3}, converting {3,} to {1,}, and converting {2,3} to {1,3}

Wish-list: Take advantage of Linux-specific fallocate syscalls

See: https://man7.org/linux/man-pages/man2/fallocate.2.html

If we are using the Linux Kernel and either an ext4 or XFS filesystem, there are two potentially helpful fallocate flags which aren't available on other platforms: FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE

All other systems only allow for: replaces, inserts to the end of the file, and removals from the end of the file. As such, currently, when Polonius does (for example) an "insert" to some position of a file prior to EOF, it begins by inserting to the end of the file, and then doing a series of "replaces" to achieve the desired result.

These system calls would allow us to perform "inserts" and "removes" DIRECTLY to earlier portions of files.

However, to my knowledge, these could not be done with "byte-level" granularity. From the fallocate manpage:

A filesystem may place limitations on the granularity of the
operation, in order to ensure efficient implementation.
Typically, offset and len must be a multiple of the filesystem
logical block size, which varies according to the filesystem type
and configuration. If a filesystem has such a requirement,
fallocate() fails with the error EINVAL if this requirement is
violated.

It would be nice to be able to take advantage of these syscalls in the event that the user is on the Linux kernel & is using a compatible filesystem such as ext4

polonius-reader: Regex search segfault with large block size

To investigate later

Command: polonius-reader ./file.doc -f "\xFF{500,}" -b 32K -e -p

-b 32K segfaults, -b 31K is fine.