rmariano / compr Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 0.0 161 KB

A text compression tool & library

Home Page: https://compr.readthedocs.io/en/latest/?badge=latest

License: MIT License

Python 94.43% Makefile 2.80% Shell 2.78%

python huffman text-compression python-3 algorithms

compr's Introduction

https://img.shields.io/pypi/pyversions/trenzalore.svg?style=flat-square

Contents

PyCompress
- Installation
- Development

PyCompress

Pycompressor is a tool for compressing text files into smaller ones, as well as extracting compressed files back into the original content.

It can be used as a program or imported as a package module, and use the functions defined on it.

For example, in order to compress one file:

$ pycompress -c /usr/share/dict/words -d /tmp/compressed.zf

The original file, in this example has a size of ~4.8M, and the tool left the resulting file at /tmp/compressed.zf, with a size of ~2.7M.

In order to extract it:

$ pycompress -x /tmp/compressed.zf -d /tmp/original

You can specify the name of the resulting file with the -d flag. If you don't indicate a name for the resulting file, the default will be <original-file>.comp.

For the full options, run:

$ pycompress -h

Installation

pip install trenzalore

Will install the package and leave an application named pycompress for using the command line utility.

Development

To install the package in development mode, run:

make testdeps

And run the tests with:

make test

Before submitting a pull request, run the checklist to make sure all dependencies are met (code style/linting, tests, pass, etc.). This is automated with:

make checklist

This will run the checks for the code style (make lint), as well as the tests (make test).

compr's People

Contributors

Stargazers

Watchers

compr's Issues

change tests layout

For each compressor/<X>.py there should be a corresponding test file tests/unit/test_<X>.py

Group tests by scenarios.
Separate unit vs. functional tests, and allow running them separately.

Setup tox

Run tests against the following Python versions:

Update travis CI

[optimization] Use actual bit array on processing

The lib is currently encoding byte characters of '1' or '0' for the binary bit representation, respectively, and not actual bits in an array.

Is not strictly required to port everything to C at this point, just doing the optimisation in Python will suffice.

Some alternatives might be:

numpy: https://docs.scipy.org/doc/numpy/reference/generated/numpy.packbits.html
Having integers being shifted with << 1 (dumping in chunks of 64 bits, for instance), etc.
Python bitarray: https://pypi.python.org/pypi/bitarray/

Compare memory utilisation before and after the change.

Warn if target file already exists

In case the target file already exists (regardless if it was user-specified or detault one), warn the user about it, and ask for confirmation before continuing with the processing.

This has to be done, before any actual processing of the file takes place.

If -f | --force is indicated, assume the output file will be overwritten and do not prompt.

Release version 0.1.0

Tag and sign version with current master at 2017-04-15
Create necessary Makefile targets for building
Build wheel for project
Public on pypi
Update README

Add a new target in Makefile that checks type hinting. If the mypy validation has some issues, the target should fail.
This new target will be part of the checklist, so make checklist should run mypy among other things.

Support multiple files

Ability to compress multiple files, packaging the compression into a single one.

This changes the cli interface, for now the user has to specify the name of the output file first (default one will probably not work anymore), and then the list of the files to search (similar to tar, etc.), like:

pycompress --ouptut <ofilename> [files...]

The compression can be done sequentially, no need to parallelism; Any sort of optimisation will be done later on.

Setup CI

Travis CI for the project.

Handle multiple files asynchronously

Blocked by

Do the processing in an asynchronous fashion.
Compare performance results, against the same benckmark

Document Project

Generate the documentation for the project, describing the main functions their parameters, etc.

High-level project information
API documentation: generated from docstrings + adding custom information about each function on the project, modules, how to use, etc.
Python annotations
~~[Low-level file documentation:]~~
- ~~Binary file format, structure, bytes, parsing, etc.~~
- ~~Input and output~~
Make documentation available online (RTD)
Update Readme

Document:

cli:
- Parameters for compression & decompression
- Examples of invocations
Programatic API

Use mmap in files

depends on [blocked by]: #29
Change the underlying implementation for mmap, and compare performance results.

Prepare build against Python 3.7 and remove 3.5

Add configuration entry in CI to run against Python 3.7
Deprecate Python 3.5 in this project (Python 3.6+ only)

Prototype with Google Fire

https://github.com/google/python-fire

Prototype to see if it's a suitable replacement for the command line interface.
Depends upon #25

Refactor tests

Move to pytest style of tests (functions with assets, etc.)
Remove nose dependency
Test in smaller units, each function separately
Add tox (For Python 3.5 and Python 3.6)
make checklist: check for style issues (pylint), syntax, & run tests
Remove randomization in tests
Remove cli-specific tools (subprocess calls to sha256sum for instance).

Improve CI

Code linting for all code
- pylint, flake8, with the most strict controls
- Set max column=79
- Break the build if the linting does not pass
- setup code revision for common patterns in code review, automatically. Maybe https://github.com/integrations/sideci can help
Tests should ignore the dataset on the run
Check that coverage level did not decrease. Fail if it did.
- setup codecov https://codecov.io/
Automate coverage level report per branch, and PR. Link directly in the project main page and documentation.
Create a checklist target in Makefile, and separate tests from checklist.
Check for security issues and updates automatically. Maybe https://github.com/integrations/src-clr can help

Default file should be placed in local directory

ATM if no default is provided for the file being worked on (extraction/compression), it uses <original-file>.comp as a default one. If an absolute path is provided, it will still use that absolute path with the .comp suffix.

A user might have read permissions for the file being worked on (that's all it should take for compression), but not write permissions (for the output file).

The proposal is to change the default for:

`pwd`/`basename <original-file>`.comp

Leaving the resulting file in the current directory, where write permissions are assumed.

Use pathlib: https://docs.python.org/3.5/library/pathlib.html

Document file formats

Generate documentation for the low-level internals of the project.

Format of the file that is compressed, or extracted.
Structure of bytes in the resulting table
Algorithms & Data structures being used in the program
Algorithm complexity

Package project

Create a setup.py that allows the project to be installed as a package for development and installation.

Publish at pypi

Document program cli

Parameters for compression & decompress
Examples of invocations

Create helper for streaming file

Helper object that will yield the contents of the file reading by a given buffer size.

Conditions:

A file-like object
Context manager, making sure the file is closed upon completion.
Iterable

pseudocode:

with IterableFile('/tmp/foo/bar') as streamed_file:
     for chunk in streamed_file.stream(buffer_size=1024):
             print(chunk)

Setup performance benchmark

Automatically run performance checks on the platform, that should be used to measure differences on changes, regression, etc. It is recommended to run as part of the CI along with the unit tests. It should be possible to compare performance across different branches and revisions.

Instrument the code, to support performance testability.

Have a separate target in Makefile.

The benchmark has to include the following relevant metrics (to be reviewed):

Running time (latency) for one "mark" file
Running time for N files (traceability to determine how does it scale as more files are added).
CPU load (%, load average, etc.)
Memory usage.
I/O

Add a `--verbose` option

This optional parameter, when selected, should gather information along the process of the main command being performed, and display the results just before the program finishes.
For example, it can collect the time elapsed, the sizes of both files (prior and after the program was called), and the compression/extraction ratio (in %), etc.
This information is rendered on stdout

If this parameter is provided, all files will be written inside this directory with the default naming convention.

Update documentation with examples of this use.