Code Monkey home page Code Monkey logo

compr's Introduction

image

  :alt: CI Status

Documentation Status

coverage

image

PyCompress

Pycompressor is a tool for compressing text files into smaller ones, as well as extracting compressed files back into the original content.

It can be used as a program or imported as a package module, and use the functions defined on it.

For example, in order to compress one file:

$ pycompress -c /usr/share/dict/words -d /tmp/compressed.zf

The original file, in this example has a size of ~4.8M, and the tool left the resulting file at /tmp/compressed.zf, with a size of ~2.7M.

In order to extract it:

$ pycompress -x /tmp/compressed.zf -d /tmp/original

You can specify the name of the resulting file with the -d flag. If you don't indicate a name for the resulting file, the default will be <original-file>.comp.

For the full options, run:

$ pycompress -h

Installation

pip install trenzalore

Will install the package and leave an application named pycompress for using the command line utility.

Development

To install the package in development mode, run:

make testdeps

And run the tests with:

make test

Before submitting a pull request, run the checklist to make sure all dependencies are met (code style/linting, tests, pass, etc.). This is automated with:

make checklist

This will run the checks for the code style (make lint), as well as the tests (make test).

compr's People

Contributors

rmariano avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

compr's Issues

Package project

Create a setup.py that allows the project to be installed as a package for development and installation.

  • Publish at pypi

Add a `--verbose` option

This optional parameter, when selected, should gather information along the process of the main command being performed, and display the results just before the program finishes.
For example, it can collect the time elapsed, the sizes of both files (prior and after the program was called), and the compression/extraction ratio (in %), etc.
This information is rendered on stdout

Setup performance benchmark

Automatically run performance checks on the platform, that should be used to measure differences on changes, regression, etc. It is recommended to run as part of the CI along with the unit tests. It should be possible to compare performance across different branches and revisions.

Instrument the code, to support performance testability.

Have a separate target in Makefile.

The benchmark has to include the following relevant metrics (to be reviewed):

  • Running time (latency) for one "mark" file
  • Running time for N files (traceability to determine how does it scale as more files are added).
  • CPU load (%, load average, etc.)
  • Memory usage.
  • I/O

Warn if target file already exists

In case the target file already exists (regardless if it was user-specified or detault one), warn the user about it, and ask for confirmation before continuing with the processing.

This has to be done, before any actual processing of the file takes place.

If -f | --force is indicated, assume the output file will be overwritten and do not prompt.

Setup tox

Run tests against the following Python versions:

  • 3.5
  • 3.6

Update travis CI

Create helper for streaming file

Helper object that will yield the contents of the file reading by a given buffer size.

Conditions:

  • A file-like object
  • Context manager, making sure the file is closed upon completion.
  • Iterable

pseudocode:

with IterableFile('/tmp/foo/bar') as streamed_file:
     for chunk in streamed_file.stream(buffer_size=1024):
             print(chunk)

Use mmap in files

depends on [blocked by]: #29
Change the underlying implementation for mmap, and compare performance results.

Support multiple files

Ability to compress multiple files, packaging the compression into a single one.

This changes the cli interface, for now the user has to specify the name of the output file first (default one will probably not work anymore), and then the list of the files to search (similar to tar, etc.), like:

pycompress --ouptut <ofilename> [files...]

The compression can be done sequentially, no need to parallelism; Any sort of optimisation will be done later on.

Setup mypy

Add a new target in Makefile that checks type hinting. If the mypy validation has some issues, the target should fail.
This new target will be part of the checklist, so make checklist should run mypy among other things.

Default file should be placed in local directory

ATM if no default is provided for the file being worked on (extraction/compression), it uses <original-file>.comp as a default one. If an absolute path is provided, it will still use that absolute path with the .comp suffix.

A user might have read permissions for the file being worked on (that's all it should take for compression), but not write permissions (for the output file).

The proposal is to change the default for:

`pwd`/`basename <original-file>`.comp

Leaving the resulting file in the current directory, where write permissions are assumed.

Use pathlib: https://docs.python.org/3.5/library/pathlib.html

change tests layout

For each compressor/<X>.py there should be a corresponding test file tests/unit/test_<X>.py

  • Group tests by scenarios.
  • Separate unit vs. functional tests, and allow running them separately.

Document file formats

Generate documentation for the low-level internals of the project.

  • Format of the file that is compressed, or extracted.
  • Structure of bytes in the resulting table
  • Algorithms & Data structures being used in the program
  • Algorithm complexity

Refactor tests

  • Move to pytest style of tests (functions with assets, etc.)
  • Remove nose dependency
  • Test in smaller units, each function separately
  • Add tox (For Python 3.5 and Python 3.6)
  • make checklist: check for style issues (pylint), syntax, & run tests
  • Remove randomization in tests
  • Remove cli-specific tools (subprocess calls to sha256sum for instance).

Move code to new cli module

All code related to the command line interface should be moved from the __init__ to ` new, module, called (for example) cli.py

Setup lint checking

make lint should be part of the checklist, and should run linting checks automatically (pycodestyle, pylint, etc.).

If some issues are found on any of the files, it should fail with exit code 1.

Document Project

Generate the documentation for the project, describing the main functions their parameters, etc.

  • High-level project information

  • API documentation: generated from docstrings + adding custom information about each function on the project, modules, how to use, etc.

  • Python annotations

  • [Low-level file documentation:]

    • Binary file format, structure, bytes, parsing, etc.
    • Input and output
  • Make documentation available online (RTD)

  • Update Readme

Document:

  • cli:

    • Parameters for compression & decompression
    • Examples of invocations
  • Programatic API

[optimization] Use actual bit array on processing

The lib is currently encoding byte characters of '1' or '0' for the binary bit representation, respectively, and not actual bits in an array.

Is not strictly required to port everything to C at this point, just doing the optimisation in Python will suffice.

Some alternatives might be:

Compare memory utilisation before and after the change.

Release version 0.1.0

  • Tag and sign version with current master at 2017-04-15
  • Create necessary Makefile targets for building
  • Build wheel for project
  • Public on pypi
  • Update README

run mypy as part of the CI

Include as one of the items of the checklist. Build must fail⚠️ , if it does not pass the type hinting checks.

Improve CI

  • Code linting for all code

    • pylint, flake8, with the most strict controls
    • Set max column=79
    • Break the build if the linting does not pass
    • setup code revision for common patterns in code review, automatically. Maybe https://github.com/integrations/sideci can help
  • Tests should ignore the dataset on the run

  • Check that coverage level did not decrease. Fail if it did.

  • Automate coverage level report per branch, and PR. Link directly in the project main page and documentation.

  • Create a checklist target in Makefile, and separate tests from checklist.

  • Check for security issues and updates automatically. Maybe https://github.com/integrations/src-clr can help

New cli option: output directory

Enable the user to indicate an output directory for the file/s that are going to be written.

Parameter must be called --output-dir or -O.

If this parameter is provided, all files will be written inside this directory with the default naming convention.

Update documentation with examples of this use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.