Code Monkey home page Code Monkey logo

iunzip's Introduction

iUnzip

Just some code written for PoC and benchmarking that others might also find useful

Background

No, this is not about about a better way to unzip file, as the name might suggest, but rather, it's about a better way to process the content within an archive file over the Web--which is useful in certain domains like cybersecurity where detection of potential threats within files, including archive files, are often done.

When processing of a larger file is done locally, the most expensive part of examining its content is often the computational processing along with the I/O operations required to retrieve the data to be processed. However, when the processing of the file needs to be done remotely from the original location of the content, the latency required for the file to be transferred across the network before it can be processed can dominate and contribute significantly to the overall turnaround time--even with abundance of computational resources.

Goal for Proof of Concept

The code published in this repo was written as part of a PoC to confirm using real-world data that for certain applications where the files are of certain variety and mix, that decompressing of much large (albeit relatively few) archive files into much smaller individual files, and transferring them concurrenty over to a remote cloud-based service that offers high-performance, scalable processing of files, overall processing time can be reduced.

Design Requirements

Decompression of archive (and subsequent processing) should be controlled based on available host resource so as to not starve user's application. The ceiling can be determined by available storage, CPU capacity, available memory, etc.

Decompressing large archives in one shot may unnecessarily block the client for too long a duration. For smoother scheduling, it might be better to decompose the archive file in phases or in small steps, while incurring some additional "disk" I/O operations. Given more computers have migrated off the traditional spinning media storage devices, the additional I/O overhead should be negligible compared to network I/O.

In the future, for certain computing environments, it might also be possible to fine-tuning performance by NICE value, cgroup, CPU usage capping, etc.

PoC Implementation

The program provided does the following:

  • Detect whether the target file to scan is a ZIP file or not. If not, just exit.
  • If target file is a ZIP file, it will iterate through all the member files.
  • If a member file is also a ZIP archive file, it will also recursively decompress it.
  • Otherwise, for every (non-archive) member file, it will start a new worker thread to process the file; work is simulated by the thread sleeping for a random period of time.
  • At most a certain number of jobs (and worker threads) are allowed to execute concurrently. Number of executing worker jobs/threads is the only factor determining whether a new job can be started or not, in this published version.

License

Making the source code to this app available under License: MIT

iunzip's People

Contributors

chchench avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.