Code Monkey home page Code Monkey logo

excel-streaming-reader's Introduction

shippable

Excel Streaming Reader

If you've used Apache POI in the past to read in Excel files, you probably noticed that it's not very memory efficient. Reading in an entire workbook will cause a severe memory usage spike, which can wreak havoc on a server.

There are plenty of good reasons for why Apache has to read in the whole workbook, but most of them have to do with the fact that the library allows you to read and write with random addresses. If (and only if) you just want to read the contents of an Excel file in a fast and memory effecient way, you probably don't need this ability. Unfortunately, the only thing in the POI library for reading a streaming workbook requires your code to use a SAX-like parser. All of the friendly classes like Row and Cell are missing from that API.

This library serves as a wrapper around that streaming API while preserving the syntax of the standard POI API. Read on to see if it's right for you.

Include

This library is available from from Maven Central, and you can optionally install it yourself. The Maven installation instructions can be found on the release page.

To use it, add this to your POM:

<dependencies>
  <dependency>
    <groupId>com.monitorjbl</groupId>
    <artifactId>xlsx-streamer</artifactId>
    <version>0.2.12</version>
  </dependency>
</dependencies>  

Usage

This library is very specific in how it is meant to be used. You should initialize it like so:

import com.monitorjbl.xlsx.StreamingReader;

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
StreamingReader reader = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
        .sheetIndex(0)        // index of sheet to use (defaults to 0)
        .sheetName("sheet1")  // name of sheet to use (overrides sheetIndex)
        .read(is);            // InputStream or File for XLSX file (required)

Once you've done this, you can then iterate through the rows and cells like so:

for (Row r : reader) {
  for (Cell c : r) {
    System.out.println(c.getStringCellValue());
  }
}

The StreamingReader is an autoclosable resource, and it's important that you close it to free the filesystem resource it consumed. With Java 7, you can do this:

try (
  InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
  StreamingReader reader = StreamingReader.builder()
          .rowCacheSize(100)
          .bufferSize(4096)
          .sheetIndex(0)
          .read(is);
) {
  for (Row r : reader) {
    for (Cell c : r) {
      System.out.println(c.getStringCellValue());
    }
  }
}

You may access cells randomly within a row, as the entire row is cached. However, there is no way to randomly access rows. As this is a streaming implementation, only a small number of rows are kept in memory at any given time.

Supported Methods

Not all POI Cell and Row functions are supported. The most basic ones are (Cell.getStringCellValue(), Cell.getColumnIndex(), etc.), but don't be surprised if you get a NotSupportedException on the more advanced ones.

I'll try to add more support as time goes on, but some items simply can't be read in a streaming fashion. Methods that require dependent values will not have said dependencies available at the point in the stream in which they are read.

This is a brief and very generalized list of things that are not supported for reads:

  • Functions
  • Macros
  • Styled cells (the styles are kept at the end of the ZIP file)

Logging

This library uses SLF4j logging. This is a rare use case, but you can plug in your logging provider and get some potentially useful output. Below is an example of doing this with log4j:

pom.xml dependencies

<dependencies>
  <dependency>
    <groupId>com.monitorjbl</groupId>
    <artifactId>xlsx-streamer</artifactId>
    <version>0.2.12</version>
  </dependency>
  <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-log4j12</artifactId>
    <version>1.7.6</version>
  </dependency>
  <dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>1.2.17</version>
  </dependency>
</dependencies>

log4j.properties

log4j.rootLogger=DEBUG, A1
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%d{ISO8601} [%c] %p: %m%n

log4j.category.com.monitorjbl=DEBUG

Implementation Details

This library will take a provided InputStream and output it to the file system. The stream is piped safely through a configurable-sized buffer to prevent large usage of memory. Once the file is created, it is then streamed into memory from the file system.

The reason for needing the stream being outputted in this manner has to do with how ZIP files work. Because the XLSX file format is basically a ZIP file, it's not possible to find all of the entries without reading the entire InputStream.

This is a problem that can't really be gotten around for POI, as it needs a complete list of ZIP entries. The default implementation of reading from an InputStream in POI is to read the entire stream directly into memory. This library works by reading out the stream into a temporary file. As part of the auto-close action, the temporary file is deleted.

If you need more control over how the file is created/disposed of, there is an option to initialize the library with a java.io.File. This file will not be written to or removed:

File f = new File("/path/to/workbook.xlsx");
StreamingReader reader = StreamingReader.builder()
        .rowCacheSize(100)    
        .bufferSize(4096)     
        .sheetIndex(0)        
        .read(f);            

This library will ONLY work with XLSX files. The older XLS format is not capable of being streamed.

excel-streaming-reader's People

Contributors

lilyliuce avatar lvsant avatar matthiasblaesing avatar monitorjbl avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.