asperagmbh / xlsx-reader Goto Github PK

xlsx-reader is a PHP library for fast and efficient reading of XLSX spreadsheet files. Its focus is on reading the data contained within XLSX files, disregarding all document styling beyond that which is strictly necessary for data type recognition. It is built to be usable for very big XLSX files in the magnitude of multiple GBs.

License: Other

PHP 100.00%

xlsx xlsx-files xlsxreader xlsx-lib xlsx-spreadsheet xlsx-parser excel excelreader excel-import excelparser

xlsx-reader's Introduction

xlsx-reader

xlsx-reader is an extension of the XLSX-targeted spreadsheet reader that is part of spreadsheet-reader.

It delivers functionality to efficiently read in data contained within the given XLSX file.

The focus of this library is on delivering data contained in XLSX spreadsheet cells, not the document's styling. As such, the library offers no support for XLSX capabilities that aren't strictly necessary to achieve this goal. Only basic cell value formatting and shared string functionalities are supported.

Requirements

PHP 7.1.0 or newer, with at least the following optional features enabled:
- Zip (see http://php.net/manual/en/zip.installation.php)
- XMLReader (see http://php.net/manual/en/xmlreader.installation.php)

Installation using Composer

The package is available on Packagist. You can install it using Composer.

composer require aspera/xlsx-reader

Usage

All data is read from the file sequentially, with each row being returned as an array of columns.

<?php
use Aspera\Spreadsheet\XLSX\Reader;

$reader = new Reader();
$reader->open('example.xlsx');

foreach ($reader as $row) {
    print_r($row);
}

$reader->close();

XLSX files with multiple worksheets are also supported. The method getSheets() returns an array with sheet indexes as keys and Worksheet objects as values. The method changeSheet($index) is used to switch between sheets to read.

<?php
use Aspera\Spreadsheet\XLSX\Reader;

$reader = new Reader();
$reader->open('example.xlsx');

$sheets = $reader->getSheets();
foreach ($sheets as $index => $sheet_data) {
    $reader->changeSheet($index);
    echo 'Sheet #' . $index . ': ' . $sheet_data->getName();

    // Note: Any call to changeSheet() resets the current read position to the beginning of the selected sheet.
    foreach ($reader as $row_number => $row) {
        echo 'Row #' . $row_number . ': ' . print_r($row, true);
    }
}

$reader->close();

Options to tune the reader's behavior and output can be specified via a ReaderConfiguration instance.

For a full list of supported options and their effects, consult the in-code documentation of ReaderConfiguration.

<?php
use Aspera\Spreadsheet\XLSX\Reader;
use Aspera\Spreadsheet\XLSX\ReaderConfiguration;
use Aspera\Spreadsheet\XLSX\ReaderSkipConfiguration;

$reader_configuration = (new ReaderConfiguration())
  ->setTempDir('C:/Temp/')
  ->setSkipEmptyCells(ReaderSkipConfiguration::SKIP_EMPTY)
  ->setReturnDateTimeObjects(true)
  ->setCustomFormats(array(20 => 'hh:mm'));
// For a full list of supported options and their effects, consult the in-code documentation of ReaderConfiguration.

$spreadsheet = new Reader($reader_configuration);

Notes about library performance

XLSX files use so-called "shared strings" to optimize file sizes for cases where the same string is repeated multiple times. For larger documents, this list of shared strings can become quite large, causing either performance bottlenecks or high memory consumption when parsing the document.

To deal with this, the reader selects sensible defaults for maximum RAM consumption. Once this memory limit has been exhausted, the file system is used for further optimization strategies.

To configure this behavior in detail, e.g. to increase the amount of memory available to the reader, a SharedStringsConfiguration instance can be attached to the ReaderConfiguration instance supplied to the reader's constructor.

For a full list of supported options and their effects, consult the in-code documentation of SharedStringsConfiguration.

<?php
use Aspera\Spreadsheet\XLSX\Reader;
use Aspera\Spreadsheet\XLSX\ReaderConfiguration;
use Aspera\Spreadsheet\XLSX\SharedStringsConfiguration;

$shared_strings_configuration = (new SharedStringsConfiguration())
    ->setCacheSizeKilobyte(16 * 1024)
    ->setUseOptimizedFiles(false);
// For a full list of supported options and their effects, consult the in-code documentation of SharedStringsConfiguration.

$reader_configuration = (new ReaderConfiguration())
  ->setSharedStringsConfiguration($shared_strings_configuration);

$spreadsheet = new Reader($reader_configuration);

Notes about unsupported features

This reader's purpose is to allow reading of basic data (text, numbers, dates...) from XLSX documents. As such, there are no plans to extend support to include all features available for XLSX files. Only a minimal subset of XLSX capabilities is supported.

In particular, the following should be noted in regard to unsupported features:

Display cell width is disregarded. As a result, in cases in which popular xlsx editors would shorten values using scientific notation or "#####"-placeholders, the reader will return un-shortened values instead.
Files with multiple internal shared strings files are not supported.
Files with multiple internal styles definition files are not supported.
Fractions are only partially supported. The results delivered by the reader might be slightly off from the original input.

Licensing

All the code in this library is licensed under the MIT license as included in the LICENSE.md file.

xlsx-reader's People

Contributors

Stargazers

Watchers

Forkers

drasill aradhell agro1986 eduardaarosaa jeijei4 fwiessner cudinh groberts84 vol4ikman

xlsx-reader's Issues

Is it possible to include an option to read a password protected file?

Is there anyway to get the "value" from a cell (unformatted)

eg.
-12345.67 for a number formatted (£#,##0);(£#,##0)
as I get "(12,345.67)"?

Enhancement to allow access to meta data read?

I need to read meta data from an xlsx to help when I re-output a new adjusted file and so was thinking that you could make the data you've read (like $cell_type & the format string) available...
I was thinking something along the lines of an "include_meta_data" configuration option which could populate a separate "current_row_meta_data" array (for backward compatibility) which could look like this:

current_row_meta_data[0]=[
  'type'=>'d',          // 'd'=datetime | 's'=shared string/'inlineStr'/'str' | 'b'=boolean | 'n'=numeric | 'e'=error
]

Maybe also include the "format" string for the cell, as in:
  'format'=>'dd/mm/yy',

Reader incorrectly handles empty <row> elements

We encountered an XLSX file with sheet data in the following form:

<row r="1" spans="1:9" ht="15.75" thickBot="1" x14ac:dyDescent="0.3"/>
<row r="2" spans="1:9" x14ac:dyDescent="0.25">
	<c r="A2" s="3" t="s">
		<v>0</v>
	</c>
	...
</row>

Note how the first <row> element merely consists of a self-closing tag.

This is currently not handled correctly by the Reader, which only checks $this->worksheet_reader->isClosingTag() to detect the end of a row. The worksheet_reader would also have to consider the state of $this->isEmptyElement.

The result is that the Reader keeps reading forward until it reaches the </row> tag of row 2, incorrectly reading and outputting the cells of row 2 as the content of row 1. Then it returns empty values for row 2, since there are no further cells to be read.

Error "P-1D" when reading an incorrect DateTime

Hi,

I have a document where the date seems to be -1, probably because of an export.

The Excel file is opened in libreoffice, where the date appears as "29/12/1899".

But the reader throws an Exception when stumbling on this value :
Exception with message 'DateInterval::__construct(): Unknown or bad format (P-1D)'.

how to disable remove repeat cell value?

my execel data is

A,B,C,D,E
1,london,london,DEC,12345

when i get by reader

        $reader = new Reader($options);
        $reader->open($tempFile);

        $data = [];
        foreach ($reader as $row) {
            var_dump($row);
            if (empty(implode('', $row))) {
                continue;
            } else {
                $data[] = $row;
            }
        }

the dump is

['A', 'B', 'C', 'D', 'E']
['1', 'london', 'DEC', '', 12345]

how to get the full data?

['1', 'london', 'london', 'DEC', 12345]

Thanks

Please support number formats [$-F400] and [$-F800]

Hi, I have received multiple XLSX files that contain the number format code [$-F400]h:mm:ss\ AM/PM. According to the Ecma Office Open XML specification the F400 code is used for "System time format". There is also an F800 code for "System long date format".

Trying to load the file with this number format fails in NumberFormatTokenizer.php:753. The regex in line 751 doesn't match this format so the next call that tries to access $matches[1] results in Undefined array key 1.
Extending the regex to include an optional F (\$([^-]*)-[fF]?\d+) would fix it and the unit tests would still pass even though you might want to handle it differently to be able to apply the correct format.

It would be nice if you could add support for those format codes.

Thank you for providing this helpful library!

Is it possible to force the library to ignore all the data types that XLSX ..?

Hi, very good library, it works very well.
My question is, is it possible to force the library to ignore all the data types that XLSX has, so that all the cells are taken as a literal string, regardless of whether it was saved as a date, currency, decimal, etc.
Since I did not find a way to do it.
Thanks

No option to skip empty rows

Excel sadly often saves xlsx files which end in a large number of explicit empty row elements (for example if the cells were formatted but then never used), which may take a form such as the following:

<row r="659" spans="1:10" x14ac:dyDescent="0.2">
	<c r="D659" s="3"/>
	<c r="E659" s="3"/>
</row>

Note that not a single <v> element is contained.

It would be nice to have a native option to ignore empty rows, ideally only at the end of the file. However, since a sequential reader cannot look ahead to know whether there will be further non-empty rows, this would probably be difficult to implement.

I would once again suggest to create feature parity with akeneo-labs/spreadsheet-parser, which skips any empty rows by default, but still updates the return value of Iterator::key() to reflect the actual excel row and therefore expose to the caller that an empty row was skipped.

Handling invalid row spans values?

Hi there,

I am using xslx-reader to parse a customer provided spreadsheet and it has handled their horror of a document with little drama however when trying to parse a particular sheet, it reaches a standard line and then throws an exception.

A non well formed numeric value encountered 
  at vendor/aspera/xlsx-reader/lib/Reader.php:296

After tracking this down, the spans value was 1:7167 7169:14335 14337:16379.

I'm unable to modify the spreadsheet and visually inspecting it doesn't appear to show any reason for this within Excel?

Is there anything that can be done via the code to gracefully handle this and move on to the next row?

Thanks

Problem handling Currency with Symbol £ English (United Kingdom)

The value "170.00" formatted as Currency / Symbol £
is coming out as fine as "170.00"

but the value "" formatted as Currency / Symbol £ English (United Kingdom)
is commng out as "-Â£170.00"

No option to return percentage values as unformatted fractional number

XLSX files internally represent values such as "20%" as a floating point string like "0.2".
The NumberFormat class is currently hardcoded to multiply such values by 100, implicitly casting them to floats.
The "ReturnUnformatted" option does not affect this behavior as it is only checked later in the formatValue() method.

Applications may find it useful to read percentage values in their more semantically native fractional representation.
This is also particularly relevant when migrating from the akeneo-labs/spreadsheet-parser library, which returns percentages this way. This represents one of the few breaking changes between the akeneo library and this one that cannot be overcome by configuration.

Add XMLReader as a requirement

Under 'Requirements' in the README.md add XMLReader.

Bug; multi-sectioned formats are exploded incorrectly...

your code:
$sections = explode(';', $format['Code']);

doesn't handle if the ";" is quoted - I suggest:
$sections = preg_split('/(;)(?=(?:[^"]|"[^"]")$)/u', $format['Code']); // up to four sections, separated with an (unquoted) semi-colon

Incompatible return type (Deprecated)

With PHP 8.1.7 im getting the following Deprecated Messages:

Deprecated:  Return type of Aspera\Spreadsheet\XLSX\Reader::current() should either be compatible with Iterator::current(): mixed, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in dist\vendor\aspera\xlsx-reader\lib\Reader.php on line 244

Deprecated:  Return type of Aspera\Spreadsheet\XLSX\Reader::next() should either be compatible with Iterator::next(): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in dist\vendor\aspera\xlsx-reader\lib\Reader.php on line 263

Deprecated:  Return type of Aspera\Spreadsheet\XLSX\Reader::key() should either be compatible with Iterator::key(): mixed, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in dist\vendor\aspera\xlsx-reader\lib\Reader.php on line 485

Deprecated:  Return type of Aspera\Spreadsheet\XLSX\Reader::valid() should either be compatible with Iterator::valid(): bool, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in dist\vendor\aspera\xlsx-reader\lib\Reader.php on line 496

Deprecated:  Return type of Aspera\Spreadsheet\XLSX\Reader::rewind() should either be compatible with Iterator::rewind(): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in dist\vendor\aspera\xlsx-reader\lib\Reader.php on line 219

Deprecated:  Return type of Aspera\Spreadsheet\XLSX\Reader::count() should either be compatible with Countable::count(): int, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in dist\vendor\aspera\xlsx-reader\lib\Reader.php on line 507

Im loading xlsx-reader via:

use Aspera\Spreadsheet\XLSX\Reader;
use Aspera\Spreadsheet\XLSX\ReaderConfiguration;
use Aspera\Spreadsheet\XLSX\ReaderSkipConfiguration;

$options = (new ReaderConfiguration())
	->setSkipEmptyCells(ReaderSkipConfiguration::SKIP_EMPTY)
	->setReturnDateTimeObjects(true);
$reader = new Reader($options);
$reader->open($sheet);

Using the most current package:
aspera/xlsx-reader v0.10.1 Spreadsheet reader library for Microsoft Excel XLSX files

Is there something i miss?

Thanks, oNdsen

Error open XLS

I try to open this XLS from de URL location Año 2021, Trimestre IV but all the time lauch a error. How can fix or debug the problem

Tests throwing errors

Hi there,

I'm currently writing a PR to add a feature but when running tests without any modification to the master code, it's throwing errors:

25) Aspera\Spreadsheet\XLSX\Tests\CustomNumberFormatTest::testFormat with data set "scientific notation - exponent larger than 1 digit" ('0.00000000005', '0.00E+0', '5.00E-11')
Undefined array key 0

/xlsx-reader/lib/NumberFormat.php:499
/xlsx-reader/lib/NumberFormat.php:273
/xlsx-reader/lib/NumberFormat.php:189
/xlsx-reader/tests/CustomNumberFormatTest.php:71

When debugging the error, it is passing 123.00 as a value to the test, but is then converted to 123. due to the following code:

// Remove insignificant zeroes for now, we will (re-)add them based on format_info next.
if (strpos($number, '.') !== false) {
    $number = rtrim($number, '0');
}

This results in this failing, as it returns an empty array:

$right_side_chars = str_split($number_parts[1]);
if ($right_side_chars[0] === '') { // Side-effect of str_split('')
    $right_side_chars = array();
}

libre office xlsx document overload

when I try to read created by libre office xlsx file. So I see more much null array. And how to see col umn name for example [A=> 'text data']