Code Monkey home page Code Monkey logo

ext-tika's People

Contributors

3l73 avatar andreaswolf avatar dkd-friedrich avatar dkd-kaehm avatar dkd-private-packagist avatar dkd-schmidt avatar doan2013 avatar eliashaeussler avatar georgringer avatar helhum avatar ichhabrecht avatar irnnr avatar megamisan avatar neufeind avatar peterkraume avatar rostyslavmatviiv avatar thomashohn avatar timohund avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ext-tika's Issues

Test files are recognized as viruses

Hi,

we recently had a virus report from our hoster.

The test documents testWORD.doc and test-documents.tbz2 appear to have an infection with Win.Exploit.CVE_2016_3316-1.

Not sure if this is a false positive but I wanted to let you know.

Cheers,
Alex

[FEATURE] Make the Java VM MaxHeapSize configurable, to avoid the Java VM error "Could not reserve enough space for object heap"

I use the tika app for heavy extraction work on a website full of media files.

Here, I had to add another memory-expanding argument to the tika app command:

-Xmx512M

(verbose version: -XX:MaxHeapSize=512m)

This prevents the Java VM error "Could not reserve enough space for object heap".

The tika EXTCONF could offer options like 256m, 512m etc., which would then be applied to the tika app java shell calls.

Thanks for your continued work!

Need for protocol and ignore of port number

We are about to go in production with the ext-tika extensions. But we have 2 major problems.

  1. It should be possible to provide the protocol for the tika server - since we run https on all our systems.
  2. If the port number is empty the extensions should not add a : to the connect string since this won't work.

It would be nice to have these 2 fixed ASAP - alternatively we could provide a pull request for it.

Add normalization of xmpDM:duration to MetaDataExtractor::normalizeMetaData()

For extracting mp3 metadata, I had to add to \ApacheSolrForTypo3\Tika\Service\Extractor\MetaDataExtractor::normalizeMetaData() a mapping of xmpDM:duration to (int)($value / 1000).

Maybe this could be configurable?
--> EXT:extractor has a nicely configurable metadata mapping (normalization) handling. There no code change would be necessary - but EXT:extractor does not support SolrCell, only Tika App local or Tika server, while EXT:tika does this very nicely.

ping function missing in SolrCellService

Fatal error: Call to undefined method ApacheSolrForTypo3\Tika\Service\Tika\SolrCellService::ping() in /..../typo3conf/ext/tika/Classes/Backend/SolrModule/TikaControlPanelModuleController.php on line 242

How to reproduce:

  • latest git master from solr and tika
  • Backend Module "search" => click on "Tika"

Compare for Solr Version > 3.1 throws error

In SolrCellService.php a version compare vor Solr > 3.1 leads to an error as the var $solrVersion returns also the patch version with leads to a value like 3.1.21.
That will call the solr method extractByQuery wich is only availabel in solr 4.

The var $solrVersion should be stripped to major and minor version nr to make the condition pass correct.

Tika extracts and saves wrong width/height metadata of jpegs (EXIF)

Hi,

as we can now reproduce, ext-tika seems to save wrong height and width metadata of jpeg (and maybe other) files.

Main problem/symptom - for example:
when a file is cropped or scaled down before uploading (e.g. photoshop) there will still be the initial width and height values in the metadata of this file along with the new and correct values (correlates with the EXIF data of the file).
Now after the upload process (with the tika extension installed) the initial values will be saved in the TYPO3 database (sys_file_metadata) instead of the new and correct values.
If you now want to crop the image with the TYPO3 cropping tool out of the core, it will save wrong cropping areas because of these values in sys_file_metadata.

Steps to reproduce:

  1. Edit a .jpeg file with a image manipulation program like Photoshop and scale it down to a reduced resolution.
  2. Upload the edited file via the normal TYPO3 upload in BE while the tika ext is installed and active.
  3. check the values of "height" and "width" for this file in the sys_file_metadata table. Those should now be the wrong values.

Counter check:

  • If you upload the previously edited file via FTP/SFTP into your fileadmin and get it indexed by TYPO3 via opening the file module (so not uploading it via the BE) the values are stored correctly.
  • If you deactivate the tika extension and proceed exactly like the reproduction steps above the values are stored correctly too.

If you need any further information, don't hesitate to hit me up.

Add mp3 mime type 'audio/mpeg' to SolrCellService::getSupportedMimeTypes()

As a workaround I had to add $GLOBALS['TYPO3_CONF_VARS']['SYS']['FileInfo']['fileExtensionToMimeType']['mp3'] = 'audio/mpeg3'; for \ApacheSolrForTypo3\Tika\Service\Tika\SolrCellService::getSupportedMimeTypes() to match.

However solr then returns the 'audio/mpeg' mimetype for the extracted mp3 file - so rather getSupportedMimeTypes() should be extended by adding 'audio/mpeg'.

'audio/mpeg' is RFC-defined: https://tools.ietf.org/html/rfc3003
It is also the first mime type mentioned at Wikipedia "MP3": https://en.wikipedia.org/wiki/MP3

[DISCUSSION] Drop Tika app and Solr Cell support?

There might be some reasons to drop support for Tika app and Solr support:

  • Tika app is slow as it needs to boot the JVM for each invocation
  • Likewise Tika server is much faster as it sits there and awaits requests
  • Solr Cell does not support all the features as provided by Tika app/server

If we were to decide to do that, it would also result in a new major version as it is a breaking change. Nothing is set in stone or even decided yet. We're just looking for opinions for now.

[TASK] Require java only for app mode

Since the tika server can also run on another node, it's not required that java is installed.

We should raise the following warnings / errors when java is not installed:

App mode: Error
Server mode: Warning
Solr mode: nothing

[TASK] Simplify classname usage by using ::class when possible

Since php 5.5 calls like:

\TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\Imaging\IconRegistry');
can be changed to:

   use \TYPO3\CMS\Core\Imaging\IconRegistry;
   \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(IconRegistry::class);

We should use this where possible to make the code more readable and aligned with ext-solr.

[BUG] file_get_contents() Warning under CMS 8 LTS

Under CMS 8 LTS we get a warning in Backend:

Core: Error handler (BE): PHP Warning: file_get_contents(http://localhost:9999/tika): failed to open stream: Connection refused in /var/www/8.7.local.typo3.org/typo3conf/ext/tika/Classes/Service/Tika/ServerService.php line 242

Check for valid configuration

EXT:tika 2.0 won't use the old service APIs anymore. Those checked whether a service is available. As checking that would have been costly for Tika by starting the JVM each time we stored the results of a check in the TYPO3 registry.

We won't store the availability check results anymore and will instead assume the configuration to be valid.

However, we should add checks for valid configuration to the Status Report.

Add Solr configuration selector

In the Extension Manager configuration view:

If EXT:solr is installed offer a custom field type to select the Solr server connection instead of having to enter the host, port, and path.

[BUG] ContextMenu Provider throws Error

If I open a ContextMenu from a page or item, the Provider from Tika (ApacheSolrForTypo3\Tika\ContextMenu\Preview) tries to check if it can handle the context menu or not.
If there is no file with the same uid as the page or content element, the Provider will throw an error, for example:
#1317178604: No file found for given UID: 120 (More information)

So Tika is blocking you from using the ContextMenu at Pages and Content Elements (maybe more).
Checked on TYPO3 8.7.17/18

[BUG] TIKA preview breaks in page-menu

TIKA preview breaks in the page-menu - the problem seems to be the following condition

        if (!$this->table === 'sys_file') {
            return false;
        }

in the canHandle method - instead it should be:

        if ($this->table !== 'sys_file') {
            return false;
        }

Make supportedFileTypes in extractors extension configuration

It would be very nice if the supportedFileTypes were not hardcoded in the extractors but a list in the extension configuration since you might have sites where you would like to be able to configure this. I can provided a pull-request fixing this since I now have to XClass the extractors in order to control this.

[BUG] Unsupported exception leads to error in uploads

From https://forge.typo3.org/issues/77659

Using Tika with SOLR for metadata extraction.

The Unsupported exception in SolrCellService::detectLanguageFromFile leads to failure reporting the success of a file upload.

Steps to reproduce:

  • Have TIKA configured to use SOLR as extraction tool;
  • Go to the Filelist module;
  • Go to some folder;
  • Click on the upload button;
  • Select a PDF file;
  • See the upload progress showing 100% but not that the upload is finished;
  • Reload the page using the refresh button;
  • See the error message `"Uploaded file could not be moved! Write-permission problem in "%s"?"``

This error message comes from ExtendedFileUtility::func_upload(1157). When a breakpoint is set here the thrown exception is from TIKA with the message "The Tika Solr service does not support language detection". Being unable to extract metadata should not prevent getting an upload finished message.

Get supported file types from Tika

At the moment the supported file types are hard coded within the extension.
Tina can provide a complete list of file types it supports. The extension should query Tika for that list and use that when TYPO3 asks what file types we can handle.

Check compatibility with EXT:solr dev-master / 4.0.0

We should:

  • Do adaptions in code (SolrService::extract => SolrService=>extractByQuery) (PHP 7.0 compatibility)
  • Adapt icon registration
  • Use core ViewHelpers instead of solr:backend.button.ActionButton

to have the extension compatible with EXT:solr dev-master for the upcomming 4.0.0 release

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.