Code Monkey home page Code Monkey logo

Comments (16)

timohund avatar timohund commented on May 26, 2024

@thomashohn Thanks i think this is a good idea. It would be nice to allow a configuration that limits the file extensions that are send to tika. If you can provide a patch, this would be nice!

from ext-tika.

thomashohn avatar thomashohn commented on May 26, 2024

@timohund There seems to be an some todo's on fetching this from the tika server - but I think i would prefer to have the extension configuring what the extract or not?

from ext-tika.

irnnr avatar irnnr commented on May 26, 2024

I can't quite follow here what you want to achieve. With the last release we're querying Tika for supported file types already instead of having a hard-coded list. Am I missing something? Can you point to the concrete code you're referring to?

from ext-tika.

thomashohn avatar thomashohn commented on May 26, 2024

The files in https://github.com/TYPO3-Solr/ext-tika/tree/master/Classes/Service/Extractor - seems pretty hardcoded to me - or?

from ext-tika.

irnnr avatar irnnr commented on May 26, 2024

I still don't know what you mean

from ext-tika.

thomashohn avatar thomashohn commented on May 26, 2024

I see the MetaDataExtractor changed to fetch data from TIKA :-) So if I would like to exclude files - I need to configure that in on TIKA server? For instance I don't want it to extract metadata for images or ?

from ext-tika.

irnnr avatar irnnr commented on May 26, 2024

It seems it's a really slow Saturday morning for me since I really can't follow you.

  • Can you point to concrete lines of code?
  • Explain your issue?
  • Explain what you want to achieve/change?
  • Why?

Of course the EXT:tika fetches meta data from Tika, what else would you expect? It's been like that since forever.

Why wouldn't you want images meta data? That's data such as width, height, exposure, camera, geo location, description...

Please describe it to me in easy language^^ :)

from ext-tika.

thomashohn avatar thomashohn commented on May 26, 2024

Before the new release:

public function canProcess(File $file)
    {
        // TODO use MIME type instead of extension
        // tika.jar --list-supported-types -> cache supported types
        // compare to file's MIME type

        return in_array($file->getProperty('extension'),
            $this->supportedFileTypes);
    }

The $this->supportedFileTypes was a "hardcoded" array
Now its:

 public function canProcess(File $file)
    {
        $tikaService = $this->getExtractor();
        $mimeTypes =  $tikaService->getSupportedMimeTypes();

        return in_array($file->getMimeType(), $mimeTypes);
    }

If I don't want to process say gif files - I would need to configure that on the TIKA server - or? Before I had to take gif out of the array $this->supportedFileTypes?
So with the new version I have to be sure my TIKA server is configured to only send back the supported mime types i want to process?

from ext-tika.

irnnr avatar irnnr commented on May 26, 2024

Ok, clear now, thanks! :)

However, it's still not clear why anyone would want to do that? Also, as you notice it had a TODO comment before :) - It was a missing feature. As you mentioned you modified the extension before. It was never something we supported so far. I'm not even sure Tika supports selectively enabling meta data extraction. If it does though, that's where I'd look.

I don't think this should or needs to be something EXT:tika does. (For the 95% of use cases)

from ext-tika.

thomashohn avatar thomashohn commented on May 26, 2024

If you buy images from iStock and other companies the images contains a lot of additional meta-information you don't want to extract beacause it will confuse your users when searching. I'll make a PR anyway since I fix it in my own code - then you can decide if it should be merged into EXT:tika or not ;-)

from ext-tika.

irnnr avatar irnnr commented on May 26, 2024

Hmm, IMO that's usually pretty valuable meta data. maybe you can provide an example?

from ext-tika.

thomashohn avatar thomashohn commented on May 26, 2024

Hi - yes I can.

  1. You have a lot of meta data files and start to use Solr and TIKA - your "old" valuable meta data will be overwritten - which is kind of annoying
  2. Meta data in files does not match the kind of meta data you want. For instance for a iStock photo that could be the title. You might want another title or add info to the title - this is not possible.

I find the PR quite realistic and it comes from a real-world scenario :-)

from ext-tika.

dkd-dobberkau avatar dkd-dobberkau commented on May 26, 2024

A short sidenote from me. I see the usecase but i d rather like to discus this with you in the new year. TYPO3 is missing a meta data manager and therefore curation of meta data could be something that an add-on could offer.

from ext-tika.

thomashohn avatar thomashohn commented on May 26, 2024

Fine with me - as I said yearlier in the thread - I need to make a "fix" no matter what in my own code - since we can't retrieve meta-data from image files currently :-)

from ext-tika.

irnnr avatar irnnr commented on May 26, 2024

Ok, I can see your use case (and your pain stemming from it), too now.

Now here's how I see the situation: IMO EXT:tika is a pure utility to extract meta data from files, a utility that is called/used by the TYPO3 core. The tika extension does not know about any existing meta data for a file that you might want to keep. Neither does the extension offer any custom mapping.

The mapping issue can be seen as a missing feature; I believe EXT:extractor offers something like that.

However, the extension's job is to simply provide meta data to the core. On that end I agree with Olivier, that what you describe is rather an issue that falls into the responsibility of the TYPO3 core.

So my suggestion would be: Feel free to open another issue for meta data property mapping, that would actually be useful to have. However, knowing about when to overwrite data in what cases is not (currently) in the domain of EXT:tika.

Advice for filing future issues:
I had to ask multiple times to understand your issue. The easier you can make it for us to understand your situation, the easier it will be for us to help you and/or agree with your issue. You should always provide as much information as possible. Read through this whole convo again and I hope you will see it was not easy to understand why/what issue you had. That saves us both a lot of time.

from ext-tika.

dkd-kaehm avatar dkd-kaehm commented on May 26, 2024

Fixed in #48

from ext-tika.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.