Deion As discussed in <a class="issue-link js-issue-link" d

you need to have Java 7+ installed on your system <p di

Something like... <div class="highlight highlight-source-python notranslate positi

Or instead of using docker there is a pip package <a href="https://pypi.org/project/ti

New Module: Apache Tika & `RAW_DATA` events about bbot HOT 11 OPEN

domwhewell-sage commented on July 19, 2024

New Module: Apache Tika & `RAW_DATA` events

from bbot.

Comments (11)

TheTechromancer commented on July 19, 2024 2

you need to have Java 7+ installed on your system

from bbot.

TheTechromancer commented on July 19, 2024 1

A possible alternative: https://github.com/Unstructured-IO/unstructured.

Probably would also need to run in a container, but still worth testing.

from bbot.

domwhewell-sage commented on July 19, 2024 1

If its ok having a metadata extraction and text extraction as separate modules I think unstructured might be the best module to add for text extraction. I have created the module in my fork and will be testing it some more.

https://github.com/domwhewell-sage/bbot/blob/unstructured/bbot/modules/unstructured.py

from bbot.

domwhewell-sage commented on July 19, 2024

Something like...

async def setup(self):
        await self.run_process("systemctl", "start", "docker", sudo=True)
        await self.run_process("docker", "pull", "apache/tika:latest", sudo=True)
        self.tika_url = "http://127.0.0.1:8889"
        return True

def extract_text(file_path):
    """
    extract_text Extracts plaintext from a document path using Tika.

    :param file_path: The path of the file to extract text from.
    :return: ASCII-encoded plaintext extracted from the document.
    """

    with open(file_path, 'rb') as f:
        resp = requests.put(self.tika_url, f, headers={'Accept': 'text/plain'})
        if(resp.status_code == 200):
            return resp.text.strip().encode("ascii","ignore").decode()

from bbot.

domwhewell-sage commented on July 19, 2024

Or instead of using docker there is a pip package https://pypi.org/project/tika/

def extract_text(file_path):
    """
    extract_text Extracts plaintext from a document path using Tika.

    :param file_path: The path of the file to extract text from.
    :return: ASCII-encoded plaintext extracted from the document.
    """
   parsed = parser.from_file(filepath)
   return parsed

Appears this is a wrapper around the REST server (java) so this method would probably involve managing a java install 🤢

To use this library, you need to have Java 7+ installed on your system as tika-python starts up the Tika REST server in the background.

from bbot.

domwhewell-sage commented on July 19, 2024

Whilst I investigate unstructured

https://github.com/domwhewell-sage/bbot/blob/fileparser/bbot/modules/fileparser.py

from bbot.

domwhewell-sage commented on July 19, 2024

unstructured

Install is either via a container or a pip package and a few apt dependency's. The pip install failed for me a few times with the error error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 so i opted to use the container instead. I'm not sure how you would add a file to the container without a REST endpoint available. But these are the results I got with a pdf generated by python reportlab inside the container

>>> from unstructured.partition.auto import partition
>>> elements = partition(filename="simple_pdf.pdf")
>>> for element in elements:
...  print(element)
...
Hello, I am a PDF document created with Python!
>>> for element in elements:
...  print(element.metadata.to_dict())
...
{'coordinates': {'points': ((100.0, 82.37380000000007), (100.0, 94.37380000000007), (362.77600000000007, 94.37380000000007), (362.77600000000007, 82.37380000000007)), 'system': 'PixelSpace', 'layout_width': 595.2756, 'layout_height': 841.8898}, 'filename': 'simple_pdf.pdf', 'languages': ['eng'], 'last_modified': '2024-06-04T11:16:09', 'page_number': 1, 'filetype': 'application/pdf'}

Apache tika

Install is either via a container or a java file. Managing a java install could be painful so I opted for the container, the container exposes a REST endpoint where you can upload files and the response is returned in a JSON format. Below is the response I got back from a similar PDF generated using reportlab.

{
        "pdf:unmappedUnicodeCharsPerPage": "0",
        "pdf:PDFVersion": "1.3",
        "pdf:docinfo:title": "untitled",
        "xmp:CreatorTool": "ReportLab PDF Library - www.reportlab.com",
        "pdf:hasXFA": "false",
        "access_permission:modify_annotations": "true",
        "access_permission:can_print_degraded": "true",
        "X-TIKA:Parsed-By-Full-Set": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"],
        "dc:creator": "anonymous",
        "pdf:num3DAnnotations": "0",
        "dcterms:created": "2024-06-03T18:58:16Z",
        "dcterms:modified": "2024-06-03T18:58:16Z",
        "dc:format": "application/pdf; version=1.3",
        "pdf:docinfo:creator_tool": "ReportLab PDF Library - www.reportlab.com",
        "pdf:overallPercentageUnmappedUnicodeChars": "0.0",
        "access_permission:fill_in_form": "true",
        "pdf:docinfo:modified": "2024-06-03T18:58:16Z",
        "pdf:hasCollection": "false",
        "pdf:encrypted": "false",
        "dc:title": "untitled",
        "pdf:containsNonEmbeddedFont": "true",
        "Content-Length": "1426",
        "pdf:docinfo:subject": "unspecified",
        "pdf:hasMarkedContent": "false",
        "Content-Type": "application/pdf",
        "pdf:docinfo:creator": "anonymous",
        "pdf:producer": "ReportLab PDF Library - www.reportlab.com",
        "dc:subject": "unspecified",
        "pdf:totalUnmappedUnicodeChars": "0",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:assemble_document": "true",
        "xmpTPg:NPages": "1",
        "pdf:hasXMP": "false",
        "pdf:charsPerPage": "13",
        "access_permission:extract_content": "true",
        "access_permission:can_print": "true",
        "pdf:docinfo:trapped": "False",
        "X-TIKA:Parsed-By": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"],
        "X-TIKA:content": '<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta name="pdf:PDFVersion" content="1.3" />\n<meta name="pdf:docinfo:title" content="untitled" />\n<meta name="xmp:CreatorTool" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="pdf:hasXFA" content="false" />\n<meta name="access_permission:modify_annotations" content="true" />\n<meta name="access_permission:can_print_degraded" content="true" />\n<meta name="dc:creator" content="anonymous" />\n<meta name="dcterms:created" content="2024-06-03T18:58:16Z" />\n<meta name="dcterms:modified" content="2024-06-03T18:58:16Z" />\n<meta name="dc:format" content="application/pdf; version=1.3" />\n<meta name="pdf:docinfo:creator_tool" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="access_permission:fill_in_form" content="true" />\n<meta name="pdf:docinfo:modified" content="2024-06-03T18:58:16Z" />\n<meta name="pdf:hasCollection" content="false" />\n<meta name="pdf:encrypted" content="false" />\n<meta name="dc:title" content="untitled" />\n<meta name="Content-Length" content="1426" />\n<meta name="pdf:docinfo:subject" content="unspecified" />\n<meta name="pdf:hasMarkedContent" content="false" />\n<meta name="Content-Type" content="application/pdf" />\n<meta name="pdf:docinfo:creator" content="anonymous" />\n<meta name="pdf:producer" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="dc:subject" content="unspecified" />\n<meta name="access_permission:extract_for_accessibility" content="true" />\n<meta name="access_permission:assemble_document" content="true" />\n<meta name="xmpTPg:NPages" content="1" />\n<meta name="pdf:hasXMP" content="false" />\n<meta name="access_permission:extract_content" content="true" />\n<meta name="access_permission:can_print" content="true" />\n<meta name="pdf:docinfo:trapped" content="False" />\n<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />\n<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />\n<meta name="access_permission:can_modify" content="true" />\n<meta name="pdf:docinfo:producer" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="pdf:docinfo:created" content="2024-06-03T18:58:16Z" />\n<title>untitled</title>\n</head>\n<body><div class="page"><p />\n<p>Hello, World!</p>\n<p />\n</div>\n</body></html>',
        "access_permission:can_modify": "true",
        "pdf:docinfo:producer": "ReportLab PDF Library - www.reportlab.com",
        "pdf:docinfo:created": "2024-06-03T18:58:16Z",
        "pdf:containsDamagedFont": "false",
    }

Apache Tika seemed to obtain more metadata such as the links in the pdf's author info it also seems to support more file types than unstructured. (Although Im thinking of implementing a extensions list of files that will be parsed https://github.com/blacklanternsecurity/bbot/blob/stable/bbot/modules/filedownload.py#L24 just to avoid potential incorrect strings being extracted from sourcecode files).

Installing either library via a container seems to be the way forward to avoid installation issues. Im leaning towards apache tika as the REST endpoint is very easy to use and it seems ready to go out the box. Its a shame it requires a separate docker container but that seems unavoidable. If bbot itself is running in a docker container would it be able to spawn this apache tika docker container..?

from bbot.

TheTechromancer commented on July 19, 2024

Okay I see the appeal of Apache Tika. But you make a good point about the docker container. I hadn't thought about the scenario of BBOT itself being in a container, which would make spawning another container unfeasible.

I guess dastardly is already affected by this, although I'm less concerned about that one since text extraction should be a core feature of BBOT. It's really important we get this right.

In this case, we need to support all possible architectures and installation methods, so I'm afraid docker is out of the picture. I think you might also agree that adding a java dependency to BBOT is not ideal.

What I'm wondering is if we can find a middle ground, maybe a golang or rust binary, that we can call similar to what we're currently doing with httpx and gowitness.

from bbot.

domwhewell-sage commented on July 19, 2024

I have managed to get the unstructured python package working now in a fresh environment, it was probably some conflicting packages which didn't work for me....

but from the results it didnt find as much metadata as tika, from the unstructured documentation

Unstructured metadata tracks general document information, like filename and file type, and more detailed document-specific information, such as element type.

They both obtained the contents of the file which is probably most valuable for us. I don't have any objections to using unstructured instead as long as we are ok with potentially missing out on some document metadata

from bbot.

TheTechromancer commented on July 19, 2024

Okay, I think when it comes to metadata vs text extraction, it might be best to treat these as two separate tasks.

I'm not opposed to having an Apache Tika module. This would be pretty convenient and provide high-tier metadata and text extraction, at the cost of complexity. If we do that, I think it would make sense to have the docker setup, but allow the user to set their own URL if needed.

Eventually of course I would like BBOT to have a high-quality text extraction module, which doesn't require docker or Java. Since this is a CPU-intensive task, it would make sense to offload it into its own script. Whether that be a rust/golang/c++ binary, or a python script written by us (we could easily cover 95% of cases just by handling PDF + MS Office), this is the approach we should be using for most modules going forward.

Since BBOT has so many modules and CPU is so scarce in the main process, to get the max performance, it makes sense to use a simple binary or python script with parseable (i.e. JSON) output.

So yeah to summarize the ultimate goal is to have native functionality for metadata and text extraction, probably in separate modules. But since Tika is easier to implement, I'm open to using it in the meantime.

@domwhewell-sage which way are you leaning?

from bbot.

TheTechromancer commented on July 19, 2024

Sounds good, let's not forget to set SCARF_NO_ANALYTICS=true before we publish.

from bbot.

New Module: Apache Tika & `RAW_DATA` events about bbot HOT 11 OPEN

Comments (11)

unstructured

Apache tika

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent