Comments (11)
you need to have Java 7+ installed on your system
from bbot.
A possible alternative: https://github.com/Unstructured-IO/unstructured.
Probably would also need to run in a container, but still worth testing.
from bbot.
If its ok having a metadata extraction and text extraction as separate modules I think unstructured might be the best module to add for text extraction. I have created the module in my fork and will be testing it some more.
https://github.com/domwhewell-sage/bbot/blob/unstructured/bbot/modules/unstructured.py
from bbot.
Something like...
async def setup(self):
await self.run_process("systemctl", "start", "docker", sudo=True)
await self.run_process("docker", "pull", "apache/tika:latest", sudo=True)
self.tika_url = "http://127.0.0.1:8889"
return True
def extract_text(file_path):
"""
extract_text Extracts plaintext from a document path using Tika.
:param file_path: The path of the file to extract text from.
:return: ASCII-encoded plaintext extracted from the document.
"""
with open(file_path, 'rb') as f:
resp = requests.put(self.tika_url, f, headers={'Accept': 'text/plain'})
if(resp.status_code == 200):
return resp.text.strip().encode("ascii","ignore").decode()
from bbot.
Or instead of using docker there is a pip package https://pypi.org/project/tika/
def extract_text(file_path):
"""
extract_text Extracts plaintext from a document path using Tika.
:param file_path: The path of the file to extract text from.
:return: ASCII-encoded plaintext extracted from the document.
"""
parsed = parser.from_file(filepath)
return parsed
Appears this is a wrapper around the REST server (java) so this method would probably involve managing a java install 🤢
To use this library, you need to have Java 7+ installed on your system as tika-python starts up the Tika REST server in the background.
from bbot.
Whilst I investigate unstructured
https://github.com/domwhewell-sage/bbot/blob/fileparser/bbot/modules/fileparser.py
from bbot.
unstructured
Install is either via a container or a pip package and a few apt dependency's. The pip install failed for me a few times with the error error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
so i opted to use the container instead. I'm not sure how you would add a file to the container without a REST endpoint available. But these are the results I got with a pdf generated by python reportlab inside the container
>>> from unstructured.partition.auto import partition
>>> elements = partition(filename="simple_pdf.pdf")
>>> for element in elements:
... print(element)
...
Hello, I am a PDF document created with Python!
>>> for element in elements:
... print(element.metadata.to_dict())
...
{'coordinates': {'points': ((100.0, 82.37380000000007), (100.0, 94.37380000000007), (362.77600000000007, 94.37380000000007), (362.77600000000007, 82.37380000000007)), 'system': 'PixelSpace', 'layout_width': 595.2756, 'layout_height': 841.8898}, 'filename': 'simple_pdf.pdf', 'languages': ['eng'], 'last_modified': '2024-06-04T11:16:09', 'page_number': 1, 'filetype': 'application/pdf'}
Apache tika
Install is either via a container or a java file. Managing a java install could be painful so I opted for the container, the container exposes a REST endpoint where you can upload files and the response is returned in a JSON format. Below is the response I got back from a similar PDF generated using reportlab.
{
"pdf:unmappedUnicodeCharsPerPage": "0",
"pdf:PDFVersion": "1.3",
"pdf:docinfo:title": "untitled",
"xmp:CreatorTool": "ReportLab PDF Library - www.reportlab.com",
"pdf:hasXFA": "false",
"access_permission:modify_annotations": "true",
"access_permission:can_print_degraded": "true",
"X-TIKA:Parsed-By-Full-Set": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"],
"dc:creator": "anonymous",
"pdf:num3DAnnotations": "0",
"dcterms:created": "2024-06-03T18:58:16Z",
"dcterms:modified": "2024-06-03T18:58:16Z",
"dc:format": "application/pdf; version=1.3",
"pdf:docinfo:creator_tool": "ReportLab PDF Library - www.reportlab.com",
"pdf:overallPercentageUnmappedUnicodeChars": "0.0",
"access_permission:fill_in_form": "true",
"pdf:docinfo:modified": "2024-06-03T18:58:16Z",
"pdf:hasCollection": "false",
"pdf:encrypted": "false",
"dc:title": "untitled",
"pdf:containsNonEmbeddedFont": "true",
"Content-Length": "1426",
"pdf:docinfo:subject": "unspecified",
"pdf:hasMarkedContent": "false",
"Content-Type": "application/pdf",
"pdf:docinfo:creator": "anonymous",
"pdf:producer": "ReportLab PDF Library - www.reportlab.com",
"dc:subject": "unspecified",
"pdf:totalUnmappedUnicodeChars": "0",
"access_permission:extract_for_accessibility": "true",
"access_permission:assemble_document": "true",
"xmpTPg:NPages": "1",
"pdf:hasXMP": "false",
"pdf:charsPerPage": "13",
"access_permission:extract_content": "true",
"access_permission:can_print": "true",
"pdf:docinfo:trapped": "False",
"X-TIKA:Parsed-By": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"],
"X-TIKA:content": '<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta name="pdf:PDFVersion" content="1.3" />\n<meta name="pdf:docinfo:title" content="untitled" />\n<meta name="xmp:CreatorTool" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="pdf:hasXFA" content="false" />\n<meta name="access_permission:modify_annotations" content="true" />\n<meta name="access_permission:can_print_degraded" content="true" />\n<meta name="dc:creator" content="anonymous" />\n<meta name="dcterms:created" content="2024-06-03T18:58:16Z" />\n<meta name="dcterms:modified" content="2024-06-03T18:58:16Z" />\n<meta name="dc:format" content="application/pdf; version=1.3" />\n<meta name="pdf:docinfo:creator_tool" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="access_permission:fill_in_form" content="true" />\n<meta name="pdf:docinfo:modified" content="2024-06-03T18:58:16Z" />\n<meta name="pdf:hasCollection" content="false" />\n<meta name="pdf:encrypted" content="false" />\n<meta name="dc:title" content="untitled" />\n<meta name="Content-Length" content="1426" />\n<meta name="pdf:docinfo:subject" content="unspecified" />\n<meta name="pdf:hasMarkedContent" content="false" />\n<meta name="Content-Type" content="application/pdf" />\n<meta name="pdf:docinfo:creator" content="anonymous" />\n<meta name="pdf:producer" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="dc:subject" content="unspecified" />\n<meta name="access_permission:extract_for_accessibility" content="true" />\n<meta name="access_permission:assemble_document" content="true" />\n<meta name="xmpTPg:NPages" content="1" />\n<meta name="pdf:hasXMP" content="false" />\n<meta name="access_permission:extract_content" content="true" />\n<meta name="access_permission:can_print" content="true" />\n<meta name="pdf:docinfo:trapped" content="False" />\n<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />\n<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />\n<meta name="access_permission:can_modify" content="true" />\n<meta name="pdf:docinfo:producer" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="pdf:docinfo:created" content="2024-06-03T18:58:16Z" />\n<title>untitled</title>\n</head>\n<body><div class="page"><p />\n<p>Hello, World!</p>\n<p />\n</div>\n</body></html>',
"access_permission:can_modify": "true",
"pdf:docinfo:producer": "ReportLab PDF Library - www.reportlab.com",
"pdf:docinfo:created": "2024-06-03T18:58:16Z",
"pdf:containsDamagedFont": "false",
}
Apache Tika seemed to obtain more metadata such as the links in the pdf's author info it also seems to support more file types than unstructured. (Although Im thinking of implementing a extensions list of files that will be parsed https://github.com/blacklanternsecurity/bbot/blob/stable/bbot/modules/filedownload.py#L24 just to avoid potential incorrect strings being extracted from sourcecode files).
Installing either library via a container seems to be the way forward to avoid installation issues. Im leaning towards apache tika as the REST endpoint is very easy to use and it seems ready to go out the box. Its a shame it requires a separate docker container but that seems unavoidable. If bbot itself is running in a docker container would it be able to spawn this apache tika docker container..?
from bbot.
Okay I see the appeal of Apache Tika. But you make a good point about the docker container. I hadn't thought about the scenario of BBOT itself being in a container, which would make spawning another container unfeasible.
I guess dastardly is already affected by this, although I'm less concerned about that one since text extraction should be a core feature of BBOT. It's really important we get this right.
In this case, we need to support all possible architectures and installation methods, so I'm afraid docker is out of the picture. I think you might also agree that adding a java dependency to BBOT is not ideal.
What I'm wondering is if we can find a middle ground, maybe a golang or rust binary, that we can call similar to what we're currently doing with httpx and gowitness.
from bbot.
I have managed to get the unstructured python package working now in a fresh environment, it was probably some conflicting packages which didn't work for me....
but from the results it didnt find as much metadata as tika, from the unstructured documentation
Unstructured metadata tracks general document information, like filename and file type, and more detailed document-specific information, such as element type.
They both obtained the contents of the file which is probably most valuable for us. I don't have any objections to using unstructured instead as long as we are ok with potentially missing out on some document metadata
from bbot.
Okay, I think when it comes to metadata vs text extraction, it might be best to treat these as two separate tasks.
I'm not opposed to having an Apache Tika module. This would be pretty convenient and provide high-tier metadata and text extraction, at the cost of complexity. If we do that, I think it would make sense to have the docker setup, but allow the user to set their own URL if needed.
Eventually of course I would like BBOT to have a high-quality text extraction module, which doesn't require docker or Java. Since this is a CPU-intensive task, it would make sense to offload it into its own script. Whether that be a rust/golang/c++ binary, or a python script written by us (we could easily cover 95% of cases just by handling PDF + MS Office), this is the approach we should be using for most modules going forward.
Since BBOT has so many modules and CPU is so scarce in the main process, to get the max performance, it makes sense to use a simple binary or python script with parseable (i.e. JSON) output.
So yeah to summarize the ultimate goal is to have native functionality for metadata and text extraction, probably in separate modules. But since Tika is easier to implement, I'm open to using it in the meantime.
@domwhewell-sage which way are you leaning?
from bbot.
Sounds good, let's not forget to set SCARF_NO_ANALYTICS=true
before we publish.
from bbot.
Related Issues (20)
- Stdout dies mid-scan HOT 4
- ASN Error HOT 1
- Ability to set timeout on individual modules
- Option to Raise FILESYSTEM and WEBSCREENSHOTs as Base64 Blobs HOT 4
- Optimize scan status message HOT 1
- Better discovery path tracking for dnsbrute_mutations HOT 1
- InternetDB: option to display open ports HOT 2
- WPScan Installation Error HOT 13
- Modile jwt_tool to check for jwts with certain CVE issues? HOT 2
- Enable Cookies By Default
- Don't Increment Scope Distance for Hostless Events HOT 5
- Optimize Neo4j
- Discrepancies in wappalyzer findings. HOT 3
- Duplicate DNS_NAME_UNRESOLVED HOT 1
- Consider adding additional domain URLs using free Hudson Rock Cybercrime Intelligence integration HOT 4
- Occasional Newlines in URLs
- Trufflehog is not version locked HOT 1
- Warn if cloud domain exists in target list HOT 5
- Don't shuffle portscan queues HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bbot.