simsong / bulk_extractor Goto Github PK
View Code? Open in Web Editor NEWThis is the development tree. Production downloads are at:
Home Page: https://github.com/simsong/bulk_extractor/releases
License: Other
This is the development tree. Production downloads are at:
Home Page: https://github.com/simsong/bulk_extractor/releases
License: Other
Thanks for integrating my suggestions (issue #53).
However, the second part of the patch concerning line 218 of file bulk_extractor-1.4.4/src/be13_api/plugin.cpp was not included, yet.
The output information of this line should go to a log-file (if at all) rather than to cout. Otherwise, BEViewer is not able to show any image data, any more, because the underlying call ''bulk_extractor -p -http ..." will be polluted by this logging information and thus it is no clean http code any more...
Scanners should be able to register magic numbers that they can handle. Then other scanners like scan_xor
could look for the magic numbers and only xor when they find them... Useful?
Whitelist stats are reported to stdout but not to report.xml.
Specifically:
When bulk_extractor initializes in main.cpp
, it reads any alert list(s) and stop list(s) using function word_and_context_list::readfile
in file word_and_context_list.cpp
. Unfortunately, bulk_extractor does this before opening report.xml as pointer variable dfxml_writer *xreport
, so it is not yet ready to write to report.xml.
To fix this:
xreport
way up near the top,xreport
pointer as a new parameter to word_and_context_list::readfile()
so that readfile
can write the stats directly into report.xml.xreport
to any functionbulk_extractor scan_pcap should support a stoplist of packet artifacts.
Since there are other tools for analyzing PE headers, it may make more sense to store them in individual files than to break them out into XML
Features encoded in UTF-16 encoding show up with escape codes such as "\x00". They need to be displayed as normal characters.
When processing nps-2010-emails, bulk_extractor misses two email addresses that are in PDF files (and were generated by Microsoft Word). Perhaps this is because the PDF text extractor is now missing them. It should be fixed. See http://digitalcorpora.org/archives/173/comment-page-1#comment-124731
The scanner API should allow a scanner to add an annotation to the banner list.
It appears that process_aff::get_sbuf()
ignores the pagesize. I don't think that this can all be rewritten to use pread
because process_dir
needs to be able to return an sbuf for an iterator.
From the mailing list:
I ran bulk_extractor against an image and then re-ran it against the
same image again giving it -w exif.txt from the first run. This
should have resulted in all exif features being stopped, but I get a
non-empty exif file on the second run:
This is the entire exif feature file from the second run.
# UTF-8 Byte Order Marker; see http://unicode.org/faq/utf_bom.html
# BULK_EXTRACTOR-Version: 1.3.1 ($Rev: 10844 $)
# Feature-Recorder: exif
# Filename: win7.vmdk
# Feature-File-Version: 1.1
292220928 288a8ed63c00c1b39343dbe82a090cd0 <exif><ifd0.tiff.Software>Adobe ImageReady</ifd0.tiff.Software></exif>
4899467264 6d5f317239f1b039bc534660ac2abae4 <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4900933632 5aea5473d3bd76a86cf4dbe46385545f <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4890324992 4146c4da38363f4e2862d10c1f84f80d <exif><ifd0.tiff.Copyright>Will Austin</ifd0.tiff.Copyright></exif>
4895125504 92fc7a14c551dae96c1960074865aa59 <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4896358400 72ee2842f3d7872a92964734322cac2b <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4897701888 9a48c674f92171fc20eb1f8a5b8c2e9b <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4918804480 72ee2842f3d7872a92964734322cac2b <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4920516608 5b57a8c6cd9393c567f89f0f4cc89522 <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4922580992 4faf65eb81de15c1a371f53e5a3a38e0 <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4924956672 698fcb66721525f86140188781bdb33e <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
(there is a tab after the offset on the first line, I don't know why
the mail client doesn't show it).
All of these features are in the exif.txt feature file from the first
run that I used as a stoplist.
Ubuntu 10.04.4 LTS
As of Version 1.4.4, a user defined Plugin can be loaded only by giving a plugin directory via command line option '-P'. I would appreciate an environment Variable a la PATH (something like BE_PATH), in order to keep the command line short.
Further, BEViewer 1.4.4 can't show the content of a path containing a component belonging to a user defined recursive plugin, as the -P option is not given in the underlying call to bulk_extractor. An environment Variable would solve this issue, too.
The following patch in the source code of bulk_extractor 1.4.4 helps me as a temporary solution. Would be nice if it would be fixed in the next release:
diff -r bulk_extractor-1.4.4/src/main.cpp source/bulk_extractor-patched/src/main.cpp
809a810,820
> // >>> Patch
> // add to plugin_path: /usr/local/lib/bulk_extractor:/usr/lib/bulk_extractor:.
> {
> const char* p;
> struct stat s;
> p="/usr/local/lib/bulk_extractor"; if(stat(p, &s)==0) scanner_dirs.push_back(p);
> p="/usr/lib/bulk_extractor"; if(stat(p, &s)==0) scanner_dirs.push_back(p);
> p="."; scanner_dirs.push_back(p);
> }
> // <<< Patch
>
diff -r bulk_extractor-1.4.4/src/be13_api/plugin.cpp source/bulk_extractor-patched/src/be13_api/plugin.cpp
218c218,219
< std::cout << "Loading: " << fn << " (" << func_name << ")\n";
---
> // >>> Patch: The following output would confuse BEViewer.
> // std::cout << "Loading: " << fn << " (" << func_name << ")\n";
Need a simple shared library API and demo program that shows bulk_extractor analyzing a block of data and performing callbacks to record features that are found.
A discussion in bulk_extractor-users group of octal vs. hex escape codes resolved that hex is preferred. Functionally, it doesn't matter, but people visually prefer hex.
Results incorrectly include trailing '"' when parsing URLs.
url.txt output:
199452984 http://www.icra.org/ratingsv02.html" (pics-1.1 "http://www.icra.org/ratingsv02.html" l gen true for
199453047 http://www.msn.com" true for "http://www.msn.com" r (cz 1 lz 1 n
199453120 http://msn.com" true for "http://msn.com" r (cz 1 lz 1 n
199453189 http://stb.msn.com" true for "http://stb.msn.com" r (cz 1 lz 1 n
199453396 http://www.rsac.org/ratingsv01.html" z 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true for
199453645 http://stc.msn.com" true for "http://stc.msn.com" r (n 0 s 0 v 0
199453709 http://stj.msn.com" true for "http://stj.msn.com" r (n 0 s 0 v 0
should be:
199452984 http://www.icra.org/ratingsv02.html (pics-1.1 "http://www.icra.org/ratingsv02.html" l gen true for
199453047 http://www.msn.com true for "http://www.msn.com" r (cz 1 lz 1 n
199453120 http://msn.com true for "http://msn.com" r (cz 1 lz 1 n
199453189 http://stb.msn.com true for "http://stb.msn.com" r (cz 1 lz 1 n
199453396 http://www.rsac.org/ratingsv01.html z 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true for
199453645 http://stc.msn.com true for "http://stc.msn.com" r (n 0 s 0 v 0
199453709 http://stj.msn.com true for "http://stj.msn.com" r (n 0 s 0 v 0
Version information:
# BULK_EXTRACTOR-Version: 1.5.5 ($Rev: 10844 $)
# Feature-Recorder: url
# Feature-File-Version: 1.1
Please let me know if I can provide you with any better information.
When trying to run bulk_extractor (1.5.5) in plugins directory, it throws an error:
"bulk_extractor: symbol lookup error: ./scan_flexdemo.so: undefined symbol: _ZN7beregexC1Esi"
Line 150 "for fn in fnames:"
should read "for fn in fns:"
to match lines 147,149
otherwise this script fails
Hi,
In process_ewf::open, fname is being freed immediately before having being used.
Patch below seems to fix the problem
--- ./src/image_process.h.orig 2014-01-15 15:00:06.000000000 +0000
+++ ./src/image_process.h 2014-06-09 14:15:54.000000000 +0000
@@ -128,7 +128,7 @@
virtual int open()=0; /* open; return 0 if successful */
virtual int pread(uint8_t *,size_t bytes,int64_t offset) const =0; /* read */
virtual int64_t image_size() const=0;
- virtual std::string image_fname() const { return image_fname_;}
+ virtual const std::string &image_fname() const { return image_fname_;}
/* iterator support; these virtual functions are called by iterator through (*myimage) */
virtual image_process::iterator begin() const =0;
The idea is to tack on these fields to the forensic path as URL
query string parameters, e.g., ?re=foo&enc=UTF-8. We'd obviously need
to work out the details about escaping, etc., but there are a few
things to like about this. First, URLs are cool and one can easily
imagine some future web service for exposing bulk_extractor output,
and that's not a bad way to integrate disparate enterprise systems.
Second, the scheme is idempotent, so if you ran a slightly different
set of patterns at a later time, the patterns that remained the same
would generate the same forensic paths. Third, the query parameters
act as annotations to the location of the data.
The main cons are that it reads kind of ugly, and will be a bit harder
to deal with in quick-and-dirty scripts.
Either encoder_report or the bulk_extractor_reader should allow the setting of a maximum number of features per feature file to process to make it faster to debug.
I downloaded both bulk_extractor and tcpflow via git. I built and installed tcpflow and am trying to build bulk_extractor but run into the above error. I could hardwire a fix, but I'd like to get this solved properly.
If I copy tcpflow/src/tcpflow.h to /usr/local/include the compile throws this error:
In file included from be13_api/pcap_fake.cpp:2:
/usr/local/include/tcpflow.h:206: error: conflicting declaration ‘typedef size_t socklen_t’
-David
https://github.com/simsong/bulk_extractor/wiki/BEViewer
there is also no MSI package or JAR fileand the current exe fails starting (some jvm error)
Looking at line 48 of /src / scan_email_lg.cpp, it looks like the ABBREV constant has a value of 'UT' instead of 'UTC'.
Was this a typo or a deliberate choice?
One of the recorders writing to gps.txt is not putting in the MD5 as the feature in the feature file. This is evidenced when processing the NPS 2TB drive.
The following warning should be fixed:
: void yyFlexLexer::LexerError( yyconst char msg[] )
:1662:6: warning: function might be candidate for attribute ‘noreturn’ [-Wsuggest-attribute=noreturn]
I started fixing:
--- ./configure.ac.orig 2013-07-12 01:19:20.000000000 +0000
+++ ./configure.ac 2013-07-13 07:43:24.000000000 +0000
@@ -518,8 +518,8 @@
fi
fi
if test x"$exiv2" == x"yes" ; then
- AC_CHECK_HEADERS([exiv2/image.hpp exiv2/exif.hpp exiv2/error.hpp])
AC_LANG_PUSH(C++)
+ AC_CHECK_HEADERS([exiv2/image.hpp exiv2/exif.hpp exiv2/error.hpp])
AC_TRY_COMPILE([#include <exiv2/image.hpp>
#include <exiv2/exif.hpp>
#include <exiv2/error.hpp>],
--- ./src/scan_exiv2.cpp.orig 2013-05-29 01:03:05.000000000 +0000
+++ ./src/scan_exiv2.cpp 2013-07-13 07:45:01.000000000 +0000
@@ -7,6 +7,7 @@
#include "config.h"
#include "bulk_extractor_i.h"
+#include "be13_api/utils.h"
#include <stdlib.h>
#include <string.h>
@@ -101,7 +102,7 @@
void scan_exiv2(const class scanner_params &sp,const recursion_control_block &rcb)
{
assert(sp.sp_version==scanner_params::CURRENT_SP_VERSION);
- if(sp.phase==scanner_params::startup){
+ if(sp.phase==scanner_params::PHASE_STARTUP){
assert(sp.info->si_version==scanner_info::CURRENT_SI_VERSION);
sp.info->name = "exiv2";
sp.info->author = "Simson L. Garfinkel";
@@ -112,8 +113,8 @@
sp.info->flags = scanner_info::SCANNER_DISABLED; // disabled because we have be_exif
return;
}
- if(sp.phase==scanner_params::shutdown) return;
- if(sp.phase==scanner_params::scan){
+ if(sp.phase==scanner_params::PHASE_SHUTDOWN) return;
+ if(sp.phase==scanner_params::PHASE_SCAN){
const sbuf_t &sbuf = sp.sbuf;
feature_recorder *exif_recorder = sp.fs.get_name("exif");
But now I have other issues:
scan_exiv2.cpp: In function 'void scan_exiv2(const scanner_params&, const recursion_control_block&)':
scan_exiv2.cpp:155: error: 'be_hash' was not declared in this scope
scan_exiv2.cpp:186: error: 'xml' is not a class or namespace
I am trying to ./configure with LEX=/usr/loca/bin/flex (this is needed because /usr/bin/flex doesn't support -R but /usr/local/bin/flex does)
But it is not possible because of those 3 lines in configure.ac:
if test "$LEX" != flex; then
AC_MSG_ERROR([flex not installed; required for compiling regular expressions. Try 'apt-get install flex' or 'yum install flex' or 'port install flex' or whatever package manager you happen to be using....])
fi
So I get the following error:
configure: error: flex not installed; required for compiling regular expressions. Try 'apt-get install flex' or 'yum install flex' or 'port install flex' or whatever package manager you happen to be using....
When running bulk_extractor I am getting the following error message:
input buffer overflow, can't enlarge buffer because scanner uses REJECT
Segmentation fault
Does anyone know what is causing this error message? I am running this under Linux Mint on a 8 core machine with 12 GB of RAM.
This appears to be a typo in the script, should it be getpwuid? It is listed as getwpuid and was missing when the ./configure command was run for Bulk_extractor.
If there aren't enough fields, it should log an error with a line number but keep going.
Currently the iterator only works with report directories and zip files of report directories. It should be modified so that it can handle top-level directories or zip files with multiple reports and return an interator for all of the reports, and each report an iterator for all of the enclosed feature files.
I noticed that some files use the #!/usr/bin/env python
approach while some are stuck at the hard-coded paths.
The files I found are:
bulk_diff.py, bulk_extractor_reader.py, identify_filenames.py, post_process_exif.py.
Thanks!
I'm getting "error while loading shared libraries: liblightgrep.so.0: cannot open shared object file: No such file or directory" when trying to run bulk extractor. I have lightgrep installed,and was hoping to run bulk extractor with it. This is from a pull made today.
Thanks
The frist line reads
#!/usr/bin/env python3.2
I suggest to change it to
#!/usr/bin/env python3
so it works with the current python 3.x, too.
Or is there a reason that makes it not working with 3.3?
When providing bulk_extractor (v 1.4.2) with a directory to processes (either a single directory or recursively), bulk_extractor seems to hang indefinitely when it encounters a FIFO pipe (see http://en.wikipedia.org/wiki/Named_pipe).
I've added a patch to correct this behaviour here: https://gist.github.com/ajengle/7998683
In this patch, we simply check the files as you stat them to determine if they are a FIFO pipe. If they are, we skip.
If name="" and type="", then the segment is not valid. Stop there.
A discussion on bulk_extractor-users group resolved in that it is good to compile BEViewer on the latest compiler. The next BEViewer will be compiled on OpenJDK and will require Java 7 JRE.
Currently BE uses MD5 as a universal hash. There should be a flag allowing other hash algorithms to be used and reported. The hash in use should be evidenced in the feature files. Also support SHA-3/128, which would be the first 128 bits of SHA-3?
The README references bootstrap.sh for OS X but that file isn't present in the distribution.
configure/make builds a working version, so this isn't a major issue.
-David
bulk_extractor wordlist currently checks if a byte isprint(ch) && ch!=' ' && ch<128.
An improvement to this would be to support encodings such as UTF-8, UTF-16 and UTF-32, possibly as options specified by the user. The words should then be converted to a single encoding (UTF-8?) and then split/deduped, for possible conversion and use by the target application.
When running the bulk_extractor in recursive directory scan mode (-R), bulk_extractor drops features and files:
This behavior limits completeness of scans using recursive mode.
We need a decoder for macho (Apple) object files.
Fix hexdigest in scan_exiv2.cpp
Currently scan_kml doesn't clean the tags at the end of a KML scan properly.
ImageReaderManager.java is missing in 3df29f7, which is HEAD at present:
[uckelman@scylla java_gui]$ make
make: *** No rule to make target `src/image/ImageReaderManager.java', needed by `BEViewer.jar'. Stop.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.