simsong / bulk_extractor Goto Github PK

This is the development tree. Production downloads are at:

Home Page: https://github.com/simsong/bulk_extractor/releases

License: Other

Shell 14.61% Makefile 0.62% HTML 4.57% Java 0.29% C++ 53.50% C 5.59% Python 12.91% NSIS 0.89% Lex 2.45% M4 3.24% Rich Text Format 1.32%

bulk_extractor's People

Contributors

Stargazers

Watchers

Forkers

mikehom sunithamisra ajnelson kefir- endeav0r stumpyuk1 brucemty ekristen jonstewart helias0509 dkarpo narayana1208 andy737 ralfg1 mattdri-ir champ1 relvinhas gklansbu uckelman schism ashamsu chubbymaggie posophe dulani enjoyhacking andreacosta hapser bluebear171 jinhaoxia phulc omnifocal prosciana fork42541 wflk burdzwastaken detrojones lucabongiorni uckelman-sf flakfizer noobfromvn acealchemycyberblaze sf-jonstewart estuart redteamcaliber dfirgeek olivierh59500 labgeek utsa-cyber mfalconi thedeserter enascimento norsig cescon179 suhyeonjin sts0mrg0 xinjijia h4ckl4bm3 naveenselvan lorz ir4n6 moddingg33k ro9ueadmin jilir tree0flife weeshlow jfire5401 edsu cameronnielsen sweettimo dbuentello n4rr34n6 luojiacs woolverino scr3w-2ooth 0xfatty lukecoughlan madisettisunil skolldz tw4l seabreg chasewyrick klouie13 dlat solodky 0b0ltus 4n6ist justforkin fake4d zerox1b erdoukki blue-infosec bodzihackerone joachimmetz wedataintelligence netredo hacker4help global19 global-localhost global19-atlassian-net denfrost

bulk_extractor's Issues

User plugins (continued)

Thanks for integrating my suggestions (issue #53).

However, the second part of the patch concerning line 218 of file bulk_extractor-1.4.4/src/be13_api/plugin.cpp was not included, yet.

The output information of this line should go to a log-file (if at all) rather than to cout. Otherwise, BEViewer is not able to show any image data, any more, because the underlying call ''bulk_extractor -p -http ..." will be polluted by this logging information and thus it is no clean http code any more...

Integrated handling of magic numbers

Scanners should be able to register magic numbers that they can handle. Then other scanners like scan_xor could look for the magic numbers and only xor when they find them... Useful?

Whitelist stats go to stdout but not to report.xml

Whitelist stats are reported to stdout but not to report.xml.

Specifically:

When bulk_extractor initializes in main.cpp, it reads any alert list(s) and stop list(s) using function word_and_context_list::readfile in file word_and_context_list.cpp. Unfortunately, bulk_extractor does this before opening report.xml as pointer variable dfxml_writer *xreport, so it is not yet ready to write to report.xml.

To fix this:

Move instantiation of xreport way up near the top,
being careful not to disrupt behavior in the event of an error or if bulk_extractor is being restarted.
Pass the xreport pointer as a new parameter to word_and_context_list::readfile() so that readfile can write the stats directly into report.xml.
I recommend the same treatment of passing xreport to any function
that prints to stdout wherever the user also wants the output to go into report.xml.

pcap stoplist

bulk_extractor scan_pcap should support a stoplist of packet artifacts.

should scan_winpe should store PE headers in files?

Since there are other tools for analyzing PE headers, it may make more sense to store them in individual files than to break them out into XML

BEViewer shows escape codes rather than just text

Features encoded in UTF-16 encoding show up with escape codes such as "\x00". They need to be displayed as normal characters.

Missing 2 email addresses in nps-2010-emails

When processing nps-2010-emails, bulk_extractor misses two email addresses that are in PDF files (and were generated by Microsoft Word). Perhaps this is because the PDF text extractor is now missing them. It should be fixed. See http://digitalcorpora.org/archives/173/comment-page-1#comment-124731

scanners should have banner_stamp

The scanner API should allow a scanner to add an annotation to the banner list.

process_aff appears to be ignoring pagesize

It appears that process_aff::get_sbuf() ignores the pagesize. I don't think that this can all be rewritten to use pread because process_dir needs to be able to return an sbuf for an iterator.

Whitelist system may not work properly with exif XML output

From the mailing list:

I ran bulk_extractor against an image and then re-ran it against the
same image again giving it -w exif.txt from the first run.  This
should have resulted in all exif features being stopped, but I get a
non-empty exif file on the second run:

This is the entire exif feature file from the second run.

# UTF-8 Byte Order Marker; see http://unicode.org/faq/utf_bom.html
# BULK_EXTRACTOR-Version: 1.3.1 ($Rev: 10844 $)
# Feature-Recorder: exif
# Filename: win7.vmdk
# Feature-File-Version: 1.1
292220928   288a8ed63c00c1b39343dbe82a090cd0    <exif><ifd0.tiff.Software>Adobe ImageReady</ifd0.tiff.Software></exif>
4899467264  6d5f317239f1b039bc534660ac2abae4    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4900933632  5aea5473d3bd76a86cf4dbe46385545f    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4890324992  4146c4da38363f4e2862d10c1f84f80d    <exif><ifd0.tiff.Copyright>Will Austin</ifd0.tiff.Copyright></exif>
4895125504  92fc7a14c551dae96c1960074865aa59    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4896358400  72ee2842f3d7872a92964734322cac2b    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4897701888  9a48c674f92171fc20eb1f8a5b8c2e9b    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4918804480  72ee2842f3d7872a92964734322cac2b    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4920516608  5b57a8c6cd9393c567f89f0f4cc89522    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4922580992  4faf65eb81de15c1a371f53e5a3a38e0    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4924956672  698fcb66721525f86140188781bdb33e    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>

(there is a tab after the offset on the first line, I don't know why
the mail client doesn't show it).

All of these features are in the exif.txt feature file from the first
run that I used as a stoplist.

Ubuntu 10.04.4 LTS

User Plugins

As of Version 1.4.4, a user defined Plugin can be loaded only by giving a plugin directory via command line option '-P'. I would appreciate an environment Variable a la PATH (something like BE_PATH), in order to keep the command line short.

Further, BEViewer 1.4.4 can't show the content of a path containing a component belonging to a user defined recursive plugin, as the -P option is not given in the underlying call to bulk_extractor. An environment Variable would solve this issue, too.

The following patch in the source code of bulk_extractor 1.4.4 helps me as a temporary solution. Would be nice if it would be fixed in the next release:

diff -r bulk_extractor-1.4.4/src/main.cpp source/bulk_extractor-patched/src/main.cpp
809a810,820
>     // >>> Patch
>     // add to plugin_path: /usr/local/lib/bulk_extractor:/usr/lib/bulk_extractor:.
>     {
>       const char* p;
>       struct stat s;
>       p="/usr/local/lib/bulk_extractor"; if(stat(p, &s)==0) scanner_dirs.push_back(p);
>       p="/usr/lib/bulk_extractor";       if(stat(p, &s)==0) scanner_dirs.push_back(p);
>       p=".";                                                scanner_dirs.push_back(p);
>     }
>     // <<< Patch
> 
diff -r bulk_extractor-1.4.4/src/be13_api/plugin.cpp source/bulk_extractor-patched/src/be13_api/plugin.cpp
218c218,219
<     std::cout << "Loading: " << fn << " (" << func_name << ")\n";

---
>     // >>> Patch: The following output would confuse BEViewer.
>     // std::cout << "Loading: " << fn << " (" << func_name << ")\n";

API to analyze a block with a feature recorder call-back

Need a simple shared library API and demo program that shows bulk_extractor analyzing a block of data and performing callbacks to record features that are found.

Prefer octal or hex escape codes

A discussion in bulk_extractor-users group of octal vs. hex escape codes resolved that hex is preferred. Functionally, it doesn't matter, but people visually prefer hex.

URL parse error when surrounded by '"'

Results incorrectly include trailing '&quot' when parsing URLs.

url.txt output:

199452984   http://www.icra.org/ratingsv02.html&quot;   (pics-1.1 &quot;http://www.icra.org/ratingsv02.html&quot; l gen true for 
199453047   http://www.msn.com&quot  true for &quot;http://www.msn.com&quot; r (cz 1 lz 1 n
199453120   http://msn.com&quot  true for &quot;http://msn.com&quot; r (cz 1 lz 1 n
199453189   http://stb.msn.com&quot  true for &quot;http://stb.msn.com&quot; r (cz 1 lz 1 n
199453396   http://www.rsac.org/ratingsv01.html&quot;   z 1 vz 1) &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true for 
199453645   http://stc.msn.com&quot  true for &quot;http://stc.msn.com&quot; r (n 0 s 0 v 0
199453709   http://stj.msn.com&quot  true for &quot;http://stj.msn.com&quot; r (n 0 s 0 v 0

should be:

199452984   http://www.icra.org/ratingsv02.html (pics-1.1 &quot;http://www.icra.org/ratingsv02.html&quot; l gen true for 
199453047   http://www.msn.com   true for &quot;http://www.msn.com&quot; r (cz 1 lz 1 n
199453120   http://msn.com   true for &quot;http://msn.com&quot; r (cz 1 lz 1 n
199453189   http://stb.msn.com   true for &quot;http://stb.msn.com&quot; r (cz 1 lz 1 n
199453396   http://www.rsac.org/ratingsv01.html z 1 vz 1) &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true for 
199453645   http://stc.msn.com   true for &quot;http://stc.msn.com&quot; r (n 0 s 0 v 0
199453709   http://stj.msn.com   true for &quot;http://stj.msn.com&quot; r (n 0 s 0 v 0

Version information:

# BULK_EXTRACTOR-Version: 1.5.5 ($Rev: 10844 $)
# Feature-Recorder: url
# Feature-File-Version: 1.1

Please let me know if I can provide you with any better information.

bulk_extractor scan_flexdemo error

When trying to run bulk_extractor (1.5.5) in plugins directory, it throws an error:
"bulk_extractor: symbol lookup error: ./scan_flexdemo.so: undefined symbol: _ZN7beregexC1Esi"

Bugfix: cda_tool.py variable name

Line 150 "for fn in fnames:"
should read "for fn in fns:"
to match lines 147,149

otherwise this script fails

fname use after free in process_ewf::open

Hi,

In process_ewf::open, fname is being freed immediately before having being used.
Patch below seems to fix the problem

--- ./src/image_process.h.orig  2014-01-15 15:00:06.000000000 +0000
+++ ./src/image_process.h       2014-06-09 14:15:54.000000000 +0000
@@ -128,7 +128,7 @@
     virtual int open()=0;                                  /* open; return 0 if successful */
     virtual int pread(uint8_t *,size_t bytes,int64_t offset) const =0;     /* read */
     virtual int64_t image_size() const=0;
-    virtual std::string image_fname() const { return image_fname_;}
+    virtual const std::string &image_fname() const { return image_fname_;}

     /* iterator support; these virtual functions are called by iterator through (*myimage) */
     virtual image_process::iterator begin() const =0;

Need an Icon

allow feature files to include ?arg=val in forensic path.

The idea is to tack on these fields to the forensic path as URL
query string parameters, e.g., ?re=foo&enc=UTF-8. We'd obviously need
to work out the details about escaping, etc., but there are a few
things to like about this. First, URLs are cool and one can easily
imagine some future web service for exposing bulk_extractor output,
and that's not a bad way to integrate disparate enterprise systems.
Second, the scheme is idempotent, so if you ran a slightly different
set of patterns at a later time, the patterns that remained the same
would generate the same forensic paths. Third, the query parameters
act as annotations to the location of the data.

The main cons are that it reads kind of ugly, and will be a bit harder
to deal with in quick-and-dirty scripts.

encoder_report should have a max_features per file to analyze

Either encoder_report or the bulk_extractor_reader should allow the setting of a maximum number of features per feature file to process to make it faster to debug.

be13_api/pcap_fake.cpp:2:21: error: tcpflow.h: No such file or directory

I downloaded both bulk_extractor and tcpflow via git. I built and installed tcpflow and am trying to build bulk_extractor but run into the above error. I could hardwire a fix, but I'd like to get this solved properly.

If I copy tcpflow/src/tcpflow.h to /usr/local/include the compile throws this error:

In file included from be13_api/pcap_fake.cpp:2:
/usr/local/include/tcpflow.h:206: error: conflicting declaration ‘typedef size_t socklen_t’

-David

installer not working

https://github.com/simsong/bulk_extractor/wiki/BEViewer

there is also no MSI package or JAR fileand the current exe fails starting (some jvm error)

TZ Typo?

Looking at line 48 of /src / scan_email_lg.cpp, it looks like the ABBREV constant has a value of 'UT' instead of 'UTC'.

Was this a typo or a deliberate choice?

scan_gps has a bad recorder

One of the recorders writing to gps.txt is not putting in the MD5 as the feature in the feature file. This is evidenced when processing the NPS 2TB drive.

yyFlexLexer warning needs resolved

The following warning should be fixed:

: void yyFlexLexer::LexerError( yyconst char msg[] )
:1662:6: warning: function might be candidate for attribute ‘noreturn’ [-Wsuggest-attribute=noreturn]

exiv2 doesn't compile

I started fixing:

--- ./configure.ac.orig 2013-07-12 01:19:20.000000000 +0000
+++ ./configure.ac      2013-07-13 07:43:24.000000000 +0000
@@ -518,8 +518,8 @@
   fi
 fi
 if test x"$exiv2" == x"yes" ; then
-  AC_CHECK_HEADERS([exiv2/image.hpp exiv2/exif.hpp exiv2/error.hpp])
   AC_LANG_PUSH(C++)
+  AC_CHECK_HEADERS([exiv2/image.hpp exiv2/exif.hpp exiv2/error.hpp])
     AC_TRY_COMPILE([#include <exiv2/image.hpp>
                    #include <exiv2/exif.hpp>
                     #include <exiv2/error.hpp>],
--- ./src/scan_exiv2.cpp.orig   2013-05-29 01:03:05.000000000 +0000
+++ ./src/scan_exiv2.cpp        2013-07-13 07:45:01.000000000 +0000
@@ -7,6 +7,7 @@

 #include "config.h"
 #include "bulk_extractor_i.h"
+#include "be13_api/utils.h"

 #include <stdlib.h>
 #include <string.h>
@@ -101,7 +102,7 @@
 void scan_exiv2(const class scanner_params &sp,const recursion_control_block &rcb)
 {
     assert(sp.sp_version==scanner_params::CURRENT_SP_VERSION);
-    if(sp.phase==scanner_params::startup){
+    if(sp.phase==scanner_params::PHASE_STARTUP){
         assert(sp.info->si_version==scanner_info::CURRENT_SI_VERSION);
        sp.info->name  = "exiv2";
         sp.info->author         = "Simson L. Garfinkel";
@@ -112,8 +113,8 @@
        sp.info->flags = scanner_info::SCANNER_DISABLED; // disabled because we have be_exif
        return;
     }
-    if(sp.phase==scanner_params::shutdown) return;
-    if(sp.phase==scanner_params::scan){
+    if(sp.phase==scanner_params::PHASE_SHUTDOWN) return;
+    if(sp.phase==scanner_params::PHASE_SCAN){

        const sbuf_t &sbuf = sp.sbuf;
        feature_recorder *exif_recorder = sp.fs.get_name("exif");

But now I have other issues:

scan_exiv2.cpp: In function 'void scan_exiv2(const scanner_params&, const recursion_control_block&)':
scan_exiv2.cpp:155: error: 'be_hash' was not declared in this scope
scan_exiv2.cpp:186: error: 'xml' is not a class or namespace

Custom LEX can't be set during configure

I am trying to ./configure with LEX=/usr/loca/bin/flex (this is needed because /usr/bin/flex doesn't support -R but /usr/local/bin/flex does)

But it is not possible because of those 3 lines in configure.ac:

if test "$LEX" != flex; then
AC_MSG_ERROR([flex not installed; required for compiling regular expressions. Try 'apt-get install flex' or 'yum install flex' or 'port install flex' or whatever package manager you happen to be using....])
fi

So I get the following error:
configure: error: flex not installed; required for compiling regular expressions. Try 'apt-get install flex' or 'yum install flex' or 'port install flex' or whatever package manager you happen to be using....

YY_FATAL_ERROR macro called in scan_email.cpp or scan_accts.cpp.

When running bulk_extractor I am getting the following error message:

input buffer overflow, can't enlarge buffer because scanner uses REJECT
Segmentation fault

Does anyone know what is causing this error message? I am running this under Linux Mint on a 8 core machine with 12 GB of RAM.

Possible typo

This appears to be a typo in the script, should it be getpwuid? It is listed as getwpuid and was missing when the ./configure command was run for Bulk_extractor.

Create examples for python module

bulk_extractor reader should provide a method to split fields

If there aren't enough fields, it should log an error with a line number but keep going.

python bulkextractor_reader should have an iterator for reports

Currently the iterator only works with report directories and zip files of report directories. It should be modified so that it can handle top-level directories or zip files with multiple reports and return an interator for all of the reports, and each report an iterator for all of the enclosed feature files.

Don't hardcode #!/usr/bin/python but use #!/usr/bin/env python

I noticed that some files use the #!/usr/bin/env python approach while some are stuck at the hard-coded paths.

The files I found are:
bulk_diff.py, bulk_extractor_reader.py, identify_filenames.py, post_process_exif.py.

Thanks!

lightgrep

I'm getting "error while loading shared libraries: liblightgrep.so.0: cannot open shared object file: No such file or directory" when trying to run bulk extractor. I have lightgrep installed,and was hoping to run bulk extractor with it. This is from a pull made today.

Thanks

cda_tool.py should not use python3.2 but python3

The frist line reads
#!/usr/bin/env python3.2

I suggest to change it to
#!/usr/bin/env python3
so it works with the current python 3.x, too.
Or is there a reason that makes it not working with 3.3?

bulk_extractor hangs on FIFO pipes

When providing bulk_extractor (v 1.4.2) with a directory to processes (either a single directory or recursively), bulk_extractor seems to hang indefinitely when it encounters a FIFO pipe (see http://en.wikipedia.org/wiki/Named_pipe).

I've added a patch to correct this behaviour here: https://gist.github.com/ajengle/7998683

In this patch, we simply check the files as you stat them to determine if they are a FIFO pipe. If they are, we skip.

scan_elf improvement

If name="" and type="", then the segment is not valid. Stop there.

JVM required by BEViewer

A discussion on bulk_extractor-users group resolved in that it is good to compile BEViewer on the latest compiler. The next BEViewer will be compiled on OpenJDK and will require Java 7 JRE.

SHA-1 support

Currently BE uses MD5 as a universal hash. There should be a flag allowing other hash algorithms to be used and reported. The hash in use should be evidenced in the feature files. Also support SHA-3/128, which would be the first 128 bits of SHA-3?

add dfxml as a submodule; migrate to that DFXML generator

bootstrap.sh not present in bulk_extractor-1.4.1.tar.gz

The README references bootstrap.sh for OS X but that file isn't present in the distribution.

configure/make builds a working version, so this isn't a major issue.

-David

Separate right-side context for scan_acct and scan_email

Implement PHASE_THREAD_AFTER_SCAN

bulk_extractor wordlist should be rewritten to use la-strings.

bulk_extractor wordlist currently checks if a byte isprint(ch) && ch!=' ' && ch<128.

An improvement to this would be to support encodings such as UTF-8, UTF-16 and UTF-32, possibly as options specified by the user. The words should then be converted to a single encoding (UTF-8?) and then split/deduped, for possible conversion and use by the target application.

Scanning in recursive mode drops features and files

When running the bulk_extractor in recursive directory scan mode (-R), bulk_extractor drops features and files:

If a feature is encountered but a feature has already been recorded at that Forensic Path from another file, then the feature is dropped.
If a filename is not simple ASCII, bulk_extractor will skip the file and not scan it.

This behavior limits completeness of scans using recursive mode.

[uckelman@scylla java_gui]$ make
make: *** No rule to make target `src/image/ImageReaderManager.java', needed by `BEViewer.jar'. Stop.