perdisci / amico Goto Github PK
View Code? Open in Web Editor NEWAMICO - Accurate Behavior-Based Detection of Malware Downloads
License: GNU General Public License v2.0
AMICO - Accurate Behavior-Based Detection of Malware Downloads
License: GNU General Public License v2.0
In get_feature_vector.py
, we currently have several DB queries that need to be optimized. Currently, generating the features for a single download event may take up to several seconds, depending on how large the DB is and the number of related historic download events.
We temporarily solved the problem by limiting the amount of historic data to be used for feature measurement, using MAX_PAST_DUMPS = 30000
. However, we noticed during deployment that increasing the value of this parameter tends to improve classification accuracy (e.g., reducing false positives). Optimizing the DB queries for feature computation will allow us to do just that.
Notice that MAX_PAST_DUMPS
parameter will still be useful even after DB query optimization, but this should be a tunable parameter to be read from config.py
.
REALLOC_SC_PAYLOAD should be set to be equal to a static M_MMAP_THRESHOLD (see
http://man7.org/linux/man-pages/man3/mallopt.3.html)
This should allow for memory blocks allocated during file reconstruction to be
reclaimed by the OS.
Original issue reported on code.google.com by [email protected]
on 2 Sep 2014 at 10:42
After running one pe_dump process for several hours at a relatively small
network (several hundred hosts):
==13678== HEAP SUMMARY:
==13678== in use at exit: 69,796 bytes in 11 blocks
==13678== total heap usage: 2,645,796 allocs, 2,645,785 frees, 2,417,435,812
bytes allocated
==13678==
==13678== LEAK SUMMARY:
==13678== definitely lost: 0 bytes in 0 blocks
==13678== indirectly lost: 0 bytes in 0 blocks
==13678== possibly lost: 0 bytes in 0 blocks
==13678== still reachable: 69,796 bytes in 11 blocks
==13678== suppressed: 0 bytes in 0 blocks
Original issue reported on code.google.com by [email protected]
on 13 Sep 2014 at 5:07
Greetings!
I am consistently getting a score of 0.0 and was wondering if this could be due to model size and/or database content. Are there any tips or guidance you can supply on maintaining the database? One thing worth mentioning is that we are collecting and classifying various file-types but not submitting all of them to VT.
As I understand it the hash's would still be checked but the file not uploaded with the following settings:
IE:
capture_file_types = ["JAR", "APK", "MSDOC", "PDF", "ZIP", "EXE", "SWF", "DMG"]
while
vt_submissions_ext = ['exe','apk','dmg','jar','swf']
We are not submitting the document filetypes to VT because some may contain sensitive information that should not be in the public domain. Would this cause issues with the model? Is there a possibility that a large model would be to slow or cause issues with evaluation?
Additionally the database serves multiple instances at different sites (on different networks) however all sensors run the same config.
Thank's in advance for any guidance you can provide.
Greetings! I love amico and thank you and your team.
Regarding the recent change from parsed/pe_files to parsed/captured_files. Doing a recursive grep for pe_files returns:
amico_scripts/vt_api.py: file_to_send = open("parsed/pe_files/%s.exe" % (md5,), "rb").read()
Other lines matched of course but were not related to the directory. I have changed the path to parsed/captured_files and have not found any ill effects.
I think having the extension specified is a good idea in order to avoid submitting non-public information to VT.
Currently, vt_api.py
only allows to submit .exe
files, because the file extension is hardcoded (see code below). We should allow for the extension to be specified in a config file, to make this more flexible (e.g., one may want to also automatically submit other file types (e.g., APKs, DMGs, etc.)
if vt_submissions == "manual":
file_name = "%s/%s.EXE" % (MAN_DOWNLOAD_DIR, md5)
print "File submission to VT (manual):", file_name
file_to_send = open(file_name, "rb").read()
elif vt_submissions == "live":
file_name = "parsed/captured_files/%s.exe" % (md5,)
print "File submission to VT:", file_name
file_to_send = open(file_name, "rb").read()
Every time a PE file is found, dump_pe() calls pthread_create() to handle the
actual dump to file. It turns out that pthread_create() allocates data
structures for thread handling that are not freed, because we do not call
pthread_join() (we don't need to).
To avoid this problem, we should add a call to pthread_detach(), so that thread
handling resources can be freed as soon as the dump thread finishes.
This problem was found via analysis with Valgrind.
While the problem is minor, in that only a few hundred bytes are leaked every
time a new PE file is dumped, it should still be corrected.
Original issue reported on code.google.com by [email protected]
on 31 Aug 2014 at 10:03
With the experimental branch I was getting broken symlinks created in parsed/pe_files which I corrected by doing the following in start_amico.py
os.symlink("%s.%s" % (sha1,file_type), md5_path)
Also with the experimental branch I was getting filenames like %hash%..pdf or %hash%..exe
I think I resolved this by removing the . from the file_extension declaration in amico_scripts/extract_file.py
138c138
file_extension = ".exe"
142c142
< file_extension = "jar"
file_extension = ".jar"
146c146
< file_extension = "apk"
file_extension = ".apk"
150c150
< file_extension = "elf"
file_extension = ".elf"
154c154
< file_extension = "dmg"
file_extension = ".dmg"
158c158
< file_extensions = "msdoc" # generic for DOC(X), PPT(X), XLS(X), etc.
file_extensions = ".msdoc" # generic for DOC(X), PPT(X), XLS(X), etc.
162c162
< file_extension = "rar"
file_extension = ".rar"
166c166
< file_extension = "swf"
file_extension = ".swf"
170c170
< file_extension = "pdf"
file_extension = ".pdf"
177c177
< file_extenstion = "zip"
file_extenstion = ".zip"
still testing this though.
All pe_*
or *_pe
table and attributes names in the download history DB should be renamed to account for the fact that we now can capture and classify more than just PE files. However, this change would have several effect on most scripts under amico_scripts
, and therefore needs to be applied with caution.
Greetings!
I have deployed amico in a number of production networks and would like to share my observations and ideas about enhancing the functionality. I have put some thought into how to accomplish somethings, but honestly I am only a novice developer at this point.
Thank you for all the hard work and thought you and your students / peers have put into developing this great idea into an awesome realization.
We have solved several issues related to the new missing data detection
algorithm implemented in is_missing_flow_data(). The main challenges we tackled
are:
1) Added a notion of what "type" of problem caused a PE file to be (likely)
corrupted
2) Revised the LRU cache and Sequence Numbers List implementations to correct
possible memory leakage problems
3) Corrected several problems that would cause the missing data detection
algorithm to segfault in very rare cases of "strange" Sequence Numbers List
content
4) Added a "kill switch" to the sequence number gap detection algorithm to
avoid possible infinite loops in case there is a still undiscovered bug in the
algorithm
Original issue reported on code.google.com by [email protected]
on 12 Sep 2014 at 1:12
The USER_AGENT used for "manual" downloads (see manual_download.py
) should change depending on the expected download file_type
. For example, for APKs we might want to use an Android-related user agent. Similarly, for DMGs we might want to use a MacOS-realed user agent string.
This probably will not make much difference in most cases, but there may be some corner cases in which the "manual" download fails because the server expects a specific user agent string.
The variable tflow->corrupt_pe is currently set to TRUE in a variety of cases
in which the TCP flow reconstruction for a given HTTP request-response pair
seems to be prematurely or incorrectly terminated.
However, we should let is_missing_flow_data() decide if the file has really
been corrupted. If the checking algorithm is correct, this should give us
confidence that the file was correctly reconstructed, even if some apparent
problem may have occurred during TCP flow reconstruction.
Original issue reported on code.google.com by [email protected]
on 2 Sep 2014 at 3:06
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.