Code Monkey home page Code Monkey logo

amico's People

Contributors

perdisci avatar phani-vadrevu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amico's Issues

Improve performance for statistical feature measurement

In get_feature_vector.py, we currently have several DB queries that need to be optimized. Currently, generating the features for a single download event may take up to several seconds, depending on how large the DB is and the number of related historic download events.

We temporarily solved the problem by limiting the amount of historic data to be used for feature measurement, using MAX_PAST_DUMPS = 30000. However, we noticed during deployment that increasing the value of this parameter tends to improve classification accuracy (e.g., reducing false positives). Optimizing the DB queries for feature computation will allow us to do just that.

Notice that MAX_PAST_DUMPS parameter will still be useful even after DB query optimization, but this should be a tunable parameter to be read from config.py.

Results of memory leakage checks using Valgrind

After running one pe_dump process for several hours at a relatively small 
network (several hundred hosts):

==13678== HEAP SUMMARY:
==13678==     in use at exit: 69,796 bytes in 11 blocks
==13678==   total heap usage: 2,645,796 allocs, 2,645,785 frees, 2,417,435,812 
bytes allocated
==13678==
==13678== LEAK SUMMARY:
==13678==    definitely lost: 0 bytes in 0 blocks
==13678==    indirectly lost: 0 bytes in 0 blocks
==13678==      possibly lost: 0 bytes in 0 blocks
==13678==    still reachable: 69,796 bytes in 11 blocks
==13678==         suppressed: 0 bytes in 0 blocks



Original issue reported on code.google.com by [email protected] on 13 Sep 2014 at 5:07

DB maintenance

Greetings!

I am consistently getting a score of 0.0 and was wondering if this could be due to model size and/or database content. Are there any tips or guidance you can supply on maintaining the database? One thing worth mentioning is that we are collecting and classifying various file-types but not submitting all of them to VT.

As I understand it the hash's would still be checked but the file not uploaded with the following settings:

IE:
capture_file_types = ["JAR", "APK", "MSDOC", "PDF", "ZIP", "EXE", "SWF", "DMG"]
while
vt_submissions_ext = ['exe','apk','dmg','jar','swf']

We are not submitting the document filetypes to VT because some may contain sensitive information that should not be in the public domain. Would this cause issues with the model? Is there a possibility that a large model would be to slow or cause issues with evaluation?

Additionally the database serves multiple instances at different sites (on different networks) however all sensors run the same config.

Thank's in advance for any guidance you can provide.

vt_api.py path correction

Greetings! I love amico and thank you and your team.

Regarding the recent change from parsed/pe_files to parsed/captured_files. Doing a recursive grep for pe_files returns:

amico_scripts/vt_api.py: file_to_send = open("parsed/pe_files/%s.exe" % (md5,), "rb").read()

Other lines matched of course but were not related to the directory. I have changed the path to parsed/captured_files and have not found any ill effects.

I think having the extension specified is a good idea in order to avoid submitting non-public information to VT.

Improve vt_api.py to allow for VT submissions other than for .exe files

Currently, vt_api.py only allows to submit .exe files, because the file extension is hardcoded (see code below). We should allow for the extension to be specified in a config file, to make this more flexible (e.g., one may want to also automatically submit other file types (e.g., APKs, DMGs, etc.)

    if vt_submissions == "manual":
        file_name = "%s/%s.EXE" % (MAN_DOWNLOAD_DIR, md5)
        print "File submission to VT (manual):", file_name
        file_to_send = open(file_name, "rb").read()
    elif vt_submissions == "live":
        file_name = "parsed/captured_files/%s.exe" % (md5,)
        print "File submission to VT:", file_name
        file_to_send = open(file_name, "rb").read()

Tiny memory leak related to the creation of dump_pe_thread

Every time a PE file is found, dump_pe() calls pthread_create() to handle the 
actual dump to file. It turns out that pthread_create() allocates data 
structures for thread handling that are not freed, because we do not call 
pthread_join() (we don't need to).

To avoid this problem, we should add a call to pthread_detach(), so that thread 
handling resources can be freed as soon as the dump thread finishes.

This problem was found via analysis with Valgrind.

While the problem is minor, in that only a few hundred bytes are leaked every 
time a new PE file is dumped, it should still be corrected.

Original issue reported on code.google.com by [email protected] on 31 Aug 2014 at 10:03

file_type vs file_extension and doubledot

With the experimental branch I was getting broken symlinks created in parsed/pe_files which I corrected by doing the following in start_amico.py

< os.symlink("%s.%s" % (sha1,file_extension), md5_path)

    os.symlink("%s.%s" % (sha1,file_type), md5_path)

Also with the experimental branch I was getting filenames like %hash%..pdf or %hash%..exe
I think I resolved this by removing the . from the file_extension declaration in amico_scripts/extract_file.py

138c138

< file_extension = "exe"

    file_extension = ".exe"

142c142

< file_extension = "jar"

    file_extension = ".jar"

146c146

< file_extension = "apk"

    file_extension = ".apk"

150c150

< file_extension = "elf"

    file_extension = ".elf"

154c154

< file_extension = "dmg"

    file_extension = ".dmg"

158c158

< file_extensions = "msdoc" # generic for DOC(X), PPT(X), XLS(X), etc.

    file_extensions = ".msdoc" # generic for DOC(X), PPT(X), XLS(X), etc.

162c162

< file_extension = "rar"

    file_extension = ".rar"

166c166

< file_extension = "swf"

    file_extension = ".swf"

170c170

< file_extension = "pdf"

    file_extension = ".pdf"

177c177

< file_extenstion = "zip"

    file_extenstion = ".zip"

still testing this though.

Rename DB tables and attributes for experimental branch

All pe_* or *_pe table and attributes names in the download history DB should be renamed to account for the fact that we now can capture and classify more than just PE files. However, this change would have several effect on most scripts under amico_scripts, and therefore needs to be applied with caution.

PDF analysis and other ideas

Greetings!

I have deployed amico in a number of production networks and would like to share my observations and ideas about enhancing the functionality. I have put some thought into how to accomplish somethings, but honestly I am only a novice developer at this point.

  1. PDF feature extraction could be performed using peepdf https://github.com/jesparza/peepdf. Peepdf is written in python so it should be fairly easy to integrate possibly via module import.
  2. It would be great to get a syslog in some post detection cases like when a virustotal verdict changes or when a malware verdict is given on a file the model classified as benign. For example we feed the syslogs into a SIEM and give benign verdicts lower severity than malware verdicts. When I retrain the classifier the log output tells me which dump.id's have mismatched verdicts, it would be nice to be able to get this information sooner than at training.
  3. I know this is likely a tall order but support for SMTP file extraction and appropriate feature extraction (Received: header) would be great. In enterprise environments this is usually a major vector for infection. Some ideas for scoring/features could be seen at ( https://raw.githubusercontent.com/CyberTaoFlow/email-parser/master/www/images/flowchart.png ). The project that is from did not take into account Received header which is very important for enterprise deployment which usually use upstream mail gateways making src-ip less than useful for classification.
  4. Running a distributed AMICO setup (multiple collection points) causes the system to frequently generate errors as it tries to upload samples to VT that are not on the local system but are in the database. Perhaps and sensor.id column in the database would help with this (similiar to snort schema)

Thank you for all the hard work and thought you and your students / peers have put into developing this great idea into an awesome realization.

Possible issues with new missing data detection algorithm

We have solved several issues related to the new missing data detection 
algorithm implemented in is_missing_flow_data(). The main challenges we tackled 
are:

1) Added a notion of what "type" of problem caused a PE file to be (likely) 
corrupted
2) Revised the LRU cache and Sequence Numbers List implementations to correct 
possible memory leakage problems
3) Corrected several problems that would cause the missing data detection 
algorithm to segfault in very rare cases of "strange" Sequence Numbers List 
content
4) Added a "kill switch" to the sequence number gap detection algorithm to 
avoid possible infinite loops in case there is a still undiscovered bug in the 
algorithm


Original issue reported on code.google.com by [email protected] on 12 Sep 2014 at 1:12

USER_AGENT in manual_download.py should change

The USER_AGENT used for "manual" downloads (see manual_download.py) should change depending on the expected download file_type. For example, for APKs we might want to use an Android-related user agent. Similarly, for DMGs we might want to use a MacOS-realed user agent string.

This probably will not make much difference in most cases, but there may be some corner cases in which the "manual" download fails because the server expects a specific user agent string.

Current "corrupt_pe" checks are probably too conservative

The variable tflow->corrupt_pe is currently set to TRUE in a variety of cases 
in which the TCP flow reconstruction for a given HTTP request-response pair 
seems to be prematurely or incorrectly terminated. 

However, we should let is_missing_flow_data() decide if the file has really 
been corrupted. If the checking algorithm is correct, this should give us 
confidence that the file was correctly reconstructed, even if some apparent 
problem may have occurred during TCP flow reconstruction.




Original issue reported on code.google.com by [email protected] on 2 Sep 2014 at 3:06

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.