Comments (13)
Thanks for reporting this, @cricard . I'll take a closer look.
from cti-python-stix2.
Certain that the bottleneck is in Bundle()?
from cti-python-stix2.
Pretty sure.
If I build the bundle manually (without calling Bundle()), it completes in a few seconds.
The below function completes in a couple seconds, for a data set of around 3000 STIX2 objects:
def build_bundle(objects):
bundle_header='{\n\t"type": "bundle",\n\t"id": "bundle--%s",\n\t"spec_version": "2.0",\n\t"objects": [\n' % uuid.uuid4()
bundle_footer='\n\t]\n}'
bundle=bundle_header
for object in objects:
bundle+=str(object)+',\n'
bundle=bundle.rstrip(',\n')
bundle+=bundle_footer
return bundle
But using the Bundle() function, it takes about 10-15 minutes with the same dataset.
def build_bundle(objects):
bundle=Bundle(objects=objects)
return bundle
I only started experiencing the issue when I upgraded stix2 to version 0.3. Worked fine in 0.2. Two other people experienced the same results (one on Windows, one on ubuntu), using stix2 v0.3. All three of us were running python 2.7.x.
If I decrease the dataset to 100 STIX objects, the Bundle() function works fine. Somewhere around 1000 STIX objects is where I experienced significant delays.
I provided gtback a copy of the script and dataset I was using out of band, so he can validate.
from cti-python-stix2.
I just tried creating a bundle with > 5000 objects in it. That completed in like 0.01 sec. Then I tried generating the JSON string via code like str(my_bundle)
. I'm still waiting for that to complete...
Those two functions aren't actually doing the same things. The first build_bundle()
is creating a string. The second is creating an object. It isn't exactly an apples-to-apples comparison. (Perhaps you subsequently stringified the return value of the latter function?)
As far as my stringification experiment, I tried commenting out the sorting here, and it sped up significantly. I gave up waiting on my above experiment; after removing the sort, it completed in about 2sec.
from cti-python-stix2.
In relation to @chisholm comment on sorting: Are we canonizing the bundle content? Are we canonizing STIX package contents in general?
from cti-python-stix2.
@chisholm oops - I think you're right. The issue does appear to be in the stringification of the STIX objects, rather than Bundle(). Thanks for catching that!
from cti-python-stix2.
Yes, the bottleneck is the way we are sorting property names when stringify-ing objects (to match the order in the spec, rather than just alphabetic order). This was introduced in 0.3. There are probably lots of ways to fix this; I've been meaning to profile the code and see if there are any quick wins, but I'm open to anything. @chisholm, @mbastian1135 , or anyone else, let me know if you want to take a look at this.
from cti-python-stix2.
We can introduce a new method to serialize objects that does not use the sorting properties and any other formatting if all you want is machine-to-machine operation. That may help a lot for performance and in cases where the "pretty" SDO/SROs are not needed.
from cti-python-stix2.
I can look at this if what I proposed seems like a good solution.
from cti-python-stix2.
Maybe _STIXBase
could have an option to whether its __str__
method uses the "pretty-print" version or the unsorted version. Not sure which should be the default.
from cti-python-stix2.
@emmanvg and I talked about moving the actual serialization of _STIXBase objects to a separate serialize
function that has a pretty
option, and having __str__
call that with pretty=True
(we could make False the default for `serialize, though).
from cti-python-stix2.
Assertion: Seems like we're going to have to solve canonical representations if we're ever going to get to encrypting/signing STIX content (?). Research indicates canonizing json increases stringification by 4 to 5 times. If you concur with the assertion, any thoughts on how we can tackle this looming issue? This library/effort is "where the rubber meets the road" in terms of practical reference implementations based on the CTI TC standards, so I'm raising it here.
from cti-python-stix2.
We already had a "canonical" (or at least, standarized) representation in python-stix2 v0.2: keys in alphabetic order. This was necessary to ensure tests would repeatably pass. In v0.3, we changed the order from alphabetic to (roughly) the order in the spec, with custom properties at the end (but still repeatable). The implementation of this ordering is what's causing the performance impact, but we should be able to speed up this implementation significantly, to roughly what we had before.
I've always asserted that the actual JSON (de-)serialization speed is negligible in comparison to whatever process creates the content being STIX-ified, or acting on parsed STIX content. The exception should be if you're just shoveling STIX content around (as a sharing hub, for instance), in which case you should be parsing only enough to examine, but then sending the original, unmodified content.
Signing STIX content has been discussed, but is not in the current scope for STIX 2.1.
from cti-python-stix2.
Related Issues (20)
- [Question] Getting value(s) from parsed STIX 2.x Indicator object (examples included) HOT 3
- Cannot run example HOT 3
- Create new (and automated?) pypi package HOT 1
- Add MemoryStore.load_from_url()
- Improve scorecard results
- Question: How/Where to Implement a Data Store Delete function? HOT 11
- How to create a Grouping with context? HOT 6
- STIX2 json parse HOT 1
- Extension decorators and registry don't support extensions with multiple 'extension_types' HOT 1
- Unable to set pretty=false in FileSystemSink as it is hardcoded as true HOT 1
- Query with filter stopped working in STIX2 ver 3.0.0 HOT 1
- Pattern equivalence with AND and Parenthesis raises an error HOT 2
- STIX Custom Object issue with references HOT 18
- Question: How to get the relevant apts of an indicator HOT 2
- Why does File Not have External References??? Seems like a poor decision
- Fails to parse valid TLP2.0 marking object. HOT 1
- Location Object can't have 0 for Longitude/Latitude HOT 1
- Objects containing LIstProperty or DictionaryProperty fields are not Immutable HOT 1
- Cannot accept timestamps more precise than microseconds because of datetime module limitation
- Question: How to... Granular Markings, Eternal References, Update Modified timestamp? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cti-python-stix2.