dmwm / das Goto Github PK
View Code? Open in Web Editor NEWData Aggregation System
Data Aggregation System
DAS currently uses .ini style configuration files (das.cfg). For CMSWEB deployment we would strongly prefer python configuration files as they provide much greater ability to make the configuration location and user independent. This should also make development easier since the same configuration can be used unchanged.
For example of location independence eased by python please see DQM GUI 'devtest' configuration, which works out of the box for any user on any computer system - P5, CERN GPN (lxplus, lxbuild), desktops, laptops, and outside CERN.
https://twiki.cern.ch/twiki/bin/view/CMS/DQMTest#Specific_details
http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/CMSSW/DQM/Integration/config/
See specifically use of BASEDIR and CONFIGDIR to achieve relocation. You can also see other more complex host-specific adaptation in online in:
Add new method for abstract_service to respect DASJSON header. The new tier0 service is already DAS compliant. It does ship data with DASJSON header which contains results as well as expire timestamp. I need to parse this info correctly.
Return HTTP 503 error when MongoDB is down. DAS server should stay alive.
Currently, queries are a raw python dictionary. I propose that this should be replaced by a wrapper class, with the following rationale:
Create a framework of unit test per data-service to test only data-service specific queries.
Weight queryspammer distributions in some quasi-real manner so that if it is used to hammer the cache it should trigger analytics appropriately.
To avoid creation of parsertab.py in DAS install area I need to allow its location being configurable parameter. This will easy issue on cmsweb and allow to have it in /data/projects/das area instead of DAS source code area.
All APIs use 3600 sec expiration timestamp (valid for testing) which need to be adjust to real case scenario. I think DBS/phedex should have 10-15 minutes, SiteDB around an 1hour, etc.
Follow up from #290.
Regarding open connections, they are connected sockets, i.e sockets between DAS and MongoDB. Just ssh to cmsweb@… and run netstat -tanlp | grep ESTABLISHED | grep 27017 to see them. We have currently: {{{
$ netstat -tanlp | grep ESTABLISHED | grep 27017 | awk '{print $NF}' | sort | uniq -c
212 4500/mongod
138 4860/python
74 4875/python
}}}
Why there are that many I can't answer. Maybe every DAS thread creates some number of connections? Note that half of the sockets are for python side, the other half is the mongod side, as shown above.
in reply to: ↑ 69
Instead of using YUI hosted by Yahoo, use local yui installation.
Review note on DAS: general comment, would find expression if not isinstance(x, dict) style more readable than if type(x) is not types.DictType style.
Better test the existing analytics tasks and add some new ones.
Fix the analytics web so that there is a
YML file which define DAS schema should reside in SITECONF/T1_CH_CERN/DAS. It will simplify maintenance of the DAS on cmsweb clister.
DAS must support aggregation of information. Since DAS cache server utilize a REST model this can be done as series of steps:
Contact Dirk/Stephen and request permanent tier0 data-service URL for DAS.
Currently I only report stats on init, sub-system call, merge steps. I want to divide sub-system stat into URL fetch time and actual DAS sub-system processing time. This can be accomplished on making singleton DASTimer class instance and use it everywhere to collect various stats.
It is possible that DAS will received a doc whose size will exceed MongoDB limit (4MB by default). In that case the bulk insert will fail for all docs in insert sequence (due to generators). To avoid that I need a new generator routine whose purpose will be scan doc and pass it if it has size < 4MB or put it to GridFS.
Review note on DAS: overview plotfairy version, session arguments are unnecessary and can be omitted.
Comment 23 follow-up: I guess it wasn't clear enough, but "session arguments" meant "session" and "version". All you need is the actual data arguments. Also would prefer they were deleted, not just commented out.
Expert's DNs need to be added to MongoDB in order to allow to access Expert page. Need to create a doc for CERN operator to perform this action.
Explore mongo replication. I can have two nodes, one used for raw-cache of user on-demand queries, while another can used by populator to replicate data from data-services.
Explore mongo sharding, where define sharding key, e.g. block.
Write some proceedings for DAS @ CHEP 2010
Migrate DAS web server to WMCore.WebTools based.
Thank you for adding checkargs to verify parameters. It has a few flaws I'd like to see fixed:
You don't use what you verify. Some arguments are casted to strings (str(x)) before checking. You should instead verify what you will use.
You should type check all arguments for reasons above. A keyword argument can be None (not given), a string (given once), or a list (if given several times).
Contents of many, but not all arguments are checked. I didn't see any additional checking added for remaining arguments elsewhere so it looks like several vulnerabilities remain. You should always sanitise all arguments. Even if the argument is free form input, you can often make sure it only consists of certain legitimate characters (e.g. letters only).
Failure to verify arguments should raise an exception.
Failure to check an argument should not return the argument value back to caller. This is unsafe; you don't know what the value contains, and you just determined it's not valid. Returning the value to caller can be used to create XSS and other attacks. My general preference is to never return anything to the caller - you simply return suitable HTTP status code.
It's not sanitising the HTTP method; note that 'method' keyword argument is not the same as the request method!
Upon Lassi/L2 suggestions will remove doc part of DAS web server.
Code clean-up.
Review note on DAS bin directory: start-up scripts should be folded into manage. We very much prefer to see everything inlined directly into the manage script without several layers of indirection, for simplicity, comprehension and transparency.
Read existing monitoring.ini files in SITECONF/T1_CH_CERN/DAS and improve them as necessary
Oli wants to have custom views in DAS to get his data:
''Essentially the sum of data for each T1 site for each combination of
acq era, tier, custodial/non-custodial.
''
I think it can be accomplished as 2 step procedure in DAS.
DAS can be either configured using either configparser or wmcore.configuration. The current config code has a few problems:
Provide a single layer performing validation/casting/defaults, which doesn't care whether it reads from an underlying configparser or wmcore config.
Restore analyticsDB to using unique qhash records, with an array of hit times. Provide a workaround to inability to pull with conditions for mongodb<=1.6. Determine interplay of capped collections and updated instead of new objects.
Related, consider making sure all related documents for a given query are removed from analytics concurrently.
CHange this block to use external dir parameter
{{{
+if [ hostname -d
== "cern.ch" ]
}}}
When using certain aggregators, e.g. max, min, I should be able to show the record itself rather then min/max value of asked field. For instance, if user type
find block | max(block.size)
I should not only show max block.size, but also a link to a record with this value.
RE-based PLY parsing is easier than writing our own ad-hoc parser but is quite expensive. Add a (capped?) mongodb collection to store the parsed versions of string queries, and intercept new queries appropriately.
Done.
The %post section has been reviewed and cleaned up.
From #290
I didn't understand the addition of urllib quoting in, for example, das_table.tmpl. Shouldn't you use encodeURIComponent in javascript code / arguments, and urllib when quoting something originating from DAS server itself? To me it seems you are now sometimes quoting javascript itself, not the javascript variable value.
Also I note here that the quoting wasn't added universally everywhere - not in all templates, and not even systematically in the one example I happened to quote, das_table.tmpl. As I wrote before, it looks like every template needs to be sanitised. I can't easily tell which values are safe.
utils/das_config.py calls
from DAS.utils.das_cms_config import read_wmcore
while utils/das_cms_config.py calls
from DAS.utils.das_config import DAS_OPTIONS
The remedy is to merge them together.
Test DAS with DBS2/Phedex/RunRegistry/Tier0 to allow PVT tester to have a look at the service.
Eventually we will need to add DAS analytics into DAS manage init script. I need to know how to start/stop up DAS analytics web server. How to check its status, etc. A basic skeleton of init script will be useful.
We need a help section for DAS web analytics server. It should describe meaninig of sections, e.g. Main, Control. It should provide examples (some description and png image of it) how to submit certain tasks. Examples (png images) of what we should see when tasks are running, etc.
This will allow to train DAS operators.
Add ability to learn and add new or reload existing map from the output of data-provider. For example by learning about keys the output of some query I can add to DAS what this data-service is capable to provide. For instance, user type
run=123
the DAS query RunSummary and get output which contains L1Trigger. So DAS can gain knowledge from the output that Run Summary provides information about L1Trigger by query run=123. If this info is captured, I can improve DAS input fields. For example, I can store associative keys into separate collection with data-service. Those keys can be used as "helpers" in DAS input query, so user can type
l1 trigger
and DAS can replies, ahh, I know data-service which provide this. And in order to get l1 trigger you must type your run number.
We can apply some word processing to allow different linguistic combinations.
This way DAS will gain knowledge what data service can provide. This can improve search and make some suggestions.
genkey() does not necessarily produce identical output for functionally identical input, which given our reliance on qhash for finding records is a problem.
From python reference:
"CPython implementation detail: Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions."
Example:
genkey({'fields': None, 'spec': [{'key': u'dataset.name', 'value': u'"/TTbar_1jet_Et30-alpgen/Winter09_IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO"'}]})
'b255596fb3728afe13c5c078ad6f9105'
genkey({'fields': None, 'spec': [{'value': u'"/TTbar_1jet_Et30-alpgen/Winter09_IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO"', 'key': u'dataset.name'}]})
'2c7d1cfc1244e5367eefe70dfeeeb321'
Here we have only trivially transposed the order of the "key" and "value" arguments, but the result is a different hash value. This problem shows up in analytics where running QueryMaintainer from the command line works but spawning it through the server doesn't, as far as I can tell just because the dictionary construction order differs. This is not because of unicode-ness of strings (tested, using json.dumps deals with this).
I will try and modify the genkey function to produce consistent output, but this is probably performance sensitive.
RIght now to get total number of results I invoke the count, since I added empty records to protect access of services which does not return results, I should exclude them from the count of results for given query. Should be trivial, e.g.
db.merge.find(spec)count()
where spec contains query and non-existance of 'das.empty_record'.
Clarify with HTTP group if I need to put login/pw for mongodb.
Investigate ways of optimising the transfers of large chunks of JSON, eg from a query "dataset", whether by socket configuration or streaming the decoding.
PLY is a better parser
This would add DAS to the browser, which would be sweet.
Some salient links:
http://au.alpha.yahoo.com/faqs/add-opensearch/index.html
https://developer.mozilla.org/en/Creating_OpenSearch_plugins_for_Firefox
http://www.opensearch.org/Specifications/OpenSearch/1.1
Right now DAS services page show all services which are registered in DAS. I need to show only those which are active (as defined by DAS configuration file).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.