asyncstageout's People
Forkers
cinquo perilousapricot mmascher hassenriahi bbockelm spigad dciangot lstorchi snlab belforteasyncstageout's Issues
Change couchDB views in async. module
The couchDB views should be changed to have transfers jobs sorted and ready to be put in a copy job files
Store the last query time of database source
The last query time of the database source should be stored somewhere. It will allow to begin to query the DB source since that time when the Async.StageOut server start after crashing/shutting down.
change import references
building the rpm we realized that all the python code must change in order to add AsyncTransfer to the imports. E.g.
from TransferWorker import TransferWorker
to
from AsyncTransfer.TransferWorker import TransferWorker
View(s) needed for monitoring state of transfer per request to propagate to the user
statistics DB provides top level information (async stageout for task is 90% complete)
files DB provides fine grained information (lfnXYZ is stuck)
Calculate long term averages in a views
Currently the statistics daemon calculates min/max/average times for transfers to provide a running average (thanks for the clarification Hassen).
There should be a view that takes the min/max values reported and calculates long term averages, e.g. the min/max/avg for a day/week/month. This should be something like
{{{
function(doc) {
if (doc.timing) {
emit doc.timing.min_transfer_duration;
emit doc.timing.max_transfer_duration;
};
}
}}}
with a {{{_stats}}} reduce.
Update LoadDummyData/TransferDummyData test scripts
Documents added by LoadDummyData/TransferDummyData test scripts into files_db need to be aligned with the AsyncStageOut code.
Multiuser support in the AsyncStageOut
Currently the transfers are done using the proxy of the AsyncStageOut server operator. The transfers of user files should be done using the own user proxy.
Fix the titles of monitor plots
Update the job FWJR when the output is transferred to the final SE
The job FWJR in the fwjr DB should be updated (by creating a new document), within the new path and location of the output, after being transferred by the AsyncStageOut.
Monitoring couchapp: the filesCount* views should be refactored
As per comment on #521
Fix the initialization of expdays and exptime attributes in StatisticDaemon
New parameters in the AsyncStageOut db
We need to add following parameters in the AsyncStageOut db:
- Size of file
- FTS server used to transfer this file.
- The time when the transfer was done/failed (by adding a start_time and end_time).
Fix doc removal from Async. DB
Sometimes a doc is not removed from Async. DB even if it was marked as good.
Commands should log to own log files
Commands (ftscp, srmcp, lcg-ls etc.) should all log to their own log files. The component logfile should see:
{{{
logger.info("Transfer completed with return code %s, detailed logs in
%s and %s" % (rc, stdout_log, stderr_log))
}}}
or similar. Log files should be in an appropriate directory, e.g.:
{{{
$AGENT_LOGS/$USER/$DESTINATION/$TIMESTAMP-$COMMAND.std{err,out}
}}}
Remove the username from the dbSource_url attribute of files_database documents
Removing the '/' from the doc_id in files_database
Simon pointed out that as described here https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/501.html and here https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/705.html, having a '/' character in the doc_id may cause problem.
Add a CouchDB package to WMCore.Agent.Database
Using the WMCore daemonising code...
Request time is not updated in WorkQueue Elements
Hi Stuart,
Could you take a look at this? I don't see any thing wrong in the code below.
If you are busy, I will look at it tomorrow.
2010-09-23 16:30:43,297:ERROR:WorkQueueManagerReqMgrPoller:Error saving reqMgr status update to db, (OperationalError) (1241, 'Operand should contain 1 column(s)') 'UPDATE wq_element SET reqmgr_time = %s\n WHERE id = %s' (1285277443, [2L, 3L, 15L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 4L, 1L])
Traceback (most recent call last):
File "/storage/local/data1/cmsdataops/wmagent/prod/install/WMCORE/src/python/WMComponent/WorkQueueManager/WorkQueueManagerReqMgrPoller.py", line 154, in reportToReqMgr
self.wq.setReqMgrUpdate(now, updated)
File "/storage/local/data1/cmsdataops/wmagent/prod/install/WMCORE/src/python/WMCore/WorkQueue/WorkQueue.py", line 237, in setReqMgrUpdate
transaction = self.existingTransaction())
File "/storage/local/data1/cmsdataops/wmagent/prod/install/WMCORE/src/python/WMCore/WorkQueue/Database/MySQL/WorkQueueElement/UpdateReqMgr.py", line 22, in execute
transaction = transaction)
File "/storage/local/data1/cmsdataops/wmagent/prod/install/WMCORE/src/python/WMCore/Database/DBCore.py", line 179, in processData
returnCursor = returnCursor)
File "/storage/local/data1/cmsdataops/wmagent/prod/install/WMCORE/src/python/WMCore/Database/MySQLCore.py", line 127, in executebinds
return DBInterface.executebinds(self, s, b, connection, returnCursor)
File "/storage/local/data1/cmsdataops/wmagent/prod/install/WMCORE/src/python/WMCore/Database/DBCore.py", line 65, in executebinds
resultProxy = connection.execute(s, b)
File "/storage/local/data1/cmsdataops/wmagent/prod/install/slc5_amd64_gcc434/external/py2-sqlalchemy/0.5.2-cmp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 824, in execute
return Connection.executors[c](self, object, multiparams, params)
File "/storage/local/data1/cmsdataops/wmagent/prod/install/slc5_amd64_gcc434/external/py2-sqlalchemy/0.5.2-cmp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 888, in _execute_text
return self.__execute_context(context)
File "/storage/local/data1/cmsdataops/wmagent/prod/install/slc5_amd64_gcc434/external/py2-sqlalchemy/0.5.2-cmp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 896, in __execute_context
self._cursor_execute(context.cursor, context.statement, context.parameters[0], context=context)
File "/storage/local/data1/cmsdataops/wmagent/prod/install/slc5_amd64_gcc434/external/py2-sqlalchemy/0.5.2-cmp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 950, in _cursor_execute
self._handle_dbapi_exception(e, statement, parameters, cursor, context)
File "/storage/local/data1/cmsdataops/wmagent/prod/install/slc5_amd64_gcc434/external/py2-sqlalchemy/0.5.2-cmp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception
raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
OperationalError: (OperationalError) (1241, 'Operand should contain 1 column(s)') 'UPDATE wq_element SET reqmgr_time = %s\n WHERE id = %s' (1285277443, [2L, 3L, 15L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 4L, 1L])
implement checksum validation of async transferred files
Reuse, if it is already implemented somewhere, the checksum validation stuff to deal with different chksum types.
Record statistics and tasks to separate DB
record statistics document per iteration like StageManager with record per user:task:source
AsyncStageout statistics documents should have a results field keyed by user:task:source, this information should also be in the results
Each task should be recorded to a separate DB for monitoring. This record should be updated if new files are found for a task. The document should hold the task id and the number of files associated with the task (and possibly total size).
AsyncStageOut workers hang silently
Use a dictionary rather than matchPFN method to get the lfn from a pfn
It can be addressed as following:
- we already do final_lfn -> dest_pfn, and know what jobs are in flight
- we make a dictionary of dest_pfn -> final_lfn in the agent, and use this to parse the ftscp output
Align the AsyncStageOut with #2109
Report transfers errors in files_database in couch
Currently, the reason of FTS transfer failure can be seen only in transfers log files. Files_database reports only the status of failed transfers as "failed" and it doesn't give more details. It will be interesting to have the failure reason in couch to avoid the need to open log files by the operator at each time where there is a transfer failure. Information in log files will be needed only for a deeper debug. To address this, it is needed to add in files_database documents a new attribute, like FailureReason or something similar, which will take as value the reason of the transfer failure.
is there any comments?
Call phedex api passing right parameter to use https in AsyncStageout
Don't update the server document in stats daemon
Create a new doc per iteration and aggregate in a view.
implement statistic database
Summary from ticket #521
The monitoring of the runtimeDB (files_db) gives a picture about how
current transfers are going, allowing thus the detection of current
problems. It provides also currents stats to predict short term problem
for e.g by showing the duration of transfers using a given FTS server is
increasing continuously during the last 3 days.
So the idea is to keep the docs of files in the runtimeDB for a
configurable period (N days) before removing them and updating the statDB
(long term stats) with needed information.
The cleaning of the runtimeDB after N days, as described above, may need
the development of a new component in the AsyncStageOut. This component
will poll the runtimeDB to remove files done/failed since N days and
update then the statsDB.
Change JSM plugin to get only docs created since the last polling and convert site source to CMS name
JSM plugin needs to be changed to get from JSM db only docs created after the last polling. Site source should be converted also to CMS name (I think JSM plugin is the best place to do that since we don't want to add CMS name in FWJR)
Add a Source baseclass for plugins to inherit from
As per comments in #389 there should be a Source base class which plugins inherit from. This should have a {{{call}}} method which the LFNSourceDuplicator uses to get the data.
The patch should also include a minor change to LFNSourceDuplicator such that it can use the new interface.
Code Review: Fix typos in diagonstics handling
Load things instead of pretending they're loaded.
Appropriate remove command
The PFN destinations should be removed before beginning the async. stage-out transfers (and maybe also when the async. transfers fail).
Add rotating database for statistics database
To avoid the stats database becoming enormous we should rotate it (e.g. change it monthly, and record only transfers started in that month to the database). There are a couple of ways to do this:
- maintain a history database that records stats per month/week and empty the stats database for a month once it is 2 months old (say)
- have history documents in the stats database that contain a month summary (generated ~2 months after the month ends) and delete the documents from the stats db when the history doc is made.
I think I prefer the first option (deleting the database is cheaper than deleting a load of documents) but it makes the stats daemon more complicated - it needs to know which month it is.
This is a general problem in how we use CouchDB, ideally this should be something we can reuse elsewhere in WMCore.
LFN destination is different from the LFN source
Actually the Async. StageOut module gets the LFN of the output in the site source (the site of the WN) and uses this same path (removing /temp/ if it is in the path) to store the output in the site destination.
Let's say that a user needs to store its outputs in /store/user/username/userDir1 in the storage of the final destination site.
We have 2 options:
1- the LFN source (LFN in the site of WN) will be /store(/temp)/user/username/userDir1. If we choose this approach, we need to allow to do that in WMCore (AFAIK it can't be done actually)
2- Allow in Async. StageOut to handle LFN's when the path in the site source is different from the path in the site destination. An AsyncStaegeOut fix is needed to implement this solution.
Add protection against None lfn
Phedex api return None as lfn when it doesn't match expr. in tfc file.
Add timestamp in the ftslog file name
Currently, the file name of fts logs follow USER-LINK.ftslog format. It is needed to add also the TIMESTAMP in order to be unique.
Advanced stat views and plots
Statistic Daemon needs unittests
agent name in the doc
Discussing a bit with Simon we agree that having the agent name in the transfer doc is a good idea and probably it does some future proofing.
Propagate myproxyserver/role/group to the asyncStageOut
follow up ticket #1015
Monitoring couchapp: the files{Acquired, Done,New,Failed} views should be refactored
As per comment on #521
unit_tests for changes done in 1402
Write unit test for the AsyncStageout
Fix Load/Transfer DummyData scripts
Please review.
Support execution of WMCore/bin commands in manage script
Add support for execute option to manage script to call a WMCore/bin commands and pass arguments to it & run it in the WMAgent environment.
Since it loads the secrets file, it checks that the command being executed actually exists in the WMCore/bin file
Fix files{Acquired, Done,New,Failed} reduce functions
Mark file as failed after max_retry
The transfer should be marked as failed after max_retry. The max_retry parameter is defined in the config. file.
Update documents in couch without loading them first
Add multiple source instances support
Update the AsyncStageOut to use the update functions API in the database module
Depending on #1879
Update JSM plugin
Update the design source to 'FWJRDump' in the JSM plugin.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.