graple / gws Goto Github PK
View Code? Open in Web Editor NEWGraple Web Service
Graple Web Service
Missed renaming file while copying consolidation script. Must rename to ConsolidateResults.py. https://github.com/GRAPLE/GWS/blob/master/gws.py#L99
Currently works because there's only one Post processing script supporting consolidation.
When a lot of jobs are in the condor pool (> 50000), condor_history doesn't report the status of jobs that got completed earlier. The history log truncates around 75000 or there's a possibility of log corruption with condor_history returning an error saying the history file is not found.
Need to use a better approach in EMS. Once the completed jobs are found, don't query for them again. This also saves on a lot of CPU time from condor_history.
Error log:
● condor.service - LSB: Manage condor daemons
Loaded: loaded (/etc/init.d/condor; bad; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2017-05-24 07:02:34 PDT; 10min ago
Docs: man:systemd-sysv-generator(8)
Process: 1185 ExecStart=/etc/init.d/condor start (code=exited, status=1/FAILURE)
May 24 07:02:32 comet-W2 systemd[1]: Starting LSB: Manage condor daemons...
May 24 07:02:34 comet-W2 condor[1185]: mkdir: cannot create directory ‘FATAL: Unable to locate LOG in /etc/condor/condor_config’: No such file
May 24 07:02:34 comet-W2 condor[1185]: chown: cannot access 'FATAL: Unable to locate LOG in /etc/condor/condor_config': No such file or directo
May 24 07:02:34 comet-W2 condor[1185]: FATAL: Required directory FATAL: Unable to locate LOG in /etc/condor/condor_config does not exist, or is
May 24 07:02:34 comet-W2 systemd[1]: condor.service: Control process exited, code=exited status=1
May 24 07:02:34 comet-W2 systemd[1]: Failed to start LSB: Manage condor daemons.
May 24 07:02:34 comet-W2 systemd[1]: condor.service: Unit entered failed state.
May 24 07:02:34 comet-W2 systemd[1]: condor.service: Failed with result 'exit-code'.
The jobs that are held by condor due to issues need to be communicated to the client with the status message/email. Available options are allowing the user to cancel the other jobs and return with the logs (in case of a power user/debug enabled); cancel the held jobs (from condor) with a descriptive file in the result about them and proceed with other jobs.
Also, the jobs that are directly managed through condor - held, stopped etc, don't propagate their status to the DB and EMS keeps querying them over and over leading to a performance issue. Without maintenance clearing of the DB, this leads to condor_history using lots of CPU. Solving the above issue needs to be done in a way that this one is avoided. This particular issue could be fixed by modifying https://github.com/GRAPLE/GWS/blob/master/ems.py#L140 process_once function to also account for held jobs (make up a new experiment status - 'held'/'error').
EMS service needs a restart after a fail in worker nodes:
Jun 26 17:45:43 graple-Submit python[252255]: Traceback (most recent call last):
Jun 26 17:45:43 graple-Submit python[252255]: File "ems.py", line 169, in <module>
Jun 26 17:45:43 graple-Submit python[252255]: process_once()
Jun 26 17:45:43 graple-Submit python[252255]: File "ems.py", line 144, in process_once
Jun 26 17:45:43 graple-Submit python[252255]: prog = round(float(cout.count('4'))/len(dbdoc['payload'])*100, 2)
Jun 26 17:45:43 graple-Submit python[252255]: ZeroDivisionError: float division by zero
Jun 26 17:45:43 graple-Submit systemd[1]: ems.service: Main process exited, code=exited, status=1/FAILURE
Jun 26 17:45:43 graple-Submit systemd[1]: ems.service: Unit entered failed state.
Jun 26 17:45:43 graple-Submit systemd[1]: ems.service: Failed with result 'exit-code'.
Currently a hard coding of Rscript filename is used to run post processing scripts in worker.
Need to make sure that the user doesn't submit a Rscript file with the same name.
Fix by removing the Rscript file if found, in the else section of https://github.com/GRAPLE/GWS/blob/master/gws.py#L128
The API manager doesn't set it by default, but ems checks the debug status every time.
EMS fails if the key is not found.
Fix: handle by adding 'debug':False in API manager always.
Also try to safely handle stuff in EMS.
GRAPLEr doesn't send any error message if GLM3 is not installed on the worker node. Simply the output would be empty.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.