Code Monkey home page Code Monkey logo

gws's People

Contributors

itshari avatar jaikrishnats avatar kcratie avatar saumitraaditya avatar smahesul avatar vahid-dan avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gws's Issues

condor_history loses info on jobs

When a lot of jobs are in the condor pool (> 50000), condor_history doesn't report the status of jobs that got completed earlier. The history log truncates around 75000 or there's a possibility of log corruption with condor_history returning an error saying the history file is not found.

Need to use a better approach in EMS. Once the completed jobs are found, don't query for them again. This also saves on a lot of CPU time from condor_history.

condor.service Fails to Start Automatically on COMET Nodes

Error log:

● condor.service - LSB: Manage condor daemons
   Loaded: loaded (/etc/init.d/condor; bad; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2017-05-24 07:02:34 PDT; 10min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 1185 ExecStart=/etc/init.d/condor start (code=exited, status=1/FAILURE)

May 24 07:02:32 comet-W2 systemd[1]: Starting LSB: Manage condor daemons...
May 24 07:02:34 comet-W2 condor[1185]: mkdir: cannot create directory ‘FATAL: Unable to locate LOG in /etc/condor/condor_config’: No such file 
May 24 07:02:34 comet-W2 condor[1185]: chown: cannot access 'FATAL: Unable to locate LOG in /etc/condor/condor_config': No such file or directo
May 24 07:02:34 comet-W2 condor[1185]: FATAL: Required directory FATAL: Unable to locate LOG in /etc/condor/condor_config does not exist, or is
May 24 07:02:34 comet-W2 systemd[1]: condor.service: Control process exited, code=exited status=1
May 24 07:02:34 comet-W2 systemd[1]: Failed to start LSB: Manage condor daemons.
May 24 07:02:34 comet-W2 systemd[1]: condor.service: Unit entered failed state.
May 24 07:02:34 comet-W2 systemd[1]: condor.service: Failed with result 'exit-code'.

Communicate held jobs to the user

The jobs that are held by condor due to issues need to be communicated to the client with the status message/email. Available options are allowing the user to cancel the other jobs and return with the logs (in case of a power user/debug enabled); cancel the held jobs (from condor) with a descriptive file in the result about them and proceed with other jobs.

Also, the jobs that are directly managed through condor - held, stopped etc, don't propagate their status to the DB and EMS keeps querying them over and over leading to a performance issue. Without maintenance clearing of the DB, this leads to condor_history using lots of CPU. Solving the above issue needs to be done in a way that this one is avoided. This particular issue could be fixed by modifying https://github.com/GRAPLE/GWS/blob/master/ems.py#L140 process_once function to also account for held jobs (make up a new experiment status - 'held'/'error').

EMS Error on Worker Fails

EMS service needs a restart after a fail in worker nodes:

Jun 26 17:45:43 graple-Submit python[252255]: Traceback (most recent call last):
Jun 26 17:45:43 graple-Submit python[252255]:   File "ems.py", line 169, in <module>
Jun 26 17:45:43 graple-Submit python[252255]:     process_once()
Jun 26 17:45:43 graple-Submit python[252255]:   File "ems.py", line 144, in process_once
Jun 26 17:45:43 graple-Submit python[252255]:     prog = round(float(cout.count('4'))/len(dbdoc['payload'])*100, 2)
Jun 26 17:45:43 graple-Submit python[252255]: ZeroDivisionError: float division by zero
Jun 26 17:45:43 graple-Submit systemd[1]: ems.service: Main process exited, code=exited, status=1/FAILURE
Jun 26 17:45:43 graple-Submit systemd[1]: ems.service: Unit entered failed state.
Jun 26 17:45:43 graple-Submit systemd[1]: ems.service: Failed with result 'exit-code'.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.