Code Monkey home page Code Monkey logo

cigri's People

Contributors

augu5te avatar bzizou avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

guilloteauq

cigri's Issues

Macro variables in the JDL

Some variables should be available into the JDL file, maybe as macro (%HOME%, %CAMPAIGN_ID%,...). The special case of the $HOME variable should be examined (HOME of the grid frontend or of the cluster? using a macro could be more easy for users than having to escape it with several backslashes)

The user should be able to configure automatic fixes

For example, in the JDL, we could have:

"action_on": {
"timeout": "ignore|resubmit|blacklist",
"walltime": "ignore|resubmit|blacklist"
},

with ignore=fix the event, resubmit=fix the event and resubmit, blacklist=disable the cluster until manual fix

gridsub -j problem

A Grid'5000 user reported that the following command failed:

gridsub -j '{"name":"Campaign", "nb_jobs":2, "resources":"nodes=1", "walltime":"00:30:00", "clusters":{"rennes":{"exec_file":"sleep 100"}}}'

with the following error message:

Error: {"status":400,"title":"Error","message":"Error submitting campaign: JDL badly defined: #<JSON::ParserError: 757: unexpected token at '{\"name\":\"Campaign\",'>"}

"Job key generation failed" event can only be fixed by root

I'm not sure how I triggered this, but I got the following problem:

fnancy:~/cigri/tests$ gridevents -e 22105
------------------------------------------------------------------------------
22105: (open) SUBMIT_JOB at 2015-07-03 17:56:56+02 on 
Cigri::Error: 400 error in POST for jobs:
 {
   "title" : "Bad query",
   "message" : "Oarsub command exited with status 8: [ADMISSION RULE] Automatically redirect in the besteffort queue\n[ADMISSION RULE] Automatically add the besteffort constraint on the resources\n[ADMISSION RULE] Modify resource description with type constraints\nGenerate a job key...\nError: Job key generation failed (256).\nOAR_JOB_ID=-14\nOarsub failed: please verify your request syntax or ask for support to your admin.\n\nCmd:\noarsub  --name=\\\"cigri.9\\\" --directory=\\\"$HOME\\\" --resource=\\\"core=1,walltime=01:00:00\\\" --type=\\\"besteffort\\\" \\\"export CIGRI_CAMPAIGN_ID=9;~/cigri/tests/test.sh\\\" --array-param-file=/tmp/oarapi.paramfile.ekXIh",
   "code" : 400
}

fnancy:~/cigri/tests$ gridevents -e 22105 -f
Failed to fix event(s): Event 22105 is not specific to a campaign belonging to you: Not authorized to close event 22105.

This event was specific to the campaign.

"Nikita kill problem" errors should only be warnings

At least, those events should be "low" severity so that we are not flooded by e-mail with a proper notify configuration (low by jabber, high by e-mail).
But maybe we have to check why those events occur so frequently (on Ciment at least)

Cigri fails to start with ruby 2.1: code is deadlockable

Ruby 2 checks for deadlockable code and the way we are using signal handlers, with logging inside handlers is in this case:
log writing failed. can't be called from trap context
/usr/lib/ruby/2.1.0/monitor.rb:185:in lock': can't be called from trap context (ThreadError) from /usr/lib/ruby/2.1.0/monitor.rb:185:inmon_enter'
from /usr/lib/ruby/2.1.0/monitor.rb:209:in mon_synchronize' from /usr/lib/ruby/vendor_ruby/dbi.rb:231:inload_driver'
from /usr/lib/ruby/vendor_ruby/dbi.rb:149:in _get_full_driver' from /usr/lib/ruby/vendor_ruby/dbi.rb:134:inconnect'
from /usr/local/share/cigri/lib/cigri-iolib.rb:44:in db_connect' from /usr/local/share/cigri/lib/cigri-iolib.rb:1606:ininitialize'
from /usr/local/share/cigri/lib/cigri-eventlib.rb:78:in initialize' from /usr/local/share/cigri/modules/judas.rb:90:innew'
from /usr/local/share/cigri/modules/judas.rb:90:in notify' from /usr/local/share/cigri/modules/judas.rb:107:inblock in

'
from /usr/local/share/cigri/modules/judas.rb:113:in call' from /usr/local/share/cigri/modules/judas.rb:113:insleep'
from /usr/local/share/cigri/modules/judas.rb:113:in `'

A simple workaround should be to create a Thread for all logger inside a trap, but I'm not sure that it's clean:
Thread.new do
logger.debug("Spawned #{mod} process #{modpid}")
end

An interesting discussion about that: http://www.mikeperham.com/2013/02/23/signal-handling-with-ruby/

Shipping CiGri with an application (ie. as a library)

CiGri does not use any specific permission to run on top of OAR clusters. It can be turned into a client-site library to ease the portability of a multi-parametric applications outside of CIMENT and Grid5000. From an application standpoint, it makes more sense to invest in CiGri if the development effort can be reused somewhere else.

Renaming CiGri modules

Module names are funny, (almighty, nikita, judas, columbo...) but their meaning is not obvious for new dev.

Cannot delete the exec_directory in the epilogue

The exec_directory cannot be deleted in the epilogue. I get the following error:

15420: (open) EPILOG_EXIT_ERROR of job 849 at 2016-03-11T10:50:34+01:00 on lyon
The job exited with exit status 256;

I believe this is because the output file of the epilogue job is stored in this directory.
Deleting the exec_directory after a campaign is rather useful when we would like to clean all output/temporary files of a campaign.

This issue has connections with #5 but even with a gridclean command, this issue should be fixed.

Example:

{
  "name": "Povray Landscape",  

  "resources": "nodes=1", 
  "walltime": "00:30:00", 
  "exec_file": "~/cigri/pov/pov.sh",
  "exec_directory": "$HOME/cigri/pov-tmp/",

  "nb_jobs": 4,

  "clusters": {   
    "lyon": {}
  },

  "epilogue": [
    "rm -rf ~/cigri/pov-tmp/"
  ]

}

Cigri-docker

Create a cigri devel environnement using oar-docker

Create exec_dir if it does not exist

I got the following error trying to create the directory using the JDL prologue:
The job exited because of a working directory error. NFS problem on the cluster?;

Using ~ (tilde) for refering to the home directory does not work for JDL parameter "exec_file"

For JDL parameter "exec_file", $HOME works fine but ~ (tilde) does not:

------------------------------------------------------------------------------
15392: (open) WORKING_DIRECTORY_ERROR of job 829 at 2016-03-11T10:36:58+01:00 on lyon
The job exited because of a working directory error. NFS problem on the cluster?;
------------------------------------------------------------------------------

Full example:

{
  "name": "Povray Landscape",  

  "resources": "nodes=1", 
  "walltime": "00:30:00", 
  "exec_file": "~/cigri/pov/pov.sh",
  "exec_directory": "~/cigri/pov-tmp/",

  "nb_jobs": 4,

  "clusters": {   
    "lyon": {}
  }

}

It is confusing because ~ (tilde) does work for exec_file.

Epilogue problem

Sometimes, the epilogue fails but the campaign is considered as "terminated".

Example:

{
  "name": "Povray Landscape",  

  "resources": "nodes=1", 
  "walltime": "00:30:00", 
  "exec_file": "~/cigri/pov/pov.sh",
  "exec_directory": "$HOME/cigri/pov-tmp/",

  "nb_jobs": 5,

  "prologue": [
    "mkdir -p ~/cigri/pov/",
    "rsync -avz lyon:~/cigri/pov/ ~/cigri/pov/",
    "mkdir -p ~/cigri/pov-tmp/"
  ],

  "epilogue": [
    "rsync -avz --update --exclude='OAR.cigri.*' ~/cigri/pov-tmp/ lyon:~/cigri/pov-results/",
    "rm ~/cigri/pov-tmp/*.png"
  ],

  "clusters": {   
    "lyon": {
      "prologue": [
        "mkdir -p ~/cigri/pov-tmp/"
      ]
    },
    "nancy": {}
  }

}
jgaidamour@flyon:~$ gridevents 29
No events!
jgaidamour@flyon:~$ gridstat 29
Campaign: 29
  Name: Povray Landscape
  User: jgaidamour
  Date: 2016-03-11 14-34-06
  State: terminated  
  Progress: 5/5 (100%)
  Stats: 
    average_jobs_duration: 13.6
    stddev_jobs_duration: 1.51657508881031
    jobs_throughput: 2250.0 jobs/h
    remaining_time: 0.0 hours
    failures_rate: 0.0 %
    resubmit_rate: 0.0 %
  Clusters: 
    lyon:
      active_jobs: 0
      queued_jobs: 0
      prologue_ok: true
      epilogue_ok: false
    nancy:
      active_jobs: 0
      queued_jobs: 0
      prologue_ok: true
      epilogue_ok: false

I cannot find the log of the epilogue. In addition, the log of the prologue are located in $HOME (instead of $HOME/cigri/pov-tmp/ ?).

It should be possible to re-run the epilogue and the campaign should have open events.

Test the job grouping capability of CiGri

CiGri provides two options for grouping task into OAR jobs:

  • dimensional_grouping: allow to execute several jobs in parallel in a
    single submission if possible
  • temporal_grouping: allow to execute several jobs one after the other
    in a single submission. The number of jobs is computed automatically
    by Cigri

Before advising users to use this feature, we should check if these options work as expected.

Add support for a "site" concept (group of clusters)

Can we consider that a site is a group of clusters that may be refered as a cluster into user interactions as the JDL?
We then simply have to add a field "site" to the clusters table and handle that in the parts of the code that need it (especially in the jdl library, for example to fix #18 ). It could be also used to implement a per user/ per site limit of the number of jobs (#23)

Prologue/epilogue per clusters

On Grid5000, it might be useful to run prologue/epilogue per clusters rather than by site (ex: for compiling code with optimized options according to CPU type).

Logs of a campaign

It would be nice to have a log on the user side of what Cigri did during a campaign.
gridstats -c -f gives the state at a given moment but there is not much information about how we reach this state.

I'm thinking of something like this:

[timestamp] Prologue on cluster xxx
[...]
[timestamp] Submit job 1 on cluster xxx
[timestamp] Submit job 2 on cluster yyy
[timestamp] Event 15463: The job 1 on cluster xxx exited because of a working directory error. 
[timestamp] User fixed all events and jobs 1 is resubmitted (-r -f)
[timestamp] Submit job 1 on cluster zzz
[...]

Allow a simplified submission given a OAR script

CiGri could directly exploit a OAR script, especially if it is an array job. The JDL file could be optionnal in this case. Cigri can send the script to every clusters of the grid, with some default behaviors and/or exploiting #OAR directives plus #CIGRI directives.

Merge, update and centralize documentation

Currently, the documentation is spread over the CIMENT wiki, the G5K wiki, the OAR web site, the CiGri website and the source repository. Outdated documents should be deleted.

Sometimes, epilogue fails with no message/event resulting in an hanging campaign

Example on Grid5000:

jgaidamour@flyon:~/cigri/pov$ gridstat
Campaign id Name                User             Submission time     S  Progress
----------- ------------------- ---------------- ------------------- -- --------
5           Povray Landscape    sdelamare        2016-01-27 17-42-58 Re 100/100 (100%)
jgaidamour@flyon:~/cigri/pov$ gridstat -f 5
Campaign: 5
  Name: Povray Landscape
  User: sdelamare
  Date: 2016-01-27 17-42-58
  State: in_treatment (events)
  Progress: 100/100 (100%)
  Stats: 
    average_jobs_duration: 14.1979166666667
    stddev_jobs_duration: 5.05468993617583
    jobs_throughput: 2081.9 jobs/h
    remaining_time: 0.0 hours
    failures_rate: 1.0 %
    resubmit_rate: 0.0 %
  Clusters: 
    lyon:
      active_jobs: 0
      queued_jobs: 0
      prologue_ok: true
      epilogue_ok: true
    nancy:
      active_jobs: 0
      queued_jobs: 0
      prologue_ok: true
      epilogue_ok: false
    rennes:
      active_jobs: 0
      queued_jobs: 0
      prologue_ok: true
      epilogue_ok: false

 Jobs:
  143: 55,802006,terminated,lyon,0,0
  [...]
  every job is in the state 'terminated'

Database:

cigri=# SELECT jobs.id FROM jobs,events WHERE jobs.id=events.job_id AND jobs.state='event' AND events.state='open' AND jobs.campaign_id=5;
 id  
-----
 156
(1 row)

  id  |  class   |       code        | state | job_id | cluster_id | campaign_id | parent | checked | notified |       date_open        | date_closed |      date_update       |                                      message                                       
------+----------+-------------------+-------+--------+------------+-------------+--------+---------+----------+------------------------+-------------+------------------------+------------------------------------------------------------------------------------
 8077 | campaign | BLACKLIST         | open  |        |          5 |           5 |   8076 | no      | t        | 2016-01-27 17:46:54+01 |             | 2016-01-27 17:46:54+01 | 
 8076 | job      | EPILOG_EXIT_ERROR | open  |    156 |          5 |           5 |        | yes     | t        | 2016-01-27 17:46:54+01 |             | 2016-01-27 17:46:54+01 | The job exited with exit status 256;                                              +
      |          |                   |       |        |            |             |        |         |          |                        |             |                        | Last 5 lines of stderr_file:                                                      +
      |          |                   |       |        |            |             |        |         |          |                        |             |                        | rm: cannot remove ‘/home/sdelamare/cigri/pov-tmp/*.png’: No such file or directory+
      |          |                   |       |        |            |             |        |         |          |                        |             |                        | 
(2 rows)

Apache error log:

/usr/share/phusion-passenger/helper-scripts/passenger-spawn-server:99:in `<main>'
/usr/lib/ruby/vendor_ruby/sinatra/base.rb:813:in `block in process_route': warning: URI.unescape is obsolete
Encoding::UndefinedConversionError - "\xE2" from ASCII-8BIT to UTF-8:

Gridclean

CiGri should provide a gridclean command that cleans all output files of a campaign (oar.stderr and stdout on every clusters, but we should also think of a solution for cleaning custom directories that may have been created by the user)

Adding the "site" keyword to JDL

To make it less confusing for G5K users, we might want to accept the keyword "site" in place of "clusters" in JDL:

    "clusters": {
         "nancy": {},
     "grenoble": {},
     "luxembourg": {},
     "lyon": {},
     "nantes": {},
     "rennes": {},
     "sophia": {}
     },
    "sites": {
         "nancy": {},
     "grenoble": {},
     "luxembourg": {},
     "lyon": {},
     "nantes": {},
     "rennes": {},
     "sophia": {}
     },

This does not have any impact for CIMENT.

griddel is not always deleting oar jobs

I'm not sure why. Maybe it is just because it takes a while to get them killed by CiGri.

When I kill them directly via oardel, CiGri gets a little bit confused:

[2015-07-06 17:08:42.865515][I][31652][RUNNER grenoble] Job 73279 is in Error state.
 {
   "title" : "Oardel error",
   "code" : 400
}

" state="open" checked="no" notified="false" date_open="2015-07-06 17:08:42 +0200" date_update="2015-07-06 17:08:42 +0200" 
[2015-07-06 17:08:43.104806][I][31726][COLOMBOLIB] Resubmitting job 73312
[2015-07-06 17:08:43.107256][I][31652][COLOMBOLIB] Resubmitting job 73279
[2015-07-06 17:08:43.144276][I][31726][JOBLIB] Not resubmiting job 73312 of non-running campaign
[2015-07-06 17:08:43.165924][I][31652][JOBLIB] Not resubmiting job 73279 of non-running campaign
[2015-07-06 17:08:43.205060][W][12010][NIKITA] Could not kill job 73351
 {
   "title" : "Oardel error",
   "code" : 400
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.