Code Monkey home page Code Monkey logo

pod's People

Contributors

anarmanafov avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pod's Issues

PBS and POD

Hi Anar,

How are you ?

I'm trying to setup a test PoD cluster on an existing PBS farm, and I'm running in a couple of issues.

First, that PBS does not define TMPDIR, so I have to define it myself (in my .profile, because if I put it only in ~/.PoD/user_worker_env.sh it's not working... OK, just to mention it, it's not blocker.

Second, and more serious, is that the worker jobs are now starting but exit almost immediately. The beginning of the log says:

working directory: /scratch/aliced/aphecete/tmp/PoD_Kbh1SSHE4V

Writing files in node's directory /scratch/aliced/aphecete/tmp/PoD_Kbh1SSHE4V
cp: cannot stat `/users/aliced/aphecete/.PoD/wrk/pod-worker': No such file or directory
gzip: stdin has flags 0x8d -- not supported
tar: Child died with signal 13
tar: Error is not recoverable: exiting now
tail: write error: Broken pipe
/scratch/aliced/aphecete/tmp/PoD_Kbh1SSHE4V/PoDWorker.sh: line 320: lockfile: command not found

Any idea what is it that I'm doing wrong ;-) ?

Thanks,

Torque-4.2.7 and pod-submit failing

With Torque-4.2.7, qsub has a function in qsub which detects if \r (or ^M) exists in the submission file.

As such, with PoD,

desilva@melui1:~$ pod-submit -r pbs -n 2 -q mel_short
PoDWorker.sh
~
qsub: script is written in DOS/Windows text format
Error submitting job.

To reproduce:

setupATLAS
localSetupPoD PoD-3.16p1-python2.7-x86_64-slc6-gcc47-boost1.55
pod-server start
qstat -q
pod-submit -r pbs -n 2 -q mel_short

regard,
Asoka

Cannot setup PROOF via ssh

Hi,

I am trying to set up a PROOF environment at our group's local workserver. I followed the instructions, but still have connection error and I have no idea why. At the moment I have 1 master and 1 worker node (for testing purposes, grid201(master), grid202 (worker)).
After starting the pod-server I get this output:

pod-server start

Starting PoD server...
updating xproofd configuration file...
starting xproofd...
starting PoD agent...
preparing PoD worker package...
selecting pre-compiled bins to be added to worker package...

PoD worker package: /home/proof_user/.PoD/wrk/PoDWorker.sh

XPROOFD [23486] port: 21002
PoD agent [23508] port: 22001

PROOF connection string: [email protected]:21002

  • it sees no worker
  • according to the log:

2015-06-26 16:33:04.553 INF 0 [LOG singleton:thread-23885] LOG singleton has been initialized.
2015-06-26 16:33:04.553 INF 0 [PROOFAgent:thread-23885] pod-agent v.3.16
2015-06-26 16:33:04.553 INF 0 [CORE:thread-23885] Bringing >>> AgentServer <<< to life...
2015-06-26 16:33:04.553 INF 0 [CORE:thread-23885] Bringing >>> ThreadPool <<< to life...
2015-06-26 16:33:04.553 INF 0 [ThreadPool:thread-23887] starting a thread worker.
2015-06-26 16:33:04.553 INF 0 [ThreadPool:thread-23888] starting a thread worker.
2015-06-26 16:33:04.554 INF 0 [ThreadPool:thread-23889] starting a thread worker.
2015-06-26 16:33:04.554 INF 0 [ThreadPool:thread-23890] starting a thread worker.
2015-06-26 16:33:04.554 INF 0 [ThreadPool:thread-23891] starting a thread worker.
2015-06-26 16:33:04.554 INF 0 [AgentServer:thread-23885] Detected xpd [23863] on port 21002
2015-06-26 16:33:04.554 INF 0 [AgentServer:thread-23885] starting a monitor
2015-06-26 16:33:04.557 INF 0 [AgentServer:thread-23885] Entering into the main 'select' loop...
2015-06-26 16:34:30.740 INF 0 [AgentServer:thread-23885] Accepting the connetion from PoD UI: grid201.kfki.hu:43627
2015-06-26 16:34:30.740 INF 0 [AgentServer:thread-23885] Client requests a list of available workers.
2015-06-26 16:34:30.740 INF 0 [AgentServer:thread-23885] Client grid201.kfki.hu:43627 has just dropped the connection
2015-06-26 16:34:35.193 INF 0 [AgentServer:thread-23885] Accepting the connetion from PoD UI: grid201.kfki.hu:43628
2015-06-26 16:34:35.193 INF 0 [AgentServer:thread-23885] Client requests a list of available workers.
2015-06-26 16:34:35.193 INF 0 [AgentServer:thread-23885] Client grid201.kfki.hu:43628 has just dropped the connection

Do you have any idea what could be the issue here?

Best regards,
Andras Hazi

Crash when connecting to pod-remote workers

Hi all--

Not sure if this is the appropriate place to write, but I thought I'd try.

I'm trying to setup PoD to replace our monolithic Proof daemon at SLAC in order to improve the stability and configurability of our cluster. We are using ROOT 5.34/05 (I know, it's old, but I'm stuck with it for now) and PoD 3.14. I'm using pod-ssh to setup connections to 35 worker nodes, with between 7 to 11 workers each. If I use the interactive machine to start the pod-agent and then do the pod-ssh commands, everything works very nicely, and I can connect to all the workers with my Proof jobs with no problems.

I'm actually now trying to use pod-remote, so that the proof master server doesn't have to sit on the interactive machine, but instead can live on a node we have reserved for this purpose. I've followed the very nice instructions in the manual, and can successfully startup the server (with the pod-remote --start and pod-remote --command "pod-ssh -c /u/at/swiatlow/myPoD/PoD-ssh.cfg submit" commands). pod-info successfully returns reasonable things, such as:

myPoD$ pod-info -cnsd
PoD Server Type: remote (managed by pod-remote)
XPROOFD [17562] port: 21001
PoD agent [17585] port: 22001
PoD agent port is forwarded via local port: 22001
XPROOFD port is forwarded via local port: 21003
swiatlow@localhost:21003
325

However, when I try to connect to the PROOF daemon in root, I am getting crashes:

root [0] TProof pod("swiatlow@localhost:21003")
Starting master: opening connection ...
Starting master: OK
pening connections to workers: 27 out of 325 (8 %)
| session: swiatlow.default.30751.status terminated by peer
Info in TXSlave::HandleError: 0x12773f0:localhost.slac.stanford.edu:0 got called ... fProof: 0x11a06f0, fSocket: 0x127ec00 (valid: 1)
Info in TXSlave::HandleError: 0x12773f0: proof: 0x11a06f0
TXSlave::HandleError: 0x12773f0: DONE ...

I get the same error if I use a connection string directly to the proof daemon on the server (swiatlow@atlprf01:21001)-- the connection looks good to start and then dies at exactly worker 27.

I've dug through many different logs (both of workers and the master) without particular success in identifying the issue. It'd be fantastic if I could get some insight on where to look further, or how to fix this.

Thank you very much!

Best,
Max

CERN lsf and PoD submissions

PoD submissions fail on lxbatch because it uploads the interactive user environment and in addition, inserts /use/lib64 in the LD_LIBRARY_PATH. As such, if a door / PoD version compiled with an external compiler is used (e.g. gcc 4.7 or 4.8), it will fail because the incorrect libstdc++ lib is picked up first in the path from /usr/lib64.

lsf allows to suppress uploading of user interactive environment to the batch job with a -L option. Can this be configured in .PoD/PoD.cfg (I don't see this in the documentation or cfg file) or another mechanism devised to overcome this ?

Thanks !

regards,
Asoka

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.