Code Monkey home page Code Monkey logo

future-kubernetes's People

Contributors

paciorek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

future-kubernetes's Issues

The localhost socket connection that failed to connect to the R worker used port 11562

Thanks for your helm chart and documentation. I was able to follow the instructions to install the rstudio server and workers without any problem. However, when I tried to run: plan(cluster, manual = TRUE, quiet = TRUE), I got the following error after 120 second timeout:

Error in socketConnection(localhostHostname, port = port, server = TRUE,  : 
  Failed to launch and connect to R worker on local machine ‘localhost’ from local machine ‘future-scheduler-76594f99f4-db96z’.
 * The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11562 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: '/usr/local/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/tmp/RtmpBQzoyE/worker.rank=2.parallelly.parent=337.15179347101.pid")), silent = TRUE)' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11562 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential.
 * Failed to kill local

I did get the correct number of workers when I tried nbrOfWorkers(). I am pretty new to RStudio so maybe I am missing something obvious?

Recovering from OOMkill pod failures/evictions

One critical piece that I think makes this challenging to use at a larger scale is that R is a garbage collected language.

There are a number of odd situations, especially when reading or writing files that will continue to "grow" memory that ought to be garbage collected but never does. We were discussing this a little bit in the future repository. Henrik suggested using the callr plan which works extremely well when you're working on a single computer, but is incompatible with the setup command that is specified in the future-kubernetes helm chart.

I've been thinking about a number of alternative approaches:

  • Find a way to restart the R process when it finishes on a pod, before the next iteration.
  • Instead of setting up the cluster via helm chart, use ssh based cluster by distributing tasks over ssh from within your primary parallel loop.

Do you have any thoughts on how one might approach this?

Known error with kubectl port-forward

Hello and thank you for this easy to follow guide. I want to leave a quick note for passerby:

I was experiencing a problem when trying to port forward the RStudio session:

E0617 13:06:27.613454  492930 portforward.go:340] error creating error stream for port 8787 -> 8787: Timeout occurred
E0617 13:06:28.010093  492930 portforward.go:362] error creating forwarding stream for port 8787 -> 8787: Timeout occurred
E0617 13:06:45.389702  492930 portforward.go:340] error creating error stream for port 8787 -> 8787: Timeout occurred

This appears to be a known issue: kubernetes/kubernetes#74551

Solution:

You can start the port-forward without specifying the host port, allowing kubectl to decide:

kubectl port-forward --namespace default svc/future-scheduler :8787 

kubectl will set the port for you:

Forwarding from 127.0.0.1:38491 -> 8787
Forwarding from [::1]:38491 -> 8787
Handling connection for 38491
Handling connection for 38491
Handling connection for 38491
Handling connection for 38491
Handling connection for 38491

Then you navigate to that link in your browser locally. (In this case it was localhost:38491 / 127.0.0.1:38491)

NB: In the issue identified above there was some mention that restarting port-forward after not specifying a port but this time with a port, works as expected. I have not tested this.

Additional notes about using AWS ECR and extending with secrets/private code

Hello @paciorek,

Would you be welcome to a PR that adds sections on:

a) Using AWS ECR
b) Extending the dockerfile to deal with secrets and private github repositories

As this respository is largely an "article" of sorts, perhaps you would prefer these sections written as standalones in an issue such that you can integrate them as you please? Some of the other elements of the article may be better presented in a question and answer format (like an FAQ). Perhaps these would fit that mold as well.

R, parallel, and kubernetes on PRP-Nautilus

Chris,

Your code and documentation for using R and kubernetes is useful in a variety of contexts.

If you want to experiment with more of a Research/Education cluster, rather than a commercial cloud provider, you might want to investigate using the PRP-Nautilus cluster. UC Berkeley most certainly has access to that (academics-only) cluster. Ryan in your Dept./Colllege or someone else on campus should be able to set you up with a Nautilus namespace. After that, you can created k8s pods, etc. If you can't get access, I can create a test namespace for you too (I have admin. rights on Nautilus).

The more we test and document our efforts with kubernetes and R, the better off everyone will be. Also, since there are no chargebacks for anything, you can run any number of tests (including for deep learning, Bayes, big data, GPU/CUDA tests and times) without worrying about any charges.

Again, thanks for your good work with parallel computing and R. The computation and storage is only becoming more and more computationally-intensive.

Best,

Wayne Smith ([email protected])
CSU Northridge

paciorek/future: Argument 'rshpostopts'

I'm looking at your fork https://github.com/paciorek/future, but it's not clear to me which branch I'm meant to look at. For instance, I cannot find get_kube_workers(). Anyway, looking at HenrikBengtsson/future@develop...paciorek:develop, I see that you added a new argument rshpostopts to future::makeNodePSOCK() such that you could append it as:

rsh_call <- paste(paste(shQuote(rshcmd), collapse = " "), rshopts, worker, rshpostopts)

For instance, with an SSH connection that would allow you do launch the remote worker as:

ssh remote.server.org <additional options> "'/usr/lib/R/bin/Rscript' ..."

However, you might be able to achieve the same via the rscript argument. For example, from ?makeClusterPSOCK I use the following to launch Rscript inside a Singularity container:

cl <- makeClusterPSOCK(
  rep("localhost", times = 2L),
  ## Launch Rscript inside Docker container
  rscript = c(
    "singularity", "exec", "docker://rocker/r-parallel",
    "Rscript"
  ),
  dryrun = TRUE
)

Could you do the same for your needs? I guess I need to see the output of your:

makeClusterPSOCK(..., manual = TRUE)

to understand what the underlying system call looks like.

ssh missing in scheduler

Hello,

thank you for posting this example using the future package in combination with kubernetes. Actually I get this error when using your helm right out of the box in minikube:

 plan(cluster, workers = get_kube_workers_ip(), manual = TRUE, quiet = TRUE)
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  : 
  Failed to launch and connect to R worker on remote machine ‘172.17.0.5’ from local machine ‘future-scheduler-fc8f4d5d4-vj6j2’.
 * The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11562 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: 'ssh' -R 11562:localhost:11562 172.17.0.5 "'/usr/local/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'workRSOCK <- tryCatch(parallel:::.slaveRSOCK, error=function(e) parallel:::.workRSOCK); workRSOCK()' MASTER=localhost PORT=11562 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE".
 * Troubleshooting suggestions:
   - Suggestion #1: Set 'verbose=TRUE' to see more details.
   - Suggestion #2: Set 'outfile=NULL' to see output from worker
In addition: Warning messages:
1: In find_rshcmd(which = which, must_work = !localMachine && !manual &&  :
  Failed to locate a default SSH client (checked: ‘ssh’). Please specify one via argument 'rshcmd'. Will still try with ‘ssh’.
2: In find_rshcmd(which = which, must_work = !localMachine && !manual &&  :
  Failed to locate a default SSH client (checked: ‘ssh’). Please specify one via argument 'rshcmd'. Will still try with ‘ssh’.
3: In find_rshcmd(which = which, must_work = !localMachine && !manual &&  :
  Failed to locate a default SSH client (checked: ‘ssh’). Please specify one via argument 'rshcmd'. Will still try with ‘ssh’.

As get_cube_workers gave me some errors regarding no access to kubectl I created a new method returning the IPs of the workers. That is just a workaround on my side.
Nethertheless I got an error pointing out that ssh seems to be missing. After installing ssh on the scheduler at least your example is working.


 plan(cluster, workers = get_kube_workers_ip(), manual = TRUE, quiet = TRUE, verbose = TRUE)
[local output] Workers: [n = 4] ‘172.17.0.4’, ‘172.17.0.5’, ‘172.17.0.7’, ‘172.17.0.8’
[local output] Base port: 11562
[local output] Creating node 1 of 4 ...
[local output] - setting up node
[local output] - attempt #1 of 3
[local output] Will search for all 'rshcmd' available

[local output] Found the following available 'rshcmd':
[local output]  1. ‘/usr/bin/ssh’ [type=‘ssh’, version=‘OpenSSH_8.2p1 Ubuntu-4ubuntu0.2, OpenSSL 1.1.1f  31 Mar 2020’]
[local output] Using 'rshcmd': ‘/usr/bin/ssh’ [type=‘ssh’, version=‘OpenSSH_8.2p1 Ubuntu-4ubuntu0.2, OpenSSL 1.1.1f  31 Mar 2020’]
[local output] Waiting for worker #1 on ‘172.17.0.4’ to connect back
[local output] Connection with worker #1 on ‘172.17.0.4’ established
[local output] - collecting session information
[local output] Creating node 1 of 4 ... done
[local output] Creating node 2 of 4 ...
[local output] - setting up node
[local output] - attempt #1 of 3
[local output] Will search for all 'rshcmd' available

[local output] Found the following available 'rshcmd':
[local output]  1. ‘/usr/bin/ssh’ [type=‘ssh’, version=‘OpenSSH_8.2p1 Ubuntu-4ubuntu0.2, OpenSSL 1.1.1f  31 Mar 2020’]
[local output] Using 'rshcmd': ‘/usr/bin/ssh’ [type=‘ssh’, version=‘OpenSSH_8.2p1 Ubuntu-4ubuntu0.2, OpenSSL 1.1.1f  31 Mar 2020’]
[local output] Waiting for worker #2 on ‘172.17.0.5’ to connect back
[local output] Connection with worker #2 on ‘172.17.0.5’ established
[local output] - collecting session information
[local output] Creating node 2 of 4 ... done
[local output] Creating node 3 of 4 ...
[local output] - setting up node
[local output] - attempt #1 of 3
[local output] Waiting for worker #3 on ‘localhost’ to connect back
[local output] Connection with worker #3 on ‘localhost’ established
[local output] - collecting session information
[local output] Creating node 3 of 4 ... done
[local output] Creating node 4 of 4 ...
[local output] - setting up node
[local output] - attempt #1 of 3
[local output] Will search for all 'rshcmd' available

[local output] Found the following available 'rshcmd':
[local output]  1. ‘/usr/bin/ssh’ [type=‘ssh’, version=‘OpenSSH_8.2p1 Ubuntu-4ubuntu0.2, OpenSSL 1.1.1f  31 Mar 2020’]
[local output] Using 'rshcmd': ‘/usr/bin/ssh’ [type=‘ssh’, version=‘OpenSSH_8.2p1 Ubuntu-4ubuntu0.2, OpenSSL 1.1.1f  31 Mar 2020’]
[local output] Waiting for worker #4 on ‘172.17.0.8’ to connect back
[local output] Connection with worker #4 on ‘172.17.0.8’ established
[local output] - collecting session information
[local output] Creating node 4 of 4 ... done


Great example, thank you.

Issues with newer versions of R/Rstudio

Problem 1: Non-existent function parallel:::.slaveRSOCK()

Using the latest rocker/tidyverse, I received an error:

Error: object '.slaveRSOCK' not found. 

That's because this internal function parallel:::.slaveRSOCK() is no longer in the parallel package. There is also no way to downgrade the parallel package as it is part of base.

Solution: Downgrade, or copy/paste function into .Rprofile?

Problem 2: Rprofile.site can't bind functions

ERROR:Cannot bind value of "setup_kube" into base environment

This one appears to be related to using Rprofile.site. I used /home/rstudio/.Rprofile instead and it I believe that relegated the issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.