Comments (19)
I added a screenshot of one of the students here. It is stuck at waiting.
from docs.
hmm - this happens at the login page before they actually log in? Or after they log in?
I just tried myself and it worked fine, maybe this is something only affecting some users?
from docs.
It only affects some users, but I have quite a few that are complaining now. I can also login fine, maybe it is linked to certain nodes.
from docs.
it does look like there was some kind of activity bump here in the logs:
from docs.
Yup, at 15.50 the students all start at the same time, because it is after the lecture. On Monday, Wednesday, and Thursday, we will have this spike, at Tuesdays it is split in two.
from docs.
that's good to know - so they'll all log-on at the same time on a regular basis?
from docs.
for the ones who's sessions are frozen, they can't reload the page or anything? It just stays stuck there?
from docs.
Yup, we have scheduled slots Monday, Wednesday, Thursday 15.50-17.20. Tuesday half of the people are scheduled from 14.00-15.30 and the other half from 15.50-17.20. At the moment certain students are already stuck for more than 20 minutes, not being able to login.
from docs.
so when they go back to the log-in page (e.g. from an incognito window, or via force-refreshing the browser), it just goes back to the blank page?
from docs.
also how many users have this problem? I see 5 failed user pods, so that could be the ones that are reporting the issue, is it more than this?
from docs.
One of students did a reboot of his computer and still has the same problem. I will provide the user account over email.
from docs.
it's probably not related to the student computers themselves, but something getting gunked up at the interface of logging in -> being moved to their session.
from docs.
Is there something we can do on our side to make this better?
from docs.
no I think it is just a bug to iron out given that these are young cloud deployments...it seems like in this case, the cluster is hitting memory limits, probably some users are using more memory than expected. But this will be easier to solve by bumping the memory so let me try this now.
from docs.
so a breakdown of what I think is going on:
- Currently for the WUR hub, we set memory guarantees of 512MB, and limits of 2GB.
- This makes for more efficient use of most nodes, and gives us a little wiggle-room since most users aren't using all of their memory allotment at the same time.
- In this case however, I think all of the students used up their memory all at once, and as a result the node memory got saturated before a "scale-up" event occurred
- So in the short term I have set the memory "guarantee" to be the same as the "limit": https://github.com/2i2c-org/pilot-hubs/pull/88/files. This will ensure that we don't run into the same issue where a bunch of users using up RAM all at once overloads the system.
- Longer term we should build some safety guards to make sure that we never hit the RAM limit of a node under situations like this.
from docs.
Alright, that seems true indeed. Students to all the time restart and run all, in order to get back to the end of their assignment to continue working.
from docs.
yeah - if students hit their limit then their kernel will restart (that's typically how Jupyter handles this)...the issue here is that enough students got close to their limit all at once that they out-paced the hub's ability to scale up
from docs.
btw @Chiil this should now be resolved, can you confirm that students have it working?
(I know this is annoying to you but congratulations on being 2i2c's first incident report! 🎉 these kinds of events will become more and more rare as we find these edge cases and work out the bugs in the infrastructure)
from docs.
That is a remarkable honour 👍 Thanks for helping out so quickly, the students are happy again.
from docs.
Related Issues (20)
- Create a user off-boarding checklist for hub admins
- Document Grafana access for communities HOT 1
- [BUG] readthedocs actions doesn't provide a working URL to deploy preview when updating docs
- Upgrade Hub Service Guide to Jupyter Book HOT 3
- [EPIC] Port existing Hub Service Guide content to Jupyter Book HOT 8
- Document that the configurator is not available if using profileLists
- Write technical content for guiding communities on how to build custom images for their hubs. HOT 7
- Add how to edit users in admin/howto/manage-users/
- Directive to replace custom Python, e.g. list of running hubs and feature tables HOT 5
- [EPIC] Update workflows for the Hub Service Guide (docs.2i2c.org) to aid support work
- How-to guide for adding persistent storage buckets HOT 6
- How-to guide for tracking usage and costs in Grafana HOT 1
- Update how-to-guides/add-packages-to-image.md
- Document the usage of temp rather than $HOME for keeping temporary data files HOT 2
- Move https://github.com/yuvipanda/example-inherit-from-community-image to 2i2c-org and change into template HOT 2
- Create how-to on using dask-gateway for communities HOT 6
- Document use of cloud object storage HOT 4
- Document how to test JupyterHub images locally
- Create redirect from old customise image docs to new customise image docs HOT 3
- Tutorial for data transfer workflow for large datasets
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docs.