Comments (14)
May I know why you need the heartbeat callback? I thought it was just for connection keep-alive, and no need for end user exposed to this detail.
from pymesos.
I see, but the connection is not keeping alive in my case(Mesos master puts the framework in disconnected state). So, it is upon the scheduler to reconnect to the master. If it is done by pymesos transparently, it would be great, otherwise the callback should be exposed to scheduler.
from pymesos.
Connection should be kept alive transparently by pymesos. Can you give us some more information about this case?
from pymesos.
Sure, I got the following log from a disconnection.
`WARNING:kazoo.client:Connection dropped: socket connection broken
WARNING:kazoo.client:Transition to CONNECTING
WARNING:kazoo.client:Session has expired
WARNING:main:disconnected
ERROR:pymesos.process:Thread abort:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/pymesos/process.py", line 308, in _run
if not conn.read():
File "/usr/local/lib/python3.5/site-packages/pymesos/process.py", line 106, in read
logger.warnig('Master is not available, retry.')
AttributeError: 'Logger' object has no attribute 'warnig'
WARNING:main:disconnected
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/websocket/_socket.py", line 80, in recv
bytes_ = sock.recv(bufsize)
socket.timeout: timed out`
Is the warnig typo causing any issue ?
from pymesos.
WARNING:kazoo.client:Connection dropped: socket connection broken
WARNING:kazoo.client:Transition to CONNECTING
WARNING:kazoo.client:Session has expired
WARNING:main:disconnected
It seems you failed to connect to the zookeeper cluster. Is the zookeeper addresses right? Have you tried to connect direct to the mesos master without zookeeper?
from pymesos.
And the typo have been fixed in the master branch.
from pymesos.
Zookeeper address is correct, and it does work fine for 20-30 min, before this issue occurs. Connecting directly with mesos master will probably work, but we would need to use zookeeper anyways for highly available mesos master. If not due to zookeeper, disconnection could happen due to network partition, but the framework should be able to reregister with the mesos master detecting such scenarios. I feel, exposing heartbeat callback can help framework deal with such cases.
Network partitions In the case of a network partition, the subscription connection between the scheduler and master might not necessarily break. To be able to detect this scenario, master periodically (e.g., 15s) sends HEARTBEAT events (similar to Twitter’s Streaming API). If a scheduler doesn’t receive a bunch (e.g., 5) of these heartbeats within a time window, it should immediately disconnect and try to resubscribe. It is highly recommended for schedulers to use an exponential backoff strategy (e.g., up to a maximum of 15s) to avoid overwhelming the master while reconnecting. Schedulers can use a similar timeout (e.g., 75s) for receiving responses to any HTTP requests.
source: http://mesos.apache.org/documentation/latest/scheduler-http-api/
from pymesos.
WARNING:kazoo.client:Connection dropped: socket connection broken
According to the log, the zookeeper connection was broken at first, which led to master re-selectction and reconnect.
And the bug should be fixed in master, and if you using the master version, the reconnection should be successful.
from pymesos.
Can you please release the new master as a new version, so that pip install works.
from pymesos.
cf #65
from pymesos.
0.2.12 released
from pymesos.
The fix did not seem solve the issue. I have some mesos logs for it.
6acebb474b1-0000 (framework_ABC)
mesos_1 | I0411 00:07:46.511173 12 master.cpp:6517] Sending 1 offers to framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC)
mesos_1 | I0411 00:07:46.526757 6 master.cpp:1297] Framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC) disconnected
mesos_1 | I0411 00:07:46.526831 6 master.cpp:2902] Disconnecting framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC)
mesos_1 | I0411 00:07:46.526860 6 master.cpp:2926] Deactivating framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC)
mesos_1 | W0411 00:07:46.527034 6 master.hpp:2266] Master attempted to send message to disconnected framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC)
mesos_1 | I0411 00:07:46.527235 11 hierarchical.cpp:386] Deactivated framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000
mesos_1 | W0411 00:07:46.527693 6 master.hpp:2272] Unable to send event to framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC): connection closed
mesos_1 | I0411 00:07:46.528533 6 master.cpp:1310] Giving framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC) 1weeks to failover
mesos_1 | 2017-04-11 00:08:27,140:1(0x7feb8329c700):ZOO_WARN@zookeeper_interest@1570: Exceeded deadline by 53ms
The connection is lost with framework. PyMesos should try to reestablish the connection by reregistering. It doesn't seem to happen. Can you point me to the code where reregister happens if connection is lost or for any other reasons ?
from pymesos.
These are logs from mesos. What about logs of the framework?
from pymesos.
Sorry, I could not get back earlier.
Framework has no unusual logs.
Framework log(last few lines):
mlengine_1 | INFO:__main__:resourceOffers offers=[{'executor_ids': [{'value': 'Executor'}], 'url': {'path': '/slave(1)', 'address': {'hostname': '10.1.1.200', 'port': 5051, 'ip': '10.1.1.200'}, .....,{'executor_ids': [{'value': 'Executor'}], 'url': {'path': '/slave(1)', 'address': {'hostname': '10.1.1.201' ...... 'ranges': {'range': [{'end': 32000, 'begin': 31000}]}, 'role': '*'}]}]
mlengine_1 | INFO:__main__:Declined offer 792d236e-1d4a-493f-aba2-92ecd6b20b16-O8
mlengine_1 | INFO:__main__:Declined offer 792d236e-1d4a-493f-aba2-92ecd6b20b16-O9
Corresponding Mesos Log:
mesos_1 | I0419 18:10:40.164337 12 http.cpp:391] HTTP POST for /master/api/v1/scheduler from 172.18.0.1:53796
mesos_1 | I0419 18:10:40.165478 12 master.cpp:4505] Processing DECLINE call for offers: [ cad12a2d-9dbd-4330-b856-7ed233ac0c05-O222 ] for framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC)
mesos_1 | I0419 18:10:40.171816 12 http.cpp:391] HTTP POST for /master/api/v1/scheduler from 172.18.0.1:53796
mesos_1 | I0419 18:10:40.172338 12 master.cpp:4505] Processing DECLINE call for offers: [ cad12a2d-9dbd-4330-b856-7ed233ac0c05-O223 ] for framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC)
mesos_1 | I0419 18:10:41.407927 7 http.cpp:391] HTTP GET for /master/state from 172.18.0.1:53840 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
mesos_1 | I0419 18:10:45.138394 8 master.cpp:1297] Framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC) disconnected
mesos_1 | I0419 18:10:45.139780 8 master.cpp:2902] Disconnecting framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC)
mesos_1 | I0419 18:10:45.139955 8 master.cpp:2926] Deactivating framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC)
mesos_1 | I0419 18:10:45.140123 8 master.cpp:1310] Giving framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC) 1weeks to failover
from pymesos.
Related Issues (20)
- Using 'offer_id' instead of 'inverse_offer_id' in RescindInverseOffer
- MesosOperatorMasterDriver cannot subscribe to events HOT 7
- Process.start() method does not call Process._notify() HOT 7
- Missing comma in list.
- missing kazoo dependency HOT 5
- Exponential backoff for executors incompatible with default & max executor_reregistration_timeout HOT 1
- Add support for HTTPS HOT 2
- Add support for executor authentication HOT 1
- Add specific timeout for http requests HOT 3
- pymesos doesn't install on python 3.7 HOT 5
- Ability to run 10,000 tasks HOT 1
- Pymesos doesn't install properly on python3.11 due to http_parser
- python version, is python2 and python3 compatible? HOT 2
- Scheduler failover flag is unsettable HOT 13
- Socket errors are not handled properly which is resulting in unexpected behavior HOT 2
- how do i get status of a task from scheduler? HOT 3
- Custom field on update dictionary not working HOT 2
- Error parsing `mins` and `ns` time units since items()
- Framework does not get HeartBeats HOT 1
- Timeout for connection refused errors?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymesos.