Code Monkey home page Code Monkey logo

Comments (14)

windreamer avatar windreamer commented on July 20, 2024

May I know why you need the heartbeat callback? I thought it was just for connection keep-alive, and no need for end user exposed to this detail.

from pymesos.

saitejar avatar saitejar commented on July 20, 2024

I see, but the connection is not keeping alive in my case(Mesos master puts the framework in disconnected state). So, it is upon the scheduler to reconnect to the master. If it is done by pymesos transparently, it would be great, otherwise the callback should be exposed to scheduler.

from pymesos.

windreamer avatar windreamer commented on July 20, 2024

Connection should be kept alive transparently by pymesos. Can you give us some more information about this case?

from pymesos.

saitejar avatar saitejar commented on July 20, 2024

Sure, I got the following log from a disconnection.

`WARNING:kazoo.client:Connection dropped: socket connection broken
WARNING:kazoo.client:Transition to CONNECTING
WARNING:kazoo.client:Session has expired
WARNING:main:disconnected
ERROR:pymesos.process:Thread abort:

Traceback (most recent call last):

File "/usr/local/lib/python3.5/site-packages/pymesos/process.py", line 308, in _run

 if not conn.read():

File "/usr/local/lib/python3.5/site-packages/pymesos/process.py", line 106, in read

 logger.warnig('Master is not available, retry.')

AttributeError: 'Logger' object has no attribute 'warnig'
WARNING:main:disconnected

Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/websocket/_socket.py", line 80, in recv

bytes_ = sock.recv(bufsize)
socket.timeout: timed out`

Is the warnig typo causing any issue ?

from pymesos.

windreamer avatar windreamer commented on July 20, 2024
WARNING:kazoo.client:Connection dropped: socket connection broken
WARNING:kazoo.client:Transition to CONNECTING
WARNING:kazoo.client:Session has expired
WARNING:main:disconnected

It seems you failed to connect to the zookeeper cluster. Is the zookeeper addresses right? Have you tried to connect direct to the mesos master without zookeeper?

from pymesos.

windreamer avatar windreamer commented on July 20, 2024

And the typo have been fixed in the master branch.

from pymesos.

saitejar avatar saitejar commented on July 20, 2024

Zookeeper address is correct, and it does work fine for 20-30 min, before this issue occurs. Connecting directly with mesos master will probably work, but we would need to use zookeeper anyways for highly available mesos master. If not due to zookeeper, disconnection could happen due to network partition, but the framework should be able to reregister with the mesos master detecting such scenarios. I feel, exposing heartbeat callback can help framework deal with such cases.

Network partitions In the case of a network partition, the subscription connection between the scheduler and master might not necessarily break. To be able to detect this scenario, master periodically (e.g., 15s) sends HEARTBEAT events (similar to Twitter’s Streaming API). If a scheduler doesn’t receive a bunch (e.g., 5) of these heartbeats within a time window, it should immediately disconnect and try to resubscribe. It is highly recommended for schedulers to use an exponential backoff strategy (e.g., up to a maximum of 15s) to avoid overwhelming the master while reconnecting. Schedulers can use a similar timeout (e.g., 75s) for receiving responses to any HTTP requests.

source: http://mesos.apache.org/documentation/latest/scheduler-http-api/

from pymesos.

windreamer avatar windreamer commented on July 20, 2024

WARNING:kazoo.client:Connection dropped: socket connection broken

According to the log, the zookeeper connection was broken at first, which led to master re-selectction and reconnect.

And the bug should be fixed in master, and if you using the master version, the reconnection should be successful.

from pymesos.

saitejar avatar saitejar commented on July 20, 2024

Can you please release the new master as a new version, so that pip install works.

from pymesos.

windreamer avatar windreamer commented on July 20, 2024

cf #65

from pymesos.

windreamer avatar windreamer commented on July 20, 2024

0.2.12 released

from pymesos.

saitejar avatar saitejar commented on July 20, 2024

The fix did not seem solve the issue. I have some mesos logs for it.

6acebb474b1-0000 (framework_ABC)  
mesos_1     | I0411 00:07:46.511173    12 master.cpp:6517] Sending 1 offers to framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC)  

mesos_1     | I0411 00:07:46.526757     6 master.cpp:1297] Framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC) disconnected  

mesos_1     | I0411 00:07:46.526831     6 master.cpp:2902] Disconnecting framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC)  

mesos_1     | I0411 00:07:46.526860     6 master.cpp:2926] Deactivating framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC)  

mesos_1     | W0411 00:07:46.527034     6 master.hpp:2266] Master attempted to send message to disconnected framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC)  

mesos_1     | I0411 00:07:46.527235    11 hierarchical.cpp:386] Deactivated framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000  

mesos_1     | W0411 00:07:46.527693     6 master.hpp:2272] Unable to send event to framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC): connection closed  

mesos_1     | I0411 00:07:46.528533     6 master.cpp:1310] Giving framework 0b582846-83e0-4a49-9e48-d6acebb474b1-0000 (framework_ABC) 1weeks to failover  

mesos_1     | 2017-04-11 00:08:27,140:1(0x7feb8329c700):ZOO_WARN@zookeeper_interest@1570: Exceeded deadline by 53ms

The connection is lost with framework. PyMesos should try to reestablish the connection by reregistering. It doesn't seem to happen. Can you point me to the code where reregister happens if connection is lost or for any other reasons ?

from pymesos.

windreamer avatar windreamer commented on July 20, 2024

These are logs from mesos. What about logs of the framework?

from pymesos.

saitejar avatar saitejar commented on July 20, 2024

Sorry, I could not get back earlier.
Framework has no unusual logs.

Framework log(last few lines):

mlengine_1  | INFO:__main__:resourceOffers offers=[{'executor_ids': [{'value': 'Executor'}], 'url': {'path': '/slave(1)', 'address': {'hostname': '10.1.1.200', 'port': 5051, 'ip': '10.1.1.200'}, .....,{'executor_ids': [{'value': 'Executor'}], 'url': {'path': '/slave(1)', 'address': {'hostname': '10.1.1.201' ...... 'ranges': {'range': [{'end': 32000, 'begin': 31000}]}, 'role': '*'}]}]

mlengine_1  | INFO:__main__:Declined offer 792d236e-1d4a-493f-aba2-92ecd6b20b16-O8

mlengine_1  | INFO:__main__:Declined offer 792d236e-1d4a-493f-aba2-92ecd6b20b16-O9

Corresponding Mesos Log:


mesos_1     | I0419 18:10:40.164337    12 http.cpp:391] HTTP POST for /master/api/v1/scheduler from 172.18.0.1:53796

mesos_1     | I0419 18:10:40.165478    12 master.cpp:4505] Processing DECLINE call for offers: [ cad12a2d-9dbd-4330-b856-7ed233ac0c05-O222 ] for framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC) 

mesos_1     | I0419 18:10:40.171816    12 http.cpp:391] HTTP POST for /master/api/v1/scheduler from 172.18.0.1:53796 

mesos_1     | I0419 18:10:40.172338    12 master.cpp:4505] Processing DECLINE call for offers: [ cad12a2d-9dbd-4330-b856-7ed233ac0c05-O223 ] for framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC) 

mesos_1     | I0419 18:10:41.407927     7 http.cpp:391] HTTP GET for /master/state from 172.18.0.1:53840 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' 

mesos_1     | I0419 18:10:45.138394     8 master.cpp:1297] Framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC) disconnected 

mesos_1     | I0419 18:10:45.139780     8 master.cpp:2902] Disconnecting framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC) 

mesos_1     | I0419 18:10:45.139955     8 master.cpp:2926] Deactivating framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC) 

mesos_1     | I0419 18:10:45.140123     8 master.cpp:1310] Giving framework cad12a2d-9dbd-4330-b856-7ed233ac0c05-0000 (ABC) 1weeks to failover

from pymesos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.