douban / pymesos Goto Github PK

A pure python implementation of Mesos scheduler and executor

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python mesos mesos-client

pymesos's Issues

Timeout for connection refused errors?

Hi, is there currently a way for the MesosSchedulerDriver to throw an error when the Mesos master is not available? It seems that it currently gets stuck in an endless loop of "Connection refused" errors once I call driver.run():

framework_1     | 2018-06-06 15:35:26.558|ERROR|Failed to send to ('mesos-master', 5050)
framework_1     | Traceback (most recent call last):
framework_1     |   File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 536, in run
framework_1     |     result = self._run(*self.args, **self.kwargs)
framework_1     |   File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
framework_1     |     self.__bootstrap_inner()
framework_1     |   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
framework_1     |     self.run()
framework_1     |   File "/usr/lib/python2.7/threading.py", line 754, in run
framework_1     |     self.__target(*self.__args, **self.__kwargs)
framework_1     |   File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 318, in _run
framework_1     |     if not conn.write():
framework_1     |   File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 76, in write
framework_1     |     logger.exception('Failed to send to %s', self._addr)
framework_1     |   File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 69, in write
framework_1     |     sent = self._sock.send(self._request)
framework_1     |   File "/usr/local/lib/python2.7/dist-packages/gevent/_socket2.py", line 323, in send
framework_1     |     return sock.send(data, flags)
framework_1     | error: [Errno 111] Connection refused

I would like a way to catch the exception and handle it accordingly. Otherwise, something like a timeout would be good as well.

Framework does not get HeartBeats

Hi @windreamer,
In this commit, 04501d1#diff-ab924b88fb1411cfeab6024b822161a8R95 , a change was made to suppress the Heart Beats events to the framework. Was there a reason behind it?

Add support for executor authentication

At the moment there is no way to enable authentication for the executor using pymesos.

When executor authentication is enabled the agent injects MESOS_EXECUTOR_AUTHENTICATION_TOKEN into the the executors environment. This needs to be added to each requests header in order to authenticate, like the following

Authorization: Bearer MESOS_EXECUTOR_AUTHENTICATION_TOKEN

See http://mesos.apache.org/documentation/latest/authentication/

If you think this is an acceptable approach I would be happy to provide a PR

Pymesos doesn't install properly on python3.11 due to http_parser

Running: docker run --rm -ti python:3.11 bash -c 'pip install pymesos'
Gives the following error

      building 'http_parser.parser' extension
      creating build/temp.linux-x86_64-cpython-311
      creating build/temp.linux-x86_64-cpython-311/http_parser
      gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Iparser -I/usr/local/include/python3.11 -c http_parser/http_parser.c -o build/temp.linux-x86_64-cpython-311/http_parser/http_parser.o
      gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Iparser -I/usr/local/include/python3.11 -c http_parser/parser.c -o build/temp.linux-x86_64-cpython-311/http_parser/parser.o
      http_parser/parser.c:196:12: fatal error: longintrepr.h: No such file or directory
        196 |   #include "longintrepr.h"
            |            ^~~~~~~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for http-parser

I think this is because of:
benoitc/http-parser#94

We need this to fix,
DataBiosphere/toil#4318

Any help would be appreciated, thanks.

Python 3 bytes, string JSON problems

Hi,

I might have stumbeled on a small bug concerning serialization. It seems that when running python 3.5

the encode_data method does not always work as expected as it return bytes and json expects string objects. For example if the executor does not speciy a uuid the default assignment lead to JSON serialization errors.

Adding a encode_data(uuid.uuid4().bytes).decode(), solves this. Adding the decode does not seem to break the code for python 2 .

The same problem is also present in the examples folder.

This can be reproduced by:

Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Dec  7 2015, 11:16:01) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from binascii import b2a_base64, a2b_base64
>>> import json
>>> json.dumps(b2a_base64("a").decode())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
>>> json.dumps(b2a_base64(b"a").decode())
'"YQ==\\n"'
>>> json.dumps(b2a_base64(b"a"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.5/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/opt/anaconda3/lib/python3.5/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/opt/anaconda3/lib/python3.5/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/opt/anaconda3/lib/python3.5/json/encoder.py", line 180, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: b'YQ==\n' is not JSON serializable
>> json.dumps(b2a_base64(b"a").decode())
'"YQ==\\n"'

I am not sure if this is a bug or I am just misunderstanding something. If this is indeed a bug, adding a decode call to the encode method seems to fix the issue

framework receives less than master offers

Im not entirely sure if this is related to pymesos, but im seeing a case where if i schedule lots of jobs that only need 0.2 cpu (.1 for task, .1 for executor) on a 2 core agent, the resources offers supplied to the framework gradually contain fewer and fewer cpus until the jobs cannot be scheduled.

(As a note: Im running mesos on debian, and launching jobs inside docker containers)

For example, here is the first offer i receive (with hostname misc removed):

'framework_id': {'value': 'e06e2242-4f76-450d-9153-4616dc03e913-0044'}, 'allocation_info': {'role': '*'}, 'id': {'value':
 'e06e2242-4f76-450d-9153-4616dc03e913-O185'}, 'resources': [{'role': '*', 'type': 'SCALAR', 'name': 'cpus', 'scalar': {'value': 2.0}, 'allocation_info': {'role': '*'}}, {'role': '*', 'type': 'SCALAR', 'name': 'mem', 'scalar': {'value': 12031.0}, 'allocation_info': {
'role': '*'}}, {'role': '*', 'type': 'SCALAR', 'name': 'disk', 'scalar': {'value': 24987.0}, 'allocation_info': {'role': '*'}}, {'role': '*', 'type': 'RANGES', 'name': 'ports', 'ranges': {'range': [{'begin': 31000, 'end': 32000}]}, 'allocation_info': {'role': '*'}}],
 'agent_id': {'value': 'e06e2242-4f76-450d-9153-4616dc03e913-S0'},

and then, after scheduling jobs eventually my final resource off looks like:

'framework_id': {'value': 'e06e2242-4f76-450d-9153-4616dc03e913-0044'}, 'allocation_info': {'role': '*'}, 'id': {'value':
 'e06e2242-4f76-450d-9153-4616dc03e913-O192'}, 'resources': [{'role': '*', 'type': 'SCALAR', 'name': 'cpus', 'scalar': {'value': 0.1}, 'allocation_info': {'role': '*'}}, {'role': '*', 'type': 'SCALAR', 'name': 'mem', 'scalar': {'value': 32.0}, 'allocation_info': {'ro
le': '*'}}, {'role': '*', 'type': 'SCALAR', 'name': 'disk', 'scalar': {'value': 32.0}, 'allocation_info': {'role': '*'}}], 'agent_id': {'value': 'e06e2242-4f76-450d-9153-4616dc03e913-S0'},

At this point, I checked the mesos master, which reports 2 cpus offered, and there is nothing else running. Notably, mem/disk offer also seems to decrease. Is it possible that pymesos may not be relinquishing resources?

wrong order of parameter

Hi, according to mesos.interface(0.20.1), scheduler's method frameworkMessage should be executed with (driver, executorId, slaveId, message), while in pymesos, it is executed with (driver, slaveId, executorId, message).

Add support for HTTPS

At the moment the scheduler, executor and operations only support using http connections. This should be extended to add support for https.
One way is to add a use_https flag (default False) for each of the respective __init__ and set self.scheme to 'http' or 'https' like the following

if use_https:
  self.scheme = 'https'
else:
  self.scheme = 'http'

And the reference this scheme whenever creating a connection to the maser/agent

if self.scheme == "https":
  conn = HTTPSConnection(...)
else:
  conn = HTTPConnection(...)

A change is also needec to resolve the ip to a fqdn for the master/agent in order for the certificates to be validated

I have a patch that does exactly this which I can make PR from if this is an acceptable approach

Ability to run 10,000 tasks

Hi,

When I use pymesos to run 10, 100, 1000 tasks at same time, it runs perfectly.
However, for 10000 tasks at same time, some status of tasks are TASK_LOST.

I'm not sure the problem is pymesos or the setting I set.

Mesos Version: 1.9.0
Pymesos: git clone the latest (2020/6/9)
Total CPU 412, MEM 5.2TB, Disk 983.9
For one task, it needs 0.01 cpu, 1M mem

For the task starts is TASK_LOST, The mesos master shows:
Sending status update TASK_LOST for task task-xx of framework xxx 'Task launched with inva
lid offers: Offer xxx is no longer valid'

I guess the cause is that two or above tasks use the same offer id. When one of these tasks finished, the offer will release, and the other task using same offer id cannot use this offer anymore.

Support backoff for reconnect.

PyMesos currently reconnect to master every 2 seconds (cf: https://github.com/douban/pymesos/blob/master/pymesos/process.py#L246)

This may lead to master pressure, we need add randomized exponentiation backoff in future.

Does pymesos support mesos 0.23.0?

I'm using mesos 0.23.0, so I want to know if pymesos supports mesos 0.23.0. thx

HA Scheduler, through persisting fwk id, seems not to work properly.

Doing an example scheduler a bit more complicated. I've added HA for the scheduler through persisting ID in redis.

When trying to subscribe the second time I receive the next message:

RuntimeError: Failed with HTTP 400: Failed to convert JSON into Call protobuf: Not expecting a JSON string for field 'framework_id'

The reason, I wonder, is due to including in the SUBSCRIBE call:

  if 'id' in self._framework:
        request['framework_id'] = self._framework['id']

The id is already in the framework_info dict, and it's not until you receive the response from master when you receive:

{
"type" : "SUBSCRIBED",
"subscribed" : {
"framework_id" : {"value":"12220-3440-12532-2345"},
"heartbeat_interval_seconds" : 15
}
}

Socket errors are not handled properly which is resulting in unexpected behavior

We are using pymesos in https://github.com/Yelp/task_processing. I ran into an issue where we accepted an offer and launched a task. The elected master was having some issues with networking. Pymesos sent an HTTP request but timed out while getting a response back from master. Master never received the task, thus it did not launch it. socket.timeout error was raised, caught and ignored by pymesos. Our framework assumed that the task has been launched because pymesos did not throw any error. But, it was not actually launched.

Stack trace from our application:
https://gist.github.com/sagar8192/da8f39e04366372815583677dd48c9c5

For POST requests, pymesos should not swallow timeout exceptions. It should raise an exception or return False so that applications know that their request failed.

Support explicit acknowledgements for status updates

At present (0.2.7) MesosSchedulerDriver.on_update always calls acknowledgeStatusUpdate, and unlike the old official bindings, there is no option for the scheduler to decide when to acknowledge status updates. This is necessary if, for example, status updates are posted to an asyncio event loop, and will only be fully processed after statusUpdate returns. At the moment I'm working around this by subclassing MesosSchedulerDriver and reimplementing on_update, but that's fragile.

Please provide a constructor argument to allow implicit acknowledgements to be turned off, similar to this.

pymesos doesn't install on python 3.7

pymesos 0.3.12 does not install from pypi on python 3.7. This is due to the http_parser package C code not being compatible with py3.7.

http_parser appears to be no longer maintained, with the last pypi release in 2013. Suggest dropping this package.

pymesos-error.txt

Python 3 Compatibility

Error parsing `mins` and `ns` time units since items()

pymesos/pymesos/utils.py

Line 20 in b7b5ddf

if s.endswith(n):

>>> parse_duration('5mins')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 10, in parse_duration
ValueError: could not convert string to float: '5mi'

Add specific timeout for http requests

At the moment _timeout is being used as a timeout for http requests as well as a connect deadline. This should be split into two separate timeouts.

A http_timeout for http connections and _timeout should remain for the necessary connect_deadline logic.

My approach would be to add an optional http_timeout in all the necessary __init__ functions and add this whenever we are creating a HTTPConnection. I would default this to DAY to ensure the behaviour would not change by default.

Happy to provide a PR for this if you agree with the approach

Scheduler failover flag is unsettable

There is no way to set the _failover flag in the scheduler driver. The current behavior means that if the scheduler throws an unexpected exception, all the tasks running on that framework are immediately terminated.

There should be a supported way to set this value to True.

mismatch between pymesos API and mesos scheduler HTTP API spec

killTask method of scheduler driver doesn't take agent_id as argument.
The body of killTask is incorrect, task_id should have another key "value"

def killTask(self, task_id): framework_id = self.framework_id assert framework_id body = dict( type='KILL', framework_id=dict( value=framework_id, ), kill=dict( task_id=task_id, ), ) self._send(body)
Scheduler HTTP api spec for Kill task:
KILL Request (JSON):
POST /api/v1/scheduler HTTP/1.1
Host: masterhost:5050
Content-Type: application/json
Mesos-Stream-Id: 130ae4e3-6b13-4ef4-baa9-9f2e85c3e9af
{
"framework_id" : {"value" : "12220-3440-12532-2345"},
"type" : "KILL",
"kill" : {
"task_id" : {"value" : "12220-3440-12532-my-task"},
"agent_id" : {"value" : "12220-3440-12532-S1233"}
}
}
KILL Response:
HTTP/1.1 202 Accepted

MesosOperatorMasterDriver cannot subscribe to events

Using the latest version of PyMesos 0.3.4, I cannot receive any events using the operator API. It looks like the SUBSCRIBE request is not sent, from my initial investigation.

On PyMesos 0.3.3, after calling driver.start(), I get a log message similar to the following:

Operator client subscribed with cluster state: ...

However, on 0.3.4, no such log is present, and any task updates or agent updates will not notify the operator.

I believe this can be reproduced with the basic examples provided. Falling back to PyMesos 0.3.3 fixes this issue, which seems like #98 introduced a regression.

Missing comma in list.

This line is missing a comma: https://github.com/douban/pymesos/blob/master/pymesos/interface.py#L32

Hope no one minds, I made a PR: #107.

No Subscribe/Register API end point in scheduler driver

There is no subscribe API end point in the scheduler driver.
Though registration happens on driver.start() implicitly, in case we
need to reregister due to disconnection from mesos master,
there is no API call for it currently.

Support Credential to authenticate framework

mesos master config CRAM-MD5 framework authentication, But pymesos example can register a framework without Credential.

在mesos 上使用docker 的方式运行 dpark ，有没有部署的文档可以参考？

Custom field on update dictionary not working

for example in the below, if i add a new key to update dictionary i.e. update.new_key = 1

class MinimalExecutor(Executor):
    def launchTask(self, driver, task):
       def run_task(task):
          update = Dict()
          update.task_id.value = task.task_id.value
          update.state = 'TASK_RUNNING'
          update.timestamp = time.time()
          update.new_key = 1
          driver.sendStatusUpdate(update)

          print(decode_data(task.data), file=sys.stderr)
          time.sleep(30)

          update = Dict()
          update.task_id.value = task.task_id.value
          update.state = 'TASK_FINISHED'
          update.timestamp = time.time()
          update.new_key = 1
          driver.sendStatusUpdate(update)
thread = Thread(target=run_task, args=(task,))
thread.start()

and in scheduler.py

def statusUpdate(self, driver, update):
  logging.debug('Status update TID %s %s %s',
                  update.task_id.value,
                  update.state,
                   update.new_key)

update.new_key is shown an empty dictionary, {}, instead of 1 in the log

how do i add a new key to update dictionary destined for statusUpdate()?

代码貌似有点问题

1：slave.py中
from mesos.interace.mesos_pb2 import *，这里interace应该是interface吧
2：原始的mesos中应该应该没有interface这个包吧

Support InverseOffers

Currently, if a maintenance schedule is posted to the Mesos master, the current assumption in process_event lead to a KeyError when trying to extract non-existent offers from an OFFERS message. Here is an example of an event which causes this issue.

{
	"offers": {
		"inverse_offers": [
			{
				"url": {
					"path": "/slave(1)",
					"scheme": "http",
					"address": {
						"ip": "10.112.46.15",
						"hostname": "mesos-dev2.hostname.io",
						"port": 5051
					}
				},
				"unavailability": {
					"start": {
						"nanoseconds": 1488897610000000000
					}
				},
				"framework_id": {
					"value": "142e3612-a19d-4be2-9f06-35a987fd36fb-0024"
				},
				"agent_id": {
					"value": "7f05b67b-6a40-4c90-aeef-782616a5ae9d-S3"
				},
				"id": {
					"value": "142e3612-a19d-4be2-9f06-35a987fd36fb-O230842"
				}
			}
		]
	},
	"type": "OFFERS"
}

I'm working to support inverse offers at https://github.com/efritz/pymesos.

Run callback in a new thread

In pymesos' current implementation, callback blocks the io_thread, this is a big problem in some case.
such as

def reregistered(self, driver, masterInfo):
    self.reconcile_all_running_tasks()
    with self.lock:
        self._connected = True

reconcile_all_running_tasks will wait util it received statusUpdate of these tasks, but because it blocks the io_thread, so the statusUpdate doesn't have a chance to run, finally it becomes a dead loop.

Heartbeat Callback for scheduler

According to Scheduler HTTP API v1(link on pymesos README), there should be a heartbeat callback for the scheduler.
It seems like the callback is suppressed in PyMesos and not exposed to pymesos clients, may be to be compliant with v0 API.

I was pointed to these messages in Mesos slack discussion by Anand Mazumdar when I asked about this issue(v0/v1)

Anand Mazumdar [6:06 PM]
@saitejar The supported python bindings still use the old v0 API (driver based). You might want to look at pymesos instead that uses the new v1 Scheduler API.

This has come up in the past: https://mesos.slack.com/archives/C1KSH30JF/p1488943124000842
Anand Mazumdar
@windreamer : The official interface (v0) won’t ever support the v1 API, so you are free to pick a method name of your choice.
Posted in #http-apiMarch 7th at 10:18 PM

[6:10]
Something that they might already be fixing. @windreamer maintains it and would know more about the plans.

Can we expect heartbeat callback to be supported anytime soon?

how do i get status of a task from scheduler?

when call driver.stop(), pymesos throws a RuntimeError.

I just want to stop the scheduler, so I call driver.stop() , but pymesos throws an exception, I'm a newbie to mesos, so I really can't figure out what's wrong, the following is the traceback:

2015-09-07 09:52:57,059 - [process.py:72 -        run_jobs() ]: ERROR    error while call <function handle at 0x7f22dfdefd70> (tried 0 times)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 69, in run_jobs
    func(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 138, in handle
f(*args)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 108, in onStatusUpdateMessage
self.sched.statusUpdate(self, update.status)
  File "/home/yangyu/Documents/my-code/geetest/gt-mesos-framework/gmesos/scheduler.py", line 130, in statusUpdate
driver.stop()
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 155, in stop
Process.stop(self)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 146, in stop
self.join()
  File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 152, in join
self.delay_t.join()
  File "/usr/lib/python2.7/threading.py", line 940, in join
raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

Exponential backoff for executors incompatible with default & max executor_reregistration_timeout

A maximum interval of 300 seconds in the exponential backoff we use when reconnecting to agents, all but ensures that an executor will not be able to reregister quickly enough once an agent comes back online.

Since the default executor_reregistration_timeout is 2secs (and mesos doesn't allow us to increase it beyond 15secs), we probably need the maximum interval to be 1 second for executor reconnect attempts.

Need to set `slave_id` for protobuf-based driver

cf: douban/dpark#66

python version, is python2 and python3 compatible?

As title

Using 'offer_id' instead of 'inverse_offer_id' in RescindInverseOffer

This issue is causing the Process thread to abort on recieving 'rescind_inverse_offer' event.
See the defintion from mesos.proto https://github.com/apache/mesos/blob/master/include/mesos/v1/scheduler/scheduler.proto#L110.

Traceback.

2018-07-06 10:03:31,005 [ERROR] pymesos.process: Failed to process event
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/pymesos/process.py", line 194, in read
self._callback.process_event(event)
File "/usr/local/lib/python2.7/site-packages/pymesos/process.py", line 274, in process_event
self.on_event(event)
File "/usr/local/lib/python2.7/site-packages/pymesos/scheduler.py", line 635, in on_event
func(event)
File "/usr/local/lib/python2.7/site-packages/pymesos/scheduler.py", line 577, in on_rescind_inverse_offer
offer_id = event['offer_id']
KeyError: 'offer_id'

Python 3 Support

Hey there, thanks for writing this library. I was wondering what the status is for python 3 support? I ran into some data encoding/decoding issues (that I was mostly able to work around) when I tried.

Process.start() method does not call Process._notify()

The start() method of the Process class does not call the _notify() method as other methods (i.e. stop()) do.
This prevents the executor to connect back to the agent and start processing events, because the method Process._run() never assigns the field _master and therefore never creates a Connection instance.
As a result the launched task remains in STAGING state until it fails a couple of minutes after.

A workaround to this issue is to extend the MesosExecutorDriver override the start() method adding a call to _notify().

Is this the intended behavior?

Should acknowledgement reply to slave directly or master?

I have no idea how old version mesos works, but seems in version 0.21, acknowledgements are sent to master and then forwarded to slave. When using pymesos with 0.21, tasks won't be moved to completed tasks because master doesn't get the acknowledgements. I take a look at the native implementation, it seems the official version goes to master despite of what pid is.

Minor Issue: Typo in a LOG method call

pymesos/pymesos/process.py
line 106:
elif code == SERVICE_UNAVAILABLE: logger.warnig('Master is not available, retry.') return False

it should be logger.warning instead.

missing kazoo dependency

In pymesos/detector.py:

from kazoo.client import KazooClient as ZKClient

gives me

ImportError: No module named kazoo.client

because kazoo is not listed as a dependency in requirements.txt and pip doesn't grab that dependency for me.

douban / pymesos Goto Github PK

pymesos's Issues

Recommend Projects

Recommend Topics

Recommend Org