douban / pymesos Goto Github PK
View Code? Open in Web Editor NEWA pure python implementation of Mesos scheduler and executor
License: BSD 3-Clause "New" or "Revised" License
A pure python implementation of Mesos scheduler and executor
License: BSD 3-Clause "New" or "Revised" License
Hi, is there currently a way for the MesosSchedulerDriver to throw an error when the Mesos master is not available? It seems that it currently gets stuck in an endless loop of "Connection refused" errors once I call driver.run()
:
framework_1 | 2018-06-06 15:35:26.558|ERROR|Failed to send to ('mesos-master', 5050)
framework_1 | Traceback (most recent call last):
framework_1 | File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 536, in run
framework_1 | result = self._run(*self.args, **self.kwargs)
framework_1 | File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
framework_1 | self.__bootstrap_inner()
framework_1 | File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
framework_1 | self.run()
framework_1 | File "/usr/lib/python2.7/threading.py", line 754, in run
framework_1 | self.__target(*self.__args, **self.__kwargs)
framework_1 | File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 318, in _run
framework_1 | if not conn.write():
framework_1 | File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 76, in write
framework_1 | logger.exception('Failed to send to %s', self._addr)
framework_1 | File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 69, in write
framework_1 | sent = self._sock.send(self._request)
framework_1 | File "/usr/local/lib/python2.7/dist-packages/gevent/_socket2.py", line 323, in send
framework_1 | return sock.send(data, flags)
framework_1 | error: [Errno 111] Connection refused
I would like a way to catch the exception and handle it accordingly. Otherwise, something like a timeout would be good as well.
Hi @windreamer,
In this commit, 04501d1#diff-ab924b88fb1411cfeab6024b822161a8R95 , a change was made to suppress the Heart Beats events to the framework. Was there a reason behind it?
At the moment there is no way to enable authentication for the executor using pymesos.
When executor authentication is enabled the agent injects MESOS_EXECUTOR_AUTHENTICATION_TOKEN
into the the executors environment. This needs to be added to each requests header in order to authenticate, like the following
Authorization: Bearer MESOS_EXECUTOR_AUTHENTICATION_TOKEN
See http://mesos.apache.org/documentation/latest/authentication/
If you think this is an acceptable approach I would be happy to provide a PR
Running: docker run --rm -ti python:3.11 bash -c 'pip install pymesos'
Gives the following error
building 'http_parser.parser' extension
creating build/temp.linux-x86_64-cpython-311
creating build/temp.linux-x86_64-cpython-311/http_parser
gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Iparser -I/usr/local/include/python3.11 -c http_parser/http_parser.c -o build/temp.linux-x86_64-cpython-311/http_parser/http_parser.o
gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Iparser -I/usr/local/include/python3.11 -c http_parser/parser.c -o build/temp.linux-x86_64-cpython-311/http_parser/parser.o
http_parser/parser.c:196:12: fatal error: longintrepr.h: No such file or directory
196 | #include "longintrepr.h"
| ^~~~~~~~~~~~~~~
compilation terminated.
error: command '/usr/bin/gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for http-parser
I think this is because of:
benoitc/http-parser#94
We need this to fix,
DataBiosphere/toil#4318
Any help would be appreciated, thanks.
Hi,
I might have stumbeled on a small bug concerning serialization. It seems that when running python 3.5
the encode_data
method does not always work as expected as it return bytes and json expects string objects. For example if the executor does not speciy a uuid the default assignment lead to JSON serialization errors.
Adding a encode_data(uuid.uuid4().bytes).decode()
, solves this. Adding the decode does not seem to break the code for python 2 .
The same problem is also present in the examples folder.
This can be reproduced by:
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Dec 7 2015, 11:16:01)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from binascii import b2a_base64, a2b_base64
>>> import json
>>> json.dumps(b2a_base64("a").decode())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
>>> json.dumps(b2a_base64(b"a").decode())
'"YQ==\\n"'
>>> json.dumps(b2a_base64(b"a"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.5/json/__init__.py", line 230, in dumps
return _default_encoder.encode(obj)
File "/opt/anaconda3/lib/python3.5/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/opt/anaconda3/lib/python3.5/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/opt/anaconda3/lib/python3.5/json/encoder.py", line 180, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: b'YQ==\n' is not JSON serializable
>> json.dumps(b2a_base64(b"a").decode())
'"YQ==\\n"'
I am not sure if this is a bug or I am just misunderstanding something. If this is indeed a bug, adding a decode call to the encode method seems to fix the issue
Im not entirely sure if this is related to pymesos, but im seeing a case where if i schedule lots of jobs that only need 0.2 cpu (.1 for task, .1 for executor) on a 2 core agent, the resources offers supplied to the framework gradually contain fewer and fewer cpus until the jobs cannot be scheduled.
(As a note: Im running mesos on debian, and launching jobs inside docker containers)
For example, here is the first offer i receive (with hostname misc removed):
'framework_id': {'value': 'e06e2242-4f76-450d-9153-4616dc03e913-0044'}, 'allocation_info': {'role': '*'}, 'id': {'value':
'e06e2242-4f76-450d-9153-4616dc03e913-O185'}, 'resources': [{'role': '*', 'type': 'SCALAR', 'name': 'cpus', 'scalar': {'value': 2.0}, 'allocation_info': {'role': '*'}}, {'role': '*', 'type': 'SCALAR', 'name': 'mem', 'scalar': {'value': 12031.0}, 'allocation_info': {
'role': '*'}}, {'role': '*', 'type': 'SCALAR', 'name': 'disk', 'scalar': {'value': 24987.0}, 'allocation_info': {'role': '*'}}, {'role': '*', 'type': 'RANGES', 'name': 'ports', 'ranges': {'range': [{'begin': 31000, 'end': 32000}]}, 'allocation_info': {'role': '*'}}],
'agent_id': {'value': 'e06e2242-4f76-450d-9153-4616dc03e913-S0'},
and then, after scheduling jobs eventually my final resource off looks like:
'framework_id': {'value': 'e06e2242-4f76-450d-9153-4616dc03e913-0044'}, 'allocation_info': {'role': '*'}, 'id': {'value':
'e06e2242-4f76-450d-9153-4616dc03e913-O192'}, 'resources': [{'role': '*', 'type': 'SCALAR', 'name': 'cpus', 'scalar': {'value': 0.1}, 'allocation_info': {'role': '*'}}, {'role': '*', 'type': 'SCALAR', 'name': 'mem', 'scalar': {'value': 32.0}, 'allocation_info': {'ro
le': '*'}}, {'role': '*', 'type': 'SCALAR', 'name': 'disk', 'scalar': {'value': 32.0}, 'allocation_info': {'role': '*'}}], 'agent_id': {'value': 'e06e2242-4f76-450d-9153-4616dc03e913-S0'},
At this point, I checked the mesos master, which reports 2 cpus offered, and there is nothing else running. Notably, mem/disk offer also seems to decrease. Is it possible that pymesos may not be relinquishing resources?
Hi, according to mesos.interface(0.20.1), scheduler's method frameworkMessage should be executed with (driver, executorId, slaveId, message), while in pymesos, it is executed with (driver, slaveId, executorId, message).
At the moment the scheduler, executor and operations only support using http connections. This should be extended to add support for https.
One way is to add a use_https
flag (default False) for each of the respective __init__
and set self.scheme
to 'http' or 'https' like the following
if use_https:
self.scheme = 'https'
else:
self.scheme = 'http'
And the reference this scheme whenever creating a connection to the maser/agent
if self.scheme == "https":
conn = HTTPSConnection(...)
else:
conn = HTTPConnection(...)
A change is also needec to resolve the ip to a fqdn for the master/agent in order for the certificates to be validated
I have a patch that does exactly this which I can make PR from if this is an acceptable approach
Hi,
When I use pymesos to run 10, 100, 1000 tasks at same time, it runs perfectly.
However, for 10000 tasks at same time, some status of tasks are TASK_LOST.
I'm not sure the problem is pymesos or the setting I set.
Mesos Version: 1.9.0
Pymesos: git clone the latest (2020/6/9)
Total CPU 412, MEM 5.2TB, Disk 983.9
For one task, it needs 0.01 cpu, 1M mem
For the task starts is TASK_LOST, The mesos master shows:
Sending status update TASK_LOST for task task-xx of framework xxx 'Task launched with inva
lid offers: Offer xxx is no longer valid'
I guess the cause is that two or above tasks use the same offer id. When one of these tasks finished, the offer will release, and the other task using same offer id cannot use this offer anymore.
PyMesos currently reconnect to master every 2 seconds (cf: https://github.com/douban/pymesos/blob/master/pymesos/process.py#L246)
This may lead to master pressure, we need add randomized exponentiation backoff in future.
I'm using mesos 0.23.0, so I want to know if pymesos supports mesos 0.23.0. thx
Doing an example scheduler a bit more complicated. I've added HA for the scheduler through persisting ID in redis.
When trying to subscribe the second time I receive the next message:
RuntimeError: Failed with HTTP 400: Failed to convert JSON into Call protobuf: Not expecting a JSON string for field 'framework_id'
The reason, I wonder, is due to including in the SUBSCRIBE call:
if 'id' in self._framework:
request['framework_id'] = self._framework['id']
The id is already in the framework_info dict, and it's not until you receive the response from master when you receive:
{
"type" : "SUBSCRIBED",
"subscribed" : {
"framework_id" : {"value":"12220-3440-12532-2345"},
"heartbeat_interval_seconds" : 15
}
}
We are using pymesos in https://github.com/Yelp/task_processing. I ran into an issue where we accepted an offer and launched a task. The elected master was having some issues with networking. Pymesos sent an HTTP request but timed out while getting a response back from master. Master never received the task, thus it did not launch it. socket.timeout error was raised, caught and ignored by pymesos. Our framework assumed that the task has been launched because pymesos did not throw any error. But, it was not actually launched.
Stack trace from our application:
https://gist.github.com/sagar8192/da8f39e04366372815583677dd48c9c5
For POST requests, pymesos should not swallow timeout exceptions. It should raise an exception or return False so that applications know that their request failed.
At present (0.2.7) MesosSchedulerDriver.on_update
always calls acknowledgeStatusUpdate
, and unlike the old official bindings, there is no option for the scheduler to decide when to acknowledge status updates. This is necessary if, for example, status updates are posted to an asyncio event loop, and will only be fully processed after statusUpdate
returns. At the moment I'm working around this by subclassing MesosSchedulerDriver
and reimplementing on_update
, but that's fragile.
Please provide a constructor argument to allow implicit acknowledgements to be turned off, similar to this.
pymesos 0.3.12 does not install from pypi on python 3.7. This is due to the http_parser package C code not being compatible with py3.7.
http_parser appears to be no longer maintained, with the last pypi release in 2013. Suggest dropping this package.
Line 20 in b7b5ddf
>>> parse_duration('5mins')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 10, in parse_duration
ValueError: could not convert string to float: '5mi'
At the moment _timeout
is being used as a timeout for http requests as well as a connect deadline. This should be split into two separate timeouts.
A http_timeout
for http connections and _timeout
should remain for the necessary connect_deadline logic.
My approach would be to add an optional http_timeout
in all the necessary __init__
functions and add this whenever we are creating a HTTPConnection
. I would default this to DAY
to ensure the behaviour would not change by default.
Happy to provide a PR for this if you agree with the approach
There is no way to set the _failover
flag in the scheduler driver. The current behavior means that if the scheduler throws an unexpected exception, all the tasks running on that framework are immediately terminated.
There should be a supported way to set this value to True
.
def killTask(self, task_id): framework_id = self.framework_id assert framework_id body = dict( type='KILL', framework_id=dict( value=framework_id, ), kill=dict( task_id=task_id, ), ) self._send(body)
Scheduler HTTP api spec for Kill task:
KILL Request (JSON):
POST /api/v1/scheduler HTTP/1.1
Host: masterhost:5050
Content-Type: application/json
Mesos-Stream-Id: 130ae4e3-6b13-4ef4-baa9-9f2e85c3e9af
{
"framework_id" : {"value" : "12220-3440-12532-2345"},
"type" : "KILL",
"kill" : {
"task_id" : {"value" : "12220-3440-12532-my-task"},
"agent_id" : {"value" : "12220-3440-12532-S1233"}
}
}
KILL Response:
HTTP/1.1 202 Accepted
Using the latest version of PyMesos 0.3.4, I cannot receive any events using the operator API. It looks like the SUBSCRIBE request is not sent, from my initial investigation.
On PyMesos 0.3.3, after calling driver.start()
, I get a log message similar to the following:
Operator client subscribed with cluster state: ...
However, on 0.3.4, no such log is present, and any task updates or agent updates will not notify the operator.
I believe this can be reproduced with the basic examples provided. Falling back to PyMesos 0.3.3 fixes this issue, which seems like #98 introduced a regression.
This line is missing a comma: https://github.com/douban/pymesos/blob/master/pymesos/interface.py#L32
Hope no one minds, I made a PR: #107.
There is no subscribe API end point in the scheduler driver.
Though registration happens on driver.start() implicitly, in case we
need to reregister due to disconnection from mesos master,
there is no API call for it currently.
mesos master config CRAM-MD5 framework authentication, But pymesos example can register a framework without Credential.
在mesos 上使用docker 的方式运行 dpark ,有没有部署的文档可以参考?
for example in the below, if i add a new key to update dictionary i.e. update.new_key = 1
class MinimalExecutor(Executor):
def launchTask(self, driver, task):
def run_task(task):
update = Dict()
update.task_id.value = task.task_id.value
update.state = 'TASK_RUNNING'
update.timestamp = time.time()
update.new_key = 1
driver.sendStatusUpdate(update)
print(decode_data(task.data), file=sys.stderr)
time.sleep(30)
update = Dict()
update.task_id.value = task.task_id.value
update.state = 'TASK_FINISHED'
update.timestamp = time.time()
update.new_key = 1
driver.sendStatusUpdate(update)
thread = Thread(target=run_task, args=(task,))
thread.start()
and in scheduler.py
def statusUpdate(self, driver, update):
logging.debug('Status update TID %s %s %s',
update.task_id.value,
update.state,
update.new_key)
update.new_key is shown an empty dictionary, {}, instead of 1 in the log
how do i add a new key to update dictionary destined for statusUpdate()?
1:slave.py中
from mesos.interace.mesos_pb2 import *,这里interace应该是interface吧
2:原始的mesos中应该应该没有interface这个包吧
Currently, if a maintenance schedule is posted to the Mesos master, the current assumption in process_event
lead to a KeyError
when trying to extract non-existent offers from an OFFERS message. Here is an example of an event which causes this issue.
{
"offers": {
"inverse_offers": [
{
"url": {
"path": "/slave(1)",
"scheme": "http",
"address": {
"ip": "10.112.46.15",
"hostname": "mesos-dev2.hostname.io",
"port": 5051
}
},
"unavailability": {
"start": {
"nanoseconds": 1488897610000000000
}
},
"framework_id": {
"value": "142e3612-a19d-4be2-9f06-35a987fd36fb-0024"
},
"agent_id": {
"value": "7f05b67b-6a40-4c90-aeef-782616a5ae9d-S3"
},
"id": {
"value": "142e3612-a19d-4be2-9f06-35a987fd36fb-O230842"
}
}
]
},
"type": "OFFERS"
}
I'm working to support inverse offers at https://github.com/efritz/pymesos.
In pymesos' current implementation, callback blocks the io_thread
, this is a big problem in some case.
such as
def reregistered(self, driver, masterInfo):
self.reconcile_all_running_tasks()
with self.lock:
self._connected = True
reconcile_all_running_tasks
will wait util it received statusUpdate of these tasks, but because it blocks the io_thread
, so the statusUpdate
doesn't have a chance to run, finally it becomes a dead loop.
According to Scheduler HTTP API v1(link on pymesos README), there should be a heartbeat callback for the scheduler.
It seems like the callback is suppressed in PyMesos and not exposed to pymesos clients, may be to be compliant with v0 API.
I was pointed to these messages in Mesos slack discussion by Anand Mazumdar when I asked about this issue(v0/v1)
Anand Mazumdar [6:06 PM]
@saitejar The supported python bindings still use the old v0 API (driver based). You might want to look atpymesos
instead that uses the new v1 Scheduler API.This has come up in the past: https://mesos.slack.com/archives/C1KSH30JF/p1488943124000842
Anand Mazumdar
@windreamer : The official interface (v0) won’t ever support the v1 API, so you are free to pick a method name of your choice.
Posted in #http-apiMarch 7th at 10:18 PM[6:10]
Something that they might already be fixing. @windreamer maintains it and would know more about the plans.
Can we expect heartbeat callback to be supported anytime soon?
I just want to stop the scheduler, so I call driver.stop()
, but pymesos throws an exception, I'm a newbie to mesos, so I really can't figure out what's wrong, the following is the traceback:
2015-09-07 09:52:57,059 - [process.py:72 - run_jobs() ]: ERROR error while call <function handle at 0x7f22dfdefd70> (tried 0 times)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 69, in run_jobs
func(*args, **kw)
File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 138, in handle
f(*args)
File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 108, in onStatusUpdateMessage
self.sched.statusUpdate(self, update.status)
File "/home/yangyu/Documents/my-code/geetest/gt-mesos-framework/gmesos/scheduler.py", line 130, in statusUpdate
driver.stop()
File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 155, in stop
Process.stop(self)
File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 146, in stop
self.join()
File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 152, in join
self.delay_t.join()
File "/usr/lib/python2.7/threading.py", line 940, in join
raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread
A maximum interval of 300 seconds in the exponential backoff we use when reconnecting to agents, all but ensures that an executor will not be able to reregister quickly enough once an agent comes back online.
Since the default executor_reregistration_timeout is 2secs (and mesos doesn't allow us to increase it beyond 15secs), we probably need the maximum interval to be 1 second for executor reconnect attempts.
cf: douban/dpark#66
As title
This issue is causing the Process thread to abort on recieving 'rescind_inverse_offer' event.
See the defintion from mesos.proto https://github.com/apache/mesos/blob/master/include/mesos/v1/scheduler/scheduler.proto#L110.
Traceback.
2018-07-06 10:03:31,005 [ERROR] pymesos.process: Failed to process event
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/pymesos/process.py", line 194, in read
self._callback.process_event(event)
File "/usr/local/lib/python2.7/site-packages/pymesos/process.py", line 274, in process_event
self.on_event(event)
File "/usr/local/lib/python2.7/site-packages/pymesos/scheduler.py", line 635, in on_event
func(event)
File "/usr/local/lib/python2.7/site-packages/pymesos/scheduler.py", line 577, in on_rescind_inverse_offer
offer_id = event['offer_id']
KeyError: 'offer_id'
Hey there, thanks for writing this library. I was wondering what the status is for python 3 support? I ran into some data encoding/decoding issues (that I was mostly able to work around) when I tried.
The start() method of the Process class does not call the _notify() method as other methods (i.e. stop()) do.
This prevents the executor to connect back to the agent and start processing events, because the method Process._run() never assigns the field _master and therefore never creates a Connection instance.
As a result the launched task remains in STAGING state until it fails a couple of minutes after.
A workaround to this issue is to extend the MesosExecutorDriver override the start() method adding a call to _notify().
Is this the intended behavior?
I have no idea how old version mesos works, but seems in version 0.21, acknowledgements are sent to master and then forwarded to slave. When using pymesos with 0.21, tasks won't be moved to completed tasks because master doesn't get the acknowledgements. I take a look at the native implementation, it seems the official version goes to master despite of what pid is.
pymesos/pymesos/process.py
line 106:
elif code == SERVICE_UNAVAILABLE: logger.warnig('Master is not available, retry.') return False
it should be logger.warning instead.
In pymesos/detector.py:
from kazoo.client import KazooClient as ZKClient
gives me
ImportError: No module named kazoo.client
because kazoo is not listed as a dependency in requirements.txt and pip doesn't grab that dependency for me.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.