airekans / monsit Goto Github PK
View Code? Open in Web Editor NEWMonsit is application help you monitor the cluster.
License: MIT License
Monsit is application help you monitor the cluster.
License: MIT License
I got a error when running agent from a machine.
return_code: 0
msg: "SUCCESS"
^CTraceback (most recent call last):
File "core.pyx", line 348, in gevent.core.loop.handle_error (gevent/gevent.core.c:6380)
File "/home/yahuang/programming/Monsit/build/client/out00-PYZ.pyz/gevent.hub", line 291, in handle_error
KeyboardInterrupt
return_code: 0
msg: "SUCCESS"
Please figure out its root cause and check whether it's a bug of rpc module.
A candidate date picker is this one.
Right the master is responsible for sending email after detecting certain condition is met.
The master should be separated from the sending function because if the email function is not functional, it can be restarted or rewrote independently.
My initial thought was to use a queue to store the request and process these request in another dedicated process.
Now what should be shown in the host information page is hard-coded in the template.
But it's better to let user configure what to display in the information page.
Sometimes email sending is not reliable, especially when the network is down.
I can check some third-party push service to see whether it's okay to use.
For example, this one.
When user set alarm send emails to multiple people, the sending will fail.
When there's other client(maybe invalid) try to call report method, now there's no sanity check for it.
Should check this and ignore it.
Now if a connection is broken, we have no way to tell.
Should add heartbeat in Connection idle time so that we can detect broken connection early.
Now when a connection fails, all unsent requests will just be marked as failed.
This can be improved by put these requests to other connections, if any, in the same channel.
Right now all stats are added by me.
If user wants to add stat, he would have to modify the client source, which is not so user-friendly.
The client should expose some API or use plugin to let user add stat.
Right now to add a new field, user has to operate the DB, which is not easy.
It would be nice to show the top memory/CPU consumer, so that you will know the killer app at certain moment.
How to store and the information should be consider.
Now only the predefined stat will be shown in the stat page.
We should show other information(user-defined stat/info) in the page.
Right now, the agent reports stat with so many redundant data. for example, CPU stats.
This causes the stats showing code so complicated, which is not scalable.
I should change the stat to report a single number which makes it so easy to change and scale.
Now the default load balancer is use flow num to do routing.
But if the user wants to send requests to a certain host, he can only no way to do it.
The best way is to add a option in RpcController
so that users can control whether the request can be randomly routed to hosts or not.
Now every host is a single entity, while in reality a host may belong to a cluster.
To monitoring a cluster, we should sum up the data in the host and show them to user.
Should check timeout handling to see what's the root cause.
It cause a failed message keep sending forever.
Already found the root cause: the time inconsistency in different hosts causes the timeout logic functions incorrectly.
Should fix this.
Right now, there is no way to know whether a host is connected from the web interface.
A column should be added to show this, and the last time of its connecting time should also be shown.
When user clicks auto-refresh in the machine state page, every chart will try to connect to server to get data.
This can be improved by collect all refresh request and send them together.
Show disk IO in the web interface.
Right now when the TcpConnection fails, it will just close the connection and do nothing.
It's better to try to reconnect multiple times, so that when the down-stream service is down, it will still be able to recover later.
"Physical memory" would be easier to understand.
Timeout queue handling has bug.
Right now the master and agent are not daemon.
If I want to run it as daemon, it should be run as nohup python server.py &
, which is not easy to use.
Master and agent should run in daemon mode by default.
Right now, the only way to detect a connection failure is when heartbeat time out 3 times.
This is not a best way to detect connection failure.
We can use the request timeout to help detect connection failure.
When the stat data is updated in DB, now the web interface will not show the data unless user clicks the refresh button.
Should support auto refresh.
Right now the data series in the interface cannot zoom in/out.
It's more convenient if zoom in/out is support.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.