Code Monkey home page Code Monkey logo

xcflushd's People

Contributors

chavdam avatar davidor avatar mayorova avatar pankajv82 avatar sekharvajjhala avatar unleashed avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xcflushd's Issues

Doc improvements to README for verification of docker images

Make the following changes to README file related to the verification of docker images.

  1. Add section on how to verify on RHEL - gpg2 is installed. So install skopeo and "make verify" .
  2. Organize the README into sections on verification and signing. So customers who want to just verify can just look at verification section and not bother with signing.
  3. Recategorize "easy" , "hard" approach . The "hard approach" which includes "make verify" can be simple on os where the tools are already installed ( e,g gpg2 ) or easily installed ( like skopeo ). On other hand, on some os , installing of tools ( gpg2 and skopeo) can be hard. The "make verify-docker" is easy to use. So find a good way to capture and describe this.
  4. Firs the image should be verified and then the docker pull should be done. Verification does not pull the docker.

Make failure for xcflushd to reach Service Management API visible

Any issue causing xcflushd to not be able to report data batches back to the Service Management API may over time result in data loss and/or impair the API gateway's ability to function correctly.

Such issues should be appropriately logged as a minimum, and maybe additional steps taken to enable monitoring of the error8s) so that an Ops alarm can be generated quickly.

Note that related to #9 that could involve more than one xcflushd instance on one or more physical machines.

Makefile: Add an option to always fetch keys from PGP keyserver

Currently the verification process in the Makefile checks if there is a .asc file locally. If present, then the keys are imported from .asc file. The keys are fetched from a PGP keyserver only if the .asc is not present. Provide an option to always fetch the keys from the PGP keyserver into the PGP keyring.

Max num of threads in params is ignored

There are two parameters to configure the number of threads of the thread pools that we use: prio-threads (for the priority auth renewer) and threads (for the main thread pool). Both of them accept a min and a max. However, the max is ignored.

The reason is that we are using concurrent-ruby's ThreadPoolExecutor without specifying a max size for the queue, and that type of pool only spawns a new thread when the queue is full. For more details check: http://ruby-concurrency.github.io/concurrent-ruby/file.thread_pools.html

Specifying a max size for the queue creates new problems, like deciding what to do with new jobs when the queue is full and no new threads can be created. I think that for our use case, a FixedThreadPool is fine.

The workaround until we solve this is to specify max:max in the params instead of min:max.

Close the gap of in-process requests by caching pubsub responses for a bit

When a request misses the cache, it asks the on-demand mechanism for a response, While that is being processed (ie. contacting the configured backend), another request can arrive and miss the cache, and request the on-demand mechanism again. The responses could happen in an unfortunate sequence so that the would be needed to be done twice (or even more, depends on latencies which could increase the window).

To reduce the impact of this, we can implement a small caching mechanism (memoizer) for pubsub in which combinations that have been very recently requested would be responded ASAP, and those which have been requested too long ago (realistically only a few seconds) would still behave like there was no cache.

Improve logging when getting process signals

Currently, if the process receives a signal, SIGTERM, for example, it prints:

FATAL -- : Unhandled exception SignalException, shutting down:  - SIGTERM

It looks like there is something wrong, even though the behavior is correct. The log messages should not be alarming in this case.

Force log flushing

Standard Ruby logger buffers the messages before printing them out to STDOUT. For this reason, the logs are sometimes not seen in the container logs.

A quick fix for that is to include:

STDOUT.sync = true

which forces the logs to be printed immediately.

This is not optimal, though. We need to review the logging and probably do the sync periodically.

Allow for multiple instances of xcflushd to run concurrently

Goal:
Avoid having a single point of failure in XC by only allowing one instance of xcflushd to run and flush the data concurrently (on single machine, or separate machines).

Also, allow for roll-over deployments in some orchestration frameworks where a new copy of xcflushd is launched, and the older one killed, avoiding downtime.

Proposal:
Changes in xcflushd to allow more than instance to run concurrently and access the data and flush to the Service Management API, without data corruption, loss, or double-counting.

Out of scope in this issue is if we force or ensure more than one is running, what tools are used for that and how roll-over deployments are handled. All of those may vary depending on how xc is deployed - this issue is limited to code changes to allow more than one to be running.

Make error handling more robust

The way we retrieve cached reports from Redis can lead to losing or duplicating some reports if the Redis connection fails while running some specific commands. This should only happen rarely.

The problematic command is rename. When we rename keys, we give them a unique suffix so they are not overwritten in the next flushing cycle and we can retrieve their content later.

When we retrieve their content successfully, we delete them.

The problem is that the delete operation can fail. When trying to recover contents of keys that failed to be renamed we'll not be able to distinguish these 2 cases:

  1. The key is there because we decided not to delete it to retrieve its content later.
  2. The key is there because the delete operation failed.

We could take a look at the logs to figure out what happened, but of course that is not an ideal solution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.