3scale / apisonator Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 27.0 4.6 MB

Red Hat 3scale API Management Apisonator backend

Home Page: https://3scale.net

License: Apache License 2.0

Ruby 98.52% Makefile 0.71% Shell 0.22% Dockerfile 0.54%

apisonator's People

Contributors

Stargazers

Watchers

apisonator's Issues

Async execution model memory leak

It looks like when enabling async Apisonator leaks memory. This might be fixed in upstream dependencies of the async reactor, but we are blocked on #303.

@3scale/operations can you provide more data on this? Grafana dashboard screen captures would be nice to have.

Undefined dependencies for running apisonator.

Hi,
I am attempting to test the apisonator project on Fedora 28.
The Running tests suggests using the script
$ script/test
But I am finding this script is expecting some tools/software in the /opt directory (amongst other locations).
I would have expected the documentation on testing to have listed the following as dependencies

Redis
PostgreSQL
Nutcracker

Do you agree ?

Rewrite or remove event_storage spec that keeps marking builds as failed

This spec failure:

Failures:

  1) ThreeScale::Backend::EventStorage.ping_if_not_empty with events in set with two calls in same moment (race condition) returns false the second time
     Failure/Error: values = threads.each(&:wakeup).map { |thread| thread.join.value }
     
     fatal:
       No live threads left. Deadlock?
     # ./spec/unit/event_storage_spec.rb:224:in `join'
     # ./spec/unit/event_storage_spec.rb:224:in `block (6 levels) in <module:Backend>'
     # ./spec/unit/event_storage_spec.rb:224:in `map'
     # ./spec/unit/event_storage_spec.rb:224:in `block (5 levels) in <module:Backend>'

Finished in 2.77 seconds (files took 2.09 seconds to load)
938 examples, 1 failure

is much more likely to be triggered by the Circle CI infrastructure. In fact, is the most likely reason tests fail, and it is becoming so annoying I am seriously considering removing the whole thing.

Apisonator data-store scalability issue.

Hi,
I've been performance testing 3Scale AMP v2.3 and come across a scalability issue. Nearly all the AMP system is scalable except for one component.
I've scaled up the pods on the system as suggested in the documentation. Increasing gateway, backend-listener (apisonator) and backend-worker pods.
The issue is with the Redis jvm instance. Redis running as a single threaded JVM process becomes the bottleneck in the system.

This should not be shocking news to apisonator project community.

To take the project beyond this scalability bottleneck something needs to change.

There will be effort involved to find a suitable data-store, migrate APIs and performance test scalability.

Is there any appetite for substituting the Redis instance with something that can scale out ?

Inconsistency in response when are limits are exceeded

Perhaps it is my misunderstanding but I am finding it confusing the way rejections due to breach of rate limits are presented to the caller.

For example in a standard response where I have exceeded the limits, I would get something like this:

<?xml version="1.0" encoding="UTF-8"?>
<status>
  <authorized>false</authorized>
  <reason>usage limits are exceeded</reason>
  <plan>Basic</plan>
  <usage_reports>
    <usage_report metric="hits" period="minute">
      <period_start>2018-09-01 14:44:00 +0000</period_start>
      <period_end>2018-09-01 14:45:00 +0000</period_end>
      <max_value>1</max_value>
      <current_value>1</current_value>
    </usage_report>
  </usage_reports>
</status>

So we see there we have a human readable reason but we have not gotten an error code tag.

Now if I look at the docs here https://github.com/3scale/apisonator/blob/master/docs/rfcs/error_responses.md#currently-known-error_codes-and-proposed-classification I can see that limits_exceeded is a known error code that can be mapped to a 409 response, so that is slightly conflicting with the actual response.

What then causes further confusion is if I use the rejection_reason_header header is see that limits_exceeded is embedded in the response headers.

Personally, what I would like to see is the limits_exceeded as part of the xml in the error_code tag for consistency. I don't want to have to enable an extension for the single case where I need to know that I've exceeded limits, as per the docs linked above.

Excluding Git metadata and project examples from gems.

Hi,
Taking a quick peek inside a generated apisonator container I see the following directories.

ruby/3scale_backend-2.89.0/vendor/bundle/ruby/2.3.0/bundler/gems/puma-9b17499eeb49/.git
ruby/3scale_backend-2.89.0/vendor/bundle/ruby/2.3.0/bundler/gems/puma-9b17499eeb49/examples
ruby/3scale_backend-2.89.0/vendor/bundle/ruby/2.3.0/bundler/gems/resque-88839e71756e/.git
ruby/3scale_backend-2.89.0/vendor/bundle/ruby/2.3.0/bundler/gems/resque-88839e71756e/examples

Git metadata can quickly add up in size. Are these directories good candidates for exclusion ?

It might be the case this suggestion needs raised upstream to the respective gem projects.

Add extension to list application keys

This extension would be useful for caches, much like hierarchy, to avoid contacting 3scale when they find an application key they don't know anything about.

Without this extension caches need to contact 3scale even if they know an app is within limits and otherwise it's been authorized before with a different app key, because this new application key might actually exist in the database or not and then the request should be rejected. If a user kept calling a cache with different app keys, a cache would be forced to keep contacting 3scale.

With this extension a cache can take the opportunity in which they learn about metric hierarchy to also list the set of accepted application keys. Like the hierarchy, this information would then be periodically retrieved to pick up any updates.

Security-wise there is no privilege boundary crossed, since a cache already has full access to a 3scale account via the Porta APIs.

Ensure that metric hierarchies of more than 2 levels are supported

Up until now, in Porta it was possible to define hierarchies of metrics with only 2 levels. This is going to change to support the "APIs as a product" feature.

For apisonator, we need to ensure that hierarchies of more than 2 levels are supported. This means that we need to check things like:

The methods to create metrics work correctly.
The hierarchy extension works correctly.
The limits extension works correctly.
Limits are applied checking all the levels of the hierarchy.
Reports take into account all the levels of the hierarchy.
The XML returns the info of all the relevant metrics.

Some of those things might already be working fine, but I think in the past, when implementing some features, we've assumed that it was fine to assume a max of 2 levels.

Guarantee that a previous report has been performed when calling authorization

This is a topic proposed by @unleashed . I'll quote what he said:

The problem arises when we want to report some usage and then perform an authorization. The and then part involves a guarantee: reporting should have been performed before authorization is evaluated.

We currently do not have a repauth endpoint (it is not clear whether that would work well without resorting to OOB jobs), so we could support this flow issuing two different calls as of now, but with a guarantee.

My idea is that apisonator could create a token when reporting, pass it to the caller, pass it to the job, and have the job create a key with a reasonable expiration time once it finishes reporting. This way, we could modify the authorization calls to receive an optional token that would be checked before proceeding with authorization. If the token did exist, the call would keep calm and carry on. Otherwise it would signal the problem and ask the client to try again soon-ish.

Redshift import crashes when there is no data for a specific hour

In S3, events exported via Kinesis are grouped by hour and it looks like the RedshiftAdapter crashes when it cannot find a directory (in this case '2017/02/18/11').

Here's an old stacktrace that shows the problem:

irb(main):006:0> ThreeScale::Backend::Stats::RedshiftAdapter.insert_pending_events
Loading events generated in hour: 2017-02-18 11:00:00 UTC
PG::InternalError: ERROR:  The specified S3 prefix '2017/02/18/11' does not exist
DETAIL:  
  -----------------------------------------------
  error:  The specified S3 prefix '2017/02/18/11' does not exist
  code:      8001
  context:   
  query:     634029
  location:  s3_utility.cpp:568
  process:   padbmaster [pid=29509]
  -----------------------------------------------


        from /var/lib/gems/2.2.0/gems/3scale_backend-2.69.0/lib/3scale/backend/stats/redshift_adapter.rb:304:in `exec'
        from /var/lib/gems/2.2.0/gems/3scale_backend-2.69.0/lib/3scale/backend/stats/redshift_adapter.rb:304:in `execute_command'
        from /var/lib/gems/2.2.0/gems/3scale_backend-2.69.0/lib/3scale/backend/stats/redshift_adapter.rb:349:in `import_s3_path'
        from /var/lib/gems/2.2.0/gems/3scale_backend-2.69.0/lib/3scale/backend/stats/redshift_adapter.rb:334:in `save_in_redshift'
        from /var/lib/gems/2.2.0/gems/3scale_backend-2.69.0/lib/3scale/backend/stats/redshift_adapter.rb:255:in `block in insert_pending_events'
        from /var/lib/gems/2.2.0/gems/3scale_backend-2.69.0/lib/3scale/backend/stats/redshift_adapter.rb:253:in `each'
        from /var/lib/gems/2.2.0/gems/3scale_backend-2.69.0/lib/3scale/backend/stats/redshift_adapter.rb:253:in `insert_pending_events'
        from (irb):6
        from /usr/bin/irb2.2:11:in `<main>'

Testing apisonator.

Hi,
I have attempted to start testing apisonator project. Using the instructions.

Seeing an error when running make test

$ docker run -ti --rm -h apisonator-test -v /thebounty/work/redhat/3scale/apisonator:$(docker run --rm apisonator-test /bin/bash -c 'cd && pwd')/apisonator:z -u $(docker run --rm apisonator-test /bin/bash -c 'id -u'):$(docker run --rm apisonator-test /bin/bash -c 'id -g') --name apisonator-test apisonator-test
/home/ruby/.bash_rbenv: eval: line 14: syntax error near unexpected token `)'
/home/ruby/.bash_rbenv: eval: line 14: `  )'
ruby 2.2.4p230 (2015-12-16 revision 53155) [x86_64-linux]
/home/ruby/.bash_rbenv: eval: line 14: syntax error near unexpected token `)'
/home/ruby/.bash_rbenv: eval: line 14: `  )'
/home/ruby/.bash_rbenv: eval: line 14: syntax error near unexpected token `)'
/home/ruby/.bash_rbenv: eval: line 14: `  )'
/home/ruby/apisonator/script/lib/functions: line 5: $'\r': command not found
/home/ruby/apisonator/script/lib/functions: line 7: syntax error near unexpected token `$'{\r''
'home/ruby/apisonator/script/lib/functions: line 7: `function daemonize {
/home/ruby/apisonator/script/lib/rbenv/ruby_versions: line 2: $'\r': command not found
/home/ruby/apisonator/script/lib/rbenv/ruby_versions: line 8: syntax error near unexpected token `$'{\r''
'home/ruby/apisonator/script/lib/rbenv/ruby_versions: line 8: `regex_escape() {
script/test: line 46: start_services: command not found
script/test: line 22: bundle_exec: command not found
script/test: line 9: stop_services: command not found
Failed tests in default version
$

Improve performance of stats deletion background jobs

We needed to temporarily disable stats deletion background jobs because they are inefficient and take too much time to complete.

Here are some numbers that can help us find a more efficient solution:

Different periods of times that we can find in stats keys for a whole year: 1 (year) + 12 (months) + 52 (weeks) + 365 (days) + 365*24 (hours) = 9190.
Total number of stats keys in a year: 9190 * n_services * n_apps * n_metrics + 9190 * n_services * n_metrics. The first part corresponds to application stats and the second one to service stats.

I also measured the runtime for different operations that we need to perform in this kind of background job. Bear in mind that these are just approximations. There are many factors that can alter these numbers (redis latency, CPU, etc.):

Number of keys that can be generated in a second: 27k.
How long it takes to delete keys (calling redis.del(keys)): 10k keys in batches of 50, takes 0.15s.
Jobs can be enqueued at 750 jobs/s.

According to the numbers above:

A job that deletes all the stats generated for a whole year for a specific {service, app, metric} combination would take 9190/27000 = 340 ms to generate the keys plus another ~150ms to delete them. That's like half a second in total. Deleting the stats at the service level for a given metric would take the same.
A job that does the same for a month instead of a year would take around 40ms. That would be close to what other kinds of jobs take.

I choose to give the number for partitioning by year and month, but we could choose other granularity.

Regarding response codes, I think they can be treated as 8 extra metrics (202, 403, 404, 500, 504, 2xx, 4xx, 5xx).

A possible implementation would be as follows: the job that generates partitions takes into account all the factors (app, metrics, period of time) and generates small jobs that take a reasonable time to generate the subset of keys that they were assigned and delete them.

As an example, we have seen a job that generates around 10M keys. In order to delete all those keys, with the approach of splitting the work by 1service-1app-1metric-1year as mentioned above. We'd need to generate around 10M/9190 = 1088 jobs. Enqueuing those jobs would take 1088/750 = 1.45s. If we partitioned by month instead of by year, we'd need 10M/(9190/12)=13056 jobs. Which would take 17.4s to be enqueued.

This approach would be much more efficient than the current one. However, the time to enqueue all the smaller jobs could be an issue.

The enqueue time could be reduced if we enqueued jobs using pipelines, which as far as I know, is not something that the Resque client that we are using supports, but should be doable. Alternatively, we could make those calls in parallel.

We could also try to find different ways to generate keys more efficiently. We could analyze the code that does that with stackprof or similar and see if we can optimize something.

@eguzki and I discussed an alternative approach. It would be a recursive approach. Once a job is executed, it would delete some keys and create a new job that has fewer applications, fewer metrics, a shorter period of time, or some combination of all that. In the end, we'd have a job without apps, metrics, and an interval of 0s. Then, we'd know that we've finished deleting all the keys of the original job. This approach removes the cost of enqueuing a large number of jobs in a single job, because each one just enqueues another one. However, I see two problems with this approach. The first one is that deleting all the keys could take a long time, since it serializes the jobs. The second problem I see is that it might not be so easy to reduce the job into a smaller one. For example, if we wanted to send one application less to the next job, we'd need to delete everything for that app first.

Let me know what you think @unleashed , @eguzki .

Expose queue sizes via Prometheus metrics

In some environments, that's done using a Redis exporter. However, that's not available in some setups. For that reason, it'd be nice if the Apisonator workers exposed the queue sizes via Prometheus.

Update locale & twemproxy

On ppc64le, upstream make ci-build resulted in the following error(s) :

Step 18/63 : RUN sudo runuser -l postgres -c     "${POSTGRES_PREFIX}/bin/initdb --pgdata='${POSTGRES_DATA_PREFIX}/data' --auth='trust'" < /dev/null     && sudo runuser -l postgres -c     "${POSTGRES_PREFIX}/bin/postgres -D '${POSTGRES_DATA_PREFIX}/data'"       > /tmp/postgres.log 2>&1 & sleep 5     && sudo runuser -l postgres -c "${POSTGRES_PREFIX}/bin/createdb test"     && sudo runuser -l postgres -c "${POSTGRES_PREFIX}/bin/psql test"
 ---> Running in 9acba8a3c1ee
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

initdb: invalid locale settings; check LANG and LC_* environment variables
createdb: could not connect to database template1: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

& twemproxy version v0.4.1 cannot be built, as it does not recognize underlying architecture.
And is fixed in v0.5.0

config/config.guess: unable to guess system type

This script, last modified 2009-04-27, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
and
  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD

If the version you run (config/config.guess) is already up to date, please
send the following data and any information you think might be
pertinent to <[email protected]> in order to provide the needed
information to handle your system.

config.guess timestamp = 2009-04-27

uname -m = ppc64le
uname -r = 4.18.0-240.10.1.el8_3.ppc64le
uname -s = Linux
uname -v = #1 SMP Mon Jan 18 17:21:08 UTC 2021

/usr/bin/uname -p = ppc64le
/bin/uname -X     = 

hostinfo               = 
/bin/universe          = 
/usr/bin/arch -k       = 
/bin/arch              = ppc64le
/usr/bin/oslevel       = 
/usr/convex/getsysinfo = 

UNAME_MACHINE = ppc64le
UNAME_RELEASE = 4.18.0-240.10.1.el8_3.ppc64le
UNAME_SYSTEM  = Linux
UNAME_VERSION = #1 SMP Mon Jan 18 17:21:08 UTC 2021
configure: error: cannot guess build type; you must specify one
configure: error: ./configure failed for contrib/yaml-0.1.4

Cannot find service token even though it's present in redis

Hi there! I was using a docker image of Apisonator to avoid reliance on SaaS for integration testing of WASM filters, being developed under the GSoC'21 program. I am using internal APIs to initialize service ids, tokens, and applications. Even though all calls are successful and registered by Apisonator and Redis, authorize endpoint is not able to resolve the service token.

Script to reproduce the error:

echo "Start Redis"
docker run -p 6379:6379 -d --name my-redis redis --databases 2

echo "Start Apisonator"
docker run -e CONFIG_QUEUES_MASTER_NAME=redis://redis:6379/0 \
        -e CONFIG_REDIS_PROXY=redis://redis:6379/1 -e CONFIG_INTERNAL_API_USER=root \
        -e CONFIG_INTERNAL_API_PASSWORD=root -p 3000:3000 -d --link my-redis:redis \
        --name apisonator quay.io/3scale/apisonator 3scale_backend start

echo "Wait for redis and apisontor to launch"
sleep 5

echo "Create a service"
curl -d '{"service":{"id":"my_service_id","state":"active"}}' http://root:[email protected]:3000/internal/services/ | jq '.'

echo "Create a service id and token pair"
curl -d '{"service_tokens":{"my_service_token":{"service_id":"my_service_id"}}}' http://root:[email protected]:3000/internal/service_tokens/ | jq '.'

echo "Add application"
curl -d '{"application":{"service_id":"my_service_id","id":"my_app_id","plan_id":"my_plan_id","state":"active"}}' http://root:[email protected]:3000/internal/services/my_service_id/applications/my_app_id | jq '.'

echo "Check if service exists or not (Should return back service in JSON format)"
curl http://root:[email protected]:3000/internal/services/my_service_id | jq '.'

echo "Check if pair exists or not (should return 200 OK)"
curl --head http://root:[email protected]:3000/internal/service_tokens/my_service_token/my_service_id/

echo "Check pair without head (returns 'not found')"
curl http://root:[email protected]:3000/internal/service_tokens/my_service_token/my_service_id/ | jq '.'

echo "Use Authorize endpoint (returns 'service_token_invalid'):"
curl "http://0.0.0.0:3000/transactions/authorize.xml?service_token=my_service_token&service_id=my_service_id&user_key=my_user_key"

sleep 2

echo "Clean up"
docker rm my-redis -f
docker rm apisonator -f

Apisonator logs:

172.17.0.1 - root [12/Jul/2021 11:50:59 UTC] "POST /internal/services/ HTTP/1.1" 201 169 0.030991 0 0 0 0 2 1 - -

172.17.0.1 - root [12/Jul/2021 11:50:59 UTC] "POST /internal/service_tokens/ HTTP/1.1" 201 20 0.0023496 0 0 0 0 2 1 - -

172.17.0.1 - root [12/Jul/2021 11:50:59 UTC] "POST /internal/services/my_service_id/applications/my_app_id HTTP/1.1" 201 180 0.0138289 0 0 0 1 3 1 - -

172.17.0.1 - root [12/Jul/2021 11:50:59 UTC] "GET /internal/services/my_service_id HTTP/1.1" 200 167 0.0052454 0 0 0 3 5 1 - -

172.17.0.1 - root [12/Jul/2021 11:51:00 UTC] "HEAD /internal/service_tokens/my_service_token/my_service_id/ HTTP/1.1" 200 - 0.0069157 0 0 0 4 6 1 - -

172.17.0.1 - root [12/Jul/2021 11:51:00 UTC] "GET /internal/service_tokens/my_service_token/my_service_id/ HTTP/1.1" 404 42 0.002605 0 0 0 4 6 1 - -

172.17.0.1 - - [12/Jul/2021 11:51:00 UTC] "GET /transactions/authorize.xml?service_token=my_service_token&service_id=my_service_id&user_key=my_user_key HTTP/1.1" 403 - 0.0026236 0 0 0 6 10 3 - -

Redis keys dump (using docker exec -it my-redis redis-cli; select 1; keys *):

1) "application/service_id:my_service_id/id:my_app_id/state"
 2) "service/id:my_service_id/state"
 3) "service/id:my_service_id/referrer_filters_required"
 4) "services_set"
 5) "application/service_id:my_service_id/id:my_app_id/plan_id"
 6) "service_id:my_service_id/applications"
 7) "service/provider_key:/ids"
 8) "service_token/token:my_service_token/service_id:my_service_id"
 9) "application/service_id:my_service_id/id:my_app_id/user_required"
10) "service/id:my_service_id/provider_key"
11) "provider_keys_set"

Checking for the service pair with '--head' makes sense because there is a path listed for it and not for GET. But @unleashed asked me to mention it in this Issue.

apisonator/app/api/internal/service_tokens.rb

Lines 6 to 8 in ea23574

    
           head '/:token/:service_id/' do |token, service_id| 
        
             ServiceToken.exists?(token, service_id) ? 200 : 404 
        
           end

I am not sure why authorize endpoint is not able to resolve the service token which is required for integration tests for the wasm-filters.

Please let me know if I missed anything, thanks!

Update: I mistakenly added the provider key into the JSON data sent for creating a service and error message change from "invalid service token" to "user key missing/not found" (which makes sense as I haven't initialized any user key). So, I think, a more relevant error should be used (pertaining to provider key); OR, @unleashed mentioned that the provider key is deprecated so maybe there shouldn't be any reliance on it?

Delete redis monkey-patch after upgrading to redis-rb >= 4.1.2

This is just a reminder to delete 83c822c after we upgrade the redis-rb gem to >= 4.1.2

Tracking issue for clearing up what to do about OIDC apps' client_secrets being stored as app_keys by Porta

We learnt in #280 that Porta is storing OIDC apps' client_secrets as app_key's, and that has caused confusion as to how to deal with OIDC in the 3scale Istio Adapter, as specifying the client_secret as an app_key while using the auth*.xml endpoints ends up in successfully authorizing requests.

This issue should be resolved when we know why this is being done and whether we should remove/not allow these keys to be stored for such apps, and consequently, whether a request for an OIDC service specifying an app_key parameter should be checked against the registered app_keys that we have in our data store.

/cc @davidor

Tracking issue for removing the Redirect validator

This validator might no longer be necessary. This issue should be closed whenever we find out whether it is or isn't needed as of the latest 3scale versions: it might be the case the admin UI (Porta) does no longer allow people to set these.

This is so because the only OAuth apps users should be creating now should be using OpenID Connect, which handles redirects in proxies way before 3scale gets to authorize/report.

Arising from #280 (comment).

/cc @davidor

Exception: undefined method `load_metric_names' for nil:NilClass

Got this exception while running v3.2.1:

lib/3scale/backend/stats/aggregator.rb:148:in `block in update_alerts': undefined method `load_metric_names' for nil:NilClass (NoMethodError)
    from lib/3scale/backend/stats/aggregator.rb:143:in `each'
    from lib/3scale/backend/stats/aggregator.rb:143:in `update_alerts'
    from lib/3scale/backend/stats/aggregator.rb:62:in `process'
    from lib/3scale/backend/transactor/process_job.rb:16:in `perform'
    from lib/3scale/backend/transactor/report_job.rb:15:in `perform_logged'
    from lib/3scale/backend/background_job.rb:33:in `perform_wrapper'
    from lib/3scale/backend/background_job.rb:12:in `perform'
    from bundler/gems/resque-88839e71756e/lib/resque/job.rb:168:in `perform'
    from lib/3scale/backend/worker.rb:69:in `perform'
    from lib/3scale/backend/worker_sync.rb:23:in `block in work'
    from lib/3scale/backend/worker_sync.rb:19:in `loop'
    from lib/3scale/backend/worker_sync.rb:19:in `work'
    from lib/3scale/backend/worker.rb:47:in `work'
    from bin/3scale_backend_worker:25:in `block in <top (required)>'
    from gems/daemons-1.2.4/lib/daemons/application.rb:266:in `block in start_proc'
    from gems/daemons-1.2.4/lib/daemons/daemonize.rb:84:in `call_as_daemon'
    from gems/daemons-1.2.4/lib/daemons/application.rb:270:in `start_proc'
    from gems/daemons-1.2.4/lib/daemons/application.rb:296:in `start'
    from gems/daemons-1.2.4/lib/daemons/controller.rb:56:in `run'
    from gems/daemons-1.2.4/lib/daemons.rb:197:in `block in run_proc'
    from gems/daemons-1.2.4/lib/daemons/cmdline.rb:92:in `catch_exceptions'
    from gems/daemons-1.2.4/lib/daemons.rb:196:in `run_proc'
    from bin/3scale_backend_worker:24:in `<top (required)>'
    from /usr/local/bin/3scale_backend_worker:22:in `load'
    from /usr/local/bin/3scale_backend_worker:22:in `<main>'

I've only seen this once and the job was successful after rescheduling it, so this is probably a corner case that triggers very infrequently.

Revamp README.md orienting it to deployment/usage and extract dev instructions

The current README file is heavily oriented towards developers / contributors, but it barely has information on how to deploy and use Apisonator. We can split both aspects and have a README that is more oriented towards new users installing and using the software. We also have broken links to documentation in the old 3scale website.

I think it would be good to have a look at it from the point of view of a user wanting to install and start using it (even if we suppose they are not acquainted with the rest of the 3scale platform or just want to test or push data to the Redis database).

timestamps in transactions can blindly take invalid input

The usage of Date._parse when taking the input of the timestamp field of transactions is insufficient to validate a date. In particular, it's been discovered that some strings with a specific number of digits are considered dates by the affected code.

Note that the documentation only talks about a specific format for dates in the timestamp field, so for example we might want to consider changing this so proper validation happens as specified in the docs.

Expose stats endpoints in the internal API

It'd be great to define stats endpoints in the internal API so Porta does not have to access the Redis DB directly.

Prometheus metrics for the internal API endpoints

It would be good to have Prometheus metrics for the Internal API endpoints.
The implementation would be similar to what we already have for the auth and report endpoints.

Delete SAAS env var

We should delete the SAAS env. There are no differences any more.

In the process, we should also unify all the dependencies under a single Gemfile and reorganize the Rake tasks to stop depending on the env.

Document "hierarchy" extension

Unfortunately, the list of documented extensions does not include the hierarchy extension.

This extension is useful for caching purposes, and could be extended or used in conjunction with existing or new extensions to make caching more effective.

Extension to report metrics without taking hierarchies into account

Caching agents essentially need to replicate some functionality in Apisonator. Because these agents need to take metric hierarchies into account for correctly applying limits, they need to keep track of the current counters for them. When reporting, these agents need to figure out what delta to report, that is, how much a metric was consumed.

However, this is problematic in the case of metrics that have a parent (or in newer releases even grandparents and so on), because they need to walk over the hierarchy carefully to take into account that whatever they report in a child metric will be added to in the parent, grandparent, etc.

They need to compute this because Apisonator will add to parents. This issue is a new feature request to help these agents avoid doing work that is only meant to be undone by Apisonator. That is, agents will keep a delta for each and all metrics involved in a given reporting period, but they will need to recompute the deltas with hierarchy data when "flushing" this information, that is, decreasing the delta values for parents by their children's deltas, only for Apisonator to take that and increase the parents' deltas by their childrens'.

So this feature would be adding an extension for reporting (authrep and report endpoints) in which hierarchy computation would be turned off. That is, the calling agent tells us to trust them to have correctly computed the deltas based on the current metric hierarchy for a given service.

This benefits both the agent and Apisonator: the agent won't need to touch the deltas, it will just report hits for all involved metrics, and Apisonator won't have to apply the reported values to parents (which is effectively undoing the work made by the agent).

This implies work in the aggregator so that if this extension is enabled for the reporting request we take the specified metrics values and just add one by one to our current values without any further processing.

Ideas for a new version of the auth and report APIs

This is just a list of things to consider when we decide to expose new APIs for authorizing and reporting. It applies to other APIs that we might want to expose for example to improve caching done by external systems. In no particular order, some things that we've discussed in the past:

Avoid including unnecessary information in the response XMLs.
Support JSON in all the endpoints.
Make sure the APIs are compatible with the swagger UI tooling.
Consider changing some response codes to the ones described in this RFC: https://github.com/3scale/apisonator/blob/master/docs/rfcs/error_responses.md

Total job process time logged is not correct

For each processed job, the workers log some information like the process time, the total time (process time + time in the queue), etc.

The second is not correct. The reason is that the enqueue time is stored in Redis by a listener when it enqueues a job:

apisonator/lib/3scale/backend/transactor.rb

Line 204 in 18edf5c

Resque.enqueue(ReportJob, service_id, data, Time.now.getutc.to_f, context_info)

Whereas the total time is calculated by the worker when it finished processing a job:

apisonator/lib/3scale/backend/background_job.rb

Line 43 in 18edf5c

" #{(end_time.to_f - enqueue_time).round(5)}"+

That's the reason why sometimes we get inconsistent numbers in the logs. For example, sometimes we see that the process time is greater than the process time plus the time in the queue. In many deployments, those two timestamps always come from different machines (listeners vs background workers) so comparing them is not reliable.

Drop support for Ruby 2.4 (when possible)

We can't do this yet, but when possible, we should drop support for Ruby 2.4.
After that, we'll be able to update some gems like the async ones.

Disable/Enable Services

When service is disabled, the applications under all these service should all stop being authed.
Service can be enabled back. Apps start authorizing again.

Only public API endpoints should be affected by service state. Internal API should work regardless service state.

Update license_finder including XML reporter

The XML reporter has been upstreamed so we no longer need to maintain an extra dependency and can finally also update.

See this PR for details.

Provide a way to create metrics that are disabled by default

We need to provide a way to create metrics that are disabled (limit of 0). This is needed to support the "APIs as a product" feature in Porta.

This might also require changes in the Pisoni gem.

Ideas to improve the processing of background jobs

These are some ideas that we have discussed in the past to improve the processing of background jobs:

Evaluate Sidekiq as a replacement for Resque.
Implement a way to process at least a small % of non-priority jobs even when there are priority jobs pending.
Break up big report jobs into smaller ones.
Improve the way we reschedule failed jobs to avoid growing queues under stress load. The problem with this is that when an incident happens, the queues start growing, and the priority one in particular grows very fast. Shoving more jobs on top of a growing queue is not a good idea, as the system clearly can't cope with the workload

Async mode not compatible with Redis logical databases

The async-redis client does not support Redis logical databases. That means that it does not work properly on environments that have for example, the main db in redis://redis-backend:6379/0 and the queues in redis://redis-backend:6379/1: https://github.com/3scale/3scale-operator/blob/master/pkg/3scale/amp/auto-generated-templates/amp/amp.yml

Ref: socketry/async-redis#15

Use Redis pipelining in NotifyBatcher.do_batch

We should group the jobs and use Redis pipelining here:

apisonator/lib/3scale/backend/transactor/notify_batcher.rb

Line 104 in ea23574

all.each do |_, v|

[RFE] Allow Severity Level for different apisonator component logs to be configured

Currently there is no way to define a severity level for logging apisonator events.

For customers with large numbers of requests, they can find that they are using a lot of disk storage for daily logs at the current log level.

For backend-listener
- request logs
- server logs
For backend-worker
- Can currently send logs to /dev/null by setting the CONFIG_WORKERS_LOG_FILE but this means that no logs are sent.
- INFO events are currently logged for every report job.
Other components?

It would be very useful to be able to set the log level with a variable, e.g rails_log_level

INFO, WARNING and ERROR severity levels should be enough for most customers, although we may want to think about adding a DEBUG level also depending on the component.

Evaluate Envoy as an alternative to Twemproxy

Twemproxy is no longer maintained and Envoy can act as a Redis proxy: https://www.envoyproxy.io/docs/envoy/v1.13.0/api-v2/config/filter/network/redis_proxy/v2/redis_proxy.proto

I tried to pass the test container by simply stopping Twemproxy and starting an Envoy with an equivalent config:

static_resources:
  listeners:
  - name: redis_listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 22121
    filter_chains:
    - filters:
      - name: envoy.redis_proxy
        typed_config:
          "@type": type.googleapis.com/envoy.config.filter.network.redis_proxy.v2.RedisProxy
          stat_prefix: egress_redis
          settings:
            op_timeout: 5s
            enable_hashtagging: true
          prefix_routes:
            catch_all_route:
              cluster: redis_cluster
  clusters:
  - name: redis_cluster
    connect_timeout: 1s
    type: strict_dns # static
    lb_policy: RING_HASH
    load_assignment:
      cluster_name: redis_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 7379
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 7380
admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 8001

All the tests pass except some in the BucketStorage and BucketReader classes. They fail because they use the SUNION command, which is not supported in Envoy. We could replace those SUNIONS with SMEMBERS and perform the union in Ruby. It would probably be less efficient.

Evaluate removing concept of default service

Recently we introduced the ability to remove a default service when it's the only one that a provider has (#51).

I think we should evaluate if it makes sense to remove the concept of a default service from the apisonator codebase. I believe the responsibility of deciding whether a service can be removed or not should be only in https://github.com/3scale/porta

Add Prometheus metrics for Falcon

Puma already has Prometheus metrics (added in #174), We need to do the same for Falcon.

Delete stats endpoint must not accept endusers

Currently, delete stats endpoint

DELETE /internal/services/#{service_id}/stats

Accepts as request body

{ 
  "deletejobdef": {
     "applications": ["1"],
     "metrics": ["5"],
     "from": 1483228800,
     "to": 1483228800, 
     "users": []
  }
}

The list of endusers users exists only in apisonator database. Porta client does not have list of end users. Moreover, Apisonator does not have internal endpoint te get list of endusers.

So, requested change is:

Remove users from request body
Get the list of endusers internally using method ThreeScale::Backend::User.service_users_set_key(service_id) when delete stats endpoint is called and pass the list to stats key generator.

Enhance development user experience

#353
#354
- Support only one ruby version
- Separated containers for tweemproxy and sentinels
- Separated contatiners for redis. New redis version. Maybe 6.2??
#355
#356
#357

Async: Avoid leaking exception messages in response body

When there's an exception, Falcon returns its description in the response body. Puma only does that in development mode. I opened an issue in Falcon to fix this socketry/falcon#98

Webhook: refactor and solve alerts ending up requiring additional requests

Issue reported by @unleashed:

backend requires further activity to "flush" alerts to system.

The whole alerts code (event_storage.rb particularly) is rather low quality. We should improve the design (avoiding the problem above) and fix the awful code that handles the webhook. A Webhook class should be created (nice if it also can handle custom Host headers), and used, hopefully without the need to re-create the client and being configurable for timeouts with proper error handling.

The "proper error handling" part is avoiding horrible stuff like rescue => e; notify(); end which caused NoMethodError, NameError and LoadError among others to inflict infinite pain to me for hours.

ppc64le PowerOs issue

Hello Team,

trying to validate the 3scale/apisonator repository getting issue while executing the following command :
Command

make DOCKER_OPTS="-e TEST_ALL_RUBIES=1" test

Issue

ruby@b650a9e48ae7 bin]$ bundle_install_rubies
Switching to 2.7.4
Latest version already installed. Done.
Bundling Gemfile on ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [powerpc64le-linux] with Bundler 2.2.26

[!] There was an error parsing `Gemfile`:
[!] There was an error while loading `apisonator.gemspec`: cannot load such file -- 3scale/backend/version. Bundler cannot continue.

Delete "user_required" attr from Application

I don't think it's used anywhere. We probably forgot to delete it when we removed the "end-users" feature.

Review update to Puma latest stable version

We are using our own fork (https://github.com/3scale/puma) that solves a performance issue. That problem has been solved in upstream, so we should try to update.

Provide a way to delete default services

Right now there's no way to delete the default service of a provider. It probably makes sense to allow deleting it when it's the only one left.

Incompatible with Redis Cluster

Apisonator is not compatible with the Redis cluster mode.

I did not try all the Apisonator functionality, I just tried to make a basic test work. The problems can be summarized in two:

Apisonator does not guarantee that all the keys inside a pipeline belong to the same shard, which is a requirement to use Redis Cluster. There are many examples of pipelines with keys that are not guaranteed to belong to the same shard (do not include the "{}" hashtag): Alerts.update_utilization, Application.save, EventStorage.pending_ping?, Metric.save, etc.
Apisonator does not guarantee that all the keys passed as params for a single command belong to the same shard, which is also a requirement to use Redis cluster. This affects all commands that accept multiple keys, for example, mget, blpop, hmget, etc. Examples: Application.load, Service.get_service, some methods in UsageLimit, etc. Also, the blpop used to fetch jobs from the queue.

I don't see a straightforward solution for those problems.

Modifying all the keys to ensure that in every operation they belong to the same shard might be complicated. Migrating a running system might be very challenging.

The solution of replacing all the mgets with multiple gets, remove the pipelines, etc. is not feasible. The performance hit would be very high.

Documented behaviour not consistent for rate limiting

In the error responses document, it suggest thats backend will return a 409 when rate limits have been exceeded which is behaviour I am observing. However it also suggests that the error_code tag in the xml response will be populated with the code limits_exceeded which I have not been able to observe.

This is the response I am seeing when an imposed limit has been breached:

<?xml version="1.0" encoding="UTF-8"?>
<status>
  <authorized>false</authorized>
  <reason>usage limits are exceeded</reason>
  <plan>app-plan</plan>
  <usage_reports>
    <usage_report metric="hits" period="minute">
      <period_start>2020-01-06 14:52:00 +0000</period_start>
      <period_end>2020-01-06 14:53:00 +0000</period_end>
      <max_value>2</max_value>
      <current_value>2</current_value>
    </usage_report>
  </usage_reports>
</status>

'3scale-Limit-Reset' Not useful when rate-limited

The limit headers return the remaining quota and the reset time based on the most constrained limit that applies.
When the request is rate-limited, these headers return incorrect information.

I tried defining a metric with 2 limits:

5 requests / minute.
100k requests / year.

After making 6 requests in a minute, I would expect to see a reset-time close to 60 seconds. However, it returns the remaining for the year.

	head '/:token/:service_id/' do \|token, service_id\|
	ServiceToken.exists?(token, service_id) ? 200 : 404
	end

3scale / apisonator Goto Github PK

apisonator's People

Contributors

Stargazers

Watchers

Forkers

apisonator's Issues

Recommend Projects

Recommend Topics

Recommend Org