Code Monkey home page Code Monkey logo

cf-networking-release's Introduction

CF Networking Release

This repository is a BOSH release for deploying CF Networking and associated tasks. CF Networking provides policy-based container networking and service discovery for Cloud Foundry.

For information on getting started with Cloud Foundry look at the docs for CF Deployment.

Docs

Contributing

See the Contributing.md for more information on how to contribute.

Working Group Charter

This repository is maintained by App Runtime Platform under Networking area.

Important

Content in this file is managed by the CI task sync-readme and is generated by CI following a convention.

cf-networking-release's People

Contributors

ameowlia avatar angelachin avatar bruce-ricard avatar christianang avatar dennisdenuto avatar dependabot[bot] avatar dsabeti avatar ebroberson avatar evanfarrar avatar ewrenn8 avatar genevieve avatar geofffranks avatar joachimvaldez avatar jrussett avatar karampok avatar marcpaquette avatar mariash avatar markstgodard avatar mcwumbly avatar mike1808 avatar moleske avatar nhsieh avatar reneighbor avatar rosenhouse avatar routing-ci avatar rowanjacobs avatar tas-runtime-bot avatar tcdowney avatar tylerschultz avatar winkingturtle-vmw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cf-networking-release's Issues

C2C DNS lookups are case-sensitive

Issue

It seems that hostnames on internal routes are case-sensitive when used in c2c networking.

Context

We recently made a small change to the v7 CLI to lowercase hostnames that are given to the map-route command before sending them to CAP. A seemingly small, innocent change: cloudfoundry/cli@c831987

This caused a CATs test to fail, and I found some interesting behavior when investigating. In the failing test, we do the following:

  1. Push a backend app and map an internal route to it (which got lowercased, like cats-1-host.apps.internal)
  2. Push a frontend app that behaves like a proxy
  3. Add a network policy to allow the frontend app to talk to the backend app (cf add-network-policy … --protocol tcp --port 8080)
  4. Curl the frontend proxy with a request like: curl -k -H Expect: -s http://CATS-1-APP-FRONT-89d91a1f93c9077c.nopa.cli.fun/proxy/CATS-1-HOST-7c9c9cbaacf08282.apps.internal:8080
  5. Expect the proxy to work (hitting the backend app).

Instead, the proxying fails, and the backend returns an error like:

request failed: Get http://CATS-1-HOST-7c9c9cbaacf08282.apps.internal:8080: dial tcp: lookup CATS-1-HOST-7c9c9cbaacf08282.apps.internal on <ip address>:53: no such host

Note that in that proxy request, the capitals are preserved. We can get the test to pass either by reverting the lowercasing change on the CLI side, or by changing the proxy request to look like .../proxy/cats-1-host-... (lowercase hostname).

Steps to Reproduce

Make sure you're targeting some space on a CF that has an internal domain (e.g. apps.internal).

  1. Use cf push frontend to push the frontend proxy app found here: https://github.com/cloudfoundry/cf-acceptance-tests/tree/master/assets/proxy

  2. Use cf push backend --no-route to push the simple app found here: https://github.com/cloudfoundry/cf-acceptance-tests/tree/master/assets/hello-world

  3. Use cf map-route backend apps.internal --hostname wow to create and map the internal route wow.apps.internal to the backend app

  4. Use cf add-network-policy frontend backend --protocol tcp --port 8080 to add a c2c networking policy from the frontend to the backend.

Expected result

We should now be able to successfully proxy to the backend app, and the casing of the hostname shouldn't matter:

→ curl frontend.nopa.cli.fun/proxy/wow.apps.internal:8080
Hello, world!

→ curl frontend.nopa.cli.fun/proxy/WOW.apps.internal:8080
Hello, world!

Current result

Passing a hostname that doesn't match in case doesn't work:

→ curl frontend.nopa.cli.fun/proxy/wow.apps.internal:8080
Hello, world!

→ curl frontend.nopa.cli.fun/proxy/WOW.apps.internal:8080
request failed: Get http://WOW.apps.internal:8080: dial tcp: lookup WOW.apps.internal on 169.254.0.2:53: no such host

→ curl frontend.nopa.cli.fun/proxy/Wow.apps.internal:8080
request failed: Get http://Wow.apps.internal:8080: dial tcp: lookup Wow.apps.internal on 169.254.0.2:53: no such host

Possible Fix

Not sure :(

Additional Context

Here's the exact test that was failing: https://github.com/cloudfoundry/cf-acceptance-tests/blob/cli-v7-cats-new-features/service_discovery/service_discovery.go

I was using the V7 CF CLI, but it shouldn't matter for this. The only relevant difference from V6 is that in V6, creating the network policy would look like: cf add-network-policy frontend --destination-app backend --protocol tcp --port 8080

cf-networking-release ignoring garden allow_networks configuration

Hi,
I deployed cf-networking-release today with cf-release 265 and diego-release 1.19. When cf-networking-release is successfully deployed, the configuration section of

garden:
    mtu: 1426
    allow_networks:
    - 172.16.100.0/24
    - 172.16.101.0/24

Is completely ignored and the communication to the defined networks fails. Is there a trick in the deployment to avoid this behaviour?

Deprecated `network-policy` CF Plugin binary URL 404s in CF-Community repo

Thanks for submitting an issue to cf-networking-release. We are always trying to improve! To help us, please fill out the following template.

Issue

While attempting to install the network-policy plugin from the public CF-Community repo, users will get a 404. Appears to fail for both linux and darwin binaries

Context

Not sure if anyone still needs this plugin, or if you all maintain the repo currently but Perhaps plugin should be removed

Steps to Reproduce

$ cf repo-plugins | grep network-policy
network-policy                  1.6.1     Allow the user to manage application network policies
$ cf install-plugin -r CF-Community "network-policy"
Searching CF-Community for plugin network-policy...
Plugin network-policy 1.6.1 found in: CF-Community
Attention: Plugins are binaries written by potentially untrusted authors.
Install and use plugins at your own risk.
Do you want to install the plugin network-policy? [yN]: y
Starting download of plugin binary from repository CF-Community...
 21 B / ? [-------------------------------------------------------------------------------------------------------------------------------------------------------------------=--------------------]   0.21% 0s
Download attempt failed; server returned 404 Not Found
Unable to install; plugin is not available from the given URL.
FAILED

trace output

Starting download of plugin binary from repository CF-Community...
REQUEST: [2018-09-08T09:39:54-04:00]
GET /cloudfoundry-incubator/cf-networking-release/releases/download/v1.6.1/network-policy-plugin-linux64 HTTP/1.1
Host: github.com
Accept: application/json
Content-Type: application/json
User-Agent: cf/6.38.0+7ddf0aadd.2018-08-07 (go1.10.3; amd64 linux)

 21 B / ? [-------------------------------------------------------------------------------------------------------------------------------------------------------------------=--------------------]   0.21% 0s
RESPONSE: [2018-09-08T09:39:55-04:00]
HTTP/1.1 404 Not Found
Cache-Control: no-cache
Content-Security-Policy: default-src 'none'; base-uri 'self'; connect-src 'self'; form-action 'self'; img-src 'self' data:; script-src 'self'; style-src 'unsafe-inline'
Content-Type: application/json; charset=utf-8
Date: Sat, 08 Sep 2018 13:39:54 GMT
Expect-Ct: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
Server: GitHub.com
Set-Cookie: has_recent_activity=1; path=/; expires=Sat, 08 Sep 2018 14:39:54 -0000
Set-Cookie: logged_in=no; domain=.github.com; path=/; expires=Wed, 08 Sep 2038 13:39:54 -0000; secure; HttpOnly
Set-Cookie: _gh_sess=Qi9Zb3UrTzBHenpudzdyb2hSa2VSc1l4WTN6L3RHRG5xbnNhc2tCOFNsbm54SGdMTFdiaWdsdDJwTUUwdHBORk12bnFjZ2VsWGN0aGh4TStGU05mODI4NkdUVU1EMndzWlRTLzVBTVBoLzNPdXJ3dGlCUmhyNnNxTi9DR1FSU1h1OTRsRlFlV0ZjbE5GaFdWVWRDZ2F5Ump1YjkxcXc2dlg5bFkxZU9CUklSMVliMll2RUM0UUgxNk54empOUmE5VXg1S2NsR3c3VzQvdENuWW9HMVlrZWZOVzBjc0ZPdnVUdWcreWlnMVhCd0x0Z2piWWo2U2U2VHh2ZzZpWFJ2VkEvNGdaQ0ljZEVpK0lxbUFCRm92SUFKd28rSnBudUhEWjZOTXV1NDdmSkYwTHBwLzEzcnZCWU13azMwUmRNM2JTY1ViRVpVR1FDMUZIVHV6Y2FpYVFyV09jandPc3VidUY4M3Vha3JTMndDQkFHOTA3RFVmRFVzVVRtNU9ZTFptNm1Wa29LcVE5OTY5U0VISWxNc2xwQT09LS16czJGU3BheHkvSDJtSmVkcDRRMzN3PT0%3D--d749acac300fceec2a519d8a17a9b2b9f7d49480; path=/; secure; HttpOnly
Status: 404 Not Found
Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
Vary: X-PJAX
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-Github-Request-Id: 29C4:433A:A9E4A1:1152B13:5B93D12A
X-Request-Id: 6b448257-8be5-4ad7-97e3-a8f782333377
X-Runtime: 0.053444
X-Runtime-Rack: 0.062918
X-Xss-Protection: 1; mode=block
[NON-JSON BODY CONTENT HIDDEN]

Download attempt failed; server returned 404 Not Found
Unable to install; plugin is not available from the given URL.
FAILED

Expected result

Successful install

Current result

Failed install

Possible Fix

Update or remove repo url

When a Space is deleted, the network policies of the apps in this Space are not deleted immediately

I'm using:
CF CLI 6.31.0+b35df905d.2017-09-15
Ops Mgr 1.12.0, ERT 1.12.0 and cf-networking 1.6.0

Here is the reproduce steps:

  1. create 2 Orgs, and 1 Space for each Org
  2. push 2 apps under both Spaces
  3. use 'cf add-network-policy' to add 2 polices in both Spaces
  4. 'cf network-policies' returns 2 polices for both Spaces; and Network Policy API returns all 4 polices : diego_database/0e3f6506-1741-41c7-b5d8-598c922833d3:~$ curl -sk --cacert $BBS_CA_CERT_FILE --cert $BBS_CERT_FILE --key $BBS_KEY_FILE https://network-policy-server.service.cf.internal:4003/networking/v0/internal/policies
  5. delete one of the Space
  6. 'cf network-policies' returns 2 polices for the remaining Space; Network Policy API still returns 4 polices, but 2 policies are expected.
  7. After several minutes (around 15 min or more), Network Policy API returns 2 polices as expected.

Please help take a look. Thanks.

Provide name based mechanism to lookup IP

Kudos to the great work on cf-networking.
I played around with it and found it very useful. But there is one small thing that could be improved from a user perspective.
When adding the policy, the source and target droplets are able to communicate with each other via the so called overlay IP. It would be great if there would also be a mechanism to communicate via a name (similar to the Docker Container link feature).
This way, the setup of the droplets would be resilient to restarts of the droplets.

question

Hi, in the list of releases of cf-networking-release I see:

image

Could you please tell me the difference between the content of:

cf-networking-2.3.0.tgz
and
Source code (zip) / Source code (tar.gz) ?

Thanks!

cf-release docs contain wrong configuration

Hi,
I deployed cf-networking-release today with cf-release 265 and diego-release 1.19. During that I found an error in the documentation. For uaa the configuration has the following section

authorized-grant-types: client_credentials,refresh_token

But if authorized-grant-types contains refresh_token you need to specify scopes as well in the section.

space_developer_test intermittently failing in 2.7

Issue

Intermittent failure of space_developer_test during 2.7 deployments.

We are filing an issue to make your team aware of these failures.

Context

PAS Releng team has seen a number of failed pnats runs with the same symptoms.

[2019-06-10 17:35:36.34 (UTC)]> cf curl -X PUT /v2/organizations/b5039dc7-ddac-402c-a5f8-06dfcaa31feb/users/space-client 

[2019-06-10 17:35:36.34 (UTC)]> cf curl -X PUT /v2/spaces/3478947b-e2bc-4db9-955b-aef70fca758d/developers/space-client 

[2019-06-10 17:35:36.35 (UTC)]> cf curl -X PUT /v2/spaces/56fd688d-18e9-4ad8-a8c8-cdc25acc386d/developers/space-client 
STEP: logging in and getting the space developer user token

[2019-06-10 17:35:36.35 (UTC)]> cf auth space-client password --client-credentials 
API endpoint: https://api.sys.vivid-fancier.gcp.releng.cf-app.com
Authenticating...
{
   "description": "space-client",
   "error_code": "CF-InvalidRelation",
   "code": 1002
}

Steps to Reproduce

Happens randomly while running pnats.

Expected result

test to pass successfully.

Current result

• Failure [73.906 seconds]
space developer policy configuration
/tmp/build/29505db7/acceptance-tests/src/test/acceptance/space_developer_test.go:23
  space developer with network.write scope
  /tmp/build/29505db7/acceptance-tests/src/test/acceptance/space_developer_test.go:125
    can create, list, and delete network policies in spaces they have access to
    /tmp/build/29505db7/acceptance-tests/src/github.com/onsi/ginkgo/extensions/table/table.go:92
      as a service account [It]
      /tmp/build/29505db7/acceptance-tests/src/github.com/onsi/ginkgo/extensions/table/table_entry.go:46

      Expected
          <int>: 1
      to match exit code:
          <int>: 0

      /tmp/build/29505db7/acceptance-tests/src/test/acceptance/space_developer_test.go:143

Possible Fix

name of issue screenshot

Receiving 403 response when access token expired

Issue

Currently the policy server returns 403 FORBIDDEN when an expired access token is used. Per HTTP specification it should be 401 UNAUTHORIZED.

Context

This issue was initially reported to CF Java Client here: cloudfoundry/cf-java-client#1016. It turns out that CFJC Networking Client relies on 401 response to invalidate token and retry, so it doesn't refresh token correctly on a 403 response.

Steps to Reproduce

Reproduced in this repo: [email protected]:LittleBaiBai/cf-java-client-playground.git. If you are on Pivotal network, you can run tests directly and see NetworkingClient test fails after token expired. Otherwise, make sure the client you are using has its access token validity is set to 10 seconds or less, then run test TMPNetworkConfigurerTest.

Possible Fix

Return UNAUTHORIZED instead of FORBIDDEN here:

a.ErrorResponse.Forbidden(logger, w, err, "failed to verify token with uaa")

Acceptance tests are not compatible with cf cli v7

Issue

Tests within src/test/acceptance are not compatible with cf cli v7

Context

While running integration tests, we found that the cf_cli_adapter does not handle the cf cli v7 api for the bind-security-group command.

Steps to Reproduce

Run the acceptance set suite using cf cli v7

Expected result

All tests pass.

Current result

STEP: removing test-generated ASGs
• Failure [70.704 seconds]
external connectivity
/tmp/build/29505db7/acceptance-tests/src/test/acceptance/external_connectivity_test.go:20
  basic (legacy) network behavior for an app
  /tmp/build/29505db7/acceptance-tests/src/test/acceptance/external_connectivity_test.go:108
    allows outbound ICMP only if allowed [It]
    /tmp/build/29505db7/acceptance-tests/src/test/acceptance/external_connectivity_test.go:147

    Expected success, but got an error:
        <*cf_cli_adapter.CmdErr | 0xc0004d8870>: {
            Out: "FAILED\n\nNAME:\n   bind-security-group - Bind a security group to a particular space, or all existing spaces of an org\n\nUSAGE:\n   cf bind-security-group SECURITY_GROUP ORG [--lifecycle (running | staging)] [--space SPACE]\n\nTIP: Changes require an app restart (for running) or restage (for staging) to apply to existing applications.\n\nOPTIONS:\n   --lifecycle      Lifecycle phase the group applies to. (Default: running)\n   --space          Space to bind the security group to. (Default: all existing spaces in org)\n\nSEE ALSO:\n   apps, bind-running-security-group, bind-staging-security-group, restart, security-groups\n",
            Err: "Incorrect Usage: unexpected argument \"test-space\"\n",
            Message: "exit status 1",
        }
        exit status 1:
        
        Out:
        FAILED
        
        NAME:
           bind-security-group - Bind a security group to a particular space, or all existing spaces of an org
        
        USAGE:
           cf bind-security-group SECURITY_GROUP ORG [--lifecycle (running | staging)] [--space SPACE]
        
        TIP: Changes require an app restart (for running) or restage (for staging) to apply to existing applications.
        
        OPTIONS:
           --lifecycle      Lifecycle phase the group applies to. (Default: running)
           --space          Space to bind the security group to. (Default: all existing spaces in org)
        
        SEE ALSO:
           apps, bind-running-security-group, bind-staging-security-group, restart, security-groups
        
        
        Err:Incorrect Usage: unexpected argument "test-space"
        
        

    /tmp/build/29505db7/acceptance-tests/src/test/acceptance/external_connectivity_test.go:156

Possible Fix

Update the cf_cli_adapter method BindSecurityGroup (here) to handle cf cli v7.

Additional Context

vxlan-policy-agent won't start due to policy-server SQL error

Issue

When starting up "vxlan-policy-agent" on our diego cells, we are getting the following error:

{"timestamp":"2020-06-15T11:58:14.233874801Z","level":"error","source":"cfnetworking.vxlan-policy-agent","message":"cfnetworking.vxlan-policy-agent.policy-client.http-client","data":{"body":"{\"error\": \"database read failed\"}","code":500,"error":"database read failed","session":"1"}}
{"timestamp":"2020-06-15T11:58:14.233926043Z","level":"error","source":"cfnetworking.vxlan-policy-agent","message":"cfnetworking.vxlan-policy-agent.policy-client-get-policies","data":{"error":"http status 500: database read failed"}}

Tracing this back to the policy-server (and policy-server-internal) on the api instances, we see the following error:

{"timestamp":"2020-06-15T11:57:57.869324360Z","level":"error","source":"cfnetworking.policy-server","message":"cfnetworking.policy-server.metric-getter","data":{"error":"listing all: sql: Scan error on column index 2, name \"guid\": converting NULL to string is unsupported","source":"totalPolicies"}}

This set of errors is causing vxlan-policy-agent, and by extension, rep not to start on the cells.

Context

We are using cf-deployment v12.45.0 which includes cf-networking-release 2.28.0

Steps to Reproduce

I'm not sure how to reproduce this I haven't found the table with the erroneous NULL value.

Expected result

vxlan-policy-agent should start without error. policy-server should not throw SQL errors.

Current result

Described under the issue.

Possible Fix

Additional Context

CustomIPTablesCompatibilityTest should be skipped by default

After bumping to master of the latest cf-networking-release acceptance tests, we see the following failure:

Custom iptables compatibility
/tmp/build/29505db7/acceptance-tests/src/test/acceptance/custom_iptables_compatibility_test.go:15
  when a custom iptables rule is added and a new app is pushed
  /tmp/build/29505db7/acceptance-tests/src/test/acceptance/custom_iptables_compatibility_test.go:42
    still applies the iptable rule to the new app [It]
    /tmp/build/29505db7/acceptance-tests/src/test/acceptance/custom_iptables_compatibility_test.go:43

    No future change is possible.  Bailing out early after 0.000s.
    Expected
        <int>: 7
    to match exit code:
        <int>: 0

    /tmp/build/29505db7/acceptance-tests/src/test/acceptance/custom_iptables_compatibility_test.go:46

The underlying error was curl: (7) Failed to connect to 10.0.4.36 port 8898: Connection refused. It sounds like this test will only pass is a certain release is co-located on the cell listening on port 8898.

As this test will not pass against a standard CF deployment, could SkipCustomIPTablesCompatibilityTest be set to true by default?

Possible issue with bbr unlock scripts

CAPI team was looking at your BBR scripts for inspiration and this line in your unlock path jumped out. Should echo 1 be return 1 so that the exit code of the function is propagated back to the caller? Haven't noticed any issues running these scripts but figured I would give a heads up.

CF App SD docs lacks explanation on new Functionality/components

Issue

CF App SD docs lacks explanation on new Functionality/component . We can only see Architecture diagram of data flow here but does not explain what/how it works.

Expected result

It would be better if we could have a more detailed documentation explaining the Architecture diagram, how it works, what's the new functionality/component adds/do/function ?

Setting up VPN connectivity for cf apps

In our standard CF-deployment with cf-networking (silk) we would like to offer
the following use case:

As a platform operator I want to be able to connect applications of a specific
space to a private customer network using a VPN solution (e.g. ipsec, openvpn).

Is there any discussion/proposal/design document on that direction?
What do you think? is it something that we can enable without changes in the upstream?

On top of my head, a such feature will require

  1. create a VPN gateway (either in each app, or as an different app, or as VM in overlay)
  2. Modify routing table within the apps
  3. Extend the policy mechanism to ensure only apps within the space can speak
    to the gateway

We are looking for a solution that does not include sidecars/istio/envoy.

Any thoughts?

Thank you in advance
Konstantinos

CF Networking for multi Tenant platform

Issue

I have cf-networking enabled for my shared CF platform on AWS. While using this feature, I encountered that apps hosted on same Diego cell can communicate with each other without setting up the network policy using overlay Network IP (CF instance internal IP). While using service discovery, I get an error as no such host. This is what I expect since we have not yet add-network-policy.

Can you help me how such instance can be handled if I have multi-tenant env and customers hosted on same CF instance using common Diego cells, as they would not want their apps to be able to communicate with another org user apps hosted on same Diego cell?

Context

Using this example app for both with and without service discovery.
https://github.com/cloudfoundry/cf-networking-examples

Update instructions to restage app after setting env variables

For the example app cats-and-dogs the user would need to restage the app after setting the env variables for the changes to apply to the app.

Update instructions to read as follows by adding cf restage backend:

cd cf-networking-release/src/example-apps/cats-and-dogs/backend
cf push backend --no-start
cf set-env backend CATS_PORTS "7007,7008"
cf set-env backend UDP_PORTS "9003,9004"
cf restage backend
cf start backend

Container to Container connection may break after apps are restarted

Thanks for submitting an issue to cf-networking-release. We are always trying to improve! To help us, please fill out the following template.

Issue

Container to Container connection may break after apps are restarted.

Context

We put C2C to a test and ran a script similar to below to make sure that C2C runs as expected.

while true; do
add policy app_cat to app_dog on port 8080
restart both app_cat & app_dog
while Connection fail; do
Test connection app_cat to app_dog on port 8080
done
remove policy app_cat to app_dog on port 8080
done

Script just runs normally completing loop over and over, and at some point, below happens:

What we have found out, It always happens that it will be stuck on inner loop at one point.

while Connection fail; do
Test connection app_cat to app_dog on port 8080
done

Means at that point, policy is in placed but connection continue to fail.

Steps to Reproduce

  • We ran a script similar to below:

while true; do
add policy app_cat to app_dog on port 8080
restart both app_cat & app_dog
while Connection fail; do
Test connection app_cat to app_dog on port 8080
done
remove policy app_cat to app_dog on port 8080
done

  • 1 instance for both app_cat and app_dog
  • Issue can be observed in v1.9.0 and v1.13.0 + silk v0.30
  • We have 3 available diego-cell when this was tested

Expected result

C2C should not break after both apps are successfully restarted with policy in placed.

Current result

  • Observations:

First time stuck on loop:
Old
this is cat IP CF_INSTANCE_INTERNAL_IP=10.78.10.15
this is dog IP CF_INSTANCE_INTERNAL_IP=10.77.65.113
New
this is cat IP CF_INSTANCE_INTERNAL_IP=10.78.23.62
this is dog IP CF_INSTANCE_INTERNAL_IP=10.78.10.16

Second time stuck on loop:
Old
this is cat IP CF_INSTANCE_INTERNAL_IP=10.78.23.247
this is dog IP CF_INSTANCE_INTERNAL_IP=10.77.65.142
New
this is cat IP CF_INSTANCE_INTERNAL_IP=10.78.10.200
this is dog IP CF_INSTANCE_INTERNAL_IP=10.78.23.248

As observed above it happens when new destination app_dog goes to source app_cat previous host-container(diego-cell)
and their IP is one just 1 diff only

e.g.
Old
this is cat IP CF_INSTANCE_INTERNAL_IP=10.78.10.15

New
this is dog IP CF_INSTANCE_INTERNAL_IP=10.78.10.16

Also, IP tables has some order change in terms of forward chain on destination, overlay comes before vpa.

Possible Fix

name of issue screenshot

[if relevant, include a screenshot]

Support TLS database connections for silk & policy-server

As a platform operator
I want to be able to configure secure (TLS) database connections for the silk & policy-server
In order to be compliant to corporate security requirements.

Currently the silk & policy-server do not expose properties that would allow to configure secure database connectivity.

Network Policy is getting removed after Restaging the Application

Hi Team,

Upon restaging the Application, the network policies of this application are being wiped off. I'm assuming this is because of new droplet that is being created.

We need to keep those network polices as long as the app is not deleted.

Can you please assist us with the issue or do suggest if I'm missing any step here?

Thanks and Regards,
Dilip Tadepalli

Applications crash during update in large deployments

Issue

In a large deployment with 1000 cells (will likely also appear with less cells), during an update we encountered that apps crash due the fact that during container creation the policy-server-internal gets overloaded and requests from the vxlan-policy-agent time out.

Context

During an update with 1000 cells we saw that multiple hundred and in another scenario even thousands of apps crashed. When apps getting drained to a different cell/on container creation they request the policies from the policy-server-internal. This will overload the policy-server-internal and leads to timeouts since a lot of requests are issued at the same time => apps crashes

Steps to Reproduce

We have a landscape with 1000 cells (max_in_flight set to 50) and 30k apps. We created 4k policies. When deleting all policies the same issue did not occur since probably the database queries are much faster.

We set the policy_poll_interval_seconds to 300, otherwise the policy-server-internal would be constantly overloaded. With this configuration the landscape seems stable (outside of an update).

Expected result

No app crashes during update

Current result

Request times out during container creation:

{"timestamp":"2018-12-04T16:16:43.381785739Z","level":"error","source":"guardian","message":"guardian.create.external-networker-result","data":{"action":"up","error":"exit status 1","handle":"27921ba0-afe9-4653-6dbb-d1ab","session":"6274","stderr":"cfnetworking: cni up failed: add network list failed: vpa response code: 500 with message: failed to force policy p
oll cycle: get-rules: failed to get policies: http client do: Get https://policy-server.service.cf.internal:4003/networking/v1/internal/policies?id=00ee43e2-cb6b-4bce-8b3d-34af44a4519b,e06649ce-40b7-40b3-8f83-339a832a44a7,83119d81-6269-410c-9c7e-6b0b5cfc2997,313261ae-4107-4f3a-8a99-5c63c2bc157c,165aebcb-cd59-4ab7-829f-a87b08642329,2207420e-9e14-4979-8c3d-eb1f0f8
4d25d,2237cfe8-cf0f-469c-b9cb-84fa8be75be2,61f88739-5ea3-40b7-8ba3-36100bf8665e,1d08fca4-5a11-44d4-85e4-90233fccd976,d5afd404-a250-4b59-8c38-106c765ebf06,2d105c1c-6aff-4f09-b83d-00550fac6245,b206bdb8-23fb-4d90-8a8c-24e2f6c6c4e9,fd886474-47ee-4709-8b06-1c88c954dfb5,75f1693d-7884-424a-9dba-ca09202a2eed,cb121d69-f5e9-4310-8d19-c636d528916f,5fa65bbe-ecea-4acd-8cb5-b
70ef56f87c5,6f3a12c9-b190-4ab8-b677-2786fff7abed,077a31a8-3925-4818-b42e-11adf7ad8772,d590b654-e32d-4f1b-985b-323daf349fb5,8045de6f-97fc-4df9-a964-12f06a4ffb2f,a229c94e-bad5-48e3-99c6-4ffa3df0eff6,03d19d9c-928f-48f6-b099-ce3cb36bc36c,a825d1e0-472e-425c-9bd6-4a344ef8b497,9d27ec24-084c-4b3a-b6dc-36c99807feab,9c31c071-6e28-4678-b8f3-653687e7ca0b,b1bf32c1-a9de-48d6-
8318-e2089ce15ce5,28d5b6a8-bd9b-4099-be30-a446133b0695,147a1af4-5a6c-48df-92b1-553c6585b21d,80ceb97e-5525-443b-8f39-dc785fd1a00d,fcba7369-eae4-4dac-9292-d3f706577472,48d694c2-dd3a-4c14-a09d-58ee14785d23,d31e5d50-5b65-4805-a45b-7e08c0f98e2c,370d95f2-29d0-4d0c-9f16-580be271d688,6aa1a0da-77ab-4669-a87d-27cb40e3dffa,fe326b2e-5b96-4b17-a98a-8e3f8faec489,d0fbf238-09a7
-41ae-8076-3ecd4d6cd4bd,c7c23786-6d46-4eb2-857c-5213564125c7,7ad92303-f5a4-42ba-b98e-0ebaa09daa2b,7b9dcad8-c6ce-485c-9a32-0e51af27f022,51cefe4b-ded6-4f36-9fc1-a6494fe45ae9,2ba95e1a-0b54-4dfe-862e-6be8cbaaccfd,1ec92530-84cd-4244-8414-073c63e9fd64,e1b1974c-ac2f-42f2-ba7d-1d128736ef28,2a1fbd35-6ec2-4cdf-99e3-65968c89a398,f40d6531-b263-43e9-b536-83005b8b5624,8354283
6-dc14-46c7-a6ce-1b485be51239,fe8d196c-2b2a-47cd-9c99-bd0d5851a733,09ae56ca-8a24-4f4a-a1d2-024de05e610c,983649b6-d1bd-4fff-b1f9-adfde60cfe2e,43758557-0568-4615-abbe-3b9926e1d46e: net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n","stdin":"{\"Pid\":3120232,\"Properties\":{\"app_id\":\"fcba7369-eae4-4dac-9292-d3f706577472\",\"org_id\":\
"12d0e6f0-ba0e-45f7-835e-d9304d92c64c\",\"policy_group_id\":\"fcba7369-eae4-4dac-9292-d3f706577472\",\"ports\":\"8080\",\"space_id\":\"2ba95e1a-0b54-4dfe-862e-6be8cbaaccfd\"},\"netout_rules\":[{\"networks\":[{\"start\":\"0.0.0.0\",\"end\":\"9.255.255.255\"}]},{\"networks\":[{\"start\":\"11.0.0.0\",\"end\":\"147.204.7.255\"}]},{\"networks\":[{\"start\":\"147.204.
9.0\",\"end\":\"155.56.54.155\"}]},{\"networks\":[{\"start\":\"155.56.54.159\",\"end\":\"155.56.68.227\"}]},{\"networks\":[{\"start\":\"155.56.68.230\",\"end\":\"155.56.68.235\"}]},{\"networks\":[{\"start\":\"155.56.68.238\",\"end\":\"169.253.255.255\"}]},{\"networks\":[{\"start\":\"169.255.0.0\",\"end\":\"172.15.255.255\"}]},{\"networks\":[{\"start\":\"172.32.0
.0\",\"end\":\"192.167.255.255\"}]},{\"networks\":[{\"start\":\"192.169.0.0\",\"end\":\"255.255.255.255\"}]},{\"protocol\":1,\"networks\":[{\"start\":\"10.0.0.0\",\"end\":\"10.255.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":2,\"networks\":[{\"start\":\"10.0.0.0\",\"end\":\"10.255.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol
\":1,\"networks\":[{\"start\":\"169.254.0.0\",\"end\":\"169.254.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":2,\"networks\":[{\"start\":\"169.254.0.0\",\"end\":\"169.254.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":1,\"networks\":[{\"start\":\"172.16.0.0\",\"end\":\"172.31.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]}
,{\"protocol\":2,\"networks\":[{\"start\":\"172.16.0.0\",\"end\":\"172.31.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":1,\"networks\":[{\"start\":\"192.168.0.0\",\"end\":\"192.168.255.255\"}],\"ports\":[{\"start\":53,\"end\":53}]},{\"protocol\":2,\"networks\":[{\"start\":\"192.168.0.0\",\"end\":\"192.168.255.255\"}],\"ports\":[{\"start\":53,\
"end\":53}]}],\"netin\":[{\"host_port\":0,\"container_port\":8080},{\"host_port\":0,\"container_port\":2222}]}","stdout":""}}

The http timeout seems hardcoded to 5s.

https://github.com/cloudfoundry/silk-release/blob/5dc42dd12d4ecc46c826c5a966f1cebbba734fc8/src/silk-daemon-bootstrap/main.go#L23

Request's source IP is a cell's IP instead of a VXLAN one.

Issue

In case of two app's located on the same cell source IP will be a cell's IP.

Context

During the cf-networking test we realaized that sometime source IP's of requests have different from VXLAN subnet. We find out that connection between application instances located on the same cell will use cell IP, not contaner VXLAN IP. At the same time connection between application instances located on different cells will be with VXLAN source IP.

Steps to Reproduce

$ cf network-policies
Listing network policies in org my-org / space my-space as [email protected]...

source     destination   protocol   ports
feeder     sd            tcp        1234

$ cf-instances.sh sd
Application: sd (1495e2c6-91dd-45f1-bc94-df608ae4d83f)
Index  State    Uptime  IP             Port   CPU  Mem    Mem_Quota  Disk   Disk_Quota
0      RUNNING  9531s   100.73.61.131  61001  2%   23.8M  64M        94.7M  1024M

$ cf-instances.sh feeder
Application: feeder (7bd85784-b7b4-4673-84d3-1b43c8fc0ba9)
Index  State    Uptime  IP             Port   CPU   Mem    Mem_Quota  Disk    Disk_Quota
0      RUNNING  6392s   100.73.61.131  61007  0.3%  27.5M  64M        105.9M  1024M
1      RUNNING  7798s   100.73.61.130  61001  0.2%  28.1M  64M        105.9M  1024M
2      RUNNING  7798s   100.73.61.129  61038  0%    23.3M  64M        105.9M  1024M



########## CONSOLE_1 ##########
$ cf ssh sd

# TIME 1.1
vcap@bdb549a1-2399-4847-5a3b-73bf:~$ nc -vl 1234
Listening on [0.0.0.0] (family 0, port 1234)
Connection from [100.73.61.131] port 1234 [tcp/*] accepted (family 2, sport 33758)

# TIME 2.1
vcap@bdb549a1-2399-4847-5a3b-73bf:~$ nc -vl 1234
Listening on [0.0.0.0] (family 0, port 1234)
Connection from [10.78.184.42] port 1234 [tcp/*] accepted (family 2, sport 38524)
###############################



########## CONSOLE_2 ##########
$ cf ssh feeder -i 0

# TIME 1.2
vcap@46c732fe-428a-4c68-6a42-7c47:~$ nc -zv 10.79.215.6 1234
Connection to 10.79.215.6 1234 port [tcp/*] succeeded!

vcap@46c732fe-428a-4c68-6a42-7c47:~$ exit

# TIME 2.2
$ cf ssh feeder -i 1
vcap@3fb455bf-1233-4256-576a-6142:~$ nc -zv 10.79.215.6 1234
Connection to 10.79.215.6 1234 port [tcp/*] succeeded!

vcap@3fb455bf-1233-4256-576a-6142:~$ exit
###############################

Expected result

We expect to have VXLAN IP as a source IP in all cases.

Increase `waitForAllInstancesToBeRunning` timeout in acceptance tests

Issue

Acceptance tests that use the waitForAllInstancesToBeRunning appear to be flakie.

Context

Occasionally the PCF Integration team has had to restart our "run nats tests" concourse job (which runs the NATs acceptance tests) due to timeouts in waiting for fixture apps to start in tests. From what we can tell, it's usually due to a single app instance taking longer to start up than the others.

Steps to Reproduce

Run the acceptance tests and occasionally they flake on waitingForAllInstancesToBeRunning.

Expected result

For waitingForAllInstancesToBeRunning to wait long enough for all the application instances to start up.

Current result

This is a trimmed log from an example flake:

...
[2019-05-15 17:25:02.01 (UTC)]> cf restage test--proxy-169313025 
This action will cause app downtime.

Restaging app test--proxy-169313025 in org test-wide-open-interaction-org / space test-wide-open-interaction-space as admin...

Staging app and tracing logs...
   Downloading go_buildpack...
   Downloaded go_buildpack
   Cell c7289fd0-5ca3-4386-af80-bf88b31790c1 creating container for instance 6eac9f40-190c-42e4-968f-bb4480cf5999
   Cell c7289fd0-5ca3-4386-af80-bf88b31790c1 successfully created container for instance 6eac9f40-190c-42e4-968f-bb4480cf5999
   Downloading build artifacts cache...
   Downloading app package...
   Downloaded app package (12.6K)
   Downloaded build artifacts cache (4.9M)
   -----> Go Buildpack version 1.8.39
          **WARNING** [DEPRECATION WARNING]:
          **WARNING** Please use AppDynamics extension buildpack for Golang Application instrumentation
          **WARNING** for more details: https://docs.pivotal.io/partners/appdynamics/multibuildpack.html
   -----> Installing godep 80
          Copy [/tmp/buildpacks/1a03f8c3635c74551eaf353a517f4183/dependencies/52a892f00e80ca4fdcf27d9828c7aba1/godep-v80-linux-x64-cflinuxfs3-b60ac947.tgz]
   -----> Installing glide 0.13.2
          Copy [/tmp/buildpacks/1a03f8c3635c74551eaf353a517f4183/dependencies/fd4e8ddc56ffc364f5d99fa014efa3db/glide-v0.13.2-linux-x64-cflinuxfs3-b50997e2.tgz]
   -----> Installing dep 0.5.1
          Copy [/tmp/buildpacks/1a03f8c3635c74551eaf353a517f4183/dependencies/74ea62ec617ddab3cb253665898fe767/dep-v0.5.1-linux-x64-cflinuxfs3-c2056e50.tgz]
   -----> Installing go 1.10.8
          Copy [/tmp/buildpacks/1a03f8c3635c74551eaf353a517f4183/dependencies/3ea0b297a9a3419a2e61690b045e13c6/go1.10.8.linux-amd64-cflinuxfs3-c80a4f0d.tar.gz]
          **WARNING** go 1.10.x will no longer be available in new buildpacks released after 2019-02-25.
          See: https://golang.org/doc/devel/release.html
          **WARNING** Installing package '.' (default)
   -----> Running: go install -tags cloudfoundry -buildmode pie .
   Exit status 0
   Uploading droplet, build artifacts cache...
   Uploading droplet...
   Uploading build artifacts cache...
   Uploaded build artifacts cache (4.9M)
   Uploaded droplet (2.6M)
   Uploading complete
   Cell c7289fd0-5ca3-4386-af80-bf88b31790c1 stopping instance 6eac9f40-190c-42e4-968f-bb4480cf5999
   Cell c7289fd0-5ca3-4386-af80-bf88b31790c1 destroying container for instance 6eac9f40-190c-42e4-968f-bb4480cf5999

Waiting for app to start...

name:              test--proxy-169313025
requested state:   started
routes:            test--proxy-169313025.example.com
last uploaded:     Wed 15 May 17:25:13 UTC 2019
stack:             cflinuxfs3
buildpacks:        go

type:            web
instances:       4/5
memory usage:    32M
start command:   proxy
     state      since                  cpu    memory        disk          details
#0   running    2019-05-15T17:25:23Z   0.0%   6.6M of 32M   8.3M of 32M   
#1   running    2019-05-15T17:25:22Z   0.0%   0 of 32M      0 of 32M      
#2   starting   2019-05-15T17:25:19Z   0.0%   4.9M of 32M   8.3M of 32M   
#3   running    2019-05-15T17:25:22Z   0.0%   6.7M of 32M   8.3M of 32M   
#4   running    2019-05-15T17:25:23Z   0.0%   0 of 32M      0 of 32M      

[2019-05-15 17:25:24.33 (UTC)]> cf app test--proxy-169313025 --guid 
da2d9ad8-a28b-4f61-b32a-3520e2648664
...
[2019-05-15 17:25:29.45 (UTC)]> cf curl v2/apps/da2d9ad8-a28b-4f61-b32a-3520e2648664/instances 
{
   "0": {
      "state": "RUNNING",
      "uptime": 8,
      "since": 1557941122
   },
   "1": {
      "state": "RUNNING",
      "uptime": 8,
      "since": 1557941121
   },
   "2": {
      "state": "STARTING",
      "uptime": 12,
      "since": 1557941118
   },
   "3": {
      "state": "RUNNING",
      "uptime": 8,
      "since": 1557941121
   },
   "4": {
      "state": "RUNNING",
      "uptime": 8,
      "since": 1557941122
   }
}
STEP: deleting the security group
STEP: deleting the org

[2019-05-15 17:25:35.86 (UTC)]> cf delete-org test-wide-open-interaction-org -f 
Deleting org test-wide-open-interaction-org as admin...
OK

• Failure in Spec Setup (BeforeEach) [77.759 seconds]
ASGs and Overlay Policy interaction
/tmp/build/29505db7/acceptance-tests/src/test/acceptance/asg_overlay_interaction_test.go:16
  when a wide open ASG is configured
  /tmp/build/29505db7/acceptance-tests/src/test/acceptance/asg_overlay_interaction_test.go:31
    when a policy is added [BeforeEach]
    /tmp/build/29505db7/acceptance-tests/src/test/acceptance/asg_overlay_interaction_test.go:84
      does allow traffic on the overlay network
      /tmp/build/29505db7/acceptance-tests/src/test/acceptance/asg_overlay_interaction_test.go:94

      Timed out after 6.005s.
      not all instances running
      Expected
          <bool>: false
      to equal
          <bool>: true

      /tmp/build/29505db7/acceptance-tests/src/test/acceptance/init_test.go:173
------------------------------
SSSS

Summarizing 1 Failure:

[Fail] ASGs and Overlay Policy interaction when a wide open ASG is configured [BeforeEach] when a policy is added does allow traffic on the overlay network 
/tmp/build/29505db7/acceptance-tests/src/test/acceptance/init_test.go:173

Possible Fix

Increase the Eventually timeout from 5s to 30s. I believe that's a slightly more reasonable timeout when waiting for apps to start, especially if the target foundation has limited resources.

OR

Reduce the number of app instances needed for these tests. Seems like in a lot of our flakes, 4 are up waiting for the 5th.

PFR: Extend CF ASGs to app scoped too

Current Status quo:
At the moment, CF supports Platform-Wide and Space-Scoped ASGs.

  • Platform-Wide : To provide granular control when securing a deployment, admins can assign platform-wide ASGs that apply to all app and task instances for the entire deployment, or
  • space-scoped ASGs that apply only to apps and tasks in a particular space.

In the CF environments the ASGs for a particular space are combined with the platform ASGs to determine the effective rules for that space.

Feature Request:
What would even nicer is supporting app-scoped ASGs that should apply only to specific app and task in a particular space.
At the end, In the CF environments the ASGs for a particular app are combined with the platform ASGs and Space-Scoped ASGs
to determine the effective rules for that app (or) with platform ASGs, App-Scoped ASGs should take the priority than Space-Scoped ASGs but off-course it is an implementation detail.

Use-case:
We have services that use a publish–subscribe model, which means that when a user creates a service instance, a worker application is launched behind the scenes on a service provider space.

  • If any other enduser space service (for example, postgres) implements ASGs, it is not allowed to consume in app running in provider space(same enduser worker app) unless whitelist complete CIDR blocks of postgres or instance sharing is done. Instance sharing requires a space developer role on both sides to work, but is not a good solution for a multi-tenant environment.

  • If the enduser space service has be consumed by the worker app above(same enduser worker instance), after which we get IP from enduser space service and add it to the allowlisting only to specific worker app that is running in provider space. In this situation, app-scoped ASGs will aid in gaining more control maintaining isolation.

  • There could more benefits to control the app specific firewalls.

Is vtep port by default supposed to be 4789 or something else?

I noticed in the Known Issues doc I saw this warning:
When using VMware NSX for vSphere 6.x, the default VXLAN port of 8472 used by cf-networking is not allowed. To fix this issue, override the default cf_networking.vtep_port with another value.

But it looks like the default port for vtep is now actually 4789, not 8472.

Is this the same port we're referring to in the known-issues, or something else, if I'm using NSX?

Test issue

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

Additional Context

policy cleanup is broken - cc_client space iteration doesn't work

Thanks for submitting an issue to cf-networking-release. We are always trying to improve! To help us, please fill out the following template.

Issue

One of our teams hit the MaxPolicyPerAppSource limit and was wondering why.

The we found this in our logs:

policy-server.stdout.log:{"timestamp":"2018-11-05T22:48:36.637458246Z","level":"error","source":"cfnetworking.policy-server","message":"cfnetworking.policy-server.policy-cleaner-poller.poll-cycle","data":{"error":"get live space guids failed: json client do: http client do: Get https://cloud-controller-ng.service.cf.internal:9024https//api.sys.*/v3/spaces?page=2\u0026per_page=50: invalid URL port "9024https"","session":"4"}}

Pagination seems to be broken for spaces as this (

route = response.Pagination.Next.Href
) returns a fully-qualified URL which is then appended to the base url by the JSON client.

As a workaround we will increase the limit for now to get our teams working again.

Please fix.

Cannot access host IP from container

Thanks for submitting an issue to cf-networking-release. We are always trying to improve! To help us, please fill out the following template.

Issue

Hello, I have a running app on diego_cell machine, eg 10.0.6.10:45454, manually deployed.
Then I deploy an container app that needs to communicate to host app from container, container address is 10.255.211.13 and app is running on port 8080.

That means:
container 10.255.211.13 <-> host ``10.0.6.10:45454`

I am unable to achieve it, i am getting a message: Connection refused.

I changed cf-deployment and enabled garden property allow_host_access: true, then recreated all vms but result is the same.
Using PAS 2.1.13, garden 1.13.3

Expected result

Since allow_host_access: true is set to true, i expect reaching host's app.

Current result

Connection refused.

Can someone help? Thanks.

Setup VPN connectivity for cf apps

Thanks for submitting an issue to cf-networking-release. We are always trying to improve! To help us, please fill out the following template.

Issue

As a platform operator I want to be able to connect applications of a specific
space to a private customer network using a VPN solution (e.g. ipsec, openvpn).

Is there any discussion/proposal/design document on that direction?
What do you think?

Context

As a platform operator I want to be able to connect applications of a specific
space to a private customer network using a VPN solution (e.g. ipsec, openvpn).

Is there any discussion/proposal/design document on that direction?
What do you think?

Steps to Reproduce

[ordered list the process to finding and recreating the issue, example below]

Expected result

[describe what you would expect to have resulted from this process]

Current result

[describe what you you currently experience from this process, and thereby explain the bug]

[include relevant logs]

Possible Fix

[not obligatory, but suggest fixes or reasons for the bug]

name of issue screenshot

[if relevant, include a screenshot]

Container networking stops too early on cell draining (CF update)

Issue

On CF update (cell draining), container networking stops to work for drained app instances before they are stopped.

Context

The issue was first observed in a Java application that uses a Hazelcast cluster (a distributed hashmap).
On CF updates, I see cluster outages due to lost connectivity between cluster nodes (= CF app instances). I don't observe similar outages when updating my application using cf7 restart --strategy rolling.

Follow-up for discussion that started on #networking slack.
CF 12.39

Steps to Reproduce

See README.md in attached c2ctest.zip. The zip contains a test application but also a log file of one test run.

The test application starts a simple web server and probes connectivity to itself via internal (container networking) and external (gorouter) routes every 5s.
App instances are identified via CF_INSTANCE_INTERNAL_IP. Logs record the connectivity, i.e. which app instance requested and which app instance responded.

Connectivity behavior is observed during a rolling app restart (cf7 restart c2ctest --strategy rolling) and during a recreate of the Diego cell on which the app instance was running (simulating a CF update).

Expected result

No connectivity errors between the two app instance during updates, no matter if the update is a CF update or a rolling application update.
Since Bosh DNS might be slow with reflecting the app instances, I expect at least one successful communication between the old and new app instance using container networking (internal routes) while updating.

Current result

During CF update (bosh recreate diego-cell)

  • Connection failures on internal routes with beginning of cell draining
  • No successfull communication between the old and new app instance using internal routes

No problems observed during rolling application restart.

Possible Fix

no idea

Additional Context

c2ctest.zip with an example app that demonstrates the issue.

Post-start script in job "policy-server" and "policy-server-internal" failed when deploying CF

Issue

When deploying CF, the bosh director reported an error complaining the post-start script failed in job "policy-server" and "policy-server-internal":

Task 28 | 23:03:51 | Updating instance api: api/45b53d96-d781-47cd-984c-2d49d9718077 (0) (canary) (00:05:32)
L Error: Action Failed get_task: Task e66171d7-d403-4900-7eeb-03bb7e05167c result: 2 of 3 post-start scripts failed. Failed Jobs: policy-server, policy-server-internal. Successful Jobs: cloud_controller_ng.
Task 28 | 23:04:48 | Updating instance uaa: uaa/c3d7b74e-7be8-42e2-beb4-1f4031306ce7 (1) (00:02:45)
Task 28 | 23:04:48 | Error: Action Failed get_task: Task e66171d7-d403-4900-7eeb-03bb7e05167c result: 2 of 3 post-start scripts failed. Failed Jobs: policy-server, policy-server-internal. Successful Jobs: cloud_controller_ng.

Context

I'm deploying CF on Azure, using azure-quick-start-template, the CF manifest version is v2.2.0

Steps to Reproduce

The issue is likely to be occasional, I have deployed CF on Azure for many times, only get this error twice.

Expected result

The CF should be deployed successfully.

Current result

The bosh director reported the error with following running log:

Task 28

Task 28 | 22:00:06 | Preparing deployment: Preparing deployment (00:00:11)
Task 28 | 22:00:28 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 28 | 22:00:29 | Creating missing vms: consul/a9cb4f16-d318-4876-940d-e7e52114209d (2)
Task 28 | 22:00:29 | Creating missing vms: api/07666b3c-74e2-4838-8b8d-82ecd8de7e02 (1)
Task 28 | 22:00:29 | Creating missing vms: consul/5a183dcf-0d63-44e6-a18c-5d319126c680 (0)
Task 28 | 22:00:29 | Creating missing vms: consul/f6f4bfea-2c77-4d17-8c6b-da85f518dbf7 (1)
Task 28 | 22:00:29 | Creating missing vms: adapter/e58e9f9d-176c-4184-a65c-dbd4a585547f (0)
Task 28 | 22:00:29 | Creating missing vms: diego-api/81a18f7f-63c8-4fa4-865c-6e003b7a4006 (1)
Task 28 | 22:00:29 | Creating missing vms: uaa/33fc6f70-27ba-4f65-b349-24751ebb4368 (0)
Task 28 | 22:00:29 | Creating missing vms: uaa/c3d7b74e-7be8-42e2-beb4-1f4031306ce7 (1)
Task 28 | 22:00:29 | Creating missing vms: router/b929e53b-fcb6-4d6e-8ca8-936f705ea572 (0)
Task 28 | 22:00:29 | Creating missing vms: diego-api/8cfbefe1-4b52-4293-99f5-808f7ec40c7c (0)
Task 28 | 22:00:29 | Creating missing vms: adapter/00bed6a8-acad-46ea-9f8a-2430e81fc128 (1)
Task 28 | 22:00:29 | Creating missing vms: diego-cell/a760ec90-a4e1-4b27-a686-ea615d15e4ad (0)
Task 28 | 22:00:29 | Creating missing vms: tcp-router/b59178ed-160f-4c0d-af0f-07acf0df9e32 (1)
Task 28 | 22:00:29 | Creating missing vms: doppler/435d8a37-817b-4adb-b1f7-f2c91c172a0f (0)
Task 28 | 22:00:29 | Creating missing vms: scheduler/0588f7ff-5131-4dec-a626-7919a800dfbe (1)
Task 28 | 22:00:29 | Creating missing vms: nats/709a90fa-9bfb-46dd-803f-631e76524835 (0)
Task 28 | 22:00:29 | Creating missing vms: scheduler/42a0f771-1ca9-4e3c-8e99-67b3c05bcc18 (0)
Task 28 | 22:00:29 | Creating missing vms: doppler/b54673df-3e68-4536-bacb-91f5c9fa32cb (1)
Task 28 | 22:00:29 | Creating missing vms: router/efb756e2-922e-4d74-97a3-d910cb03f9f1 (1)
Task 28 | 22:00:29 | Creating missing vms: cc-worker/5735938a-ffe7-4e03-a67e-ab8908794dab (1)
Task 28 | 22:00:29 | Creating missing vms: diego-cell/c27c2575-5f85-4760-b262-a4231deb29b8 (1)
Task 28 | 22:00:29 | Creating missing vms: database/531e17be-a0ed-4409-99f6-4ccb4e948b58 (0)
Task 28 | 22:00:29 | Creating missing vms: log-api/8e3f8ffc-df1c-464a-960c-275ef412718f (0)
Task 28 | 22:00:29 | Creating missing vms: cc-worker/32cd1d01-051a-4e34-91e7-fa840fb64c83 (0)
Task 28 | 22:00:29 | Creating missing vms: tcp-router/46415b07-5a38-45ab-84ab-3886471f7245 (0)
Task 28 | 22:00:29 | Creating missing vms: doppler/948b4f21-e382-4204-84a5-d797662d8da0 (2)
Task 28 | 22:00:29 | Creating missing vms: log-api/3f7eaf7f-9a99-4697-bd6d-fcf3612ca647 (1)
Task 28 | 22:00:29 | Creating missing vms: doppler/a54e1be3-e5c1-46c6-9041-f40e5a4dd792 (3)
Task 28 | 22:00:29 | Creating missing vms: nats/4efa1a74-1437-4651-bdf5-643031c0a879 (1)
Task 28 | 22:00:29 | Creating missing vms: api/45b53d96-d781-47cd-984c-2d49d9718077 (0)
Task 28 | 22:04:44 | Creating missing vms: uaa/c3d7b74e-7be8-42e2-beb4-1f4031306ce7 (1) (00:04:15)
Task 28 | 22:04:46 | Creating missing vms: api/45b53d96-d781-47cd-984c-2d49d9718077 (0) (00:04:17)
Task 28 | 22:05:08 | Creating missing vms: database/531e17be-a0ed-4409-99f6-4ccb4e948b58 (0) (00:04:39)
Task 28 | 22:05:09 | Creating missing vms: doppler/b54673df-3e68-4536-bacb-91f5c9fa32cb (1) (00:04:40)
Task 28 | 22:05:09 | Creating missing vms: uaa/33fc6f70-27ba-4f65-b349-24751ebb4368 (0) (00:04:40)
Task 28 | 22:05:16 | Creating missing vms: api/07666b3c-74e2-4838-8b8d-82ecd8de7e02 (1) (00:04:47)
Task 28 | 22:05:19 | Creating missing vms: adapter/e58e9f9d-176c-4184-a65c-dbd4a585547f (0) (00:04:50)
Task 28 | 22:05:22 | Creating missing vms: diego-api/81a18f7f-63c8-4fa4-865c-6e003b7a4006 (1) (00:04:53)
Task 28 | 22:05:23 | Creating missing vms: doppler/948b4f21-e382-4204-84a5-d797662d8da0 (2) (00:04:54)
Task 28 | 22:05:26 | Creating missing vms: doppler/a54e1be3-e5c1-46c6-9041-f40e5a4dd792 (3) (00:04:57)
Task 28 | 22:05:27 | Creating missing vms: consul/5a183dcf-0d63-44e6-a18c-5d319126c680 (0) (00:04:58)
Task 28 | 22:05:28 | Creating missing vms: consul/a9cb4f16-d318-4876-940d-e7e52114209d (2) (00:04:59)
Task 28 | 22:05:29 | Creating missing vms: scheduler/0588f7ff-5131-4dec-a626-7919a800dfbe (1) (00:05:00)
Task 28 | 22:05:30 | Creating missing vms: nats/709a90fa-9bfb-46dd-803f-631e76524835 (0) (00:05:01)
Task 28 | 22:05:30 | Creating missing vms: diego-cell/c27c2575-5f85-4760-b262-a4231deb29b8 (1) (00:05:01)
Task 28 | 22:05:30 | Creating missing vms: consul/f6f4bfea-2c77-4d17-8c6b-da85f518dbf7 (1) (00:05:01)
Task 28 | 22:05:31 | Creating missing vms: diego-cell/a760ec90-a4e1-4b27-a686-ea615d15e4ad (0) (00:05:02)
Task 28 | 22:05:32 | Creating missing vms: diego-api/8cfbefe1-4b52-4293-99f5-808f7ec40c7c (0) (00:05:03)
Task 28 | 22:05:33 | Creating missing vms: tcp-router/b59178ed-160f-4c0d-af0f-07acf0df9e32 (1) (00:05:04)
Task 28 | 22:05:33 | Creating missing vms: adapter/00bed6a8-acad-46ea-9f8a-2430e81fc128 (1) (00:05:04)
Task 28 | 22:05:33 | Creating missing vms: log-api/3f7eaf7f-9a99-4697-bd6d-fcf3612ca647 (1) (00:05:04)
Task 28 | 22:05:33 | Creating missing vms: nats/4efa1a74-1437-4651-bdf5-643031c0a879 (1) (00:05:04)
Task 28 | 22:05:33 | Creating missing vms: tcp-router/46415b07-5a38-45ab-84ab-3886471f7245 (0) (00:05:04)
Task 28 | 22:05:43 | Creating missing vms: doppler/435d8a37-817b-4adb-b1f7-f2c91c172a0f (0) (00:05:14)
Task 28 | 22:05:43 | Creating missing vms: log-api/8e3f8ffc-df1c-464a-960c-275ef412718f (0) (00:05:14)
Task 28 | 22:05:44 | Creating missing vms: cc-worker/5735938a-ffe7-4e03-a67e-ab8908794dab (1) (00:05:15)
Task 28 | 22:05:48 | Creating missing vms: cc-worker/32cd1d01-051a-4e34-91e7-fa840fb64c83 (0) (00:05:19)
Task 28 | 22:07:18 | Creating missing vms: router/b929e53b-fcb6-4d6e-8ca8-936f705ea572 (0) (00:06:49)
Task 28 | 22:08:07 | Creating missing vms: router/efb756e2-922e-4d74-97a3-d910cb03f9f1 (1) (00:07:38)
Task 28 | 22:33:51 | Creating missing vms: scheduler/42a0f771-1ca9-4e3c-8e99-67b3c05bcc18 (0) (00:33:22)
Task 28 | 22:33:53 | Updating instance consul: consul/5a183dcf-0d63-44e6-a18c-5d319126c680 (0) (canary) (00:05:45)
Task 28 | 22:39:38 | Updating instance consul: consul/f6f4bfea-2c77-4d17-8c6b-da85f518dbf7 (1) (00:04:35)
Task 28 | 22:44:13 | Updating instance consul: consul/a9cb4f16-d318-4876-940d-e7e52114209d (2) (00:05:20)
Task 28 | 22:49:33 | Updating instance adapter: adapter/e58e9f9d-176c-4184-a65c-dbd4a585547f (0) (canary)
Task 28 | 22:49:33 | Updating instance nats: nats/709a90fa-9bfb-46dd-803f-631e76524835 (0) (canary) (00:00:49)
Task 28 | 22:50:22 | Updating instance adapter: adapter/e58e9f9d-176c-4184-a65c-dbd4a585547f (0) (canary) (00:00:49)
Task 28 | 22:50:22 | Updating instance nats: nats/4efa1a74-1437-4651-bdf5-643031c0a879 (1)
Task 28 | 22:50:22 | Updating instance adapter: adapter/00bed6a8-acad-46ea-9f8a-2430e81fc128 (1) (00:00:26)
Task 28 | 22:50:48 | Updating instance nats: nats/4efa1a74-1437-4651-bdf5-643031c0a879 (1) (00:00:26)
Task 28 | 22:50:48 | Updating instance database: database/531e17be-a0ed-4409-99f6-4ccb4e948b58 (0) (canary) (00:06:14)
Task 28 | 22:57:02 | Updating instance diego-api: diego-api/8cfbefe1-4b52-4293-99f5-808f7ec40c7c (0) (canary) (00:00:51)
Task 28 | 22:57:53 | Updating instance diego-api: diego-api/81a18f7f-63c8-4fa4-865c-6e003b7a4006 (1) (00:00:26)
Task 28 | 22:58:19 | Updating instance cc-worker: cc-worker/32cd1d01-051a-4e34-91e7-fa840fb64c83 (0) (canary)
Task 28 | 22:58:19 | Updating instance uaa: uaa/33fc6f70-27ba-4f65-b349-24751ebb4368 (0) (canary)
Task 28 | 22:58:19 | Updating instance api: api/45b53d96-d781-47cd-984c-2d49d9718077 (0) (canary)
Task 28 | 23:02:03 | Updating instance uaa: uaa/33fc6f70-27ba-4f65-b349-24751ebb4368 (0) (canary) (00:03:44)
Task 28 | 23:02:03 | Updating instance uaa: uaa/c3d7b74e-7be8-42e2-beb4-1f4031306ce7 (1)
Task 28 | 23:02:09 | Updating instance cc-worker: cc-worker/32cd1d01-051a-4e34-91e7-fa840fb64c83 (0) (canary) (00:03:50)
Task 28 | 23:02:09 | Updating instance cc-worker: cc-worker/5735938a-ffe7-4e03-a67e-ab8908794dab (1) (00:00:34)
Task 28 | 23:03:51 | Updating instance api: api/45b53d96-d781-47cd-984c-2d49d9718077 (0) (canary) (00:05:32)
L Error: Action Failed get_task: Task e66171d7-d403-4900-7eeb-03bb7e05167c result: 2 of 3 post-start scripts failed. Failed Jobs: policy-server, policy-server-internal. Successful Jobs: cloud_controller_ng.
Task 28 | 23:04:48 | Updating instance uaa: uaa/c3d7b74e-7be8-42e2-beb4-1f4031306ce7 (1) (00:02:45)
Task 28 | 23:04:48 | Error: Action Failed get_task: Task e66171d7-d403-4900-7eeb-03bb7e05167c result: 2 of 3 post-start scripts failed. Failed Jobs: policy-server, policy-server-internal. Successful Jobs: cloud_controller_ng.

Task 28 Started Fri Jun 29 22:00:06 UTC 2018
Task 28 Finished Fri Jun 29 23:04:48 UTC 2018
Task 28 Duration 01:04:42
Task 28 error

Updating deployment:
Expected task '28' to succeed but state is 'error'

Exit code 1

I run monit summary in VM api/45b53d96-d781-47cd-984c-2d49d9718077 and get:

Process 'consul_agent' running
Process 'cloud_controller_ng' running
Process 'cloud_controller_worker_local_1' running
Process 'cloud_controller_worker_local_2' running
Process 'nginx_cc' running
Process 'route_registrar' running
Process 'statsd_injector' running
Process 'file_server' running
Process 'routing-api' running
Process 'policy-server' not monitored
Process 'policy-server-internal' Connection failed
Process 'cc_uploader' running
Process 'loggregator_agent' running
System 'system_localhost' running

Log /var/vcap/sys/log/policy-server/post-start.stderr.log in VM api/45b53d96-d781-47cd-984c-2d49d9718077:

+ source /var/vcap/packages/networking-ctl-utils/ctl_util.sh
+ export URL=127.0.0.1:4002
+ URL=127.0.0.1:4002
+ export TIMEOUT=25
+ TIMEOUT=25
++ wait_for_server_to_become_healthy 127.0.0.1:4002 25
++ local url=127.0.0.1:4002
++ local timeout=25
+++ seq 25
++ for _ in '$(seq "${timeout}")'
++ set +e
++ curl -f --connect-timeout 1 127.0.0.1:4002
++ '[' 7 -eq 0 ']'
++ set -e
++ sleep 1
++ for _ in '$(seq "${timeout}")'
++ set +e
++ curl -f --connect-timeout 1 127.0.0.1:4002
++ '[' 7 -eq 0 ']'
++ set -e
++ sleep 1
......
++ for _ in '$(seq "${timeout}")'
++ set +e
++ curl -f --connect-timeout 1 127.0.0.1:4002
++ '[' 7 -eq 0 ']'
++ set -e
++ sleep 1
++ for _ in '$(seq "${timeout}")'
++ set +e
++ curl -f --connect-timeout 1 127.0.0.1:4002
++ '[' 7 -eq 0 ']'
++ set -e
++ sleep 1
++ echo 1
+ exit 1

Log /var/vcap/sys/log/policy-server-internal/post-start.stderr.log in VM api/45b53d96-d781-47cd-984c-2d49d9718077:

+ source /var/vcap/packages/networking-ctl-utils/ctl_util.sh
+ export URL=127.0.0.1:4003
+ URL=127.0.0.1:4003
+ export TIMEOUT=25
+ TIMEOUT=25
++ wait_for_server_to_become_healthy 127.0.0.1:4003 25
++ local url=127.0.0.1:4003
++ local timeout=25
+++ seq 25
++ for _ in '$(seq "${timeout}")'
++ set +e
++ curl -f --connect-timeout 1 127.0.0.1:4003
++ '[' 7 -eq 0 ']'
++ set -e
++ sleep 1
++ for _ in '$(seq "${timeout}")'
++ set +e
++ curl -f --connect-timeout 1 127.0.0.1:4003
++ '[' 7 -eq 0 ']'
++ set -e
++ sleep 1
......
++ for _ in '$(seq "${timeout}")'
++ set +e
++ curl -f --connect-timeout 1 127.0.0.1:4003
++ '[' 7 -eq 0 ']'
++ set -e
++ sleep 1
++ for _ in '$(seq "${timeout}")'
++ set +e
++ curl -f --connect-timeout 1 127.0.0.1:4003
++ '[' 7 -eq 0 ']'
++ set -e
++ sleep 1
++ echo 1
+ exit 1

Log /var/vcap/sys/log/policy-server/policy-server.stderr.log in VM api/45b53d96-d781-47cd-984c-2d49d9718077:

......
2018/07/02 03:21:34 cfnetworking.policy-server: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:22:44 cfnetworking.policy-server: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:23:54 cfnetworking.policy-server: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:25:04 cfnetworking.policy-server: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:26:14 cfnetworking.policy-server: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:27:24 cfnetworking.policy-server: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2

Log /var/vcap/sys/log/policy-server-internal/policy-server-internal.stderr.log in VM api/45b53d96-d781-47cd-984c-2d49d9718077:

......
2018/07/02 03:23:14 cfnetworking.policy-server-internal: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:24:25 cfnetworking.policy-server-internal: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:25:34 cfnetworking.policy-server-internal: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:26:45 cfnetworking.policy-server-internal: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:27:54 cfnetworking.policy-server-internal: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2
2018/07/02 03:29:05 cfnetworking.policy-server-internal: failed to construct datastore: perform migrations: executing migration: Error 1060: Duplicate column name 'start_port' handling 2

It seems these two errors are caused by curl failure, so I run curl manually and get:

api/45b53d96-d781-47cd-984c-2d49d9718077:/var/vcap/data/sys/log/policy-server-internal# curl -f --connect-timeout 1 127.0.0.1:4002

curl: (7) Failed to connect to 127.0.0.1 port 4002: Connection refused

api/45b53d96-d781-47cd-984c-2d49d9718077:/var/vcap/data/sys/log/policy-server-internal# curl -f --connect-timeout 1 127.0.0.1:4003

curl: (7) Failed to connect to 127.0.0.1 port 4003: Connection refused

I think there should be two services listening on port 4002 and 4003, but for some reason they are not here. Could you have a look on the issue? Besides, I have two questions:

  • There are two api instance in the cluster, why only one is reported to be failing? I run bosh -e azure -d cf vms and get:
Instance                                         Process State  AZ  IPs        VM CID                                                                                                                                               VM Type        Active
api/07666b3c-74e2-4838-8b8d-82ecd8de7e02         running        z2  10.0.0.19  agent_id:eaa20d34-a50e-436f-8747-2009016c1ebb;resource_group_name:myrg;storage_account_name:mysa                            small        true
api/45b53d96-d781-47cd-984c-2d49d9718077         failing        z1  10.0.0.18  agent_id:271b035c-2feb-4288-9fc0-2102aef9c44d;resource_group_name:myrg;storage_account_name:mysa                             small          true

And I logged into another api instance reported to be running, and the folder /var/vcap/sys/log is empty. What happened?

  • The uaa instance doesn't have job policy-server and policy-server-internal, why bosh director report that the error also happens in uaa instance? I run monit summary in VM uaa/c3d7b74e-7be8-42e2-beb4-1f4031306ce7 and find the uaa instance doesn't have job policy-server and policy-server-internal :

Process 'consul_agent' running
Process 'uaa' running
Process 'route_registrar' running
Process 'statsd_injector' running
Process 'loggregator_agent' running
System 'system_localhost' running

could not access the app from the diego cell host vm which the app are running on.

Thanks for submitting an issue to cf-networking-release. We are always trying to improve! To help us, please fill out the following template.

Issue

I'm trying to access the apps on the diego-cell which the app containers running.
say I'm trying to access
10.0.16.5:61000, 10.0.16.5 is the diego-cell eth0's ip, the diego-cell host ip.
and the "10.255.74.3" is the ip of the container.

but from other node in the same subnet, say, a vm with ip 10.0.16.4.
i can wget 10.0.16.5:61000.
seems the ip:port could not been accessed only in the containers are running on.

could you please help?

BTW, the 61000 is from the router config:
"game-2048.20.189.129.209.xip.io": [
{
"address": "10.0.16.5:61000",
"tls": false,
"ttl": 120,
"tags": {
"component": "route-emitter"
},
"private_instance_id": "192e60cd-4404-42d3-6e87-404d",
"server_cert_domain_san": "192e60cd-4404-42d3-6e87-404d"
}
],

Context

[provide more detailed introduction]

Steps to Reproduce

[ordered list the process to finding and recreating the issue, example below]

Expected result

[describe what you would expect to have resulted from this process]

Current result

[describe what you you currently experience from this process, and thereby explain the bug]

[include relevant logs]

Possible Fix

[not obligatory, but suggest fixes or reasons for the bug]

name of issue screenshot

[if relevant, include a screenshot]

Opening many ports between applications is cumbersome

Thanks to Container Networking, it's possible to easy to run simple microservice architectures on Cloud Foundry, such as the "cats and dogs" example app. It's even possible to run more sophisticated distributed systems such as Spark. It's a bit cumbersome, but a desirable UX seems easily within reach with some minor tweaks to the API.

With Spark, the Master and Workers spin up and down many concurrent TCP listeners, on random ports (to deal with the large numbers and concurrency), and they need to be able to communicate with each other on those ports. It would be nice to open up all ports, or a range of ports, with a single policy definition. Right now, you can do this with the API but the request payload has tens of thousands of items in the array. Through the cf CLI plugin, it's even worse, because you can only define one policy at a time, so you'd have to issue tens of thousands of individual commands.

My current workaround is this ruby script which takes 3-4 minutes to run.

/cc @rusha19 @jaydunk @rosenhouse @jbayer

Looking for a feature to inspect container inbound and outbound traffic

Hi

I Know that CAP_NET_ADMIN capability - need to run tcpdump - has been removed from the Warden container in order to drastically reduce the attack surface. This is understood and makes sense to not let container users to execute privileged binaries like tcpdump.

We came to run tcpdump in a container to capture network traffic done in the context of troubleshooting sessions. It is a very convenient, none-intrusive, i.e. not requiring any configuration changes in an applications. Mostly we are not interested in the payload itself, we have only an interest in HTTP request and response headers.

Now from the exchange with the cf-container-networking folks I learned that already today there is a solution that forwards packets outside of the container's network stack to a logging system. More specifically the first packet of every connection going out from the container is sent to the logging system.

Now I wonder whether it would be possible to enrich this capability in order that

  • It is possible to define an IP address and port filter (source for inbound and destination for outbound)
  • Allow use of event triggers, e.g. only start logging when a header arrives which has a certain value (regex)
  • Both inbound and outbound packets or even only the HTTP headers should be forwarded to the logging system including the HTTP method and so on.

Your thoughts are appreciated very much
Regards
Ray

`GET /networking/v1/external/policies` missing policies from paginated spaces

Issue

As a user has access to more spaces than the pagination limit (50)
When I request a list of network policies
Then I should see policies from all of the spaces I have access to

At the moment the API only looks at the first page of spaces, so any network policies that relate to subsequent spaces won't be listed in the results.

Context

Our tenants reported a "weird issue" whereby they'd run

cf add-network-policy ...

and then immediately run:

cf network-policies

But couldn't find the network policy they'd just added.

This only happened for some users - if you created a new user with access only to that space you could see the network-policies just fine.

Some investigation pointed us at src/policy-server/cc_client/client.go which looks like it doesn't attempt to walk through the pagination.

Steps to Reproduce

Prepare a user with access to more than 50 spaces:

cf create-org test-policy-bug
cf target -o test-policy-bug

cf create-user [email protected] S3cr3tS3cr3t

for i in $(seq 1 60); do
    cf create-space space-$i
    cf set-space-role [email protected] test-policy-bug space-$i SpaceDeveloper
    
done

Also one user with access to only space-60

cf create-user [email protected] S3cr3tS3cr3t
cf set-space-role [email protected] test-policy-bug space-60 SpaceDeveloper

Login with the user with > 50 spaces, deploy a pair of apps, and setup the policy

cf login -u [email protected] -p S3cr3tS3cr3t
cf target -o test-policy-bug -s space-60

mkdir -p /tmp/test-app
touch /tmp/test-app/Staticfile /tmp/test-app/index.html
cf push app1 -p /tmp/test-app
cf push app2 -p /tmp/test-app

cf add-network-policy app1 --destination-app app2

The user cannot see the policy! But the policy is working fine

$ cf network-policies
Listing network policies in org test-policy-bug / space space-60 as [email protected]...

source   destination   protocol   ports
$

But other user with only one space is be able, the same for an admin user

$ cf login -u [email protected] -p S3cr3tS3cr3t
....
$ cf target -o test-policy-bug -s space-60
....
$ cf network-policies
Listing network policies in org test-policy-bug / space space-60 as [email protected]...

source   destination   protocol   ports
$

Expected result

It should list all of the network policies from the spaces that the user has access to.

Current result

It only includes network policies that relate to spaces on the first page of results.

Possible Fix

See #60

Intent to Support Dynamic/Live ASG Updates

Overview

ASG (Application Security Groups) rule application currently requires app restarts + downtime in order for rules to take effect. Cloud Foundry operators with large installations and distributed teams spend an enormous amount of time/money trying to wrangle developer teams to restart their apps when global rules change. It also seems at odds with expectations, as developers by and large expect these changes to take effect immediately with no extra action on their part.

Proposal

We have generated this proposal to enable the dynamic application of ASG rules to running containers using the Silk CNI.

We will be using this issue as our central coordination point and linking any associated PRs. Please let us know if you have requests or concerns.

Handle CORS calls

Issue

It should be possible to target the /networking endpoints via a browser. When doing so, a browser will first check CORS by issuing a preflight request. These requests fail with a 405 status code.

Steps to Reproduce

Make an http request to /networking/v1/external/policies using the OPTIONS method (this simulates the preflight request which is performed by browsers by default).

Expected result

Response with 200 status code.

Current result

Response with 405 status code (Method Not Allowed)

selecting all active subnets: pq: relation \"subnets\" does not exist

Thanks for submitting an issue to cf-networking-release. We are always trying to improve! To help us, please fill out the following template.

Issue

I deployed cf-networking release as configured in cf-deployment, using an external postgresql94l as Data backend. While the Silk Controller starts, the Cells, which contain the silk-daemon fails when starting. Looking into the silk-daemon, I see the following error, which is also displayed on the silk-controllers:

{"timestamp":"1512762188.908506393","source":"cfnetworking.silk-daemon","message":"cfnetworking.silk-daemon.http-client","log_level":2,"data":{"body":"{\"error\": \"getting lease for underlay ip: pq: relation \"subnets\" does not exist\"}","code":500,"error":"{\"error\": \"getting lease for underlay ip: pq: relation \"subnets\" does not exist\"}"}}

Context

I have a cloud foundry to which I wanted to add container networking, I added Silk, deployed, and it failed. I have verified that:

  • credentials are correct
  • db url is correct
  • the db is reachable
  • I can log in using the url and the credentials to log in
  • tables are present in the database

Database on the PG instance:

containernetworking=> \dt
                   List of relations
 Schema |      Name       | Type  |        Owner        
--------+-----------------+-------+---------------------
 public | destinations    | table | containernetworking
 public | gorp_lock       | table | containernetworking
 public | gorp_migrations | table | containernetworking
 public | groups          | table | containernetworking
 public | policies        | table | containernetworking

Steps to Reproduce

Add those snippets (which are apart from db config the same as in vanilla cf-deployment) to a cf-deployment which does not contain cf-networking
Manifest snippet on the diego-api:

  - name: silk-controller
    release: cf-networking
    properties:
      cf_networking:
        silk_controller:
          ca_cert: ((silk_controller.ca))
          server_cert: ((silk_controller.certificate))
          server_key: ((silk_controller.private_key))
          database:
            type: postgres
            username: containernetworking
            password: ((/a9s_pg_container-networking_password))
            host: a9s-pg-psql-master-alias.node.dc1.consul
            port: 5432
            name: containernetworking
        silk_daemon:
          ca_cert: ((silk_daemon.ca))
          client_cert: ((silk_daemon.certificate))
          client_key: ((silk_daemon.private_key))

cell-manifest-snippet:

  - name: garden-cni
    release: cf-networking
  - name: netmon
    release: cf-networking
  - name: vxlan-policy-agent
    release: cf-networking
    properties:
      cf_networking:
        vxlan_policy_agent:
          ca_cert: ((network_policy_client.ca))
          client_cert: ((network_policy_client.certificate))
          client_key: ((network_policy_client.private_key))
  - name: silk-daemon
    release: cf-networking
    properties:
      cf_networking:
        silk_daemon:
          ca_cert: ((silk_daemon.ca))
          client_cert: ((silk_daemon.certificate))
          client_key: ((silk_daemon.private_key))
  - name: cni

Hit deploy with a psql94 as data store

Expected result

I can use container networking and the deployment succeeds

Current result

Cell Fails, due to silk-daemon not starting, silk-controller is running, but throws errors because the relation 'subnets' does not exist

mestamp":"1512755942.645250082","source":"cfnetworking.silk-controller","message":"cfnetworking.silk-controller.request.leases-acquire.getting lease for underlay ip: pq: relation \"subnets\" does not exist","log_level":2,"data":{"error":"getting lease for underlay ip: pq: relation \"subnets\" does not exist","method":"PUT","request":"/leases/acquire","session":"647.1"}}
{"timestamp":"1512755959.055695534","source":"cfnetworking.silk-controller","message":"cfnetworking.silk-controller.metric-getter","log_level":2,"data":{"error":"selecting all subnets: pq: relation \"subnets\" does not exist","source":"totalLeases"}}
{"timestamp":"1512755959.063234806","source":"cfnetworking.silk-controller","message":"cfnetworking.silk-controller.metric-getter","log_level":2,"data":{"error":"selecting all subnets: pq: relation \"subnets\" does not exist","source":"freeLeases"}}
{"timestamp":"1512755959.083369732","source":"cfnetworking.silk-controller","message":"cfnetworking.silk-controller.metric-getter","log_level":2,"data":{"error":"selecting all active subnets: pq: relation \"subnets\" does not exist","source":"staleLeases"}}
{"timestamp":"1512755982.123170137","source":"cfnetworking.silk-controller","message":"cfnetworking.silk-controller.request.leases-acquire.getting lease for underlay ip: pq: relation \"subnets\" does not exist","log_level":2,"data":{"error":"getting lease for underlay ip: pq: relation \"subnets\" does not exist","method":"PUT","request":"/leases/acquire","session":"661.1"}}
{"timestamp":"1512755989.093193769","source":"cfnetworking.silk-controller","message":"cfnetworking.silk-controller.metric-getter","log_level":2,"data":{"error":"selecting all subnets: pq: relation \"subnets\" does not exist","source":"totalLeases"}}
{"timestamp":"1512755989.101576567","source":"cfnetworking.silk-controller","message":"cfnetworking.silk-controller.metric-getter","log_level":2,"data":{"error":"selecting all subnets: pq: relation \"subnets\" does not exist","source":"freeLeases"}}
{"timestamp":"1512755989.113017321","source":"cfnetworking.silk-controller","message":"cfnetworking.silk-controller.metric-getter","log_level":2,"data":{"error":"selecting all active subnets: pq: relation \"subnets\" does not exist","source":"staleLeases"}}

Possible Fix

maybe it forgets to create the table? so creating it should fix it?

Add the cf-networking plugin to the CF-Community plugin repository

Now that the networking release is part of CF-Deployment, it would be a good next step to have the plugin be apart of the official CF-Community Plugin Repository. This way people can easily find it without having to hunt down this git repo.

The big benefit is that all current CF CLI users can just type in cf install-plugin cf-networking (or whatever name you decide to go with in the public repository) and have the functionality installed.

Deploy fails sometimes due to race condition when starting netmon, vxlan-policy-agent and garden

Issue

When deploying Cells in parallel (max-in-flight is 8) on a big deployment (~500 cells) sometimes rep sometimes fails to start. This leads to a failure of the deployment process and the need to start it again. During the last deploy this happend ~10 times.

Task 30730 | 07:24:33 | Updating instance cell_stg2: cell_stg2/18d6f81b-6173-4824-947a-01f37f615168 (36) (00:15:27)
                     L Error: 'cell_stg2/18d6f81b-6173-4824-947a-01f37f615168 (36)' is not running after update. Review logs for failed jobs: rep
Task 30730 | 07:24:47 | Updating instance cell_stg2: cell_stg2/9f90ea46-d9e3-4633-9944-ae5474f96efc (18) (00:04:16)
Task 30730 | 07:25:51 | Updating instance cell_stg2: cell_stg2/a54fc6f0-43e1-4613-bcd1-b3ab958bc1ab (87) (00:05:04)
Task 30730 | 07:26:01 | Updating instance cell_stg2: cell_stg2/a8b88cb7-3b3f-4f20-b311-e57c35b9e369 (69) (00:05:04)
Task 30730 | 07:26:02 | Updating instance cell_stg2: cell_stg2/a611363f-20f6-45ef-af31-469b22c3f102 (24) (00:05:06)
Task 30730 | 07:26:08 | Updating instance cell_stg2: cell_stg2/a9d3b954-b06d-4331-aec2-563293ebb1d7 (66) (00:05:06)
Task 30730 | 07:26:10 | Updating instance cell_stg2: cell_stg2/9fcd8d5a-1ca8-4fc7-9e89-36adfc294156 (78) (00:05:23)
Task 30730 | 07:26:15 | Updating instance cell_stg2: cell_stg2/a993e227-75a5-4fd9-a0dd-3e328e5ed826 (57) (00:05:15)
Task 30730 | 07:26:15 | Error: 'cell_stg2/18d6f81b-6173-4824-947a-01f37f615168 (36)' is not running after update. Review logs for failed jobs: rep

Task 30730 Started  Fri Apr 19 03:47:47 UTC 2019
Task 30730 Finished Fri Apr 19 07:26:15 UTC 2019
Task 30730 Duration 03:38:28
Task 30730 error

Updating deployment:
  Expected task '30730' to succeed but state is 'error'

Exit code 1

Context

cf-deployment v4.4.0
cf-networking 2.15.0

Steps to Reproduce

  1. Stop all jobs with monit stop
monit stop all
  1. Remove /var/vcap/data/garden-cni
rm -r /var/vcap/data/garden-cni
  1. Start garden, rep, consul_agent and loggregator_agent
monit start consul_agent
monit start loggregator_agent
monit start garden
monit start rep
  1. Start the rest of jobs
monit start all

Expected result

All jobs should start successfully

Current result

rep is failing to start

Note the difference in permissions for /var/vcap/data/garden-cni directory. The problematic case doesn't have x. The permissions depend on which job starts first and creates the folder (race).

Good

# monit summary
The Monit daemon 5.2.5 uptime: 1h 49m 

Process 'consul_agent'              running
Process 'garden'                    running
Process 'rep'                       running
Process 'route_emitter'             running
Process 'netmon'                    running
Process 'vxlan-policy-agent'        running
Process 'silk-daemon'               running
Process 'loggregator_agent'         running
System 'system_localhost'           running

$ sudo ls -ld /var/vcap/data/garden-cni
drwx------ 3 vcap vcap 4096 Apr 18 04:37 /var/vcap/data/garden-cni

$ sudo su vcap -c "ls -l /var/vcap/data/garden-cni"
total 8
drw------- 2 root root 4096 Apr 18 04:37 container-netns
-rw------- 1 root root   22 Apr 18 04:37 external-networker-state.json
-rw------- 1 vcap vcap    0 Apr 18 04:37 iptables.lock

Bad

# monit summary
The Monit daemon 5.2.5 uptime: 1h 47m 

Process 'consul_agent'              running
Process 'garden'                    running
Process 'rep'                       Does not exist
Process 'route_emitter'             running
Process 'netmon'                    running
Process 'vxlan-policy-agent'        running
Process 'silk-daemon'               running
Process 'loggregator_agent'         running
System 'system_localhost'           running

$ sudo ls -ld /var/vcap/data/garden-cni
drw------- 3 vcap vcap 4096 Apr 18 04:27 /var/vcap/data/garden-cni

$ sudo su vcap -c "ls -l /var/vcap/data/garden-cni"
ls: cannot access /var/vcap/data/garden-cni/iptables.lock: Permission denied
ls: cannot access /var/vcap/data/garden-cni/external-networker-state.json: Permission denied
ls: cannot access /var/vcap/data/garden-cni/container-netns: Permission denied
total 0
d????????? ? ? ? ?            ? container-netns
-????????? ? ? ? ?            ? external-networker-state.json
-????????? ? ? ? ?            ? iptables.lock

rep.stdout.log:

...
{
  "timestamp": "2019-04-18T04:28:10.250161481Z",
  "level": "error",
  "source": "rep",
  "message": "rep.exited-with-failure",
  "data": {
    "error": "Exit trace for group:\ngarden_health_checker exited with error: external networker up: exit status 1\ncontainer-metrics-reporter exited with nil\nhub-closer exited with nil\nmetrics-reporter exited with nil\nvolman-driver-syncer exited with nil\ndebug-server exited with nil\n"
  }
}
...

vxlan-policy-agent.stdout.log:

...
{
  "timestamp": "2019-04-18T04:27:43.901602556Z",
  "level": "error",
  "source": "cfnetworking.vxlan-policy-agent",
  "message": "cfnetworking.vxlan-policy-agent.rules-enforcer.create-chain",
  "data": {
    "error": "lock: open lock file: open /var/vcap/data/garden-cni/iptables.lock: permission denied",
    "session": "4"
  }
}
...

Possible Fix

Quick fixes

  • Create a directory right after removing it and set proper permissions
    or
  • Make garden-external-networker create the directory with permissions 700 instead of 600
    or
  • BPM could ensure not only directory owner but also its permissions

More generic fix

  • Every directory should have its owner and the owner should ensure the directory existence and permissions (this is what package managers, such as rpm, dpkg, pacman, normally do)

release components don't support Postgres with SSL required

Thanks for submitting an issue to cf-networking-release. We are always trying to improve! To help us, please fill out the following template.

Issue

policy-server does not support connecting to a PostgreSQL database that requires TLS.

Context

Any service using https://github.com/cloudfoundry/cf-networking-helpers for its db library is impacted by this. In this case, migratedb (in the pre-start for policyserver fails). This is due to sslmode=disable being hard-coded in the connection string for postgres (but not for mysql). Therefore, policy-server using postgres ignores the database tls configuration options that can be specified in the manifest spec.

I tested this using an Azure Database for PostgreSQL with SSL required.

Steps to Reproduce

Make a Postgres server that requires ssl.
Specify database.require_ssl and database.ca_cert in your deployment manifest.
Check out /var/vcap/sys/log/policy-server/pre-start.stdout.log

Expected result

[describe what you would expect to have resulted from this process]
I expected the deployment to succeed by connecting to the database.

Current result

[describe what you you currently experience from this process, and thereby explain the bug]
It fails with a complaint that the server requires SSL, but the client does not have any SSL settings configured.

[include relevant logs]
Fails with:
db connect: unable to ping: pq: SSL connection is required. Please specify SSL options and retry.

Possible Fix

This line appears to be the smoking gun.

https://github.com/cloudfoundry/cf-networking-helpers/blob/671c7b0535ed7bde1fa9e4b7a3c38f3e9dbc8798/db/config.go#L28

The hardcoded sslmode=disable needs to be replaced with settings that set the sslmode and potentially specify the CA cert (the sslrootcert value).

Adding a CNI plugin in the chain

Greetings,
I am trying to create a PoC which requires to add another CNI plugin between the CF Wrapper plugin and the Slik CNI plugin. In theory, it should be supported by CNI spec since version 0.5.
I would like to have something like that, but I am not sure how it can be integrated.

My questions are:
Which version of the CNI spec is supported?
Any thoughts how I can add a binary CNI plugin by keeping everything else in place?

Thank you in advance
Konstantinos

Apps network usage metrics missing

Issue

Follow up of this slack discussion

Unlike CPU, memory and disk, there is no metric available with information about network (how much data is being sent/received by the container).

Context

Silk release allows to limit the amount of network bandwidth an application can use from the host (diego-cell) but there is no way to monitor if some applications are reaching the limit and being throttled or not.

Regardless of the limit being in place or not, I think knowing how much bandwidth is used by each application is useful and should be part of the already existing container metrics.

Steps to Reproduce

Look at data available in ContainerMetric

Expected result / Possible Fix

I would expect to see information about the network. Either a counter of sent / received bytes or an average bandwidth usage for the last period.

There was an implementation of the first option on the Garden API but was abandoned due to the swith to cf-networking (see slack)

Default GC_Thresh params too low for large environments

Hi cf-networkers! Here is an issue.

Issue

On larger environments (~600+ cells seems to be the tipping point) we experience severe performance degradation once cf-networking is enabled. This can be resolved by increasing various kernel parameters, listed below. I'm opening an issue here rather than PRing a fix because I'm not super sure whether this is best done in the stemcell or in cf-networking and I'd like to get your opinion (I also have no idea how to test this, sorry :-( ).

Context

The default (stemcell) values for net.ipv4.neigh.default.gc_thresh1, net.ipv4.neigh.default.gc_thresh2 and net.ipv4.neigh.default.gc_thresh3 cause very high cpu load on environments with larger numbers of cells once container networking is enabled, leading to app crashes and system instability.

Steps to Reproduce

Scale to a large (~600ish+) number of nodes (sorry - not sure of an easier way to reproduce this unless you can expand the ARP cache size some other way)

Expected result

The system should not experience instability even with large numbers of cells.

Current result

Once enough cells join the overlay cpu load becomes very large and apps start crashing (due to health check failures). Kernel logs show neighbour: arp_cache: neighbor table overflow.

Possible Fix

The following sysctl parameters fix the problem:

sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=8192; 
sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=4096; 
sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=2048;

Picture of what William Shatner would look like debugging this

image

App SD docs are out of date

I'm trying to find some official docs on "App SD" / "Internal Routes" / "Polyglot Service Discovery" and can't find anything authoritative.

In this repo, the docs for application service discovery says things like

  • "Application Developers who want to use container to container networking today are required to bring their own service discovery"
  • "we plan to build service discovery for c2c"

maybe we can update the docs to reflect the fact that this stuff is built and is part of stock cloudfoundry?

also, the official docs.cloudfoundry.org are out of date and sparse. e.g.

  • the docs for internal domains says "When you enable service discovery, the internal domain apps.internal becomes provided for your use." I believe this is no longer true.
  • nothing on that page mentions that internal routes are provided as DNS records for applications on the platform

deleting policies failed

➜ bosh-lite git:(develop) ✗ cf remove-access zuul-proxy backend --protocol tcp --port 8080

Denying traffic from zuul-proxy to backend as admin...
FAILED
deleting policies: failed to make request to policy server

api/b0c7ebc0-23c3-4f21-9119-e577af0fbaff:/var/vcap/sys/log/policy-server# tail -f policy-server.stdout.log
{"timestamp":"1500482810.026837587","source":"cfnetworking.policy-server","message":"cfnetworking.policy-server.authenticator: failed to verify token with uaa","log_level":2,"data":{"error":"bad uaa response: 400: {"error":"invalid_token","error_description":"Token has expired"}"}}
^C

CF CLI plugin coloration does not gracefully degrade when piped

Expected behavior:
When I have an app named node-1
And one access rules for node-1
And I execute cf list-access --app node-1
Then I should see a list of access rules for my app with all strings node-1 colorized
And I execute cf list-access --app node-1 | less (or other command that does not accept colors)
Then I should see a list of access rules for my app with no colorized strings
Ex:

Listing policies as [email protected]...
OK

Source              Destination     Protocol        Port
node-1              node-2         tcp             9000

Observed behavior:
...
When I execute cf list-access -app node-1 | less
Then I see all the control characters
Ex:

cf list-access --app node-1 | less
Listing policies as ESC[36;[email protected][0m...
ESC[32;1mOK
ESC[0m
ESC[;1mSource           Destination     Protocol        Port
ESC[0mESC[36;1mnode-1ESC[0m             ESC[36;1mnode-2ESC[0m           tcp             9000

Notes
For reference, it looks like cf apps degrades correctly. When I run cf apps or cf apps | less it is colorized and not, resp (with no extraneous characeters). I suspect the implementation may use the colorization library differently.

Allowing networking tests to run in parallel with CF acceptance tests

Currently we have to run the nats suite after the cats suite due to nats messing with global settings like running ASGs: https://github.com/cloudfoundry/cf-networking-release/blob/develop/src/test/acceptance/external_connectivity_test.go. Could this test instead use ASGs scoped to a given org to avoid test pollution?

This has previously bitten our team as we started by running cats and nats in parallel and only later discovered this introduced flakiness due to changing global ASGs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.