gacybercenter / kinetic Goto Github PK

View Code? Open in Web Editor NEW

27.0 12.0 12.0 8.42 MB

(MIRROR) Deployment and maintenance tool for Cyber Ranges. Core components are salt, openstack, and ceph.

Home Page: https://gitlab.com/gacybercenter/open/kinetic/kinetic

License: Apache License 2.0

SaltStack 62.99% Shell 3.52% Python 32.73% HTML 0.05% Java 0.02% JavaScript 0.70%

openstack ceph operational-technology cloud cyberrange guacamole opensource range alternate-architecture

kinetic's People

Contributors

$snootfracks avatar$

Stargazers

Watchers

Forkers

djivey ga-cyberworkforceacademy jasonadsit ppeereb1 jpward1981 bitskri3g gtherron gacwr srivignesh31 cvlabsio rohit-tambakhe reyesj2

kinetic's Issues

[FEATURE] Cloud-init supported Security Onion image

Is your feature request related to a problem? Please describe.
This image is needed to support the GTA training requirements.

Describe the solution you'd like
cloud-init capable security-onion iamge

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] orch.map is slow and unreliable

Describe the bug
orch.map doesn't work most of the time and does not take advantage of parallelism opportunities.

To Reproduce
run orch.map

Expected behavior
orch.map works and everything is brought up in parallel where applicable. This includes:

All minions going to the install state and then only running the configuration state in a dependent order
Built-in and configurable retry thresholds
Performance reporting after deployment is done

This will also include a change to the dependency model and service definitions (likely a 'needs' dictionary) like so:

  cache:
    count: 1
    ram: 8192000
    cpu: 2
    os: ubuntu2004
    disk: 512G
    needs:
      controller: configure
      some_other_thing: install
    networks:
      management:
        interfaces: [ens3]

That way, the orchestration runner can independently determine pre-reqs without needing centralized changes and makes extending functionality of kinetic and resolving dependencies extremely easily.

Similar to the old-timey vta/saltstack, each host will have the last successfully completed phase of deployment tracked, except instead of preprov, prov, preprod and prod, it will be base, networking, install, and configure. These states will be tracked in the mine rather than as a grain so everyone can be aware of where everyone else is in their process. This will also likely be a dictionary, like so:

controller:
  overall: install
  individual:
    controller-foo: install
    controller-bar: configure

The overall key should report the status of the slowest group member, so services that have dependencies can get the required functionality regardless of who they request the service from.

[FEATURE] Alternate Architecture support via pure emulation

Currently, integrating instances that use alternate architectures that differ from the host ISA means creating them on a compute node and then plugging them directly in to the provider network.

While this 'works' in the broadest sense of the word, there is significant value in make this functionality fully integrated with nova, neutron, and glance. We got a grant that will fund this work, but this issue is just a tracker since it is the 'killer feature' that necessitated the creation of kinetic vs. everything else anyway.

Getting this done will (along with everything else in the 1.0 milestone) will trigger the release of v1 and the press releases and fanfare that we have staged for the GCC as a whole.

[FEATURE] Standardized shipped version of mariadb

The version of mdb should be the same regardless of OS.

By doing that, we can do things like this: https://galeracluster.com/2019/02/galera-cluster-4-available-for-use-in-the-latest-mariadb-10-4-3-release-candidate/

and set pc.recovery to ON so we don't have to have weird staggered reboots during provisioning to keep the cluster healthy.

[FEATURE]Windows Server 2016/2019 GUI

Is your feature request related to a problem? Please describe.
We are required to provide an environment to train on Windows Server 2016/2019 with a GUI front-end. We require this capability in order to meet training objectives that are directly related to performing tasks on through the GUI interface for server management.

Describe the solution you'd like
Images available for Windows Server 2016 and 2019 with the GUI front end ready to deploy on the range.

Describe alternatives you've considered
N/A

Additional context
Add any other context or screenshots about the feature request here.

[FEATURE] Add initial integrated DANOS support

Is your feature request related to a problem? Please describe.
Currently, things like haproxy, share servers, and other opportunities for dynamic configuration (e.g. multi-tenancy at the gateway) are missed because there is no integration with any gateway devices. To start, DANOS (https://www.danosproject.org/) support should be added.

Describe the solution you'd like
haproxy, nfs-ganesha, and other relevant services should be able to automatically change NAT and filter settings at the gateway depending on cluster happenings. For example, if the haproxy address changes (or more endpoints get added), DNAT should be automatically and transparently update.

Additional context

Initial support should be added for:

haproxy address updates
nfs-ganesha address updates

[FEATURE] Networking Re-Write

Is your feature request related to a problem? Please describe.
As written, the networking state at formulas/common/networking is awful. It's impossible to read, covers a small number of use cases, and doesn't function with haproxy networking requirements. Similarly, the host listing configuration in the pillar also needs to be re-imagined.

Describe the solution you'd like
The networking state should:

Function regardless of whether or not the host is using netplan, ifupdown, etc.
Support VLANs and bonding
Support 'oddball' network configurations such as haproxy out of the box

Additional context
This will almost certainly break all existing installations. Tag master and make a new branch before you start working on this and merge only when completed.

Upstream issue with mine.get in orchestrate runner

saltstack/salt#48020

Currently manually implement workaround defined in issue. Need to update as soon as packages with the fix get published.

[FEATURE] Console expiry

Currently, horizon sessions last no more than an hour due to:

https://docs.openstack.org/keystone/ussuri/configuration/samples/keystone-conf.html

expiration = 3600

https://docs.openstack.org/horizon/latest/configuration/settings.html

SESSION_TIMEOUT = 3600

https://docs.openstack.org/nova/ussuri/configuration/config.html#consoleauth

token_ttl = 600

note: https://bugzilla.redhat.com/show_bug.cgi?id=1500136
Setting the token_ttl only affects the establishment of new sessions (or re-establishing sessions that got killed due to aggressive connection scavenging and whatnot).

These values should become tunables, but should also probably have increased defaults

[BUG] neutron endpoint deps doesn't differentiate between linuxbridge or ovn backend

Currently, you must specify your network backend AND change your neutron endpoint dependency when you move between the two, e.g.

needs:
  configure:
    networking: configure

needs:
  configure:
    ovsdb: configure

Must be set in neutron deps, but not both. This is silly, and the system should pick the correct dep based on the backend.

[FEATURE] Code Re-Use

There is a significant amount of duplicated code/workflows (particularly inside of the openstack services, which have very similar design patterns). These should be pulled out into generic states using macros and inheritance: https://docs.saltstack.com/en/latest/topics/jinja/index.html#macros

Zun-ui patches need removed once they are merged in upstream repos

/formulas/horizon/files/client.py is a patch for https://bugs.launchpad.net/zun-ui/+bug/1797285 which is applied in /formulas/horizon/install-zun-ui.sls.

/formulas/horizon/files/websocketclient.py is a patch for https://bugs.launchpad.net/zun/+bug/1762511 which is applied in /formulas/horizon/install-zun-ui.sls.

/formulas/horizon/files/requirements.txt is a patch to set an upper constraint for which python-zunclient is installed on zun-ui for communication with zun-api. If I don't set this it will install 3.2.1 which is for stein release and causes issues.

[BUG] apt-cacher-ng holds broken packages

See: https://askubuntu.com/questions/119298/apt-get-using-apt-cacher-ng-fails-to-fetch-packages-with-hash-sum-mismatch

acng runs a daily job that cleans this up, but there should be an onfail condition that wipes broken packages in the cache to avoid issues like this happening during orch runs

[BUG] Interface assignment is not intuitive

Describe the bug
Currently, interfaces inside of kvm-based endpoints get assigned like this:

        {% for network in pillar['virtual'][type]['networks']|sort() %}
          <interface type='bridge'>
            <source bridge='{{ network }}_br'/>
            <target dev='vnet{{ loop.index0 }}'/>
            <model type='virtio'/>
            <alias name='net{{ loop.index0 }}'/>
            <mac address='{{ salt['generate.mac']('52:54:00') }}'/>
          </interface>
        {% endfor %}

This creates predictable interface assignments, but only if the user configures their pillar alphabetically, which is not ideal. The above function should be changed to match physical interface bridge names to the network that its supposed to be attached to regardless of the actual order in the pillar.

[FEATURE] Add support for letsencrypt test certificates

Currently the haproxy endpoint(s) assume that you will always be running in production and thus always request certificates from the production letsencrypt server. This leads to situations where heavy development will trigger issuance limits (https://letsencrypt.org/docs/rate-limits/).

The system should support a 'dev' flag that will do two things:

Only request certificates from the LE staging server
Import the happy hacker CA into the root store on all endpoints.

Note that the happy hacker CA private key is intentionally public, so creating a deployment done in dev mode should never be trusted and should always be re-spun once development is complete. The master branch and formal release tags should never have dev enabled.

[FEATURE] True Multi-Master

Currently, we have to track a differentiator across most non-physical endpoints to do things like service creation since these activities are not idempotent in many cases. We solve this by having the arbitrary endpoint that is identified as 'spawn 0' perform these actions, and then notify the balance of endpoints of that type that it is safe to proceed.

While there isn't anything necessarily wrong with this, it means that we have to track an extra piece of state data that also makes creating additional individual nodes more difficult since they aren't necessarily aware of the spawning of all the other endpoints.

I don't have any great ideas on how to solve this currently, but the problem needs to be acknowledged.

[FEATURE] cloud-init supported Kali image

Is your feature request related to a problem? Please describe.
Need a cloud-init kali image

Describe the solution you'd like
This image is needed to support the GTA training requirements.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] multiple nfs-ganesha servers

Multiple nfs-ganesha servers are currently not possible when using UCA as the version of nfs-ganesha must be 2.7+.

This should be backported from somewhere.

Add centos8 support

Add support for centos8 in all functional areas.

base64 encoded passwords breaks nova-manage

If a base64 encoded password in the dynamic_pillar has a '+' symbol, it will break nova manage.

2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping [req-7d8197a5-f222-40c4-b230-1ff93f127894 - - - - -] Failed to parse [database]/connection to format cell mapping: ValueError: invalid literal for int() with base 10: 'SYYkKq4V7JhtqFDN+mrXfvhIBn1+'
2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping Traceback (most recent call last):
2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping   File "/usr/lib/python2.7/dist-packages/nova/objects/cell_mapping.py", line 137, in _format_db_url
2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping     return CellMapping._format_url(url, CONF.database.connection)
2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping   File "/usr/lib/python2.7/dist-packages/nova/objects/cell_mapping.py", line 107, in _format_url
2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping     'port': default_url.port,
2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping   File "/usr/lib/python2.7/urlparse.py", line 113, in port
2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping     port = int(port, 10)
2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping ValueError: invalid literal for int() with base 10: 'SYYkKq4V7JhtqFDN+mrXfvhIBn1+'
2018-12-31 15:49:35.595 16085 ERROR nova.objects.cell_mapping

Suggested fix is to have openssl continue to generate passwords until it creates one without a '+' for all cases. Haven't had an issue with '=' yet.

Add Ussuri support

Add Ussuri support in CentOS8 and Ubuntu 20.04.

[FEATURE] Orchestration does not retry hung physical endpoints

Currently, the orchestration system does not proceed from step-to-step until all endpoints that it expects complete the requisite task. So, when transient errors break a bare-metal installation (networking, installer issues, etc.), all endpoints will fail rather than just that one. Manually running an orch.generate fixes it, but the user shouldn't have to do this in an ideal world, particularly since running and out-of-band orch.generate won't make orch.init report success, even though it will still technically be fully functional.

Things that have to be done:

Identify endpoint(s) that failed after a certain timeout - this will get tricky during installation routines since there are only a handful of places where affirmation checks can be done, but it should be doable.
Re-try the failed routine on these endpoints without halting the progress of others of the same type
Ensure that orch.init doesn't exit nonzero if the retry(ies) are successful

There are things like onfail, etc present in the state system, but I'm not sure if these are granular enough to handle retries in this manner.

[FEATURE] Image-bakery integration

Integrate image-bakery into control plane (for purposes of creating images to run endpoints) and take out cli hacks.

[FEATURE] Add Focal support

Add support for Ubuntu 20.04 Focal, to include subiquity support.

[BUG] Expired repo key for kata on ubuntu

Ref: kata-containers/kata-containers#545

I disabled the gpg check on the kata repo to continue testing, but i-am-speed should not be merged before the gpg check is re-enabled.

[BUG] Nested Stack Issue

Describe the bug
Using nested stacks along with the get_file function will cause a stack creation error.

To Reproduce
Steps to reproduce the behavior:

Template: https://raw.githubusercontent.com/sarahdaniellerees/better-openstack/master/heat/Base_templates/scaled_training_env1.yaml

Environment file: https://raw.githubusercontent.com/sarahdaniellerees/better-openstack/master/heat/environments/Debian_env.yaml

Expected behavior

Stack should create successfully.

Client Configuration (please complete the following information):
Horizon Dashboard

Additional context
Add any other context about the problem here.

[FEATURE]Add function and integration testing to orchestration methods

Is your feature request related to a problem? Please describe.
Currently, the only 'testing' done during an orchestration run is built in to the state and modules used (e.g., if a service won't start, the orch will fail, which means that particular function won't work). This is OK for most very basic scenarios, but doesn't capture poorly performing endpoints, service flapping, or other non-obvious failures

Describe the solution you'd like
After each service orchestration runner completes, it should run a test battery (rally, etc. depending on the service) that checks various pieces of functionality. This test suite should be configurable with a baseline defined in the file pillar with extensions available to be defined in the gitfs pillar.

[FEATURE] Minimal Implementation Walk Through

@bitskri3g I like the goal of Kinetic which is to have a general baseline and a site specific overlay/pillar. I will be glad to help build a blog/tutorial/etc on a Minimal Implementation that can be posted as part of the readme or a link to include in the readme. I think that will help provide a quick start guide for easy adoption. Let me know your thoughts and how I can help.

Upstream issue with mine.send aliases

saltstack/salt#34581

Currently cannot assign an alias to mine.send functions inside the reactor system. Currently can only handle single file.read module at a time as a result. Can still develop, but will need to be fixed later or else only one host can go through the pipeline at a time.

[BUG] frr+junos unnumbered bgp

Describe the bug
Can't get

Model: qfx5200-48y
Junos: 20.1R1.11

to peer with

FRRouting 7.3.1 on Linux compute-41b1734e-6df2-5761-8573-4bf5e9fcd75b 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

compute 1

# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log
# in /var/log/frr/frr.log
log file /var/log/frr/bgp.log debugging

!
service integrated-vtysh-config
!
int lo
 ip address 10.3.0.1/32
!
interface enp113s0f1
 ipv6 nd ra-interval 5
 no ipv6 nd suppress-ra
!
router bgp 65301
 bgp router-id 10.3.0.1
 bgp bestpath as-path multipath-relax
 bgp bestpath compare-routerid
 neighbor fabric peer-group
 neighbor fabric remote-as external
 neighbor fabric description Internal Fabric Network
 neighbor fabric capability extended-nexthop
 neighbor enp113s0f1 interface peer-group fabric
 !
 address-family ipv4 unicast
  network 10.3.0.1/32
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor fabric activate
  neighbor fabric prefix-list host-routes-out out
 exit-address-family
!
ip prefix-list host-routes-out seq 100 permit 10.3.0.1/32
!
line vty
!
end

compute 2

# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log
# in /var/log/frr/frr.log
log file /var/log/frr/bgp.log debugging

!
service integrated-vtysh-config
!
int lo
 ip address 10.4.0.1/32
!
interface enp113s0f1
 ipv6 nd ra-interval 5
 no ipv6 nd suppress-ra
!
router bgp 65302
 bgp router-id 10.4.0.1
 bgp bestpath as-path multipath-relax
 bgp bestpath compare-routerid
 neighbor fabric peer-group
 neighbor fabric remote-as external
 neighbor fabric description Internal Fabric Network
 neighbor fabric capability extended-nexthop
 neighbor enp113s0f1 interface peer-group fabric
 !
 address-family ipv4 unicast
  network 10.4.0.1/32
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor fabric activate
  neighbor fabric prefix-list host-routes-out out
 exit-address-family
!
ip prefix-list host-routes-out seq 100 permit 10.4.0.1/32
!
line vty
!
end

qfx

interfaces {
    et-0/0/24 {
        unit 0 {
            family inet {
                unnumbered-address lo0.0;
            }
            family inet6;
        }
    }
    et-0/0/25 {
        unit 0 {
            family inet {
                unnumbered-address lo0.0;
            }
            family inet6;
        }
    }
    lo0 {
        unit 0 {
            family inet {
                address 10.5.0.1/32;
            }
            family inet6;
        }
    }
}
routing-options {
    static {
        route 0.0.0.0/0 next-hop 10.100.3.254;
    }
    router-id 10.5.0.1;
    autonomous-system 65303;
}
protocols {
    router-advertisement {
        interface et-0/0/24.0;
        interface et-0/0/25.0;
    }
    bgp {
        group frr {
            type external;
            traceoptions {
                file bgp-frr.log;
                flag all;
            }
            advertise-peer-as;
            family inet {
                unicast {
                    extended-nexthop;
                }
            }
            family inet6 {
                unicast;
            }
            local-as 65303;
            multipath {
                multiple-as;
            }
        }
        traceoptions {
            file bgp.log;
            flag all;
        }
    }
}

qfx bgp log

Nov  3 21:02:34.678657 BGP SEND fe80::b233:a6ff:fe70:4913+179 -> fe80::225:90ff:fe5f:603d+53674
Nov  3 21:02:34.678661 BGP SEND message type 1 (Open) length 29
Nov  3 21:02:34.678665 BGP SEND version 4 as 65303 holdtime 90 id 10.5.0.1 parmlen 0
Nov  3 21:02:34.678670
Nov  3 21:02:34.678670 BGP SEND fe80::b233:a6ff:fe70:4913+179 -> fe80::225:90ff:fe5f:603d+53674
Nov  3 21:02:34.678673 BGP SEND message type 3 (Notification) length 21
Nov  3 21:02:34.678678 BGP SEND Notification code 6 (Cease) subcode 5 (Connection Rejected)
Nov  3 21:02:34.678687 bgp_listen_accept:5610: NOTIFICATION sent to fe80::225:90ff:fe5f:603d+53674 (proto): code 6 (Cease) subcode 5 (Connection Rejected), Reason: Connection attempt from unconfigured neighbor: fe80::225:90ff:fe5f:603d+53674
Nov  3 21:02:34.678693 Notify sent to fe80::225:90ff:fe5f:603d+53674 (proto), code 6, subcode 5
Nov  3 21:02:34.678741 task_delete: deleting task BGP_Proto.fe80::225:90ff:fe5f:603d+53674
Nov  3 21:02:34.678747 task_close: close socket 88 task BGP_Proto.fe80::225:90ff:fe5f:603d+53674
Nov  3 21:02:34.678750 task_reset_socket: task BGP_Proto.fe80::225:90ff:fe5f:603d+53674 socket 88
Nov  3 21:02:34.678804 task_job_delete_task: deleting all jobs for task BGP_Proto.fe80::225:90ff:fe5f:603d+53674
Nov  3 21:02:34.678811 task_job_deleted_task: no jobs found for task BGP_Proto.fe80::225:90ff:fe5f:603d+53674
Nov  3 21:02:46.886312 bgp_reuse_scan: Starting scan
Nov  3 21:03:06.745650 bgp_reuse_scan: Starting scan
Nov  3 21:03:26.598241 bgp_reuse_scan: Starting scan
Nov  3 21:03:46.577803 bgp_reuse_scan: Starting scan
Nov  3 21:04:06.578322 bgp_reuse_scan: Starting scan
Nov  3 21:04:26.416272 bgp_reuse_scan: Starting scan
Nov  3 21:04:34.182177 task_process_events_internal: accept ready for BGP_Listen.::+179
Nov  3 21:04:34.182201 task_accept: task BGP_Listen.::+179 socket 86 addr ::+179
Nov  3 21:04:34.182235 bgp_listen_accept: Connection attempt from unconfigured neighbor: fe80::225:90ff:fe5f:5f21+41324
Nov  3 21:04:34.182243 task_alloc: allocated task block for BGP_Proto priority 50
Nov  3 21:04:34.182253 bgp_listen_accept: Connection with incoming ifl 0xc6df500 instance 0xbdf2000(master)
Nov  3 21:04:34.182280 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324 socket 88 option TOS(16) value 192
Nov  3 21:04:34.182290 bgp_listen_accept: accepting connection from fe80::225:90ff:fe5f:5f21+41324 (local fe80::b233:a6ff:fe70:4912+179)
Nov  3 21:04:34.182298 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324 socket 88 option NonBlocking(8) value 1
Nov  3 21:04:34.182305 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324 socket 88 option RecvBuffer(0) value 16384
Nov  3 21:04:34.182311 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324 socket 88 option SendBuffer(1) value 16384
Nov  3 21:04:34.182318 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324 socket 88 option Linger(2) value { 0, 0 }
Nov  3 21:04:34.182334
Nov  3 21:04:34.182334 BGP SEND fe80::b233:a6ff:fe70:4912+179 -> fe80::225:90ff:fe5f:5f21+41324
Nov  3 21:04:34.182338 BGP SEND message type 1 (Open) length 29
Nov  3 21:04:34.182342 BGP SEND version 4 as 65303 holdtime 90 id 10.5.0.1 parmlen 0
Nov  3 21:04:34.182347
Nov  3 21:04:34.182347 BGP SEND fe80::b233:a6ff:fe70:4912+179 -> fe80::225:90ff:fe5f:5f21+41324
Nov  3 21:04:34.182350 BGP SEND message type 3 (Notification) length 21
Nov  3 21:04:34.182355 BGP SEND Notification code 6 (Cease) subcode 5 (Connection Rejected)
Nov  3 21:04:34.182366 bgp_listen_accept:5610: NOTIFICATION sent to fe80::225:90ff:fe5f:5f21+41324 (proto): code 6 (Cease) subcode 5 (Connection Rejected), Reason: Connection attempt from unconfigured neighbor: fe80::225:90ff:fe5f:5f21+41324
Nov  3 21:04:34.182370 Notify sent to fe80::225:90ff:fe5f:5f21+41324 (proto), code 6, subcode 5
Nov  3 21:04:34.182412 task_delete: deleting task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324
Nov  3 21:04:34.182417 task_close: close socket 88 task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324
Nov  3 21:04:34.182421 task_reset_socket: task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324 socket 88
Nov  3 21:04:34.182484 task_job_delete_task: deleting all jobs for task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324
Nov  3 21:04:34.182519 task_job_deleted_task: no jobs found for task BGP_Proto.fe80::225:90ff:fe5f:5f21+41324
Nov  3 21:04:34.684921 task_process_events_internal: accept ready for BGP_Listen.::+179
Nov  3 21:04:34.684954 task_accept: task BGP_Listen.::+179 socket 86 addr ::+179
Nov  3 21:04:34.684989 bgp_listen_accept: Connection attempt from unconfigured neighbor: fe80::225:90ff:fe5f:603d+53734
Nov  3 21:04:34.684998 task_alloc: allocated task block for BGP_Proto priority 50
Nov  3 21:04:34.685002 bgp_listen_accept: Connection with incoming ifl 0xc6df680 instance 0xbdf2000(master)
Nov  3 21:04:34.685018 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:603d+53734 socket 88 option TOS(16) value 192
Nov  3 21:04:34.685024 bgp_listen_accept: accepting connection from fe80::225:90ff:fe5f:603d+53734 (local fe80::b233:a6ff:fe70:4913+179)
Nov  3 21:04:34.685032 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:603d+53734 socket 88 option NonBlocking(8) value 1
Nov  3 21:04:34.685041 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:603d+53734 socket 88 option RecvBuffer(0) value 16384
Nov  3 21:04:34.685046 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:603d+53734 socket 88 option SendBuffer(1) value 16384
Nov  3 21:04:34.685052 task_set_option_internal: task BGP_Proto.fe80::225:90ff:fe5f:603d+53734 socket 88 option Linger(2) value { 0, 0 }
Nov  3 21:04:34.685066
Nov  3 21:04:34.685066 BGP SEND fe80::b233:a6ff:fe70:4913+179 -> fe80::225:90ff:fe5f:603d+53734
Nov  3 21:04:34.685070 BGP SEND message type 1 (Open) length 29
Nov  3 21:04:34.685077 BGP SEND version 4 as 65303 holdtime 90 id 10.5.0.1 parmlen 0
Nov  3 21:04:34.685083
Nov  3 21:04:34.685083 BGP SEND fe80::b233:a6ff:fe70:4913+179 -> fe80::225:90ff:fe5f:603d+53734
Nov  3 21:04:34.685086 BGP SEND message type 3 (Notification) length 21
Nov  3 21:04:34.685091 BGP SEND Notification code 6 (Cease) subcode 5 (Connection Rejected)
Nov  3 21:04:34.685100 bgp_listen_accept:5610: NOTIFICATION sent to fe80::225:90ff:fe5f:603d+53734 (proto): code 6 (Cease) subcode 5 (Connection Rejected), Reason: Connection attempt from unconfigured neighbor: fe80::225:90ff:fe5f:603d+53734
Nov  3 21:04:34.685104 Notify sent to fe80::225:90ff:fe5f:603d+53734 (proto), code 6, subcode 5
Nov  3 21:04:34.685146 task_delete: deleting task BGP_Proto.fe80::225:90ff:fe5f:603d+53734
Nov  3 21:04:34.685154 task_close: close socket 88 task BGP_Proto.fe80::225:90ff:fe5f:603d+53734
Nov  3 21:04:34.685158 task_reset_socket: task BGP_Proto.fe80::225:90ff:fe5f:603d+53734 socket 88
Nov  3 21:04:34.685222 task_job_delete_task: deleting all jobs for task BGP_Proto.fe80::225:90ff:fe5f:603d+53734
Nov  3 21:04:34.685230 task_job_deleted_task: no jobs found for task BGP_Proto.fe80::225:90ff:fe5f:603d+53734
Nov  3 21:04:46.268312 bgp_reuse_scan: Starting scan

@djivey if you've got some pro services hours see if they can take a look at this. I noticed that the qfx5k only started supporting unnumbered ebgp in 20.1, but I think I'm good. Not sure what I am missing here.

If you want to reproduce this, take the latest from the frr branch and highstate computes. Change /etc/systemd/network/private.network to

[Match]
Name={{ your private interface }}
[Network]
DHCP=no

You should be able to ping around to the link local ipv6 addresses once you get everything going. The challenge is getting ipv4 routes to be transported over ipv6 link-local iaw https://tools.ietf.org/html/rfc5549 which I just can't seem to figure out. Let me know if you need more info...

Add octopus support

Add octopus support in CentOS8 and Ubuntu 20.04.

[BUG] salt 3000 ships with broken mysql module

Describe the bug

ref: saltstack/salt#56124

There are argument issues when using unix sockets on salt 3000.

OpenStack Service creation where #of endpoints > 1

Currently, openstack services are added to the service catalog with a simple screen-scrape check to see if they already exist. This works fine when there is only a single service endpoint being provisioned, but when you have >1 there is an extremely high likelihood that several of them will see an empty service catalog for the service they're responsible for, and then create multiple service entries for the same thing.

Unsure of the best way to address this.

[BUG] Current fernet deployment method requires huge universal import

Describe the bug
cryptography.fernet is required to generate fernet keys on keystone, but the current salt custom module method checks all minions for the presence of the library, which is not ideal.

Ceph Pool Creation

Currently, there is no good place to create the three required pools

ceph osd pool create vms
ceph osd pool create images
ceph osd pool create volumes

and then rbd pool init {{ pool }} as a follow-on.

You can't do it during the cephmon phase because there aren't enough OSDs to justify the quantity of PGs. The requiring services (glance, etc.) don't have access to client.admin, and I'd rather not give it to them.

Needs a solution; for now pools are created manually as a final step prior to deploying openstack.

Windows Instances BSOD due to unhandled MSRs

Ref: https://gist.github.com/jorritfolmer/d01194a00f440ad257bd56d51baddc2d

Modern windows guests (10/2016+) reference MSRs that kvm doesn't handle correctly which causes the guest to BSOD. KVM should be configured to ignore MSRs for the time being.

[BUG] Large number of flavor definitions causes nova bootstrap timeout

When a large number of flavors is defined in the pillar, nova may not be able to create all flavors in time prior to the orchestration timeout.

Flavors are created with two API calls - first a list of existing flavors is pulled, and then a flavor is created if the flavor name is not in the list.

Speedup ideas:
Write initial flavors directly to the database
Pull the list of flavors down one time and save it to a file (this would only be ~50% speedup
Push flavor creation to a post install user task

nvme device naming

NVME devices appear to be (significantly) more susceptible to name changes (/dev/nvmeX) across reboots than traditional drives. The current disk assignment model doesn't account for this very well, and controllers that have more than one nvme device don't reliably work with using explicit names as part of the pillar. Need to either:

Make devices names persistent
Re-write the states so that it consumes an arbitrary quantity of available disks, and not necessarily named ones.

[BUG] Etcd quorum issue after orchestration

Describe the bug
Etcd spawning 0 service is not running after full orchestration. Will likely need mechanism to reboot all simultaneously after initial quorum is achieved (similar to how galera is handled)

[BUG] Windows Server 2019 no boot/error

Describe the bug
The Windows Server 2019 halts at boot manager with hardware error.

To Reproduce
Steps to reproduce the behavior:

create Windows Server 2019 instance. Upon boot, error occurs.

Expected behavior
Windows Server 2019 should boot up fully and allow access to OS

Screenshots
If applicable, add screenshots to help explain your problem.

Client Configuration (please complete the following information):

OS: Windows Server 2019
Interface: Horizon
- Chrome Version 76.0.3809.100 (Official Build) (64-bit)

Additional context
None

[FEATURE] Update to Train and remove python2

Update all packages to Train and strip out all python2 code.

[FEATURE] Service creations scripts should not always try and run

All OpenStack services have a mkservice.sh script that configures the service and endpoints within keystone. This script is always executed by the endpoint that is spawning zero. Currently, the script will always run when salt is run in test mode since the script doesn't know that the service exists until the script executes and does its internal checks.

This should be dropped in favor of keystoneng, but should probably wait until saltstack/salt#58032 gets resolved.

[BUG] Older firmware does not support all redfish v1.0.1 API options

Describe the bug
My use case specifically is the X10 motherboard does not support the "BootSourceOverrideMode" option to set UEFI in the redfish firmware and SuperMicro will not be adding it because UEFI network boot is not supported on that motherboard. Ipmitool and pyghmi still work to set this option.

[BUG] database changes required when mysql and rmq address updates

Describe the bug
When the address for the mysql server changes (or in cases where mysql connections are proxied, the haproxy server), nova has certain hardcoded values in the nova_api database from service deploy that don't get updated, namely:

MariaDB [nova_api]> select name, database_connection, transport_url from cell_mappings;
+-------+---------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
| name  | database_connection                                                                                           | transport_url                                                                                    |
+-------+---------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
| cell0 | mysql+pymysql://nova:[email protected]/nova_cell0 | none:///                                                                                         |
| cell1 | mysql+pymysql://nova:[email protected]/nova       | rabbit://openstack:[email protected] |
+-------+---------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+

To Reproduce
Steps to reproduce the behavior:

Deploy kinetic
Change address of mysql or rmq
Highstate environment
Observe nova api failures, even though nova.conf has correct values

Expected behavior
The environment should always function when the highstate is fully applied

Screenshots
N/A

Client Configuration (please complete the following information):
N/A

Additional context
N/A

[FEATURE] Add l3 to the host support

Is your feature request related to a problem? Please describe.
Making the cloud scale using traditional l2/l3 constructs is challenging and error-prone. STP, VLANS, massive broadcast domains, and other related challenges make horizontal scaling hard. L2 should be eliminated completely and l3 should touch every device.

Describe the solution you'd like
https://www.exertishammer.com/assets/uploads/resources/Cumulus%20Networks%20-%20Datasheet%20-%20Cumulus%20Networks;%20Routing%20on%20the%20Host.pdf

Gives a decent overview. This does not speak to implementation details, but rather an overall strategy.

Additional context
This is a decent example: https://codingpackets.com/blog/linux-routing-on-the-host-with-frr/

[BUG] Windows 10 LTSC 2019 BSOD

Describe the bug
Windows 10 LTSC 2019 image boots to BSOD

To Reproduce
launch Windows 10 LTSC 2019 instance

Expected behavior
Windows 10 LTSC 2019 should boot normally and allow access to OS

Screenshots
If applicable, add screenshots to help explain your problem.

Client Configuration (please complete the following information):

OS: [Windows 10 LTSC 2019]
Interface: [Horizon,]
- Version 76.0.3809.100 (Official Build) (64-bit)

Additional context
Add any other context about the problem here.

[FEATURE] 'other' file roots capability

b9e4228 removes an explicit reference to a stig file root. This mechanism should be replaced in the pillar with something generic/extensible so users can import arbitrary file roots at creation based on data in the answer file.

[BUG] Cannot manipulate object store from horizon

ref: https://bugs.launchpad.net/horizon/+bug/1880188

There is a workaround listed that modifies the swift client. This is probably a decent option for now.

Old mine data still present after re-spinning service

After re-spinning a service (orch.virtual, orch.bootstrap, orch.map), the mine data from the old minions of that service type is still present on the master and causes undesired (but harmless) behavior, such as creating user entries in mysql for hosts that no longer exists. The orchestration states should flush the mine for that service type as one of the first parts of their execution.

[BUG] share server issues

There is an selinux problem for nfs-ganesha that prevents it from making a working directory in /var/lib when it starts for the first time. Turning off selinux, starting the service, then turning it on again resolves the issue as nfs-ganesha only needs to do this one time, but that solution is lame.

Either -
a. write an selinux policy that addresses whatever the issue is
b. make the directory on behalf of nfs-ganesha

Since public is now a dummy interface, users can't touch the share servers without an intermediary.

Either -
a. Modify the public network with a flag that allows for real addresses when necessary (this is the only case so far)
b. Proxy nfs requests through haproxy with complicated reactor rules. This will be ugly, but probably doable.

gacybercenter / kinetic Goto Github PK

kinetic's People

Contributors

Stargazers

Watchers

Forkers

kinetic's Issues

Recommend Projects

Recommend Topics

Recommend Org