Code Monkey home page Code Monkey logo

sonic-platform-daemons's People

Contributors

abdosi avatar alexrallen avatar anoopkamath avatar aravindmani-1 avatar arunsaravananbalachandran avatar assrinivasan avatar bratashx avatar chiourung avatar jleveque avatar judyjoseph avatar junchao-mellanox avatar keboliu avatar kuanyu99 avatar lguohan avatar liuh-80 avatar liushilongbuaa avatar longhuan-cisco avatar michaelwangsmci avatar mihirpat1 avatar mlok-nokia avatar mprabhu-nokia avatar prgeor avatar qiluo-msft avatar staphylo avatar stephenxs avatar sujinmkang avatar vdahiya12 avatar vganesan-nokia avatar vivekrnv avatar zzhiyuan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sonic-platform-daemons's Issues

test_SfpStateUpdateTask_task_run_stop failed

When I build master branch of sonic-buildimage (sonic-broadcom.bin) sometimes I get this error.

=================================== FAILURES ===================================
____________ TestXcvrdScript.test_SfpStateUpdateTask_task_run_stop _____________

self = <tests.test_xcvrd.TestXcvrdScript object at 0x7f11132cb0d0>

    @patch('xcvrd.xcvrd_utilities.port_mapping.subscribe_port_config_change', MagicMock(return_value=(None, None)))
    def test_SfpStateUpdateTask_task_run_stop(self):
        port_mapping = PortMapping()
        retry_eeprom_set = set()
        stop_event = threading.Event()
        sfp_error_event = threading.Event()
        task = SfpStateUpdateTask(DEFAULT_NAMESPACE, port_mapping, retry_eeprom_set, stop_event, sfp_error_event)
        task.start()
        assert wait_until(5, 1, task.is_alive)
        task.raise_exception()
>       task.join()

tests/test_xcvrd.py:1041: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
xcvrd/xcvrd.py:2177: in join
    raise self.exc
xcvrd/xcvrd.py:2152: in run
    self.task_worker(self.task_stopping_event, self.sfp_error_event)
xcvrd/xcvrd.py:1965: in task_worker
    port_mapping.handle_port_config_change(sel, asic_context, stopping_event, self.port_mapping, helper_logger, self.on_port_config_change)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

sel = None, asic_context = None
stop_event = <threading.Event object at 0x7f1113ccbc40>
port_mapping = <xcvrd.xcvrd_utilities.port_mapping.PortMapping object at 0x7f1113ccbcd0>
logger = <sonic_py_common.logger.Logger object at 0x7f1115fcf970>
port_change_event_handler = <bound method SfpStateUpdateTask.on_port_config_change of <SfpStateUpdateTask(SfpStateUpdateTask, stopped 139711322162944)>>

    def handle_port_config_change(sel, asic_context, stop_event, port_mapping, logger, port_change_event_handler):
        """Select CONFIG_DB PORT table changes, once there is a port configuration add/remove, notify observers
        """
        if not stop_event.is_set():
>           (state, _) = sel.select(SELECT_TIMEOUT_MSECS)
E           AttributeError: 'NoneType' object has no attribute 'select'

xcvrd/xcvrd_utilities/port_mapping.py:218: AttributeError

Handling CMIS_STATE_FAILED state in xcvrd

This GIT issue will overall track the effort required to handle CMIS_STATE_FAILED in xcvrd with a way to notify the user of this status and take corrective actions if applicable.
Also, depending on the step at which CMIS_STATE_FAILED is set for a port, we may need to move the port to shut state and perform DP deinit if needed.

thermalctld no longer adds 'speed_tolerance' to the Redis database

Hi, i want to know why we should delete 'speed_tolerance' field?
Cause as i know, the command 'show system-health summary' will call the file 'hardware_checker.py' which is at master branch, and it will get the 'speed_tolerance' from Redis database.
If we delete the 'speed_tolerance', an error will be generated when using the command 'show system-health summary'

image

root@sonic:/home/admin# show system-health summary

System status summary



System status LED  amber

Services:

Status: OK

Hardware:

Status: Not OK

Reasons: Failed to get speed tolerance for Fantray7_2

	 Failed to get speed tolerance for Fantray7_1

	 Failed to get speed tolerance for Fantray6_2

	 Failed to get speed tolerance for Fantray6_1

	 Failed to get speed tolerance for Fantray5_2

	 Failed to get speed tolerance for Fantray5_1

	 Failed to get speed tolerance for Fantray4_2

	 Failed to get speed tolerance for Fantray4_1

	 Failed to get speed tolerance for Fantray3_2

	 Failed to get speed tolerance for Fantray3_1

	 Failed to get speed tolerance for Fantray2_2

	 Failed to get speed tolerance for Fantray2_1

	 Failed to get speed tolerance for Fantray1_2

	 Failed to get speed tolerance for Fantray1_1

root@sonic:/home/admin# show platform fan

  Drawer    LED         FAN    Speed    Direction    Presence    Status          Timestamp

--------  -----  ----------  -------  -----------  ----------  --------  -----------------

Fantray1  green  Fantray1_1      56%       INTAKE     Present        OK  20221222 15:29:07

Fantray1  green  Fantray1_2      56%       INTAKE     Present        OK  20221222 15:29:07

Fantray2  green  Fantray2_1      55%       INTAKE     Present        OK  20221222 15:29:07

Fantray2  green  Fantray2_2      56%       INTAKE     Present        OK  20221222 15:29:07

Fantray3  green  Fantray3_1      55%       INTAKE     Present        OK  20221222 15:29:07

Fantray3  green  Fantray3_2      56%       INTAKE     Present        OK  20221222 15:29:07

Fantray4  green  Fantray4_1      55%       INTAKE     Present        OK  20221222 15:29:07

Fantray4  green  Fantray4_2      56%       INTAKE     Present        OK  20221222 15:29:07

Fantray5  green  Fantray5_1      56%       INTAKE     Present        OK  20221222 15:29:07

Fantray5  green  Fantray5_2      56%       INTAKE     Present        OK  20221222 15:29:07

Fantray6  green  Fantray6_1      56%       INTAKE     Present        OK  20221222 15:29:07

Fantray6  green  Fantray6_2      56%       INTAKE     Present        OK  20221222 15:29:07

Fantray7  green  Fantray7_1      55%       INTAKE     Present        OK  20221222 15:29:07

Fantray7  green  Fantray7_2      56%       INTAKE     Present        OK  20221222 15:29:07

 N/A    N/A   PSU1_FAN1      29%       INTAKE     Present        OK  20221222 15:29:07

 N/A    N/A   PSU2_FAN1      29%       INTAKE     Present        OK  20221222 15:29:07

400G-ZR: Allow all configurable laser frequency for coherent modules

Problem description:
At present only 75GHz grid frequency values are allowed for configuration (75GHz grid is hardcoded).
https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L1295

Any other frequency value configuration lead to xcvrd child process to crash due to exception raised by current code if the value is out of range or other than 75GHz grid channel frequency.

All the module capable frequency should be allowed for configuration/programming and any frequency which the module is not capable of should be rejected by the CLI.

psu daemon doesn't update PSU FAN information

Issue:

  • The PSU FAN information is not updated in the redis-db.

Steps to reproduce

  • Load latest master image and check redis-db PSU Fan table.

Logs:
root@sonic:~# redis-cli -n 6 hgetall "FAN_INFO|PSU1 Fan"
1) "led_status"
2) "None"

To fix this issue:

  • Initialize self.presence and other variables in PsuStatus dunder init to False instead of True.
  • Import datetime module.
  • Then, the following values will be seen.

root@sonic:/# redis-cli -n 6 hgetall "FAN_INFO|PSU1 Fan"
1) "presence"
2) "True"
3) "status"
4) "Updating"
5) "direction"
6) "exhaust"
7) "speed"
8) "67"
9) "timestamp"
10) "20201223 01:48:55"
11) "led_status"
12) "None"

The one more issue is that since the PSU fan status values are "Updating/ N/A".
In system health daemon, the expected values for PSU fan status is "True/False".

Thermalctld also updates the PSU Fan status value to "True/False".
So, need your input on whether we can change the PSU Fan status in psu daemon to return "True/False" instead of "Updating/ N/A".
If it is not done, then "PSU Fan is broken" error will be logged in redis-db for system health table.

xcvrd: Modify xcvrd from multi process to multi threaded

Problem description:
At present xcvrd creates child process for the task workers to handle SFP state transition and CMIS handler. If one of the process get crashed then it become un-noticed.

Move the xcvrd from multi-process model to multi-threaded.

xcvrd exception in transceiver eeprom during OIR

Symptoms:

  1. It takes around 10 seconds to read the eeprom of the transceiver with pages. sfp_optoe_base reads the eeprom too many times.
  2. The exception occurs when a transceiver is plugged in then unplugged within 10 seconds.
  3. If the transceiver is unplugged 10 seconds later, the the exception won't happen.

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1366, in task_worker
post_port_dom_info_to_db(logical_port_name, self.port_mapping, xcvr_table_helper.get_dom_tbl(asic_index), self.task_stopping_event, dom_info_cache=dom_info_cache)
File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 434, in post_port_dom_info_to_db
dom_info_dict = _wrapper_get_transceiver_dom_info(physical_port)
File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 154, in _wrapper_get_transceiver_dom_info
return platform_chassis.get_sfp(physical_port).get_transceiver_bulk_status()
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/sfp_optoe_base.py", line 28, in get_transceiver_bulk_status
return api.get_transceiver_bulk_status() if api is not None else None
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmis.py", line 205, in get_transceiver_bulk_status
self.vdm_dict = self.get_vdm()
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmis.py", line 1015, in get_vdm
vdm = self.vdm.get_vdm_allpage() if not self.is_flat_memory() else {}
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmisVDM.py", line 178, in get_vdm_allpage
vdm_current_page = self.get_vdm_page(page, vdm_flag_page)
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmisVDM.py", line 55, in get_vdm_page
vdm_typeID = vdm_descriptor[1::2]
TypeError: 'NoneType' object is not subscriptable

Build fails because of test failure in sonic-xcvrd/tests/test_xcvrd.py:test_SfpStateUpdateTask_task_run_stop()

This failure happens once in three builds. It is not consistent and may be a timing issue.

=================================== FAILURES ===================================
____________ TestXcvrdScript.test_SfpStateUpdateTask_task_run_stop _____________

self = <tests.test_xcvrd.TestXcvrdScript object at 0x7fbd2e6508b0>

@patch('xcvrd.xcvrd_utilities.port_mapping.subscribe_port_config_change', MagicMock(return_value=(None, None)))
def test_SfpStateUpdateTask_task_run_stop(self):
    port_mapping = PortMapping()
    retry_eeprom_set = set()
    stop_event = threading.Event()
    sfp_error_event = threading.Event()
    task = SfpStateUpdateTask(DEFAULT_NAMESPACE, port_mapping, retry_eeprom_set, stop_event, sfp_error_event)
    task.start()
  assert wait_until(5, 1, task.is_alive)

E assert False
E + where False = wait_until(5, 1, <bound method Thread.is_alive of <SfpStateUpdateTask(SfpStateUpdateTask, stopped 140450521253632)>>)
E + where <bound method Thread.is_alive of <SfpStateUpdateTask(SfpStateUpdateTask, stopped 140450521253632)>> = <SfpStateUpdateTask(SfpStateUpdateTask, stopped 140450521253632)>.is_alive

tests/test_xcvrd.py:1039: AssertionError
=============================== warnings summary ===============================
/usr/lib/python3/dist-packages/_pytest/junitxml.py:446
/usr/lib/python3/dist-packages/_pytest/junitxml.py:446: PytestDeprecationWarning: The 'junit_family' default value will change to 'xunit2' in pytest 6.0. See:
https://docs.pytest.org/en/stable/deprecations.html#junit-family-default-value-change-to-xunit2
for more information.
_issue_warning_captured(deprecated.JUNIT_XML_DEFAULT_FAMILY, config.hook, 2)

-- Docs: https://docs.pytest.org/en/stable/warnings.html

  • generated xml file: /sonic/src/sonic-platform-daemons/sonic-xcvrd/test-results.xml -

----------- coverage: platform linux, python 3.9.2-final-0 -----------
Name Stmts Miss Cover

xcvrd/init.py 0 0 100%
xcvrd/xcvrd.py 1544 341 78%
xcvrd/xcvrd_utilities/init.py 0 0 100%
xcvrd/xcvrd_utilities/port_mapping.py 191 26 86%
xcvrd/xcvrd_utilities/sfp_status_helper.py 25 1 96%

TOTAL 1760 368 79%

Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml

=========================== short test summary info ============================
FAILED tests/test_xcvrd.py::TestXcvrdScript::test_SfpStateUpdateTask_task_run_stop
=================== 1 failed, 69 passed, 1 warning in 16.38s ===================
[ FAIL LOG END ] [ target/python-wheels/bullseye/sonic_xcvrd-1.0-py3-none-any.whl ]
make: *** [slave.mk:878: target/python-wheels/bullseye/sonic_xcvrd-1.0-py3-none-any.whl] Error 1

docker-platform-monitor failure

When running the sonic-vs image, the docker-platform-monitor is constantly rolling over with this error from Python:
Jul 24 13:48:14.209452 vlab-01 INFO pmon#thermalctld: Starting up...
Jul 24 13:48:14.209929 vlab-01 INFO pmon#supervisord: thermalctld Traceback (most recent call last):
Jul 24 13:48:14.210241 vlab-01 INFO pmon#supervisord: thermalctld File "/usr/bin/thermalctld", line 590, in
Jul 24 13:48:14.210501 vlab-01 INFO pmon#supervisord: thermalctld main()
Jul 24 13:48:14.210843 vlab-01 INFO pmon#supervisord: thermalctld File "/usr/bin/thermalctld", line 586, in main
Jul 24 13:48:14.211067 vlab-01 INFO pmon#supervisord: thermalctld thermal_control.run()
Jul 24 13:48:14.211349 vlab-01 INFO pmon#supervisord: thermalctld File "/usr/bin/thermalctld", line 547, in run
Jul 24 13:48:14.211599 vlab-01 INFO pmon#supervisord: thermalctld import sonic_platform.platform
Jul 24 13:48:14.211861 vlab-01 INFO pmon#supervisord: thermalctld ImportError: No module named sonic_platform.platform

I found a potential solution at https://www.gitmemory.com/jleveque. But it appears I cannot access the source code repository nor build the sonic-vs.img on my own to fix it.
Is there a workaround for this?

[xcvrd] : xcvrd exits and sfpshow doesnt fetch output for all ports for 50G profile

In Z9100 T0 profile (C8D48) ,xcvrd process exits with the below error and sfpshow presence shows the transceiver as Not present. Transceiver output is shown only for Ethernet0. 0.

root@sonic-z9100-02:/var/log# xcvrd
Traceback (most recent call last):
  File "/usr/bin/xcvrd", line 735, in <module>
    main()
  File "/usr/bin/xcvrd", line 732, in main
    xcvrd.run()
  File "/usr/bin/xcvrd", line 696, in run
    self.init()
  File "/usr/bin/xcvrd", line 680, in init
    post_port_sfp_dom_info_to_db(is_warm_start, self.stop_event)
  File "/usr/bin/xcvrd", line 238, in post_port_sfp_dom_info_to_db
    notify_media_setting(logical_port_name, transceiver_dict, app_port_tbl)
  File "/usr/bin/xcvrd", line 473, in notify_media_setting
    media_dict[media_key])
  File "/usr/bin/xcvrd", line 414, in get_media_val_str
    start_lane = logical_idx * num_lanes_per_logical_port
NameError: global name 'logical_idx' is not defined

With the below fix , we are able to fetch the values.

def get_media_val_str(num_logical_ports, lane_dict):
         changed to 
def get_media_val_str(num_logical_ports, logical_idx, lane_dict):

and

Line : 473

 for media_key in media_dict:
            if type(media_dict[media_key]) is dict:
                media_val_str = get_media_val_str(num_logical_ports, \
                                            media_dict[media_key])

changed to

                media_val_str = get_media_val_str(num_logical_ports, \
                                            logical_idx, media_dict[media_key])

root@sonic-z9100-02:~# sfpshow presence
Port         Presence
-----------  -----------
Ethernet0    Present
Ethernet2    Not present
Ethernet4    Not present
Ethernet6    Not present
Ethernet8    Not present
Ethernet10   Not present
Ethernet12   Not present
Ethernet14   Not present
Ethernet16   Not present
Ethernet18   Not present
Ethernet20   Not present
Ethernet22   Not present
Ethernet24   Not present
Ethernet28   Not present
Ethernet32   Not present
Ethernet36   Not present
Ethernet40   Not present
Ethernet42   Not present
Ethernet44   Not present
Ethernet46   Not present
Ethernet48   Not present
Ethernet50   Not present
Ethernet52   Not present
Ethernet54   Not present
Ethernet56   Not present
Ethernet58   Not present
Ethernet60   Not present
Ethernet62   Not present
Ethernet64   Not present
Ethernet66   Not present
Ethernet68   Not present
Ethernet70   Not present
Ethernet72   Not present
Ethernet74   Not present
Ethernet76   Not present
Ethernet78   Not present
Ethernet80   Not present
Ethernet82   Not present
Ethernet84   Not present
Ethernet86   Not present
Ethernet88   Not present
Ethernet90   Not present
Ethernet92   Not present
Ethernet94   Not present
Ethernet96   Not present
Ethernet98   Not present
Ethernet100  Not present
Ethernet102  Not present
Ethernet104  Not present
Ethernet108  Not present
Ethernet112  Not present
Ethernet116  Not present
Ethernet120  Not present
Ethernet122  Not present
Ethernet124  Not present
Ethernet126  Not present

[ycabled][dualtor][active-active] `show mux status` is slow and always return `inconsistent` HWSTATUS

On 20220531.28 image, the show mux status command is slow and always return inconsistent for HWSTATUS:

$ time show mux s
PORT        STATUS    SERVER_STATUS    HEALTH    HWSTATUS      LAST_SWITCHOVER_TIME
----------  --------  ---------------  --------  ------------  ---------------------------
Ethernet4   active    active           healthy   inconsistent  2023-May-30 05:46:51.795124
Ethernet8   active    active           healthy   inconsistent  2023-May-30 05:46:51.734891
Ethernet12  active    active           healthy   inconsistent  2023-May-30 05:46:51.956624
Ethernet16  active    active           healthy   inconsistent  2023-May-30 05:46:51.755520
Ethernet20  active    active           healthy   inconsistent  2023-May-30 05:46:52.007026
Ethernet24  active    active           healthy   inconsistent  2023-May-30 05:46:51.826122
Ethernet28  active    active           healthy   inconsistent  2023-May-30 05:46:52.048576
Ethernet32  active    active           healthy   inconsistent  2023-May-30 05:46:51.719157
Ethernet36  active    active           healthy   inconsistent  2023-May-30 05:46:51.845886
Ethernet40  active    active           healthy   inconsistent  2023-May-30 05:46:51.981137
Ethernet44  active    active           healthy   inconsistent  2023-May-30 05:46:51.912922
Ethernet48  active    active           healthy   inconsistent  2023-May-30 05:46:51.924832
Ethernet52  active    active           healthy   inconsistent  2023-May-30 05:46:52.120600
Ethernet56  active    active           healthy   inconsistent  2023-May-30 05:46:51.992982
Ethernet60  active    active           healthy   inconsistent  2023-May-30 05:46:51.771509
Ethernet64  active    active           healthy   inconsistent  2023-May-30 05:46:51.898908
Ethernet68  active    active           healthy   inconsistent  2023-May-30 05:46:52.034526
Ethernet72  active    active           healthy   inconsistent  2023-May-30 05:46:52.061691
Ethernet76  active    active           healthy   inconsistent  2023-May-30 05:46:52.095044
Ethernet80  active    active           healthy   inconsistent  2023-May-30 05:46:52.020575
Ethernet84  active    active           healthy   inconsistent  2023-May-30 05:46:51.687364
Ethernet88  active    active           healthy   inconsistent  2023-May-30 05:46:52.132286
Ethernet92  active    active           healthy   inconsistent  2023-May-30 05:46:52.110357
Ethernet96  active    active           healthy   inconsistent  2023-May-30 05:46:51.969622

real    0m26.449s
user    0m1.366s
sys     0m0.469s
$ show mux grpc mux
Port        Direction    Presence    PeerDirection    ConnectivityState
----------  -----------  ----------  ---------------  -------------------
Ethernet4   active       True        active           READY
Ethernet8   active       True        active           READY
Ethernet12  active       True        active           READY
Ethernet16  active       True        active           READY
Ethernet20  active       True        active           READY
Ethernet24  active       True        active           READY
Ethernet28  active       True        active           READY
Ethernet32  active       True        active           READY
Ethernet36  active       True        active           READY
Ethernet40  active       True        active           READY
Ethernet44  active       True        active           READY
Ethernet48  active       True        active           READY
Ethernet52  active       True        active           READY
Ethernet56  active       True        active           READY
Ethernet60  active       True        active           READY
Ethernet64  active       True        active           READY
Ethernet68  active       True        active           READY
Ethernet72  active       True        active           READY
Ethernet76  active       True        active           READY
Ethernet80  active       True        active           READY
Ethernet84  active       True        active           READY
Ethernet88  active       True        active           READY
Ethernet92  active       True        active           READY
Ethernet96  active       True        active           READY

Issue in transceiver eeprom during OIR

problem with "show interfaces transceiver eeprom -d"

We now have .06 installed on the switch and I have found a new, but seemingly related, issue.

When we installed .06 and rebooted/reloaded the switch, the eeprom data from the ONet AOCs seemed to be correct. Then, today, I continued with my testing and started getting test failures during out hotswap test. This is the test where we check the function of the optics before and after unplugging and re-inserting the optics.

Here is an example of what is happening.
Before unplugging the optic we have what looks like correct data.
Ethernet72: SFP EEPROM detected
Application Advertisement: 400GAUI-8 C2M (Annex 120E) - Active Cable assembly with BER < 2.6x10^-4 200GAUI-4 C2M (Annex 120E) - Active Cable assembly with BER < 2.6x10^-4
Connector: No separable connector
Encoding: Not supported
Extended Identifier: Power Class 1(10.0W Max)
Extended RateSelect Compliance: Not supported
Identifier: QSFP-DD Double Density 8X Pluggable Transceiver
Length Cable Assembly(m): 3.0
Nominal Bit Rate(100Mbs): Not supported
Specification compliance: active_cable_media_interface
Vendor Date Code(YYYY-MM-DD Lot): 2021-02-05 01
Vendor Name: O-NET
Vendor OUI: 34-78-77
Vendor PN: 1AT-5QAM03XX-10A
Vendor Rev: A
Vendor SN: 4QA-0000022
ChannelMonitorValues:
RX1Power: 1.4476dBm
RX2Power: 1.4826dBm
RX3Power: 1.502dBm
RX4Power: 1.3437dBm
RX5Power: 1.3792dBm
RX6Power: 1.4461dBm
RX7Power: 1.2031dBm
RX8Power: 1.213dBm
TX1Bias: 5.986mA
TX1Power: 1.9011dBm
TX2Bias: 6.27mA
TX2Power: 2.1024dBm
TX3Bias: 6.178mA
TX3Power: 2.0382dBm
TX4Bias: 6.104mA
TX4Power: 1.986dBm
TX5Bias: 6.058mA
TX5Power: 1.9529dBm
TX6Bias: 6.152mA
TX6Power: 2.02dBm
TX7Bias: 5.892mA
TX7Power: 1.8324dBm
TX8Bias: 6.082mA
TX8Power: 1.97dBm
ChannelThresholdValues:
RxPowerHighAlarm : 4.7699dBm
RxPowerHighWarning: 3.9799dBm
RxPowerLowAlarm : -6.9919dBm
RxPowerLowWarning : -6.0206dBm
TxBiasHighAlarm : 9.0000mA
TxBiasHighWarning : 8.5000mA
TxBiasLowAlarm : 4.0000mA
TxBiasLowWarning : 4.5000mA
TxPowerHighAlarm : 4.7699dBm
TxPowerHighWarning: 3.9799dBm
TxPowerLowAlarm : -6.9919dBm
TxPowerLowWarning : -6.0206dBm
ModuleMonitorValues:
Temperature: 43.3555C
Vcc: 3.3803Volts
ModuleThresholdValues:
TempHighAlarm : 75.0000C
TempHighWarning: 70.0000C
TempLowAlarm : -5.0000C
TempLowWarning : 0.0000C
VccHighAlarm : 3.5700Volts
VccHighWarning : 3.4650Volts
VccLowAlarm : 3.0400Volts
VccLowWarning : 3.1350Volts

Now, I unplug the optics and re-insert them.
Afterward, the data from the optic shows changes.

Ethernet72: SFP EEPROM detected
Application Advertisement: 400GAUI-8 C2M (Annex 120E) - Active Cable assembly with BER < 2.6x10^-4 200GAUI-4 C2M (Annex 120E) - Active Cable assembly with BER < 2.6x10^-4
Connector: N/A
Encoding: Not supported
Extended Identifier: N/A
Extended RateSelect Compliance: Not supported
Identifier: N/A
Length Cable Assembly(m): N/A
Nominal Bit Rate(100Mbs): Not supported
Specification compliance:
N/A
Vendor Date Code(YYYY-MM-DD Lot): N/A
Vendor Name: N/A
Vendor OUI: 34-78-77
Vendor PN: 1AT-5QAM03XX-10A
Vendor Rev: A
Vendor SN: 4QA-0000022
MonitorData:
RXPower: 1.3985dBm
TXBias: 5.978mA
TXPower: 1.919dBm
Temperature: 41.7695C
Vcc: 3.3795Volts
ThresholdData:
TempHighAlarm : 75.0000C
TempHighWarning: 70.0000C
TempLowAlarm : -5.0000C
TempLowWarning : 0.0000C
VccHighAlarm : 3.5700Volts
VccHighWarning : 3.4650Volts
VccLowAlarm : 3.0400Volts
VccLowWarning : 3.1350Volts
RxPowerHighAlarm : 4.7699dBm
RxPowerHighWarning: 3.9799dBm
RxPowerLowAlarm : -6.9919dBm
RxPowerLowWarning : -6.0206dBm
TxBiasHighAlarm : 9.0000mA
TxBiasHighWarning : 8.5000mA
TxBiasLowAlarm : 4.0000mA
TxBiasLowWarning : 4.5000mA
TxPowerHighAlarm : 4.7699dBm
TxPowerHighWarning: 3.9799dBm
TxPowerLowAlarm : -6.9919dBm
TxPowerLowWarning : -6.0206dBm

So, it looks like there remains some instability in the I2C reads in this switch.
Also, the whole structure of the output changes.

Things I know:

It doesn’t always happen the same way. Different fields can be affected on different iterations.
It seems that rebooting has a pretty good chance of clearing up the problem. I am not 100% sure this works all the time.
It happens on more than one port, but doesn’t seem to be guaranteed to happen on any given port.
Things I don’t know:

I don’t know if optics from any other vendor show the same behavior.
I don’t know if it happened before on older version of OS, because we had problems with the before measurement
I don’t know if rebooting solves the problem all the time.

xcvrd exercises non-host_tx_ready change in value event from state-DB port-table

Below is the snippet where xcvrd process the event without host_tx_ready in the state-DB port table. Full log is in the attachment "xcvrd_non_tx_ready_event_processing"

May 26 20:17:37.886250 sonic NOTICE pmon#xcvrd[32]: CMIS: Ethernet248 Forcing Tx laser OFF

May 26 20:17:37.904964 sonic WARNING pmon#xcvrd[32]: $$$ Ethernet248 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'state': 'ok', 'netdev_oper_status': 'down', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '', 'supported_fecs': 'rs', 'host_tx_ready': 'true'}
May 26 20:17:37.905200 sonic WARNING pmon#xcvrd[32]: *** Ethernet248STATE_DBPORT_TABLE handle_port_update_event() fvp {'host_tx_ready': 'true', 'index': '-1', 'key': 'Ethernet248', 'asic_id': 2, 'op': 'SET'}
<<<<<<<<<<<<<<<<<<<<<<<
May 26 20:17:37.942950 sonic NOTICE pmon#xcvrd[32]: CMIS: Ethernet248: 400G, lanemask=0xff, state=INSERTED, appl 1 host_lane_count 8 retries=0
May 26 20:17:38.008109 sonic NOTICE pmon#xcvrd[32]: CMIS: Ethernet248: Setting appl=1
May 26 20:17:38.070372 sonic NOTICE pmon#xcvrd[32]: CMIS: Ethernet248: Setting lanemask=0xff

Implement Global CLI to enable performance monitoring for coherent optics.

Implement global CLI to enable/disable the performance monitoring feature.
PM feature HLD : sonic-net/SONiC#1258

config
pm # Global-performance monitoring feature over capable coherent transceiver

config pm
enable # Enable performance monitoring on all ports
disable # Disable performance monitoring on all ports

https://github.com/sonic-net/SONiC/blob/d0364205764aa49decfca1eb669c76d7f0c8f4e0/doc/platform_api/CMIS_and_C-CMIS_support_for_ZR.md#751-configurations

Active Firmware version not available from CMIS modules without CDB feature

Regression in the latest code base in obtaining the Active Firmware version info from modules that are not implementing CDB feature. Current logic is trying to retrieve firmware information using CDB command. CDB being optional feature it is not expected to work in all cases.

Active firmware version should be retrieved from advertised information from page0

root@sonic:/home/cisco# show int transceiver eeprom Ethernet192
Ethernet192: SFP EEPROM detected
Active Firmware: N/A
Active application selected code assigned to host lane 1: 2
Active application selected code assigned to host lane 2: 2
Active application selected code assigned to host lane 3: 2
Active application selected code assigned to host lane 4: 2
Active application selected code assigned to host lane 5: 2
Active application selected code assigned to host lane 6: 2
Active application selected code assigned to host lane 7: 2
Active application selected code assigned to host lane 8: 2
Application Advertisement: 100GAUI-2 C2M (Annex 135G) - Host Assign (0x55) - 100GBASE-DR (Cl 140) - Media Assign (0xf)
400GAUI-8 C2M (Annex 120E) - Host Assign (0x1) - 400GBASE-DR4 (Cl 124) - Media Assign (0x1)

[xcvrd] Python3 Compatibility - Exits abruptly on dynamic transceiver tuning

Xcvrd exits abruptly while dynamic transceiver tuning.
Issue is seen in latest master image which by default invokes Xcvrd with Python3.

Logs:

/var/log/syslog.16.gz:6106:Jan  7 16:59:33.813092 sonic INFO pmon#/supervisord: xcvrd Traceback (most recent call last):
/var/log/syslog.16.gz:6107:Jan  7 16:59:33.813346 sonic INFO pmon#/supervisord: xcvrd   File "/usr/local/bin/xcvrd", line 8, in <module>
/var/log/syslog.16.gz:6108:Jan  7 16:59:33.813517 sonic INFO pmon#/supervisord: xcvrd     sys.exit(main())
/var/log/syslog.16.gz:6109:Jan  7 16:59:33.813698 sonic INFO pmon#/supervisord: xcvrd   File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1376, in main
/var/log/syslog.16.gz:6110:Jan  7 16:59:33.813840 sonic INFO pmon#/supervisord: xcvrd     xcvrd.run()
/var/log/syslog.16.gz:6111:Jan  7 16:59:33.813980 sonic INFO pmon#/supervisord: xcvrd   File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1324, in run
/var/log/syslog.16.gz:6112:Jan  7 16:59:33.814124 sonic INFO pmon#/supervisord: xcvrd     self.init()
/var/log/syslog.16.gz:6113:Jan  7 16:59:33.814270 sonic INFO pmon#/supervisord: xcvrd   File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1289, in init
/var/log/syslog.16.gz:6114:Jan  7 16:59:33.814845 sonic INFO pmon#/supervisord: xcvrd     post_port_sfp_dom_info_to_db(is_warm_start, self.stop_event)
/var/log/syslog.16.gz:6115:Jan  7 16:59:33.814845 sonic INFO pmon#/supervisord: xcvrd   File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 492, in post_port_sfp_dom_info_to_db
/var/log/syslog.16.gz:6116:Jan  7 16:59:33.814845 sonic INFO pmon#/supervisord: xcvrd     notify_media_setting(logical_port_name, transceiver_dict, app_port_tbl[asic_index])
/var/log/syslog.16.gz:6117:Jan  7 16:59:33.817393 sonic INFO pmon#/supervisord: xcvrd   File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 727, in notify_media_setting
/var/log/syslog.16.gz:6118:Jan  7 16:59:33.818147 sonic INFO pmon#/supervisord: xcvrd     key = get_media_settings_key(physical_port, transceiver_dict)
/var/log/syslog.16.gz:6119:Jan  7 16:59:33.818147 sonic INFO pmon#/supervisord: xcvrd   File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 627, in get_media_settings_key
/var/log/syslog.16.gz:6120:Jan  7 16:59:33.818147 sonic INFO pmon#/supervisord: xcvrd     vendor_key = string.upper(vendor_name_str) + '-' + vendor_pn_str
/var/log/syslog.16.gz:6121:Jan  7 16:59:33.818147 sonic INFO pmon#/supervisord: xcvrd AttributeError: module 'string' has no attribute 'upper'
/var/log/syslog.16.gz:6213:Jan  7 16:59:35.027516 sonic INFO pmon#/supervisor-proc-exit-listener: Process xcvrd exited unxepectedly. Terminating supervisor...

Error log "Error in obtaining media setting for EthernetX" not representing an error

Please take a look at the following error log:
https://github.com/sonic-net/sonic-platform-daemons/blob/3b969c3142210d0439d11aa480fb29afb1ac546a/sonic-xcvrd/xcvrd/xcvrd.py#L777C1-L777C105

On Nvidia platforms, we expect to see this log printed for every non-CMIS module as part of the code's regular operation, even though it doesn't represent an actual error. This is because our media_settings.json file intentionally contains data for CMIS modules only. That's why for CMIS modules we expect the data from the json to be published to APP_DB, and for the others we expect encountering this particular message.

Is there a possibility to modify this log from 'log_error()' to 'log_notice()'?

thermalctld does not report alarm logging correctly for critical alarms for thermal sensors

For any thermal sensor, when the current temperature exceeds critical low/high threshold, alarm logging that appears under syslog is not correct.
It still reports with a high/low threshold value for a critical alarm, when it should report with a critical high/low threshold value.

Current for critical thermal alarms:
"High/Low temperature warning: {} current temperature {}C, high/low threshold {}"

Expected for critical thermal alarms:
"High/low temperature warning: {} current temperature {}C, critical high/low threshold {}"

On config interface shutdown, xcvrd to not reset CMIS state

While reviewing CMIS PR changeset - #254 , found following issue:
Discussed with @prgeor and following is to be done (corrected/enhanced):

Issue: Why the cmis state is reseted when config-interface shutdown is configured. It supposed to avoid app_code reprogramming during this state.
@jaganbal-a updated: As mentioned in the PR comment, We can have new state as CMIS_STATE_HOST_READY and this to be set under “elif state == self.CMIS_STATE_DP_DEINIT:” and self.CMIS_STATE_AP_CONF will be set only after the host_tx_ready is turned true. This will prevent moving state from DP_DEINIT to AP_CONF by default.

thermalctld adds nonpresent PSUs sensors to the TEMPERATURE_INFO table

For any nonpresent PSU, thermalctld will add its sensors to the redis database, with all values N/A.
thermalctld should check for PSU presence before adding their sensors to the db.

show platform temperature

            PSU0.0 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:04
            PSU0.1 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:04
            PSU0.2 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:04
            PSU1.0 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:04
            PSU1.1 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:04
            PSU1.2 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:04
            PSU2.0 Outlet_Temp         46.019       97.0      -5.0           102.0          -10.0      False  20230214 14:05:05
            PSU2.1 Outlet_Temp         46.019       97.0      -5.0           102.0          -10.0      False  20230214 14:05:05
            PSU2.2 Outlet_Temp         46.019       97.0      -5.0           102.0          -10.0      False  20230214 14:05:05
            PSU3.0 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:05
            PSU3.1 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:05
            PSU3.2 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:05
            PSU4.0 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:05
            PSU4.1 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:05
            PSU4.2 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:05
            PSU5.0 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:05
            PSU5.1 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:05
            PSU5.2 Outlet_Temp            N/A        N/A       N/A             N/A            N/A      False  20230214 14:05:0

db entry

"TEMPERATURE_INFO|PSU0.0 Outlet_Temp": {
    "expireat": 1676383627.929436,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "critical_high_threshold": "N/A",
      "critical_low_threshold": "N/A",
      "high_threshold": "N/A",
      "is_replaceable": "False",
      "low_threshold": "N/A",
      "maximum_temperature": "N/A",
      "minimum_temperature": "N/A",
      "temperature": "N/A",
      "timestamp": "20230214 14:06:59",
      "warning_status": "False"
    }

Introduce SONiC CLI to fetch and display HW (ASIC, PHY) stats/counters

PR: https://github.com/sonic-net/sonic-buildimage/pull/11731
“sonic-clear macsec” is clearing the stats with above fix/PR, and we see that stats are getting cleared in subsequent “show macsec Ethernet<>” CLI.

Deep-diving little more into this:
This clears the stats from the SW table (redis/FLEX) only and does not clear them at the PHY/device (HW) level.
Subsequent call of “show macsec Ethernet<>” CLI would display the delta stats (i.e. [stats fetched via/from platform/HW] – [stats cleared]).

Impact:
This approach can cause the mismatch of the stats between what SW is displaying vs what’s available in PHY!
During live traffic, this might complicate stats validation while debugging in case of traffic drop etc.
Mismatch:
After triggering “sonic-clear …” , stats in “show counters…” would be (HW – cached) counters whereas stats at the PHY (device) level would still be original HW stats/counters (without anyone invoking a clear to them), so this would be a mismatch

Proposal A:
“sonic-clear macsec” trigger can call the PHY-SAI API to clear the stats from the device (PHY/HW) and show the real stats fetched from the PHY without doing calculations at upper layer (SONiC/ SW table (redis/FLEX DB))
Or
Proposal B:
Do we have any command to display HW stats at SONiC (command) level?
If not, should we introduce one? At least, there would be a way to check and compare stats at SW and HW (device) level.

is_warm_reboot_enabled() in xcvrd.py can not correctly detect warm reboot due to late start of pmon

The delay of pmon is true, and pmon would start with a delay.
On a warm reboot, the value of WARM_RESTART_ENABLE_TABLE|system would be set to true, and after the WARMBOOT_FINALIZER it would be set to false.
But pmon will start after WARMBOOT_FINALIZER has finished, and is_warm_reboot_enabled() would return false.

[2024-08-07 14:33:34.623] admin@sonic:~$ redis-cli -n 6 hgetall "WARM_RESTART_ENABLE_TABLE|system"
[2024-08-07 14:33:36.695] 1) "enable"
[2024-08-07 14:33:36.695] 2) "true"
[2024-08-07 14:33:36.695] admin@sonic:~$ docker ps
[2024-08-07 14:33:38.784] CONTAINER ID   IMAGE                             COMMAND                  CREATED          STATUS         PORTS     NAMES
[2024-08-07 14:33:38.784] 42ba626dd413   docker-router-advertiser:latest   "/usr/bin/docker-ini…"   13 minutes ago   Up 2 minutes             radv
[2024-08-07 14:33:38.804] ff130d42c83e   docker-fpm-frr:latest             "/usr/bin/docker_ini…"   13 minutes ago   Up 2 minutes             bgp
[2024-08-07 14:33:38.804] 13bcfde779d8   docker-syncd-brcm:latest          "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes             syncd
[2024-08-07 14:33:38.842] c033a238d0ba   docker-teamd:latest               "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes             teamd
[2024-08-07 14:33:38.842] 52c8cfbcae85   docker-orchagent:latest           "/usr/bin/docker-ini…"   13 minutes ago   Up 2 minutes             swss
[2024-08-07 14:33:38.842] 0bf4250f1268   docker-eventd:latest              "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes             eventd
[2024-08-07 14:33:38.865] ab0e2e9b9120   docker-database:latest            "/usr/local/bin/dock…"   13 minutes ago   Up 2 minutes             database
[2024-08-07 14:33:38.865] admin@sonic:~$ redis-cli -n 6 hgetall "WARM_RESTART_ENABLE_TABLE|system"
[2024-08-07 14:33:41.034] 1) "enable"
[2024-08-07 14:33:41.034] 2) "false"
[2024-08-07 14:33:41.034] admin@sonic:~$ docker ps
[2024-08-07 14:33:43.004] CONTAINER ID   IMAGE                             COMMAND                  CREATED          STATUS         PORTS     NAMES
[2024-08-07 14:33:43.004] 5243a9eda309   docker-sonic-gnmi:latest          "/usr/local/bin/supe…"   11 minutes ago   Up 1 second              gnmi
[2024-08-07 14:33:43.025] 42ba626dd413   docker-router-advertiser:latest   "/usr/bin/docker-ini…"   13 minutes ago   Up 2 minutes             radv
[2024-08-07 14:33:43.025] ff130d42c83e   docker-fpm-frr:latest             "/usr/bin/docker_ini…"   13 minutes ago   Up 2 minutes             bgp
[2024-08-07 14:33:43.046] 13bcfde779d8   docker-syncd-brcm:latest          "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes             syncd
[2024-08-07 14:33:43.046] c033a238d0ba   docker-teamd:latest               "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes             teamd
[2024-08-07 14:33:43.046] 52c8cfbcae85   docker-orchagent:latest           "/usr/bin/docker-ini…"   13 minutes ago   Up 2 minutes             swss
[2024-08-07 14:33:43.088] 0bf4250f1268   docker-eventd:latest              "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes             eventd
[2024-08-07 14:33:43.088] ab0e2e9b9120   docker-database:latest            "/usr/local/bin/dock…"   13 minutes ago   Up 2 minutes             database
[2024-08-07 14:33:43.103] admin@sonic:~$ redis-cli -n 6 hgetall "WARM_RESTART_ENABLE_TABLE|system"docker ps
[2024-08-07 14:33:49.514] CONTAINER ID   IMAGE                                COMMAND                  CREATED          STATUS         PORTS     NAMES
[2024-08-07 14:33:49.514] c77c720ab127   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   11 minutes ago   Up 2 seconds             pmon
[2024-08-07 14:33:49.537] 2965e04beadd   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   11 minutes ago   Up 4 seconds             mgmt-framework
[2024-08-07 14:33:49.537] 4d653ad6155f   docker-lldp:latest                   "/usr/bin/docker-lld…"   11 minutes ago   Up 6 seconds             lldp
[2024-08-07 14:33:49.563] 5243a9eda309   docker-sonic-gnmi:latest             "/usr/local/bin/supe…"   11 minutes ago   Up 8 seconds             gnmi
[2024-08-07 14:33:49.563] 42ba626dd413   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   13 minutes ago   Up 2 minutes             radv
[2024-08-07 14:33:49.604] ff130d42c83e   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   13 minutes ago   Up 2 minutes             bgp
[2024-08-07 14:33:49.604] 13bcfde779d8   docker-syncd-brcm:latest             "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes             syncd
[2024-08-07 14:33:49.604] c033a238d0ba   docker-teamd:latest                  "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes             teamd
[2024-08-07 14:33:49.623] 52c8cfbcae85   docker-orchagent:latest              "/usr/bin/docker-ini…"   13 minutes ago   Up 2 minutes             swss
[2024-08-07 14:33:49.623] 0bf4250f1268   docker-eventd:latest                 "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes             eventd
[2024-08-07 14:33:49.655] ab0e2e9b9120   docker-database:latest               "/usr/local/bin/dock…"   13 minutes ago   Up 2 minutes             database

Enhance pcieutil to check actual presence of pcie device

pcieutil checks the sysfs entry to conclude if a pcie device is present or not. The sysfs entry won't be updated for non-hotplug pcie device if say, a downstream port of a pcie switch (to which a pcie device is connected) receives a "hot-reset". In this case the pcie device will be disconnected from the bus, but sysfs entry doesn't reflect the missing device.

Add intelligence to xcvrd to understand process restart and config reload

Upon xcvrd/pmon docker restart set the media settings in app DB which sends spurious SI programming from OA upon receiving. xcvrd should not set the settings upon process restart/pmon restart.

So add intelligence to xcvrd to understand the process restart and config reload.

-Create a new table "TABLE_xcvrd" by xcvrd to set a magic number and process restart count in STATE_DB upon fresh start. (config reload/cold reboot)
-Upon xcvrd process restart /pmon docker restart, xcvrd to check the magic number presence on coming up path to understand whether it is a restart case or a fresh start case.

  • Add a hook in config reload CLI handler to clear this magic number upon config reload.
    -Mask spurious notification during process restart
  • Possibly another flag can be added to the same table to differentiate between config reload and cold boot.

Consumer of same field subscribed to different DBs to have DBs/TABLE check prior to processing event

While reviewing CMIS PR changeset - #254 , found following issue:

Issue: Field with same naming on different Table across State/APPL DB is having conflict when an event is occurred.
(a) Port table in APPL DB – caters to speed, lane admin status etc. changes
(b) Port table in STATE DB – caters to host_tx_ready , admin status etc. changes
admin status change is found false in one table and true in another when config interface on a port/interface was applied.

Plan: Reviewed this with @prgeor
Need to add DB type check when an event is received.
Also, investigated and correct as to why this field (admin status and any other field) is in mismatch between the two tables.

Command 'sudo sfpshow eeprom' could not be executed

Error message:

sudo sfpshow eeprom
Traceback (most recent call last):
  File "/usr/local/bin/sfpshow", line 722, in <module>
    cli()
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/bin/sfpshow", line 646, in eeprom
    sfp.get_eeprom()
  File "/usr/local/lib/python3.9/dist-packages/utilities_common/multi_asic.py", line 157, in wrapped_run_on_all_asics
    func(self,  *args, **kwargs)
  File "/usr/local/bin/sfpshow", line 552, in get_eeprom
    self.intf_eeprom[interface] = self.convert_interface_sfp_info_to_cli_output_string(
  File "/usr/local/bin/sfpshow", line 449, in convert_interface_sfp_info_to_cli_output_string
    sfp_info_output = self.convert_sfp_info_to_output_string(sfp_info_dict)
  File "/usr/local/bin/sfpshow", line 337, in convert_sfp_info_to_output_string
    output += '{}{}: {}\n'.format(indent, data_map[key], sfp_info_dict[key])
KeyError: 'active_firmware'

it seems after involving those two commits:

7792838

sonic-net/sonic-platform-common@796e89a

The firmware version fields had been moved from TRANSCEIVER_INFO table to TRANSCEIVER_FIRMWARE_INFO table, but meanwhile, there is no related changes made in sfpshow script. Thereby causes the key not founding error as the script still only searches in the old TRANSCEIVER_INFO table, and there is no firmware version key exists anymore.

[psud] Psu status led not set on init

When psud starts, it will only call _set_psu_led if there is a state change.
Because PsuStatus has some default state, like its present attribute to True, there is never a state change to act upon.

The best way to solve this without breaking backward compatibility would be to initialize set_led in _update_single_psu_data to True on the first run.
This could be done by adding an attribute self.firstrun to DaemonPsud that would be initialized to True and later set to False after a loop iteration in the run method.

CLI checker (out of range values) for tx power and freq configuration

Problem description:

Out of range values for laser frequency and Tx power configuration should be rejected before committing the configuration.
Currently the out of range values are updated to config-DB, but not programmed to the transceiver due to logical checks in the lower layer in xcvrd.

xcvrd NameError: global name 'select' is not defined

Current master gives me following error in syslog

Oct  4 00:09:52.935763 str-s6100-acs-1 INFO pmon#xcvrd: Start main loop
Oct  4 00:09:52.936035 str-s6100-acs-1 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
Oct  4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd   File "/usr/bin/xcvrd", line 449, in <module>
Oct  4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd     main()
Oct  4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd   File "/usr/bin/xcvrd", line 415, in main
Oct  4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd     status, port_dict = platform_sfputil.get_transceiver_change_event()
Oct  4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd   File "/usr/share/sonic/platform/plugins/sfputil.py", line 454, in get_transceiver_change_event
Oct  4 00:09:52.937327 str-s6100-acs-1 INFO pmon#supervisord: xcvrd     epoll = select.epoll()
Oct  4 00:09:52.937327 str-s6100-acs-1 INFO pmon#supervisord: xcvrd NameError: global name 'select' is not defined

Invalid port info in state DB after configuring split port in config_db.json without changing port_config.ini/platform.json

Description

We found that user can configure split port by directly changing config_db.json without modifying port_config.ini or platform.json. However, the problem is that xcvrd still reads port configuration from port_config.ini/platform.json which would cause wrong information in DB.

Steps to reproduce the issue:

  1. User configure split port on Ethernet0 to “1 split 4” in config_db.json
  2. Reboot switch
  3. Split port works well, but xcvrd still reads the old port_config.ini and save them to DB.

Describe the results you received:

Invalid port information in state DB

Describe the results you expected:

State DB have correct port information

media settings is not getting applied for 100G xcvrs

In xcvrd media settings is looked upon based on the media_settings key in media_settings.json file. The key is derived from compliance_code in xcvr data.

In xcvrd the key for media_settings is derived only for "10/40G Ethernet Compliance Code" for QSFP28 transceivers.
Need to make changes in xcvrd so that other transceivers are also recognised for constructing media_settings key which can then be used to extract emphasis settigns from media_settings.json

Build failure due to TestXcvrdScript.test_SfpStateUpdateTask_task_run_stop

We are noticing build failure due to TestXcvrdScript.test_SfpStateUpdateTask_task_run_stop . Below is the trace

[2023-06-21T12:27:31.213Z] =================================== FAILURES ===================================
[2023-06-21T12:27:31.213Z] ____________ TestXcvrdScript.test_SfpStateUpdateTask_task_run_stop _____________
[2023-06-21T12:27:31.213Z]
[2023-06-21T12:27:31.213Z] self = <tests.test_xcvrd.TestXcvrdScript object at 0x7ff1f404b610>
[2023-06-21T12:27:31.213Z]
[2023-06-21T12:27:31.213Z] @patch('xcvrd.xcvrd_utilities.port_mapping.subscribe_port_config_change', MagicMock(return_value=(None, None)))
[2023-06-21T12:27:31.213Z] def test_SfpStateUpdateTask_task_run_stop(self):
[2023-06-21T12:27:31.213Z] port_mapping = PortMapping()
[2023-06-21T12:27:31.213Z] stop_event = threading.Event()
[2023-06-21T12:27:31.213Z] sfp_error_event = threading.Event()
[2023-06-21T12:27:31.213Z] task = SfpStateUpdateTask(DEFAULT_NAMESPACE, port_mapping, stop_event, sfp_error_event)
[2023-06-21T12:27:31.213Z] task.start()
[2023-06-21T12:27:31.213Z] assert wait_until(5, 1, task.is_alive)
[2023-06-21T12:27:31.213Z] task.raise_exception()
[2023-06-21T12:27:31.213Z] > task.join()
[2023-06-21T12:27:31.213Z]
[2023-06-21T12:27:31.213Z] tests/test_xcvrd.py:1041:
[2023-06-21T12:27:31.214Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:2200: in join
[2023-06-21T12:27:31.214Z] raise self.exc
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:2175: in run
[2023-06-21T12:27:31.214Z] self.task_worker(self.task_stopping_event, self.sfp_error_event)
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:1987: in task_worker
[2023-06-21T12:27:31.214Z] self.init()
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:1905: in init
[2023-06-21T12:27:31.214Z] self.retry_eeprom_set = self._post_port_sfp_info_and_dom_thr_to_db_once(port_mapping_data, self.xcvr_table_helper, self.main_thread_stop_event)
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:1845: in _post_port_sfp_info_and_dom_thr_to_db_once
[2023-06-21T12:27:31.214Z] warmstart.initialize("xcvrd", "pmon")
[2023-06-21T12:27:31.214Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2023-06-21T12:27:31.214Z]
[2023-06-21T12:27:31.214Z] app_name = 'xcvrd', docker_name = 'pmon', db_timeout = 0, isTcpConn = False
[2023-06-21T12:27:31.214Z]
[2023-06-21T12:27:31.214Z] @staticmethod
[2023-06-21T12:27:31.214Z] def initialize(app_name, docker_name, db_timeout=0, isTcpConn=False):
[2023-06-21T12:27:31.214Z] > return _swsscommon.WarmStart_initialize(app_name, docker_name, db_timeout, isTcpConn)
[2023-06-21T12:27:31.214Z] E RuntimeError: Unable to connect to redis (unix-socket): Cannot assign requested address
[2023-06-21T12:27:31.214Z]
[2023-06-21T12:27:31.214Z] /usr/lib/python3/dist-packages/swsscommon/swsscommon.py:3369: RuntimeError
[2023-06-21T12:27:31.214Z] =============================== warnings summary ===============================
[2023-06-21T12:27:31.214Z] /usr/lib/python3/dist-packages/_pytest/junitxml.py:446
[2023-06-21T12:27:31.214Z] /usr/lib/python3/dist-packages/_pytest/junitxml.py:446: PytestDeprecationWarning: The 'junit_family' default value will change to 'xunit2' in pytest 6.0. See:
[2023-06-21T12:27:31.214Z] https://docs.pytest.org/en/stable/deprecations.html#junit-family-default-value-change-to-xunit2
[2023-06-21T12:27:31.214Z] for more information.
[2023-06-21T12:27:31.214Z] _issue_warning_captured(deprecated.JUNIT_XML_DEFAULT_FAMILY, config.hook, 2)
[2023-06-21T12:27:31.214Z]
[2023-06-21T12:27:31.214Z] -- Docs: https://docs.pytest.org/en/stable/warnings.html
[2023-06-21T12:27:31.215Z] - generated xml file: /sonic/src/sonic-platform-daemons/sonic-xcvrd/test-results.xml -
[2023-06-21T12:27:31.215Z]
[2023-06-21T12:27:31.215Z] ----------- coverage: platform linux, python 3.9.2-final-0 -----------
[2023-06-21T12:27:31.215Z] Name Stmts Miss Cover
[2023-06-21T12:27:31.215Z] ----------------------------------------------------------------
[2023-06-21T12:27:31.215Z] xcvrd/init.py 0 0 100%
[2023-06-21T12:27:31.215Z] xcvrd/xcvrd.py 1548 358 77%
[2023-06-21T12:27:31.215Z] xcvrd/xcvrd_utilities/init.py 0 0 100%
[2023-06-21T12:27:31.215Z] xcvrd/xcvrd_utilities/port_mapping.py 191 26 86%
[2023-06-21T12:27:31.215Z] xcvrd/xcvrd_utilities/sfp_status_helper.py 27 2 93%
[2023-06-21T12:27:31.215Z] ----------------------------------------------------------------
[2023-06-21T12:27:31.215Z] TOTAL 1766 386 78%
[2023-06-21T12:27:31.215Z] Coverage HTML written to dir htmlcov
[2023-06-21T12:27:31.215Z] Coverage XML written to file coverage.xml
[2023-06-21T12:27:31.215Z]
[2023-06-21T12:27:31.215Z] =========================== short test summary info ============================
[2023-06-21T12:27:31.215Z] FAILED tests/test_xcvrd.py::TestXcvrdScript::test_SfpStateUpdateTask_task_run_stop
[2023-06-21T12:27:31.215Z] =================== 1 failed, 69 passed, 1 warning in 9.18s ====================
[2023-06-21T12:27:31.215Z] [ FAIL LOG END ] [ target/python-wheels/bullseye/sonic_xcvrd-1.0-py3-none-any.whl ]

CMIS FSM (state machine) timer to be fetched from optics EEPROM instead of a hard-coded value

While working with 400G ZR optical module, found an issue in CMIS FSM whereby it timed-out on 1 minute hard-coded value.
Optical module's DPinit timer / timeout value differs based on optics type/variant and is defined in the spec (byte 144) - lower nibble (0-3) for DPinit timeout.

Fetch this value from the optics EEPROM along with optics variant/type in CMIS FSM instead of present hard-coded value

media_settings.json are not decoded properly on ports with unused lanes

I am adding the media_settings.json file for x86_64-arista_7800r3a_36d2_lc. This platform has 36 QSFP-DD ports. The media_settings.json contains the tuning data for lane0-lane7 on each QSFP-DD port. But this makes the 36x100G SKU unhappy because the xcvrd sets an 8-lane tuning data for a 4-lane 100G port. The Jupiter SAI will fail port serdes creation because the number of port serdes does not match the number of tuning parameters.

root@s1:~# sonic-db-cli -n asic1 APPL_DB hgetall "PORT_TABLE:Ethernet240"
{'admin_status': 'up', 'alias': 'Ethernet31/1', 'asic_port_name': 'Eth240-ASIC1', 'coreId': '0', 'corePortId': '25', 'description': 'Ethernet240-connected-to-nv419@eth56/1', 'fec': 'rs', 'index': '31', 'lanes': '40,41,42,43', 'mtu': '9100', 'numVoq': '8', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '100000', 'tpid': '0x8100', 'oper_status': 'up', 'main': '0x4e,0x4e,0x4b,0x4b,0x4b,0x4e,0x4b,0x4e', 'post1': '-0x16,-0x16,-0x14,-0x14,-0x14,-0x16,-0x14,-0x16', 'post2': '0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0', 'post3': '0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0', 'pre1': '-0x5,-0x5,-0x5,-0x5,-0x5,-0x5,-0x5,-0x5', 'pre2': '0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0'}

The problem is that the method "get_media_val_str" in sonic-platform-daemons/sonic-xcvrd/xcvrd/xcvrd.py cannot handle this case properly. I am thinking about how to handle this issue. Should we fix the logic in get_media_val_str, read the number of lanes from config_db? or make an SKU-specific media_settings.json file?

xcvrd: logical error in CMIS taskworker: double negate lead to module stuck in Tx_Disable state.

Problem Scenario : 400G Ports are up and then swss restart and below are the event sequence for the issue to reproduce.
Event 1 : admin state is set to up. - > with this event, the module is put into tx disable
Event 2: host_tx_ready is set false -> with this event, double negate condition happens which leave the optics in TX disable state and the SW state goes to CMIS_STATE_READY

if_application update is required() and in CMIS task_worked()
https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L1224
https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L1601

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.