sonic-net / sonic-platform-daemons Goto Github PK
View Code? Open in Web Editor NEWPlatform module daemons for SONiC
License: Other
Platform module daemons for SONiC
License: Other
When I build master branch of sonic-buildimage (sonic-broadcom.bin) sometimes I get this error.
=================================== FAILURES ===================================
____________ TestXcvrdScript.test_SfpStateUpdateTask_task_run_stop _____________
self = <tests.test_xcvrd.TestXcvrdScript object at 0x7f11132cb0d0>
@patch('xcvrd.xcvrd_utilities.port_mapping.subscribe_port_config_change', MagicMock(return_value=(None, None)))
def test_SfpStateUpdateTask_task_run_stop(self):
port_mapping = PortMapping()
retry_eeprom_set = set()
stop_event = threading.Event()
sfp_error_event = threading.Event()
task = SfpStateUpdateTask(DEFAULT_NAMESPACE, port_mapping, retry_eeprom_set, stop_event, sfp_error_event)
task.start()
assert wait_until(5, 1, task.is_alive)
task.raise_exception()
> task.join()
tests/test_xcvrd.py:1041:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xcvrd/xcvrd.py:2177: in join
raise self.exc
xcvrd/xcvrd.py:2152: in run
self.task_worker(self.task_stopping_event, self.sfp_error_event)
xcvrd/xcvrd.py:1965: in task_worker
port_mapping.handle_port_config_change(sel, asic_context, stopping_event, self.port_mapping, helper_logger, self.on_port_config_change)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sel = None, asic_context = None
stop_event = <threading.Event object at 0x7f1113ccbc40>
port_mapping = <xcvrd.xcvrd_utilities.port_mapping.PortMapping object at 0x7f1113ccbcd0>
logger = <sonic_py_common.logger.Logger object at 0x7f1115fcf970>
port_change_event_handler = <bound method SfpStateUpdateTask.on_port_config_change of <SfpStateUpdateTask(SfpStateUpdateTask, stopped 139711322162944)>>
def handle_port_config_change(sel, asic_context, stop_event, port_mapping, logger, port_change_event_handler):
"""Select CONFIG_DB PORT table changes, once there is a port configuration add/remove, notify observers
"""
if not stop_event.is_set():
> (state, _) = sel.select(SELECT_TIMEOUT_MSECS)
E AttributeError: 'NoneType' object has no attribute 'select'
xcvrd/xcvrd_utilities/port_mapping.py:218: AttributeError
This GIT issue will overall track the effort required to handle CMIS_STATE_FAILED in xcvrd with a way to notify the user of this status and take corrective actions if applicable.
Also, depending on the step at which CMIS_STATE_FAILED is set for a port, we may need to move the port to shut state and perform DP deinit if needed.
Hi, i want to know why we should delete 'speed_tolerance' field?
Cause as i know, the command 'show system-health summary' will call the file 'hardware_checker.py' which is at master branch, and it will get the 'speed_tolerance' from Redis database.
If we delete the 'speed_tolerance', an error will be generated when using the command 'show system-health summary'
root@sonic:/home/admin# show system-health summary
System status summary
System status LED amber
Services:
Status: OK
Hardware:
Status: Not OK
Reasons: Failed to get speed tolerance for Fantray7_2
Failed to get speed tolerance for Fantray7_1
Failed to get speed tolerance for Fantray6_2
Failed to get speed tolerance for Fantray6_1
Failed to get speed tolerance for Fantray5_2
Failed to get speed tolerance for Fantray5_1
Failed to get speed tolerance for Fantray4_2
Failed to get speed tolerance for Fantray4_1
Failed to get speed tolerance for Fantray3_2
Failed to get speed tolerance for Fantray3_1
Failed to get speed tolerance for Fantray2_2
Failed to get speed tolerance for Fantray2_1
Failed to get speed tolerance for Fantray1_2
Failed to get speed tolerance for Fantray1_1
root@sonic:/home/admin# show platform fan
Drawer LED FAN Speed Direction Presence Status Timestamp
-------- ----- ---------- ------- ----------- ---------- -------- -----------------
Fantray1 green Fantray1_1 56% INTAKE Present OK 20221222 15:29:07
Fantray1 green Fantray1_2 56% INTAKE Present OK 20221222 15:29:07
Fantray2 green Fantray2_1 55% INTAKE Present OK 20221222 15:29:07
Fantray2 green Fantray2_2 56% INTAKE Present OK 20221222 15:29:07
Fantray3 green Fantray3_1 55% INTAKE Present OK 20221222 15:29:07
Fantray3 green Fantray3_2 56% INTAKE Present OK 20221222 15:29:07
Fantray4 green Fantray4_1 55% INTAKE Present OK 20221222 15:29:07
Fantray4 green Fantray4_2 56% INTAKE Present OK 20221222 15:29:07
Fantray5 green Fantray5_1 56% INTAKE Present OK 20221222 15:29:07
Fantray5 green Fantray5_2 56% INTAKE Present OK 20221222 15:29:07
Fantray6 green Fantray6_1 56% INTAKE Present OK 20221222 15:29:07
Fantray6 green Fantray6_2 56% INTAKE Present OK 20221222 15:29:07
Fantray7 green Fantray7_1 55% INTAKE Present OK 20221222 15:29:07
Fantray7 green Fantray7_2 56% INTAKE Present OK 20221222 15:29:07
N/A N/A PSU1_FAN1 29% INTAKE Present OK 20221222 15:29:07
N/A N/A PSU2_FAN1 29% INTAKE Present OK 20221222 15:29:07
Problem description:
At present only 75GHz grid frequency values are allowed for configuration (75GHz grid is hardcoded).
https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L1295
Any other frequency value configuration lead to xcvrd child process to crash due to exception raised by current code if the value is out of range or other than 75GHz grid channel frequency.
All the module capable frequency should be allowed for configuration/programming and any frequency which the module is not capable of should be rejected by the CLI.
Issue:
Steps to reproduce
Logs:
root@sonic:~# redis-cli -n 6 hgetall "FAN_INFO|PSU1 Fan"
1) "led_status"
2) "None"
To fix this issue:
root@sonic:/# redis-cli -n 6 hgetall "FAN_INFO|PSU1 Fan"
1) "presence"
2) "True"
3) "status"
4) "Updating"
5) "direction"
6) "exhaust"
7) "speed"
8) "67"
9) "timestamp"
10) "20201223 01:48:55"
11) "led_status"
12) "None"
The one more issue is that since the PSU fan status values are "Updating/ N/A".
In system health daemon, the expected values for PSU fan status is "True/False".
Thermalctld also updates the PSU Fan status value to "True/False".
So, need your input on whether we can change the PSU Fan status in psu daemon to return "True/False" instead of "Updating/ N/A".
If it is not done, then "PSU Fan is broken" error will be logged in redis-db for system health table.
Problem description:
At present xcvrd creates child process for the task workers to handle SFP state transition and CMIS handler. If one of the process get crashed then it become un-noticed.
Move the xcvrd from multi-process model to multi-threaded.
Symptoms:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1366, in task_worker
post_port_dom_info_to_db(logical_port_name, self.port_mapping, xcvr_table_helper.get_dom_tbl(asic_index), self.task_stopping_event, dom_info_cache=dom_info_cache)
File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 434, in post_port_dom_info_to_db
dom_info_dict = _wrapper_get_transceiver_dom_info(physical_port)
File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 154, in _wrapper_get_transceiver_dom_info
return platform_chassis.get_sfp(physical_port).get_transceiver_bulk_status()
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/sfp_optoe_base.py", line 28, in get_transceiver_bulk_status
return api.get_transceiver_bulk_status() if api is not None else None
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmis.py", line 205, in get_transceiver_bulk_status
self.vdm_dict = self.get_vdm()
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmis.py", line 1015, in get_vdm
vdm = self.vdm.get_vdm_allpage() if not self.is_flat_memory() else {}
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmisVDM.py", line 178, in get_vdm_allpage
vdm_current_page = self.get_vdm_page(page, vdm_flag_page)
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmisVDM.py", line 55, in get_vdm_page
vdm_typeID = vdm_descriptor[1::2]
TypeError: 'NoneType' object is not subscriptable
This failure happens once in three builds. It is not consistent and may be a timing issue.
=================================== FAILURES ===================================
____________ TestXcvrdScript.test_SfpStateUpdateTask_task_run_stop _____________
self = <tests.test_xcvrd.TestXcvrdScript object at 0x7fbd2e6508b0>
@patch('xcvrd.xcvrd_utilities.port_mapping.subscribe_port_config_change', MagicMock(return_value=(None, None)))
def test_SfpStateUpdateTask_task_run_stop(self):
port_mapping = PortMapping()
retry_eeprom_set = set()
stop_event = threading.Event()
sfp_error_event = threading.Event()
task = SfpStateUpdateTask(DEFAULT_NAMESPACE, port_mapping, retry_eeprom_set, stop_event, sfp_error_event)
task.start()
assert wait_until(5, 1, task.is_alive)
E assert False
E + where False = wait_until(5, 1, <bound method Thread.is_alive of <SfpStateUpdateTask(SfpStateUpdateTask, stopped 140450521253632)>>)
E + where <bound method Thread.is_alive of <SfpStateUpdateTask(SfpStateUpdateTask, stopped 140450521253632)>> = <SfpStateUpdateTask(SfpStateUpdateTask, stopped 140450521253632)>.is_alive
tests/test_xcvrd.py:1039: AssertionError
=============================== warnings summary ===============================
/usr/lib/python3/dist-packages/_pytest/junitxml.py:446
/usr/lib/python3/dist-packages/_pytest/junitxml.py:446: PytestDeprecationWarning: The 'junit_family' default value will change to 'xunit2' in pytest 6.0. See:
https://docs.pytest.org/en/stable/deprecations.html#junit-family-default-value-change-to-xunit2
for more information.
_issue_warning_captured(deprecated.JUNIT_XML_DEFAULT_FAMILY, config.hook, 2)
-- Docs: https://docs.pytest.org/en/stable/warnings.html
TOTAL 1760 368 79%
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml
=========================== short test summary info ============================
FAILED tests/test_xcvrd.py::TestXcvrdScript::test_SfpStateUpdateTask_task_run_stop
=================== 1 failed, 69 passed, 1 warning in 16.38s ===================
[ FAIL LOG END ] [ target/python-wheels/bullseye/sonic_xcvrd-1.0-py3-none-any.whl ]
make: *** [slave.mk:878: target/python-wheels/bullseye/sonic_xcvrd-1.0-py3-none-any.whl] Error 1
When running the sonic-vs image, the docker-platform-monitor is constantly rolling over with this error from Python:
Jul 24 13:48:14.209452 vlab-01 INFO pmon#thermalctld: Starting up...
Jul 24 13:48:14.209929 vlab-01 INFO pmon#supervisord: thermalctld Traceback (most recent call last):
Jul 24 13:48:14.210241 vlab-01 INFO pmon#supervisord: thermalctld File "/usr/bin/thermalctld", line 590, in
Jul 24 13:48:14.210501 vlab-01 INFO pmon#supervisord: thermalctld main()
Jul 24 13:48:14.210843 vlab-01 INFO pmon#supervisord: thermalctld File "/usr/bin/thermalctld", line 586, in main
Jul 24 13:48:14.211067 vlab-01 INFO pmon#supervisord: thermalctld thermal_control.run()
Jul 24 13:48:14.211349 vlab-01 INFO pmon#supervisord: thermalctld File "/usr/bin/thermalctld", line 547, in run
Jul 24 13:48:14.211599 vlab-01 INFO pmon#supervisord: thermalctld import sonic_platform.platform
Jul 24 13:48:14.211861 vlab-01 INFO pmon#supervisord: thermalctld ImportError: No module named sonic_platform.platform
I found a potential solution at https://www.gitmemory.com/jleveque. But it appears I cannot access the source code repository nor build the sonic-vs.img on my own to fix it.
Is there a workaround for this?
As per Media-based port settings HLD, if 'vendor_key = vendor_name + vendor_PN' lookup fails, then reduce 'vendor_key = vendor_name' and search again for media_settings.
Current implementation:
get_media_settings_key
vendor_key = vendor_name_str.upper() + '-' + vendor_pn_str
In Z9100 T0 profile (C8D48) ,xcvrd process exits with the below error and sfpshow presence shows the transceiver as Not present. Transceiver output is shown only for Ethernet0. 0.
root@sonic-z9100-02:/var/log# xcvrd
Traceback (most recent call last):
File "/usr/bin/xcvrd", line 735, in <module>
main()
File "/usr/bin/xcvrd", line 732, in main
xcvrd.run()
File "/usr/bin/xcvrd", line 696, in run
self.init()
File "/usr/bin/xcvrd", line 680, in init
post_port_sfp_dom_info_to_db(is_warm_start, self.stop_event)
File "/usr/bin/xcvrd", line 238, in post_port_sfp_dom_info_to_db
notify_media_setting(logical_port_name, transceiver_dict, app_port_tbl)
File "/usr/bin/xcvrd", line 473, in notify_media_setting
media_dict[media_key])
File "/usr/bin/xcvrd", line 414, in get_media_val_str
start_lane = logical_idx * num_lanes_per_logical_port
NameError: global name 'logical_idx' is not defined
With the below fix , we are able to fetch the values.
def get_media_val_str(num_logical_ports, lane_dict):
changed to
def get_media_val_str(num_logical_ports, logical_idx, lane_dict):
and
Line : 473
for media_key in media_dict:
if type(media_dict[media_key]) is dict:
media_val_str = get_media_val_str(num_logical_ports, \
media_dict[media_key])
changed to
media_val_str = get_media_val_str(num_logical_ports, \
logical_idx, media_dict[media_key])
root@sonic-z9100-02:~# sfpshow presence
Port Presence
----------- -----------
Ethernet0 Present
Ethernet2 Not present
Ethernet4 Not present
Ethernet6 Not present
Ethernet8 Not present
Ethernet10 Not present
Ethernet12 Not present
Ethernet14 Not present
Ethernet16 Not present
Ethernet18 Not present
Ethernet20 Not present
Ethernet22 Not present
Ethernet24 Not present
Ethernet28 Not present
Ethernet32 Not present
Ethernet36 Not present
Ethernet40 Not present
Ethernet42 Not present
Ethernet44 Not present
Ethernet46 Not present
Ethernet48 Not present
Ethernet50 Not present
Ethernet52 Not present
Ethernet54 Not present
Ethernet56 Not present
Ethernet58 Not present
Ethernet60 Not present
Ethernet62 Not present
Ethernet64 Not present
Ethernet66 Not present
Ethernet68 Not present
Ethernet70 Not present
Ethernet72 Not present
Ethernet74 Not present
Ethernet76 Not present
Ethernet78 Not present
Ethernet80 Not present
Ethernet82 Not present
Ethernet84 Not present
Ethernet86 Not present
Ethernet88 Not present
Ethernet90 Not present
Ethernet92 Not present
Ethernet94 Not present
Ethernet96 Not present
Ethernet98 Not present
Ethernet100 Not present
Ethernet102 Not present
Ethernet104 Not present
Ethernet108 Not present
Ethernet112 Not present
Ethernet116 Not present
Ethernet120 Not present
Ethernet122 Not present
Ethernet124 Not present
Ethernet126 Not present
On 20220531.28
image, the show mux status
command is slow and always return inconsistent
for HWSTATUS
:
$ time show mux s
PORT STATUS SERVER_STATUS HEALTH HWSTATUS LAST_SWITCHOVER_TIME
---------- -------- --------------- -------- ------------ ---------------------------
Ethernet4 active active healthy inconsistent 2023-May-30 05:46:51.795124
Ethernet8 active active healthy inconsistent 2023-May-30 05:46:51.734891
Ethernet12 active active healthy inconsistent 2023-May-30 05:46:51.956624
Ethernet16 active active healthy inconsistent 2023-May-30 05:46:51.755520
Ethernet20 active active healthy inconsistent 2023-May-30 05:46:52.007026
Ethernet24 active active healthy inconsistent 2023-May-30 05:46:51.826122
Ethernet28 active active healthy inconsistent 2023-May-30 05:46:52.048576
Ethernet32 active active healthy inconsistent 2023-May-30 05:46:51.719157
Ethernet36 active active healthy inconsistent 2023-May-30 05:46:51.845886
Ethernet40 active active healthy inconsistent 2023-May-30 05:46:51.981137
Ethernet44 active active healthy inconsistent 2023-May-30 05:46:51.912922
Ethernet48 active active healthy inconsistent 2023-May-30 05:46:51.924832
Ethernet52 active active healthy inconsistent 2023-May-30 05:46:52.120600
Ethernet56 active active healthy inconsistent 2023-May-30 05:46:51.992982
Ethernet60 active active healthy inconsistent 2023-May-30 05:46:51.771509
Ethernet64 active active healthy inconsistent 2023-May-30 05:46:51.898908
Ethernet68 active active healthy inconsistent 2023-May-30 05:46:52.034526
Ethernet72 active active healthy inconsistent 2023-May-30 05:46:52.061691
Ethernet76 active active healthy inconsistent 2023-May-30 05:46:52.095044
Ethernet80 active active healthy inconsistent 2023-May-30 05:46:52.020575
Ethernet84 active active healthy inconsistent 2023-May-30 05:46:51.687364
Ethernet88 active active healthy inconsistent 2023-May-30 05:46:52.132286
Ethernet92 active active healthy inconsistent 2023-May-30 05:46:52.110357
Ethernet96 active active healthy inconsistent 2023-May-30 05:46:51.969622
real 0m26.449s
user 0m1.366s
sys 0m0.469s
$ show mux grpc mux
Port Direction Presence PeerDirection ConnectivityState
---------- ----------- ---------- --------------- -------------------
Ethernet4 active True active READY
Ethernet8 active True active READY
Ethernet12 active True active READY
Ethernet16 active True active READY
Ethernet20 active True active READY
Ethernet24 active True active READY
Ethernet28 active True active READY
Ethernet32 active True active READY
Ethernet36 active True active READY
Ethernet40 active True active READY
Ethernet44 active True active READY
Ethernet48 active True active READY
Ethernet52 active True active READY
Ethernet56 active True active READY
Ethernet60 active True active READY
Ethernet64 active True active READY
Ethernet68 active True active READY
Ethernet72 active True active READY
Ethernet76 active True active READY
Ethernet80 active True active READY
Ethernet84 active True active READY
Ethernet88 active True active READY
Ethernet92 active True active READY
Ethernet96 active True active READY
problem with "show interfaces transceiver eeprom -d"
We now have .06 installed on the switch and I have found a new, but seemingly related, issue.
When we installed .06 and rebooted/reloaded the switch, the eeprom data from the ONet AOCs seemed to be correct. Then, today, I continued with my testing and started getting test failures during out hotswap test. This is the test where we check the function of the optics before and after unplugging and re-inserting the optics.
Here is an example of what is happening.
Before unplugging the optic we have what looks like correct data.
Ethernet72: SFP EEPROM detected
Application Advertisement: 400GAUI-8 C2M (Annex 120E) - Active Cable assembly with BER < 2.6x10^-4 200GAUI-4 C2M (Annex 120E) - Active Cable assembly with BER < 2.6x10^-4
Connector: No separable connector
Encoding: Not supported
Extended Identifier: Power Class 1(10.0W Max)
Extended RateSelect Compliance: Not supported
Identifier: QSFP-DD Double Density 8X Pluggable Transceiver
Length Cable Assembly(m): 3.0
Nominal Bit Rate(100Mbs): Not supported
Specification compliance: active_cable_media_interface
Vendor Date Code(YYYY-MM-DD Lot): 2021-02-05 01
Vendor Name: O-NET
Vendor OUI: 34-78-77
Vendor PN: 1AT-5QAM03XX-10A
Vendor Rev: A
Vendor SN: 4QA-0000022
ChannelMonitorValues:
RX1Power: 1.4476dBm
RX2Power: 1.4826dBm
RX3Power: 1.502dBm
RX4Power: 1.3437dBm
RX5Power: 1.3792dBm
RX6Power: 1.4461dBm
RX7Power: 1.2031dBm
RX8Power: 1.213dBm
TX1Bias: 5.986mA
TX1Power: 1.9011dBm
TX2Bias: 6.27mA
TX2Power: 2.1024dBm
TX3Bias: 6.178mA
TX3Power: 2.0382dBm
TX4Bias: 6.104mA
TX4Power: 1.986dBm
TX5Bias: 6.058mA
TX5Power: 1.9529dBm
TX6Bias: 6.152mA
TX6Power: 2.02dBm
TX7Bias: 5.892mA
TX7Power: 1.8324dBm
TX8Bias: 6.082mA
TX8Power: 1.97dBm
ChannelThresholdValues:
RxPowerHighAlarm : 4.7699dBm
RxPowerHighWarning: 3.9799dBm
RxPowerLowAlarm : -6.9919dBm
RxPowerLowWarning : -6.0206dBm
TxBiasHighAlarm : 9.0000mA
TxBiasHighWarning : 8.5000mA
TxBiasLowAlarm : 4.0000mA
TxBiasLowWarning : 4.5000mA
TxPowerHighAlarm : 4.7699dBm
TxPowerHighWarning: 3.9799dBm
TxPowerLowAlarm : -6.9919dBm
TxPowerLowWarning : -6.0206dBm
ModuleMonitorValues:
Temperature: 43.3555C
Vcc: 3.3803Volts
ModuleThresholdValues:
TempHighAlarm : 75.0000C
TempHighWarning: 70.0000C
TempLowAlarm : -5.0000C
TempLowWarning : 0.0000C
VccHighAlarm : 3.5700Volts
VccHighWarning : 3.4650Volts
VccLowAlarm : 3.0400Volts
VccLowWarning : 3.1350Volts
Now, I unplug the optics and re-insert them.
Afterward, the data from the optic shows changes.
Ethernet72: SFP EEPROM detected
Application Advertisement: 400GAUI-8 C2M (Annex 120E) - Active Cable assembly with BER < 2.6x10^-4 200GAUI-4 C2M (Annex 120E) - Active Cable assembly with BER < 2.6x10^-4
Connector: N/A
Encoding: Not supported
Extended Identifier: N/A
Extended RateSelect Compliance: Not supported
Identifier: N/A
Length Cable Assembly(m): N/A
Nominal Bit Rate(100Mbs): Not supported
Specification compliance:
N/A
Vendor Date Code(YYYY-MM-DD Lot): N/A
Vendor Name: N/A
Vendor OUI: 34-78-77
Vendor PN: 1AT-5QAM03XX-10A
Vendor Rev: A
Vendor SN: 4QA-0000022
MonitorData:
RXPower: 1.3985dBm
TXBias: 5.978mA
TXPower: 1.919dBm
Temperature: 41.7695C
Vcc: 3.3795Volts
ThresholdData:
TempHighAlarm : 75.0000C
TempHighWarning: 70.0000C
TempLowAlarm : -5.0000C
TempLowWarning : 0.0000C
VccHighAlarm : 3.5700Volts
VccHighWarning : 3.4650Volts
VccLowAlarm : 3.0400Volts
VccLowWarning : 3.1350Volts
RxPowerHighAlarm : 4.7699dBm
RxPowerHighWarning: 3.9799dBm
RxPowerLowAlarm : -6.9919dBm
RxPowerLowWarning : -6.0206dBm
TxBiasHighAlarm : 9.0000mA
TxBiasHighWarning : 8.5000mA
TxBiasLowAlarm : 4.0000mA
TxBiasLowWarning : 4.5000mA
TxPowerHighAlarm : 4.7699dBm
TxPowerHighWarning: 3.9799dBm
TxPowerLowAlarm : -6.9919dBm
TxPowerLowWarning : -6.0206dBm
So, it looks like there remains some instability in the I2C reads in this switch.
Also, the whole structure of the output changes.
Things I know:
It doesn’t always happen the same way. Different fields can be affected on different iterations.
It seems that rebooting has a pretty good chance of clearing up the problem. I am not 100% sure this works all the time.
It happens on more than one port, but doesn’t seem to be guaranteed to happen on any given port.
Things I don’t know:
I don’t know if optics from any other vendor show the same behavior.
I don’t know if it happened before on older version of OS, because we had problems with the before measurement
I don’t know if rebooting solves the problem all the time.
Below is the snippet where xcvrd process the event without host_tx_ready in the state-DB port table. Full log is in the attachment "xcvrd_non_tx_ready_event_processing"
May 26 20:17:37.886250 sonic NOTICE pmon#xcvrd[32]: CMIS: Ethernet248 Forcing Tx laser OFF
May 26 20:17:37.904964 sonic WARNING pmon#xcvrd[32]: $$$ Ethernet248 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'state': 'ok', 'netdev_oper_status': 'down', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '', 'supported_fecs': 'rs', 'host_tx_ready': 'true'}
May 26 20:17:37.905200 sonic WARNING pmon#xcvrd[32]: *** Ethernet248STATE_DBPORT_TABLE handle_port_update_event() fvp {'host_tx_ready': 'true', 'index': '-1', 'key': 'Ethernet248', 'asic_id': 2, 'op': 'SET'}
<<<<<<<<<<<<<<<<<<<<<<<
May 26 20:17:37.942950 sonic NOTICE pmon#xcvrd[32]: CMIS: Ethernet248: 400G, lanemask=0xff, state=INSERTED, appl 1 host_lane_count 8 retries=0
May 26 20:17:38.008109 sonic NOTICE pmon#xcvrd[32]: CMIS: Ethernet248: Setting appl=1
May 26 20:17:38.070372 sonic NOTICE pmon#xcvrd[32]: CMIS: Ethernet248: Setting lanemask=0xff
Implement global CLI to enable/disable the performance monitoring feature.
PM feature HLD : sonic-net/SONiC#1258
config
pm # Global-performance monitoring feature over capable coherent transceiver
config pm
enable # Enable performance monitoring on all ports
disable # Disable performance monitoring on all ports
Regression in the latest code base in obtaining the Active Firmware version info from modules that are not implementing CDB feature. Current logic is trying to retrieve firmware information using CDB command. CDB being optional feature it is not expected to work in all cases.
Active firmware version should be retrieved from advertised information from page0
root@sonic:/home/cisco# show int transceiver eeprom Ethernet192
Ethernet192: SFP EEPROM detected
Active Firmware: N/A
Active application selected code assigned to host lane 1: 2
Active application selected code assigned to host lane 2: 2
Active application selected code assigned to host lane 3: 2
Active application selected code assigned to host lane 4: 2
Active application selected code assigned to host lane 5: 2
Active application selected code assigned to host lane 6: 2
Active application selected code assigned to host lane 7: 2
Active application selected code assigned to host lane 8: 2
Application Advertisement: 100GAUI-2 C2M (Annex 135G) - Host Assign (0x55) - 100GBASE-DR (Cl 140) - Media Assign (0xf)
400GAUI-8 C2M (Annex 120E) - Host Assign (0x1) - 400GBASE-DR4 (Cl 124) - Media Assign (0x1)
Xcvrd exits abruptly while dynamic transceiver tuning.
Issue is seen in latest master image which by default invokes Xcvrd with Python3.
Logs:
/var/log/syslog.16.gz:6106:Jan 7 16:59:33.813092 sonic INFO pmon#/supervisord: xcvrd Traceback (most recent call last):
/var/log/syslog.16.gz:6107:Jan 7 16:59:33.813346 sonic INFO pmon#/supervisord: xcvrd File "/usr/local/bin/xcvrd", line 8, in <module>
/var/log/syslog.16.gz:6108:Jan 7 16:59:33.813517 sonic INFO pmon#/supervisord: xcvrd sys.exit(main())
/var/log/syslog.16.gz:6109:Jan 7 16:59:33.813698 sonic INFO pmon#/supervisord: xcvrd File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1376, in main
/var/log/syslog.16.gz:6110:Jan 7 16:59:33.813840 sonic INFO pmon#/supervisord: xcvrd xcvrd.run()
/var/log/syslog.16.gz:6111:Jan 7 16:59:33.813980 sonic INFO pmon#/supervisord: xcvrd File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1324, in run
/var/log/syslog.16.gz:6112:Jan 7 16:59:33.814124 sonic INFO pmon#/supervisord: xcvrd self.init()
/var/log/syslog.16.gz:6113:Jan 7 16:59:33.814270 sonic INFO pmon#/supervisord: xcvrd File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1289, in init
/var/log/syslog.16.gz:6114:Jan 7 16:59:33.814845 sonic INFO pmon#/supervisord: xcvrd post_port_sfp_dom_info_to_db(is_warm_start, self.stop_event)
/var/log/syslog.16.gz:6115:Jan 7 16:59:33.814845 sonic INFO pmon#/supervisord: xcvrd File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 492, in post_port_sfp_dom_info_to_db
/var/log/syslog.16.gz:6116:Jan 7 16:59:33.814845 sonic INFO pmon#/supervisord: xcvrd notify_media_setting(logical_port_name, transceiver_dict, app_port_tbl[asic_index])
/var/log/syslog.16.gz:6117:Jan 7 16:59:33.817393 sonic INFO pmon#/supervisord: xcvrd File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 727, in notify_media_setting
/var/log/syslog.16.gz:6118:Jan 7 16:59:33.818147 sonic INFO pmon#/supervisord: xcvrd key = get_media_settings_key(physical_port, transceiver_dict)
/var/log/syslog.16.gz:6119:Jan 7 16:59:33.818147 sonic INFO pmon#/supervisord: xcvrd File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 627, in get_media_settings_key
/var/log/syslog.16.gz:6120:Jan 7 16:59:33.818147 sonic INFO pmon#/supervisord: xcvrd vendor_key = string.upper(vendor_name_str) + '-' + vendor_pn_str
/var/log/syslog.16.gz:6121:Jan 7 16:59:33.818147 sonic INFO pmon#/supervisord: xcvrd AttributeError: module 'string' has no attribute 'upper'
/var/log/syslog.16.gz:6213:Jan 7 16:59:35.027516 sonic INFO pmon#/supervisor-proc-exit-listener: Process xcvrd exited unxepectedly. Terminating supervisor...
Please take a look at the following error log:
https://github.com/sonic-net/sonic-platform-daemons/blob/3b969c3142210d0439d11aa480fb29afb1ac546a/sonic-xcvrd/xcvrd/xcvrd.py#L777C1-L777C105
On Nvidia platforms, we expect to see this log printed for every non-CMIS module as part of the code's regular operation, even though it doesn't represent an actual error. This is because our media_settings.json file intentionally contains data for CMIS modules only. That's why for CMIS modules we expect the data from the json to be published to APP_DB, and for the others we expect encountering this particular message.
Is there a possibility to modify this log from 'log_error()' to 'log_notice()'?
Convert unit tests to use Pytest, and also take advantage of pytest-cov coverage reporting.
For any thermal sensor, when the current temperature exceeds critical low/high threshold, alarm logging that appears under syslog is not correct.
It still reports with a high/low threshold value for a critical alarm, when it should report with a critical high/low threshold value.
Current for critical thermal alarms:
"High/Low temperature warning: {} current temperature {}C, high/low threshold {}"
Expected for critical thermal alarms:
"High/low temperature warning: {} current temperature {}C, critical high/low threshold {}"
While reviewing CMIS PR changeset - #254 , found following issue:
Discussed with @prgeor and following is to be done (corrected/enhanced):
Issue: Why the cmis state is reseted when config-interface shutdown is configured. It supposed to avoid app_code reprogramming during this state.
@jaganbal-a updated: As mentioned in the PR comment, We can have new state as CMIS_STATE_HOST_READY and this to be set under “elif state == self.CMIS_STATE_DP_DEINIT:” and self.CMIS_STATE_AP_CONF will be set only after the host_tx_ready is turned true. This will prevent moving state from DP_DEINIT to AP_CONF by default.
For any nonpresent PSU, thermalctld will add its sensors to the redis database, with all values N/A.
thermalctld should check for PSU presence before adding their sensors to the db.
show platform temperature
PSU0.0 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:04
PSU0.1 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:04
PSU0.2 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:04
PSU1.0 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:04
PSU1.1 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:04
PSU1.2 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:04
PSU2.0 Outlet_Temp 46.019 97.0 -5.0 102.0 -10.0 False 20230214 14:05:05
PSU2.1 Outlet_Temp 46.019 97.0 -5.0 102.0 -10.0 False 20230214 14:05:05
PSU2.2 Outlet_Temp 46.019 97.0 -5.0 102.0 -10.0 False 20230214 14:05:05
PSU3.0 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:05
PSU3.1 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:05
PSU3.2 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:05
PSU4.0 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:05
PSU4.1 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:05
PSU4.2 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:05
PSU5.0 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:05
PSU5.1 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:05
PSU5.2 Outlet_Temp N/A N/A N/A N/A N/A False 20230214 14:05:0
db entry
"TEMPERATURE_INFO|PSU0.0 Outlet_Temp": {
"expireat": 1676383627.929436,
"ttl": -0.001,
"type": "hash",
"value": {
"critical_high_threshold": "N/A",
"critical_low_threshold": "N/A",
"high_threshold": "N/A",
"is_replaceable": "False",
"low_threshold": "N/A",
"maximum_temperature": "N/A",
"minimum_temperature": "N/A",
"temperature": "N/A",
"timestamp": "20230214 14:06:59",
"warning_status": "False"
}
PR: https://github.com/sonic-net/sonic-buildimage/pull/11731
“sonic-clear macsec” is clearing the stats with above fix/PR, and we see that stats are getting cleared in subsequent “show macsec Ethernet<>” CLI.
Deep-diving little more into this:
This clears the stats from the SW table (redis/FLEX) only and does not clear them at the PHY/device (HW) level.
Subsequent call of “show macsec Ethernet<>” CLI would display the delta stats (i.e. [stats fetched via/from platform/HW] – [stats cleared]).
Impact:
This approach can cause the mismatch of the stats between what SW is displaying vs what’s available in PHY!
During live traffic, this might complicate stats validation while debugging in case of traffic drop etc.
Mismatch:
After triggering “sonic-clear …” , stats in “show counters…” would be (HW – cached) counters whereas stats at the PHY (device) level would still be original HW stats/counters (without anyone invoking a clear to them), so this would be a mismatch
Proposal A:
“sonic-clear macsec” trigger can call the PHY-SAI API to clear the stats from the device (PHY/HW) and show the real stats fetched from the PHY without doing calculations at upper layer (SONiC/ SW table (redis/FLEX DB))
Or
Proposal B:
Do we have any command to display HW stats at SONiC (command) level?
If not, should we introduce one? At least, there would be a way to check and compare stats at SW and HW (device) level.
Using logger directly works, such as in PSUD:
https://github.com/Azure/sonic-platform-daemons/blob/master/sonic-psud/scripts/psud#L162
However, using self.logger directly doesn't:
https://github.com/Azure/sonic-platform-daemons/blob/master/sonic-syseepromd/scripts/syseepromd#L116
This git issue is to track the addition of Coherent (400G-ZR for now) module time window based performance monitoring parameter statistics collection
The delay of pmon is true, and pmon would start with a delay.
On a warm reboot, the value of WARM_RESTART_ENABLE_TABLE|system would be set to true, and after the WARMBOOT_FINALIZER it would be set to false.
But pmon will start after WARMBOOT_FINALIZER has finished, and is_warm_reboot_enabled() would return false.
[2024-08-07 14:33:34.623] admin@sonic:~$ redis-cli -n 6 hgetall "WARM_RESTART_ENABLE_TABLE|system"
[2024-08-07 14:33:36.695] 1) "enable"
[2024-08-07 14:33:36.695] 2) "true"
[2024-08-07 14:33:36.695] admin@sonic:~$ docker ps
[2024-08-07 14:33:38.784] CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[2024-08-07 14:33:38.784] 42ba626dd413 docker-router-advertiser:latest "/usr/bin/docker-ini…" 13 minutes ago Up 2 minutes radv
[2024-08-07 14:33:38.804] ff130d42c83e docker-fpm-frr:latest "/usr/bin/docker_ini…" 13 minutes ago Up 2 minutes bgp
[2024-08-07 14:33:38.804] 13bcfde779d8 docker-syncd-brcm:latest "/usr/local/bin/supe…" 13 minutes ago Up 2 minutes syncd
[2024-08-07 14:33:38.842] c033a238d0ba docker-teamd:latest "/usr/local/bin/supe…" 13 minutes ago Up 2 minutes teamd
[2024-08-07 14:33:38.842] 52c8cfbcae85 docker-orchagent:latest "/usr/bin/docker-ini…" 13 minutes ago Up 2 minutes swss
[2024-08-07 14:33:38.842] 0bf4250f1268 docker-eventd:latest "/usr/local/bin/supe…" 13 minutes ago Up 2 minutes eventd
[2024-08-07 14:33:38.865] ab0e2e9b9120 docker-database:latest "/usr/local/bin/dock…" 13 minutes ago Up 2 minutes database
[2024-08-07 14:33:38.865] admin@sonic:~$ redis-cli -n 6 hgetall "WARM_RESTART_ENABLE_TABLE|system"
[2024-08-07 14:33:41.034] 1) "enable"
[2024-08-07 14:33:41.034] 2) "false"
[2024-08-07 14:33:41.034] admin@sonic:~$ docker ps
[2024-08-07 14:33:43.004] CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[2024-08-07 14:33:43.004] 5243a9eda309 docker-sonic-gnmi:latest "/usr/local/bin/supe…" 11 minutes ago Up 1 second gnmi
[2024-08-07 14:33:43.025] 42ba626dd413 docker-router-advertiser:latest "/usr/bin/docker-ini…" 13 minutes ago Up 2 minutes radv
[2024-08-07 14:33:43.025] ff130d42c83e docker-fpm-frr:latest "/usr/bin/docker_ini…" 13 minutes ago Up 2 minutes bgp
[2024-08-07 14:33:43.046] 13bcfde779d8 docker-syncd-brcm:latest "/usr/local/bin/supe…" 13 minutes ago Up 2 minutes syncd
[2024-08-07 14:33:43.046] c033a238d0ba docker-teamd:latest "/usr/local/bin/supe…" 13 minutes ago Up 2 minutes teamd
[2024-08-07 14:33:43.046] 52c8cfbcae85 docker-orchagent:latest "/usr/bin/docker-ini…" 13 minutes ago Up 2 minutes swss
[2024-08-07 14:33:43.088] 0bf4250f1268 docker-eventd:latest "/usr/local/bin/supe…" 13 minutes ago Up 2 minutes eventd
[2024-08-07 14:33:43.088] ab0e2e9b9120 docker-database:latest "/usr/local/bin/dock…" 13 minutes ago Up 2 minutes database
[2024-08-07 14:33:43.103] admin@sonic:~$ redis-cli -n 6 hgetall "WARM_RESTART_ENABLE_TABLE|system"docker ps
[2024-08-07 14:33:49.514] CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[2024-08-07 14:33:49.514] c77c720ab127 docker-platform-monitor:latest "/usr/bin/docker_ini…" 11 minutes ago Up 2 seconds pmon
[2024-08-07 14:33:49.537] 2965e04beadd docker-sonic-mgmt-framework:latest "/usr/local/bin/supe…" 11 minutes ago Up 4 seconds mgmt-framework
[2024-08-07 14:33:49.537] 4d653ad6155f docker-lldp:latest "/usr/bin/docker-lld…" 11 minutes ago Up 6 seconds lldp
[2024-08-07 14:33:49.563] 5243a9eda309 docker-sonic-gnmi:latest "/usr/local/bin/supe…" 11 minutes ago Up 8 seconds gnmi
[2024-08-07 14:33:49.563] 42ba626dd413 docker-router-advertiser:latest "/usr/bin/docker-ini…" 13 minutes ago Up 2 minutes radv
[2024-08-07 14:33:49.604] ff130d42c83e docker-fpm-frr:latest "/usr/bin/docker_ini…" 13 minutes ago Up 2 minutes bgp
[2024-08-07 14:33:49.604] 13bcfde779d8 docker-syncd-brcm:latest "/usr/local/bin/supe…" 13 minutes ago Up 2 minutes syncd
[2024-08-07 14:33:49.604] c033a238d0ba docker-teamd:latest "/usr/local/bin/supe…" 13 minutes ago Up 2 minutes teamd
[2024-08-07 14:33:49.623] 52c8cfbcae85 docker-orchagent:latest "/usr/bin/docker-ini…" 13 minutes ago Up 2 minutes swss
[2024-08-07 14:33:49.623] 0bf4250f1268 docker-eventd:latest "/usr/local/bin/supe…" 13 minutes ago Up 2 minutes eventd
[2024-08-07 14:33:49.655] ab0e2e9b9120 docker-database:latest "/usr/local/bin/dock…" 13 minutes ago Up 2 minutes database
pcieutil checks the sysfs entry to conclude if a pcie device is present or not. The sysfs entry won't be updated for non-hotplug pcie device if say, a downstream port of a pcie switch (to which a pcie device is connected) receives a "hot-reset". In this case the pcie device will be disconnected from the bus, but sysfs entry doesn't reflect the missing device.
Upon xcvrd/pmon docker restart set the media settings in app DB which sends spurious SI programming from OA upon receiving. xcvrd should not set the settings upon process restart/pmon restart.
So add intelligence to xcvrd to understand the process restart and config reload.
-Create a new table "TABLE_xcvrd" by xcvrd to set a magic number and process restart count in STATE_DB upon fresh start. (config reload/cold reboot)
-Upon xcvrd process restart /pmon docker restart, xcvrd to check the magic number presence on coming up path to understand whether it is a restart case or a fresh start case.
Convert unit tests to use Pytest, and also take advantage of pytest-cov coverage reporting.
happens on 7050-qx32 platform
running inside docker, plugin is in directory /usr/share/sonic/platform/
While reviewing CMIS PR changeset - #254 , found following issue:
Issue: Field with same naming on different Table across State/APPL DB is having conflict when an event is occurred.
(a) Port table in APPL DB – caters to speed, lane admin status etc. changes
(b) Port table in STATE DB – caters to host_tx_ready , admin status etc. changes
admin status change is found false in one table and true in another when config interface on a port/interface was applied.
Plan: Reviewed this with @prgeor
Need to add DB type check when an event is received.
Also, investigated and correct as to why this field (admin status and any other field) is in mismatch between the two tables.
Add high power handling in sff_mgr.
Error message:
sudo sfpshow eeprom
Traceback (most recent call last):
File "/usr/local/bin/sfpshow", line 722, in <module>
cli()
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/local/bin/sfpshow", line 646, in eeprom
sfp.get_eeprom()
File "/usr/local/lib/python3.9/dist-packages/utilities_common/multi_asic.py", line 157, in wrapped_run_on_all_asics
func(self, *args, **kwargs)
File "/usr/local/bin/sfpshow", line 552, in get_eeprom
self.intf_eeprom[interface] = self.convert_interface_sfp_info_to_cli_output_string(
File "/usr/local/bin/sfpshow", line 449, in convert_interface_sfp_info_to_cli_output_string
sfp_info_output = self.convert_sfp_info_to_output_string(sfp_info_dict)
File "/usr/local/bin/sfpshow", line 337, in convert_sfp_info_to_output_string
output += '{}{}: {}\n'.format(indent, data_map[key], sfp_info_dict[key])
KeyError: 'active_firmware'
it seems after involving those two commits:
sonic-net/sonic-platform-common@796e89a
The firmware version fields had been moved from TRANSCEIVER_INFO table to TRANSCEIVER_FIRMWARE_INFO table, but meanwhile, there is no related changes made in sfpshow script. Thereby causes the key not founding error as the script still only searches in the old TRANSCEIVER_INFO table, and there is no firmware version key exists anymore.
When psud
starts, it will only call _set_psu_led
if there is a state change.
Because PsuStatus
has some default state, like its present
attribute to True
, there is never a state change to act upon.
The best way to solve this without breaking backward compatibility would be to initialize set_led
in _update_single_psu_data
to True
on the first run.
This could be done by adding an attribute self.firstrun
to DaemonPsud
that would be initialized to True
and later set to False
after a loop iteration in the run
method.
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
Problem description:
Out of range values for laser frequency and Tx power configuration should be rejected before committing the configuration.
Currently the out of range values are updated to config-DB, but not programmed to the transceiver due to logical checks in the lower layer in xcvrd.
Current master gives me following error in syslog
Oct 4 00:09:52.935763 str-s6100-acs-1 INFO pmon#xcvrd: Start main loop
Oct 4 00:09:52.936035 str-s6100-acs-1 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
Oct 4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd File "/usr/bin/xcvrd", line 449, in <module>
Oct 4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd main()
Oct 4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd File "/usr/bin/xcvrd", line 415, in main
Oct 4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd status, port_dict = platform_sfputil.get_transceiver_change_event()
Oct 4 00:09:52.937167 str-s6100-acs-1 INFO pmon#supervisord: xcvrd File "/usr/share/sonic/platform/plugins/sfputil.py", line 454, in get_transceiver_change_event
Oct 4 00:09:52.937327 str-s6100-acs-1 INFO pmon#supervisord: xcvrd epoll = select.epoll()
Oct 4 00:09:52.937327 str-s6100-acs-1 INFO pmon#supervisord: xcvrd NameError: global name 'select' is not defined
Description
We found that user can configure split port by directly changing config_db.json without modifying port_config.ini or platform.json. However, the problem is that xcvrd still reads port configuration from port_config.ini/platform.json which would cause wrong information in DB.
Steps to reproduce the issue:
Describe the results you received:
Invalid port information in state DB
Describe the results you expected:
State DB have correct port information
In xcvrd media settings is looked upon based on the media_settings key in media_settings.json file. The key is derived from compliance_code in xcvr data.
In xcvrd the key for media_settings is derived only for "10/40G Ethernet Compliance Code" for QSFP28 transceivers.
Need to make changes in xcvrd so that other transceivers are also recognised for constructing media_settings key which can then be used to extract emphasis settigns from media_settings.json
We are noticing build failure due to TestXcvrdScript.test_SfpStateUpdateTask_task_run_stop . Below is the trace
[2023-06-21T12:27:31.213Z] =================================== FAILURES ===================================
[2023-06-21T12:27:31.213Z] ____________ TestXcvrdScript.test_SfpStateUpdateTask_task_run_stop _____________
[2023-06-21T12:27:31.213Z]
[2023-06-21T12:27:31.213Z] self = <tests.test_xcvrd.TestXcvrdScript object at 0x7ff1f404b610>
[2023-06-21T12:27:31.213Z]
[2023-06-21T12:27:31.213Z] @patch('xcvrd.xcvrd_utilities.port_mapping.subscribe_port_config_change', MagicMock(return_value=(None, None)))
[2023-06-21T12:27:31.213Z] def test_SfpStateUpdateTask_task_run_stop(self):
[2023-06-21T12:27:31.213Z] port_mapping = PortMapping()
[2023-06-21T12:27:31.213Z] stop_event = threading.Event()
[2023-06-21T12:27:31.213Z] sfp_error_event = threading.Event()
[2023-06-21T12:27:31.213Z] task = SfpStateUpdateTask(DEFAULT_NAMESPACE, port_mapping, stop_event, sfp_error_event)
[2023-06-21T12:27:31.213Z] task.start()
[2023-06-21T12:27:31.213Z] assert wait_until(5, 1, task.is_alive)
[2023-06-21T12:27:31.213Z] task.raise_exception()
[2023-06-21T12:27:31.213Z] > task.join()
[2023-06-21T12:27:31.213Z]
[2023-06-21T12:27:31.213Z] tests/test_xcvrd.py:1041:
[2023-06-21T12:27:31.214Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:2200: in join
[2023-06-21T12:27:31.214Z] raise self.exc
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:2175: in run
[2023-06-21T12:27:31.214Z] self.task_worker(self.task_stopping_event, self.sfp_error_event)
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:1987: in task_worker
[2023-06-21T12:27:31.214Z] self.init()
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:1905: in init
[2023-06-21T12:27:31.214Z] self.retry_eeprom_set = self._post_port_sfp_info_and_dom_thr_to_db_once(port_mapping_data, self.xcvr_table_helper, self.main_thread_stop_event)
[2023-06-21T12:27:31.214Z] xcvrd/xcvrd.py:1845: in _post_port_sfp_info_and_dom_thr_to_db_once
[2023-06-21T12:27:31.214Z] warmstart.initialize("xcvrd", "pmon")
[2023-06-21T12:27:31.214Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2023-06-21T12:27:31.214Z]
[2023-06-21T12:27:31.214Z] app_name = 'xcvrd', docker_name = 'pmon', db_timeout = 0, isTcpConn = False
[2023-06-21T12:27:31.214Z]
[2023-06-21T12:27:31.214Z] @staticmethod
[2023-06-21T12:27:31.214Z] def initialize(app_name, docker_name, db_timeout=0, isTcpConn=False):
[2023-06-21T12:27:31.214Z] > return _swsscommon.WarmStart_initialize(app_name, docker_name, db_timeout, isTcpConn)
[2023-06-21T12:27:31.214Z] E RuntimeError: Unable to connect to redis (unix-socket): Cannot assign requested address
[2023-06-21T12:27:31.214Z]
[2023-06-21T12:27:31.214Z] /usr/lib/python3/dist-packages/swsscommon/swsscommon.py:3369: RuntimeError
[2023-06-21T12:27:31.214Z] =============================== warnings summary ===============================
[2023-06-21T12:27:31.214Z] /usr/lib/python3/dist-packages/_pytest/junitxml.py:446
[2023-06-21T12:27:31.214Z] /usr/lib/python3/dist-packages/_pytest/junitxml.py:446: PytestDeprecationWarning: The 'junit_family' default value will change to 'xunit2' in pytest 6.0. See:
[2023-06-21T12:27:31.214Z] https://docs.pytest.org/en/stable/deprecations.html#junit-family-default-value-change-to-xunit2
[2023-06-21T12:27:31.214Z] for more information.
[2023-06-21T12:27:31.214Z] _issue_warning_captured(deprecated.JUNIT_XML_DEFAULT_FAMILY, config.hook, 2)
[2023-06-21T12:27:31.214Z]
[2023-06-21T12:27:31.214Z] -- Docs: https://docs.pytest.org/en/stable/warnings.html
[2023-06-21T12:27:31.215Z] - generated xml file: /sonic/src/sonic-platform-daemons/sonic-xcvrd/test-results.xml -
[2023-06-21T12:27:31.215Z]
[2023-06-21T12:27:31.215Z] ----------- coverage: platform linux, python 3.9.2-final-0 -----------
[2023-06-21T12:27:31.215Z] Name Stmts Miss Cover
[2023-06-21T12:27:31.215Z] ----------------------------------------------------------------
[2023-06-21T12:27:31.215Z] xcvrd/init.py 0 0 100%
[2023-06-21T12:27:31.215Z] xcvrd/xcvrd.py 1548 358 77%
[2023-06-21T12:27:31.215Z] xcvrd/xcvrd_utilities/init.py 0 0 100%
[2023-06-21T12:27:31.215Z] xcvrd/xcvrd_utilities/port_mapping.py 191 26 86%
[2023-06-21T12:27:31.215Z] xcvrd/xcvrd_utilities/sfp_status_helper.py 27 2 93%
[2023-06-21T12:27:31.215Z] ----------------------------------------------------------------
[2023-06-21T12:27:31.215Z] TOTAL 1766 386 78%
[2023-06-21T12:27:31.215Z] Coverage HTML written to dir htmlcov
[2023-06-21T12:27:31.215Z] Coverage XML written to file coverage.xml
[2023-06-21T12:27:31.215Z]
[2023-06-21T12:27:31.215Z] =========================== short test summary info ============================
[2023-06-21T12:27:31.215Z] FAILED tests/test_xcvrd.py::TestXcvrdScript::test_SfpStateUpdateTask_task_run_stop
[2023-06-21T12:27:31.215Z] =================== 1 failed, 69 passed, 1 warning in 9.18s ====================
[2023-06-21T12:27:31.215Z] [ FAIL LOG END ] [ target/python-wheels/bullseye/sonic_xcvrd-1.0-py3-none-any.whl ]
Convert unit tests to use Pytest, and also take advantage of pytest-cov coverage reporting.
While working with 400G ZR optical module, found an issue in CMIS FSM whereby it timed-out on 1 minute hard-coded value.
Optical module's DPinit timer / timeout value differs based on optics type/variant and is defined in the spec (byte 144) - lower nibble (0-3) for DPinit timeout.
Fetch this value from the optics EEPROM along with optics variant/type in CMIS FSM instead of present hard-coded value
#290 : CMIS FSM (state machine) timer to be fetched from optics EEPROM instead of a hard-coded value
Once above (#290) is in place, incorporate fix for this issue in xcvrd (as CMIS FSM orchestrator) to read the right timer value advertised by CMIS spec for DP_INIT state transition timeout and act upon it (instead of current hard-coded value)
Add SI setting handling in sff_mgr
I am adding the media_settings.json file for x86_64-arista_7800r3a_36d2_lc. This platform has 36 QSFP-DD ports. The media_settings.json contains the tuning data for lane0-lane7 on each QSFP-DD port. But this makes the 36x100G SKU unhappy because the xcvrd sets an 8-lane tuning data for a 4-lane 100G port. The Jupiter SAI will fail port serdes creation because the number of port serdes does not match the number of tuning parameters.
root@s1:~# sonic-db-cli -n asic1 APPL_DB hgetall "PORT_TABLE:Ethernet240"
{'admin_status': 'up', 'alias': 'Ethernet31/1', 'asic_port_name': 'Eth240-ASIC1', 'coreId': '0', 'corePortId': '25', 'description': 'Ethernet240-connected-to-nv419@eth56/1', 'fec': 'rs', 'index': '31', 'lanes': '40,41,42,43', 'mtu': '9100', 'numVoq': '8', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '100000', 'tpid': '0x8100', 'oper_status': 'up', 'main': '0x4e,0x4e,0x4b,0x4b,0x4b,0x4e,0x4b,0x4e', 'post1': '-0x16,-0x16,-0x14,-0x14,-0x14,-0x16,-0x14,-0x16', 'post2': '0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0', 'post3': '0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0', 'pre1': '-0x5,-0x5,-0x5,-0x5,-0x5,-0x5,-0x5,-0x5', 'pre2': '0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0'}
The problem is that the method "get_media_val_str" in sonic-platform-daemons/sonic-xcvrd/xcvrd/xcvrd.py cannot handle this case properly. I am thinking about how to handle this issue. Should we fix the logic in get_media_val_str, read the number of lanes from config_db? or make an SKU-specific media_settings.json file?
Problem Scenario : 400G Ports are up and then swss restart and below are the event sequence for the issue to reproduce.
Event 1 : admin state is set to up. - > with this event, the module is put into tx disable
Event 2: host_tx_ready is set false -> with this event, double negate condition happens which leave the optics in TX disable state and the SW state goes to CMIS_STATE_READY
if_application update is required() and in CMIS task_worked()
https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L1224
https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L1601
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.