DeepMatch is a deep packet inspection capability designed for programmable network processors that can be called as an extern in P4 programs. More broadly, our paper provides guidance for applications that need to touch every byte of the payload. Such tasks, that can fit in the compute and memory bounds identified in the paper, are candidates for implementation on this class of network processors, and the paper provides guidance on how to start thinking about implementing them.
For further details, see our CoNEXT 2020 paper and video.
I have a question about the motivation. what use cases do you envision for this, given that everything is going encrypted now?
Section 3 of the DeepMatch paper provides a few use cases.
I would argue that everything is not going to encryption. In fact, many distributed web applications do not encrypt traffic between all components. For example, consider a cloud service offloading packet analysis compute to the SmartNIC. Here, the SmartNIC performs packet classification and adds a new header containing information that a later component can pick up and use to quickly determine how to handle the packet without the need to reprocess it. DeepMatch has the necessary capabilities to support such use cases.
That said, there are many applications that do use encryption. Note that encryption/decryption is another operation that needs to process every byte of the payload, so an interesting future work to build on this might be to offload decryption to the NPU. Decryption-pattern-match pipelines are already seen in some networks and is a possible direction to explore.
Building on top of the above question, what about using AI algos? It's a RISC platform, but sounds like a possible compromise wrt a TPU (but sticking to int math)?
Techniques such as knowledge distillation and quantization may be used to deploy such algorithms on network processors. The concept of knowledge distillation looks at how to transfer knowledge from a large model to a small one, The concept of quantization looks at how to approximate a large set of values to a smaller set; e.g. from real numbers to floating point to fixed point integer.
More broadly, our paper provides guidance for anything that needs to touch every byte of the payload. So, classification tasks that can fit in the compute and memory bounds the paper identify become candidates, and the paper provides guidance on how to start thinking about implementing them.
Netronome does the right thing to make their card small and compact and real time. The memories are not larger because, if they were, then they would be slower and you would fit less FPCs on the card, so you would get less throughput. Netronome is giving us more of a real-time engine but then putting the burden on the programmer in order to actually achieve real time. So, a lot of the simple things like "oh, I need more memory" would eat into the performance that you get.
One limiting factor in the DeepMatch design is that each FPC is running a pretty tight DFA loop. This loop is a small fraction of the 8K instruction limit. So, if we could replicate the DFA loop in both code stores (when in shared code mode) then we would eliminate contention for that block of code; now the only contention would be for uncommon code that is used for packet headers. This should result in less impact due to code sharing. Unfortunately, such code placement is not supported by the version of Netronome's NFP-6000 that we used.
DeepMatch motivates a match-action processing pipeline design with better support for full-packet processing (not just headers). That said, the future is in co-design โ software/high-level definitions of tasks that are then supported by specialized hardware. That means definiting good abstractions and compilers/tools to map those abstractions along with developing architectures that can provide both flexibility and programmability and high performance and efficiency.
Install the Netronome SDK6 Run Time Environment, Hardware debug server, and BSP according to their documentation.
We used the following configuration for the DeepMatch host:
- Dell PowerEdge R720 with dual Intel Xeon E5-2650 v2 8-core 2.60 GHz processors and 64 GB DDR3 1600MHz RAM
- Netronome Agilio-CX Dual-Port 40 Gigabit Ethernet Intelligent Server Adapter (Part ID: ISA-4000-40-2-2)
- Ubuntu 16.04 LTS
- Python 2.7
- Enable SR-IOV in the bios
Install the Netronome SDK IDE according to their documentation. Note that the current version of DeepMatch is written in P4v14 (not P4v16). We used a Windows 10 host to run the IDE.
We used the following configuration for our traffic generator host:
- Dell R720 with dual Intel Xeon E5-2680 v2 10-core 2.80GHz processors and 256 GB DDR3 1866 MHz RAM
- Mellanox MCX456A-ECA ConnectX-4 dual-port QSFP28 100GbE
- Unbuntu 16.04 LTS
- Anaconda3 with Python 3.7.8 and Python 2.7.12
- dpdk-17.08.1 with 1 GB Hugepages
- pktgen-3.4.9
- latest version of tcpreplay
- latest version of python scapy
- python fabric (http://www.fabfile.org/) is used to automate many of our experiments, including interacting with the NFP card remotely.
We use an Arista DCS-7050QX-32-R 32x Port 40G QSFP+ Layer 3 Switch
Use Netronome Programmer Studio (we used version 6.0.3.1 build 3241) to create a new project with the DeepMatch P4v14 source files.
Useful Programmer Studio Settings:
- Chip Setting:
- nfp-6xxxc-b0
- Project Configuation (see "P4/Managed C" tab):
- Number of worker MEs: 80
- Use Shared Code Store: enable
- Reduced thread usage: enable
- Optional Components (see "P4/Managed C" tab):
- GRO: disable
- Preprocessor definitions (see the "General" tab):
- PHAST_FX_PY: where X is no. flows and Y is OoO buffer size; e.g. PHAST_F321_P100
- PHAST_DFA_X: where X identifies the DFA to load; e.g. PHAST_DFA_MAL_BACKDOOR
- PHAST_DLOC_X: where X identifies which memory to place the DFA; e.g. CLS, CTM, IMEM, EMEM
- PHAST_LOCK: enables all locks (needed when reordering is turned on)
Build the code and transfer the following files to the DeepMatch host:
- file.nffw
- file.p4cfg
- out/pif_design.json
systemctl start nfp-sdk6-rte1
sudo netstat -tulpn | grep -E 'pif_rte|nfp-sdk-hwdbgs'
rtecli -p 20207 design-load -p pif_design.json -f file.nffw
rtecli -p 20207 config-reload -c file.p4cfg
rtecli -p 20207 status
Another way to check that firmware is loaded:
sudo /opt/netronome/bin/nfp-nffw status -n 1
sudo setup_experiment.py -o v
rtecli -p 20207 design-unload
One way to check the status
rtecli -p 20207 status
Another way to check the status
sudo /opt/netronome/bin/nfp-nffw status -n 1
Use tcpreplay, dpdk/pktgen, or your favorite tools to transmit/receive packets to/from DeepMatch.
sudo /opt/netronome/bin/nfp-rtsym -n 1 -L | grep flow_ht
Use this output to determine the location and size of flow_ht. Then dump the variable's memory into a file and check the value.
sudo /opt/netronome/bin/nfp-rtsym -n 1 -L | grep pkt_ht
Use this output to determine the location and size of pkt_ht. Then dump the variable's memory into a file and check the value.
One way to check
sudo /opt/netronome/bin/nfp -n 1 -m mac -e show port stats 0 0-8
Another way to check
rtecli -p 20207 counters list-system
We use Python fabric to automate the manual evaluation steps. http://www.fabfile.org/