YOLO 系列

YOLO - You Only Look Once

graph TD
    O(YOLO家谱)--> |Joseph Redmon| V1(2015 YOLOv1)
    V1 --> V2(2016 YOLOv2)
    V2 --> V3(2018 YOLOv3)
    V3 --> |Alexey Bochkovskiy| V4(2020 YOLOv4)
    V3 --> |百度| PP(2020 PP-YOLO)
    V3 --> |旷世| VX(2021 YOLOX)
    V4 --> V7(2022 YOLOv7)
    V4 --> |Ultralytics LLC| V5(2021 YOLOv5)
    V5 --> V8(2022 YOLOv8)
    V4 --> |美团| V6(2022 YOLOv6)
    V5 --> V6
    V4 --> |ChienYaoWang| VR(2021 YOLOR)

Papers

2015 YOLO V1 You Only Look Once: Unified, Real-Time Object Detection [paper] [netscope] [Code] [Code2 - AlexeyAB] [docs]
2016 YOLO V2 YOLO9000: Better, Faster, Stronger [paper] [netscope] [Code] [docs] [YOLO 9000]
2018 YOLOv3 An Incremental Improvement [paper] [YOLOv3 netscope] [Code] docs [docs]
2020 PP-YOLO PP-YOLO: An Effective and Efficient Implementation of Object Detector
[paper]
2021 YOLOv4 YOLOv4: Optimal Speed and Accuracy of Object Detection [paper]
2021 YOLOR You Only Learn One Representation: Unified Network for Multiple Tasks [paper]
2021 YOLOX YOLOX: Exceeding YOLO Series in 2021 [paper]
2022 YOLOv6 YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications [paper]
2022 YOLOv7 YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors [paper]

提升变化

YOLOv1

定位问题 --> 回归问题

YOLOv2

Darknet19
BN Layers
高分辨率分类器
使用聚类获取 Anchor Box
特征融合
多尺度训练
分类检测数据集联合训练

YOLOv3

Darknet53
FPN

YOLOv4

CSPDarkNet53
CutMix & Mosaic
Mish

Datasets Image2D Classification Pets

Oxford-IIIT Pets

由英国牛津大学视觉几何组（Visual Geometry Group）收集整理
Oxford-IIIT Pet Dataset 包含37种宠物分类，每种分类大概200张图片

Downloads

wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz

Annotations

trimaps/ 	Trimap annotations for every image in the dataset
		Pixel Annotations: 1: Foreground 2:Background 3: Not classified
xmls/		Head bounding box annotations in PASCAL VOC Format

list.txt	Combined list of all images in the dataset
		Each entry in the file is of following nature:
		Image CLASS-ID SPECIES BREED ID
		ID: 1:37 Class ids
		SPECIES: 1:Cat 2:Dog
		BREED ID: 1-25:Cat 1:12:Dog
		All images with 1st letter as captial are cat images while
		images with small first letter are dog images.
trainval.txt	Files describing splits used in the paper.However,
test.txt	you are encouraged to try random splits.

             (a)                          (b)                        (c)

(a) species and breed name;
(b) a tight bounding box (ROI) around the head of the animal;
(c) a pixel level foreground-background segmentation (Trimap).

list.txt

german_shorthaired_9 15 2 9
 |                    | | |— BREED ID
 |                    | |— SPECIES: 1:Cat 2:Dog
 |                    |— Class ids
 |— SPECIES

德国短毛犬( German Shorthaired)

Reference

Brief

Microsoft COCO Datasets

Detection
Keypoints
Segmentation
- Stuff - Semantic Segmentation
- Panoptic
Captions

Download

curl https://sdk.cloud.google.com | bash
mkdir train2017
gsutil -m rsync gs://images.cocodataset.org/train2017 train2017
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip

COCO Challenges Tasks

Tasks	Labels
Object Detection
Stuff Segmentation
Panoptic Segmentation
Keypoint Detection
Captioning

Annotation format

Detection

annotation{
       "id" : int,
       "image_id" : int,
       "category_id" : int,
       "segmentation" : RLE or [polygon],
       "area" : float,
       "bbox" : [x,y,width,height],
       "iscrowd" : 0 or 1,
}

categories[{
       "id" : int,
       "name" : str, 
       "supercategory" : str,
}]

iscrowd=1 是否是 semantic segmentation

ML Tasks Image Retrieval CBIR SiameseNetwork

Siamese Network

Reference

Brief

Siamese网络是一种相似性度量方法
- 用于当类别数多，但每个类别的样本数量少的情况下可用于类别的识别、分类等。
- 传统分类方法是需要确切的知道每个样本属于哪个类，需要针对每个样本有确切的标签。而且相对来说标签的数量是不会太多的。当类别数量过多，每个类别的样本数量又相对较少的情况下，这些方法就不那么适用了。
主要**是通过一个函数将输入映射到目标空间，在目标空间使用简单的距离（欧式距离等）进行对比相似度。在训练阶段去最小化来自相同类别的一对样本的损失函数值，最大化来自不同类别的一堆样本的损失函数值。
Siamese network就是“连体的神经网络”，神经网络的“连体”是通过共享权值来实现的. 即左右两个神经网络的权重一模一样。

Siamese也就是“暹罗”人或“泰国”人。Siamese在英语中是“孪生”、“连体”的意思

Arch

	- 孪生神经网络有两个输入 - 两个输入进入两个神经网络 - 输入映射到新空间形成新表示 - 计算Loss来评价输入相似度。

伪孪生神经网络/Pseudo-Siamese Network
- 左右两边不共享权值
- 可以是不同的神经网络(LSTM/CNN), 也可以是相同类型的神经网络。

Network	siamese network	pseudo-siamese network
Diff	处理两个输入"比较类似"的情况	处理两个输入"有一定差别"的情况
UseCase	- 比较两张图片	- 验证标题与正文的描述是否一致 - 文字是否描述了一幅图片

输入不再是单个样本，而是一对样本，不再给单个的样本确切的标签，而且给定一对样本是否来自同一个类的标签，是就是0，不是就是1
设计了两个一模一样的网络，网络共享权值W，对输出进行了距离度量，可以说l1、l2等。
针对输入的样本对是否来自同一个类别设计了损失函数，损失函数形式有点类似交叉熵损失
优点是淡化了标签，使得网络具有很好的扩展性，可以对那些没有训练过的类别进行分类，这点是优于很多算法的。而且这个算法对一些小数据量的数据集也适用，变相的增加了整个数据集的大小，使得数据量相对较小的数据集也能用深度网络训练出不错的效果。

应用

图像匹配
图像块匹配
视频跟踪

ML Tasks Image Classification MobileNet

MobileNet

Reference

2017 MobileNet V1 MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications [paper]
2018 MobileNet V2 MobileNetV2: Inverted Residuals and Linear Bottlenecks [paper]
2019 MobileNet V3 Searching for MobileNetV3 [paper]
MobileNets 论文笔记
深度解读谷歌MobileNet
Caffe Mobilenet
MobileNets: Open-Source Models for Efficient On-Device Vision
论文记录_MobileNets Efficient Convolutional Neural Networks for Mobile Vision Application
MobileNet论文详解
MobileNet V2 论文初读
轻量化网络：MobileNet-V2
目标检测：Mobilenet-SSD实现步骤
浅谈SSD检测算法

Brief

Google针对手机等嵌入式设备提出的一系列轻量级的深层神经网络，取名为MobileNet.
目前有三个版本：v1/v2/v3

加速深度模型: 不影响准确率的前提下减少参数数量，进而减少计算时间
- MobileNet v1 利用了深度可分离卷积提高了计算效率
- MobileNet v2 加入了线性bottlenecks和反转残差模块构成了高效的基本模块。

Name	Description
MobileNet V1	可分卷积
MobileNet V2	线性 Bottlenecks & 反转残差模块
MobileNet V3	NAS

量化

Quantization - 量化
- 嵌入式设备/移动设备 - 低计算能力/低内存/低存储空间/低功耗/...
- 压缩参数/提升速度/降低内存占用/低计算单元占用面积
- 精度损失
浮点运算 --> 整型运算: float32 --> int8
线性量化/非线性量化
逐层/组/通道量化
在线/离线量化
比特量化
权重/权重激活量化
OpenVINO INT8 Model

线性量化

$r = Round(S(q-Z))$
- $q$: 原始 float32 值
- $Z$: float32 数值的偏移量 Zero Point
- $S$: float32 缩放因子 Scale
- $Round$: 四舍五入或向上向下取整
- $r$: 量化后的整数值
对称量化/非对称量化

对称量化

输入数据的 最大值/最小值 映射到 [-128, 127]
输入数据零，对应量化后的零

非对称量化

输入数据的 最大值/最小值 映射到 [0, 255]

非线性量化

Pipeline

统计输入数据 Max/Min
选择量化方法 int8/uint8 ...
计算 Zero Point/Scale
根据Data对模型进行量化
验证量化模型性能, 若模型性能不好，则尝试不同的 Z/S 重复量化

Reference

模型量化详解

ML Tasks Image Retrieval CBIR VLAD

Reference

Brief

Vector of Locally Aggregated Descriptors , 由Jegou et al.在2010年提出，其核心**是aggregated(积聚)，主要应用于图像检索领域

在深度学习时代之前，图像检索领域以及分类主要使用的常规算法有BoW、Fisher Vector及VLAD等。

BoW: 核心**是提取出关键点描述子后利用聚类的方法训练一个码本，随后每幅图片中各描述子向量在码本中各中心向量出现的次数来表示该图片，该方法的缺点是需要码本较大
FV: 核心**是利用高斯混合模型(GMM)，通过计算高斯混合模型中的均值、协方差等参数来表示每张图像。该方法的优点是准确度高，但缺点是计算量较大
VLAD: 可以看做是一种简化的FV，其主要方法是通过聚类方法训练一个小的码本，对于每幅图像中的特征找到最近的码本聚类中心，随后所有特征与聚类中心的差值做累加，得到一个kd的vlad矩阵，其中k是聚类中心个数，d是特征维数(如sift是128维),随后将该矩阵扩展为一个(kd)维的向量，并对其L2归一化，所得到的向量即为VLAD。
NetVLAD: 在VLAD算法的基础上Arandjelovic et al.在 All about VLAD 一文中提出了一种改进方法。随后，其又结合深度卷积神经网络的相关内容，提出了NetVLAD。

VLAD

VLAD的优点：

提出一种图像表示方法，在拥有合理的向量维度下，具有很好的搜索准确率。
联合优化降维和索引效率。

Steps

读取图片文件路径及特征提取
使用聚类方法训练码本
将每张图片的特征与最近的聚类中心进行累加
对累加后的VLAD进行PCA降维并对其归一化
得到VLAD后，使用ADC方法继续降低储存空间和提高搜索速度

其中步骤4、5可选，在步骤3得到残差累加向量后进行L2归一化即可用欧氏距离等计算两张图片的相似性从而实现图片检索
VLAD 实现

C++
Python

NetVLAD

带有VLAD层的卷积神经网络结构

NetVLAD层，是一种受VLAD灵感激发的可泛化的VLAD层。
基于弱监督排序损失，提出了一种可训练的方法用于学习这种端到端框架的参数。

NetRVLAD

NetFV

Soft-DBoW

BOW

特征提取。在训练阶段，将图像用很多“块”（patch）表示，以SIFT特征为例，图像中每个关键点就是一个patch，每一个patch特征向量的维数128。
字典构建。假设共有M幅训练图像，字典的大小为100，即有100个词，用K-均值算法对所有的patch进行聚类，等k-均值收敛时，将得到每一个聚类最后的质心，这100个质心（维数128）就是词典里的100个词，词典构建完毕。
字典表示。在测试阶段，对每幅图像初始化一个维数100、值全为0的直方图，计算测试图像每个patch与字典中单词的距离，距离patch最近的单词对应的直方图计数加1，将所有patch计算完毕后的直方图即为图像的字典表示。
图像检索。训练图与测试图都以100维向量表示，对每个待检索图像，只需在字典中计算所有训练图与其距离，并返回最近的若干幅即可。

FV - Fisher Vector

一种类似于BOVW词袋模型的一种编码方式，如提取图像的SIFT特征，通过矢量量化（KMeans聚类），构建视觉词典（码本），FV采用混合高斯模型（GMM）构建码本，但是FV不只是存储视觉词典的在一幅图像中出现的频率，并且FV还统计视觉词典与局部特征（如SIFT）的差异。

Reference

Brief

SVD - Singular Value Decomposition - 奇异值分解
特征分解

特征值 & 特征向量 & 特征分解

定义	Description
$A$	n x n 维矩阵
$Ax = \lambda x$	特征向量和特征值关系
$x$	n 维向量
$\lambda$	矩阵 A 的一个特征值
$x$	矩阵 A 的特征值 $\lambda$ 对应的特征向量
$\left{ \lambda_{1} \leq \lambda_{2} \leq ... \leq \lambda_{n} \right}$	n 个特征值
$ \omega_{1}, \omega_{2},..., \omega_{n} $	n 个特征值对应的特征向量
$\Sigma$	n个特征值为主对角线的 n x n 矩阵
$W$	n 个特征向量组成的 n x n 矩阵
$A = W\Sigma W^{-1}$	矩阵 A 的特征分解表示
$A = W\Sigma W^{T}$	W 的 n 个特征向量标准化 $\omega_{i}^T\omega_{i}=1$, 即 $W^TW =I$, $W^T=W^{-1}$

SVD 分解

不要求分解矩阵为方阵

定义	Description
$A$	m x n 矩阵
$A=U\Sigma V^T$	A 的 SVD 分解
$U$	m x m 矩阵 $U^TU=I$
$\Sigma$	m x n 矩阵除主对角线外全是 0, 主对角线上的每个元素都称为奇异值
$V$	n x n 矩阵 $V^TV=I$
$(A^TA)v_{i}=\lambda_{i}v_{i}$	n x n 矩阵 $A^TA$ 的特征分解, 特征向量组成 V
$(AA^T)u_{i}=\lambda_{i}u_{i}$	m x m 矩阵 $AA^T$ 的特征分解, 特征向量组成 U
$\Sigma$	$A=U\Sigma V^T \Rightarrow AV=U\Sigma V^TV \Rightarrow AV=U\Sigma \Rightarrow Av_{i} = \sigma_{i}u_{i} \Rightarrow \sigma_{i} = Av_{i}/u_{i}$

UV 推导

$A=U\Sigma V^T \Rightarrow A^T=V\Sigma U^T$
$\Rightarrow A^TA = V\Sigma U^T U \Sigma V^T = V\Sigma^2V^T$ : 特征向量为 V 矩阵
$\Rightarrow AA^T= U\Sigma V^T V\Sigma U^T = U\Sigma^2 U^T$ : 特征向量为 U 矩阵
$\Rightarrow \sigma_{i} = \sqrt{\lambda_{i}} $ : 特征值和奇异值关系

Examples

特征值和特征向量计算

Steps	Calc
A 的定义，计算 $A^T$	$A = \begin{bmatrix} 0 & 1 \ 1 & 1 \ 1 & 0 \ \end{bmatrix}$ $A^T = \begin{bmatrix} 0 & 1 & 1 \ 1 & 1 & 0 \ \end{bmatrix}$
计算 $A^TA$ 与 $AA^T$	$A^TA=\begin{bmatrix} 2 & 1 \ 1 & 2 \ \end{bmatrix} $ $AA^T= \begin{bmatrix} 1 & 1 & 0 \ 1 & 2 & 1 \ 0 & 1 & 1 \ \end{bmatrix}$
计算特征值$A^TA$ $\lambda$	$(A-\lambda I)x=0 \Rightarrow \begin{vmatrix} 2-\lambda & 1 \ 1 & 2-\lambda \ \end{vmatrix} = 0 \Rightarrow (2-\lambda)^2 - 1=0 \Rightarrow \lambda_{1}=3, \lambda_{2}=1$
计算$A^TA$ 特征向量	$\lambda_{1}=3$, $\begin{bmatrix} 2 & 1 \ 1 & 2 \ \end{bmatrix}x=3x$ $\Rightarrow \begin{bmatrix} 2 & 1 \ 1 & 2 \end{bmatrix}\begin{bmatrix} a_{1} \ a_{2} \end{bmatrix}=\begin{bmatrix} 3a_{1} \ 3a_{2} \end{bmatrix}$ $\Rightarrow 2a_{1} + a_{2} = 3a_{1}, a_{1} + 2a_{2} = 3a_{2}$ $\Rightarrow a_{1} - a_{2}=0$ $\Rightarrow v_{2} = \begin{bmatrix} \frac{1}{\sqrt{2}} \ \frac{1}{\sqrt{2}} \end{bmatrix}$
↑	$\lambda_{2}=1$, $\begin{bmatrix} 2 & 1 \ 1 & 2 \ \end{bmatrix}x=1x$ $\Rightarrow \begin{bmatrix} 2 & 1 \ 1 & 2 \end{bmatrix}\begin{bmatrix} a_{1} \ a_{2} \end{bmatrix}=\begin{bmatrix} a_{1} \ a_{2} \end{bmatrix}$ $\Rightarrow 2a_{1} + a_{2} = a_{1}, a_{1} + 2a_{2} = a_{2}$ $\Rightarrow a_{1} + a_{2}=0$ $\Rightarrow v_{1} = \begin{bmatrix} -\frac{1}{\sqrt{2}} \ \frac{1}{\sqrt{2}} \end{bmatrix}$

特征向量标准化
$w^Tw=1 $
$\Rightarrow \begin{bmatrix} a_{1} & a_{2} \ \end{bmatrix} \begin{bmatrix} a_{1} \ a_{2} \end{bmatrix} = 1$
$\Rightarrow a_{1}^2 + a_{2}^2 = 1$

求解点云旋转矩阵

点云 $P \left{ p_i\right}$ 与点云 $Q \left{ q_i\right}$
1 计算两个点云质心 $p = \frac{1}{n}\sum_{i=1}^{n}p_i$ $q = \frac{1}{n}\sum_{i=1}^{n}q_i$
2 计算各点相对于质心的位移向量 $v_i^q =q_i -q$ $v_i^p =p_i -p$
3 利用质心位移向量，计算H矩阵 $H=\sum_{i=1}^{n}$
4 对H矩阵进行SVD分解 $H=UAV^T$
5 基于矩阵 U 和 V ，计算旋转矩阵 $R=UV$
6 验证结果: $det(R)=1$ 则 R 有效, $det(R)=-1$ 则算法失效？？
7 计算点云间的位移 $T=q-Rp$

模型结构可视化工具 netron

Visualizer for deep learning and machine learning models
Support model files
- ONNX (.onnx, .pb, .pbtxt)
- Keras (.h5, .keras)
- CoreML (.mlmodel)
- Caffe (.caffemodel, .prototxt)
- Caffe2 (predict_net.pb, predict_net.pbtxt)
- MXNet (.model, -symbol.json)
- PyTorch (.pth)
- Torch (.t7)
- CNTK (.model, .cntk)
- PaddlePaddle (model)
- scikit-learn (.pkl)
- TensorFlow Lite (.tflite)
- TensorFlow.js (model.json, .pb)
- TensorFlow (.pb, .meta, .pbtxt).

Install

pip install netron

Usecase

netron -b xxx.pb

http://localhost:8080/

netron xxx.onnx --port xxx_port --host xxx_host

access from http://xxx_host:xxx_port

Reference

Brief

CBIR - Content-based image retrieval - 基于内容的图像检索

Algos

Gradient Descent & Optimizer

使用梯度下降进行优化，是几乎所有优化器的核心**

梯度

向量(方向) 描述变化率
表示某一函数在该点处的方向导数沿着该方向取得最大值(即最大方向导数的方向)
函数在该点处沿着该方向变化最快，变化率最大（为该梯度的模）

梯度下降

通过寻找梯度下降的方式寻找最优点

对比

Reference

Global Average Pooling

Reference

Brief

GAP的思路就是将全连接层+dropout过程合二为一
对整个网路在结构上做正则化防止过拟合。其直接剔除了全连接层中黑箱的特征，直接赋予了每个channel实际的内别意义。
GAP可以实现任意图像大小的输入
使用GAP可能会造成收敛速度减慢。

GAP的工作原理

假设卷积层的最后输出是h × w × d 的三维特征图，具体大小为6 × 6 × 3，经过GAP转换后，变成了大小为 1 × 1 × 3 的输出值，也就是每一层 h × w 会被平均化成一个值。

AP vs MP

GAP vs GMP vs FC

GMP在本模型中表现太差，不值一提；而FC在前40次迭代时表现尚可，但到了40次后发生了剧烈变化，出现了过拟合现象（运行20次左右时的模型相对较好，但准确率不足70%，模型还是很差）；三者中表现最好的是GAP，无论从准确度还是损失率，表现都较为平稳，抗过拟合化效果明显（但最终的准确度70%，模型还是不行）。

VGG19
InceptionV3

读取 RTSP Camera

Code:

IP Camera: IPC-SR3321P-IP

#!/usr/bin/env python
# coding=utf-8
import cv2
cap = cv2.VideoCapture("rtsp://192.168.1.10:554/user=admin&password=&channel=1&stream=0.sdp?")
print (cap.isOpened())
while cap.isOpened():
        success,frame = cap.read()
        cv2.imshow("frame",frame)
        cv2.waitKey(1)

Reference

关于cv2.imread()读取图像为BGR问题

Read Images

cv2.imread
cv::ImreadModes
OpenCV 读取图像的格式为 BGR, 而其他很多工具(matplotlib/...)需要 RGB format Image, 可以使用 cv2.cvtColor() 转换

Mat cv::imread(const String & filename, int flags = IMREAD_COLOR)

retval = cv.imread( filename[, flags])

Support Formats

Windows bitmaps - *.bmp, *.dib (always supported)
JPEG files - *.jpeg, *.jpg, *.jpe (see the Note section)
JPEG 2000 files - *.jp2 (see the Note section)
Portable Network Graphics - *.png (see the Note section)
WebP - *.webp (see the Note section)
Portable image format - *.pbm, *.pgm, *.ppm *.pxm, *.pnm (always supported)
PFM files - *.pfm (see the Note section)
Sun rasters - *.sr, *.ras (always supported)
TIFF files - *.tiff, *.tif (see the Note section)
OpenEXR Image files - *.exr (see the Note section)
Radiance HDR - *.hdr, *.pic (always supported)
Raster and Vector geospatial data supported by GDAL (see the Note section)

Write Images

cv2.imwrite()

bool cv::imwrite( const String &  filename, InputArray 	img, const std::vector< int > &  params = std::vector< int >() )

retval = cv.imwrite(filename, img[, params])

BGR -> RGB

cv2.cvtColor

Converts an image from one color space to another.

Model Optimizer

Brief

MO - Model Optimizer 把各种框架训练的模型文件转换成 OpenVINO 可以识别的文件并优化模型
Support Framework : Tensorflow/MXNet/Kaldi/ONNX
- Pytorch ==> ONNX ==> IR

Parameters	Description
`--reverse_input_channels`	当输入为 3 通道图像时, 转换通道顺序 RGB -> BGR
--input_shape	输入的 Shape NCHW/NHWC
--mean_values	RGB Mean values ==> `image - mean`
--scale	归一化 ==> `image - mean) /scale`

models mean/scale value 和 Training 保持一致

UseCase

SSD_Mobilenet

ssd_mobilenet_v2_coco
ssd_mobilenet_v1_fpn_640

python3 /opt/intel/openvino/deployment_tools/model_optimizer/mo.py \
        --input_model frozen_inference_graph.pb \
        --tensorflow_use_custom_operations_config /opt/intel/openvino/deployment_tools/model_optimizer/extensions/front/tf/ssd_v2_support.json \
        --tensorflow_object_detection_api_pipeline_config pipeline.config \
        --data_type FP16\
        --output_dir FP16

YOLO

YOLO v3
- /opt/intel/openvino/deployment_tools/model_optimizer/extensions/front/tf/yolo_v3.json

python3 /opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py -b 1 \
    --input_model ./frozen_darknet_yolov3_model.pb \
    --tensorflow_use_custom_operations_config ./yolo_v3.json \
    --data_type FP16\
    --output_dir ./FP16

YOLO tiny
- /opt/intel/openvino/deployment_tools/model_optimizer/extensions/front/tf/yolo_v3_tiny.json

python3 /opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py -b 1 \
   --input_model ./frozen_darknet_yolov3_model.pb \
    --tensorflow_use_custom_operations_config ./yolo_v3_tiny.json \
    --data_type FP16\
    --output_dir ./tiny_FP16

Inference Engine

Reference

Brief

API & Libraries

C++ API
Python API

Core IE Library
- Linux* OS: libinference_engine.so
- Windows* OS: inference_engine.dll
Device-specific Plugin Libraries

Supported Devices

CPU
GPU
VPU
HDDL
FPGA
Heterogeneous: 优先使用第一个 Device 运行，若遇到不支持的 Layers, 则不支持的 Layers 放在第二个 Device 运行
Multi-Device: 多个 Devices 共同运行，因为一些共同资源的消耗，所以实际 performance 会比理论值略低
GNA

Uesed Plugins

PLUGIN	LIBRARY NAME FOR LINUX	DEPENDENCY LIBRARIES FOR LINUX
CPU	libMKLDNNPlugin.so	libmklml_tiny.so, libiomp5md.so
GPU	libclDNNPlugin.so	libclDNN64.so
FPGA	libdliaPlugin.so	libdla_compiler_core.so, libdla_runtime_core.so
MYRIAD	libmyriadPlugin.so	No dependencies
HDDL	libHDDLPlugin.so	libbsl.so, libhddlapi.so, libmvnc-hddl.so
GNA	libGNAPlugin.so	libgna_api.so
HETERO	libHeteroPlugin.so	Same as for selected plugins
MULTI	libMultiDevicePlugin.so	Same as for selected plugins

PLUGIN	LIBRARY NAME FOR WINDOWS	DEPENDENCY LIBRARIES FOR WINDOWS
CPU	MKLDNNPlugin.dll	mklml_tiny.dll, libiomp5md.dll
GPU	clDNNPlugin.dll	clDNN64.dll
FPGA	dliaPlugin.dll	dla_compiler_core.dll, dla_runtime_core.dll
MYRIAD	myriadPlugin.dll	No dependencies
HDDL	HDDLPlugin.dll	bsl.dll, hddlapi.dll, json-c.dll, libcrypto-1_1-x64.dll, libssl-1_1-x64.dll, mvnc-hddl.dll
GNA	GNAPlugin.dll	gna.dll
HETERO	HeteroPlugin.dll	Same as for selected plugins
MULTI	MultiDevicePlugin.dll	Same as for selected plugins

Pipeline

Steps	C++ Classes	Python Classes
读取IR	CNNNetReader CNNNetwork
指定 IO 格式	CNNNetwork
创建 IE Core 对象	Core
编译并载入 Network 到 Device	Core
设置输入 Data	ExecutableNetwork InferRequest
执行 Inference	InferRequest
获取输出	InferRequest

Benchmark Tool

Benchmark C++ Tool
Benchmark Python* Tool

ML Tasks Image Classification ResNet

Reference

2015 ResNet Deep Residual Learning for Image Recognition [paper]
2016 Identity Mappings in Deep Residual Networks [paper]
ResNet 解析
paper - Wide Residual Networks
ResNet详解
大话深度残差网络（DRN）ResNet网络原理
ResNet 分析
深度学习网络篇——ResNet
一文简述ResNet及其多种变体
极深网络（ResNet/DenseNet）: Skip Connection为何有效及其它
netscope - ResNet-50

Brief

ResNet - Residual Network - 残差网络
ResNet 提出于 2015 年，在 ImageNet 比赛 Classification 任务上获得第一名
解决的问题
- 简单地增加网络层数会导致梯度消失和梯度爆炸，表现为收敛变慢，准确率变差
- 正则化初始化和中间的正则化层(Batch Normalization) 会导致退化问题
- 精度下降并不是由过拟合导致的
网络中增加直连通道 - shortcut connection
- 让深度网络实现和浅层网络一样的性能，即让深度网络后面的层至少实现恒等映射的作用
很多其他网络借鉴了 ResNet: Alpha zero/MobileNetv2

梯度消失精度下降

Net

ResNet 101 - 101 Layers = 1 + 33 x 3 + 1

7x7x64 卷积
3x3 max poool
经过3 + 4 + 23 + 3 = 33个building block，每个block为3层，所以有33 x 3 = 99层
最后有个fc层(用于分类)

101层网络仅仅指卷积或者全连接层，而激活层或者Pooling层并没有计算在内

ResNet 34

shortcut connection

来源于 Highway Network 的**: 允许原始输入信息直接传到后面的层中
残差：观测值与估计值之间的差。

- H(x) : 观测值
- x : 估计值, 称为identity mapping
- F(x) : residual mapping
在极端情况下如果 identity mapping 就能达到最佳了, 则另 F(x) = 0

building block

ResNet34 buildingblock
ResNet50/101/152 bottleneck building block`
- 1x1x64 到 64 channel 的 featuremap
- 3x3x64 的卷积计算
- 1x1x256 恢复 256 channel
- 目的：缩小计算量，减小 training 时间

Residual Network Improve

paper - Identity Mappings in Deep Residual Networks

不同的 shortcut connections

Result

还是 original 比较好点

不同的 activation 用法

Result

pre-activation 有所提升
- 把 BN 和 ReLU 放在卷积的前面

ML Tasks Image Classification Inception

Inception

Reference

2014 Inception V1 / GoogleNet Going deeper with convolutions [paper]
2015 Inception V2 & V3 Rethinking the Inception Architecture for Computer Vision [paper]
2016 Inception V4 & Inception-ResNet Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning [paper]
一文概览Inception家族的「奋斗史」
图像分类丨Inception家族进化史「GoogleNet、Inception、Xception」

Brief

Inception V1 即 GoogLeNet
- GoogLeNet 是对 Yann LeCuns 的 LeNet 致敬
Inception V2 和 V3 是同一篇 paper 提出，各种 Improve 最终版的 V2 即 V3
Inception V4 和 Inception-ResNet 是同一篇 paper 提出

Improve

Inception v1

引入 1x1, 3x3, 5x5 Conv 并行结构 Inception Module

由于图像的突出部分可能有极大的尺寸变化

全局的信息应该使用大的内核卷积

局部的信息应该使用小内核卷积

Inception v2 & Inception v3

说明了 Inception Module 设计准则
- 避免大kernel 导致的表达瓶颈
- 高维特征更易处理
- 低维空间聚合无需担心丢失信息
- 平衡网络宽度和深度
使用了Batch Normalization
分解卷积 - Factorizing Convolutions
- 使用包括 2 个 3x3 的卷积 Mini-network 替换 5x5 的卷积
- 任意nxn的卷积都可以通过1xn卷积后接nx1 非对称卷积来替代
采用 Conv 和 Pooling 并行的方式来减小特征图大小，减少计算量
- 先 Pooling 后 Conv 会导致特征表示遇到瓶颈（特征缺失）
- 先 Conv 后 Pooling 会导致计算量很大
Label smoothing
- softmax loss过于注重使模型学习分类出正确的类别, 当在新数据集扩展时, 容易过拟合, 可以使用label的先验分布信息对其loss进行校正
RMSProp优化器

Inception v4 & Inception-ResNet

Inception v4
- 修改了Inception的Stem
- 添加了Reduction Block
Inception-ResNet v1/v2
- 添加了 ResNet

Cost Function 代价函数

Reference

代价函数(Cost function)

定义在整个训练集上面的，也就是所有样本的误差的总和的平均，也就是损失函数的总和的平均，有没有这个平均其实不会影响最后的参数的求解结果。
Total Loss

Cost Function vs Loss Function

Name	计算数据	Description
Cost Function	所有数据误差	Total Loss
Loss Function	单个数据或者 Batch数据误差	Batch Loss

Data Normalization 数据标准化

Reference

Brief

数据的标准化（normalization）是将数据按比例缩放，使之落入一个小的特定区间。

因为我们得到的样本数据中可能一个特征向量的某几个元素的值非常大，使得特征数据不在一个数量级，因此必须限定在一个合适的范围内。归一化就是为了后面数据处理的方便，其次是保证程序运行时收敛加
快。

归一化 Normalization
标准化 Normalization
正则化 Regularization

归一化和标准化的英文翻译是一致的，但是根据其用途（或公式）的不同去理解（或翻译）

线性归一化（min-max normalization）

也叫离差标准化，是对原始数据的线性变换，使结果落到[0，1]区间
该方法将某个变量的观察值减去该变量的最小值，然后除以该变量的离差，其标准化的数值落到[0,1]区间

x’=(x-min)/(max-min)

其中max为样本的最大值，min为样本的最小值。

该方法对原始数据进行线性变换，保持原始数据之间的联系，其缺陷是当有新数据加入时，可能导致max或min的变化，转换函数需要重新定义。

标准差标准化/ 零均值标准化/ z-score standardization

将某变量中的观察值减去该变量的平均数，然后除以该变量的标准差，标准化后的数据符合标准正态分布，即均值为0，标准差为1，转化函数为：

x’=(x-μ)/σ

其中μ为所有样本数据的均值，σ为所有样本数据的标准差。

该方法对离群点不敏感，当原始数据的最大值、最小值未知或离群点左右了Max-Min标准化时非常有用，Z-Score标准化目前使用最为广泛的标准化方法。

小数定标（decimal scaling）标准化

该方法通过移动数据的小数点位置来进行标准化。小数点移动多少位取决于变量取值中的最大绝对值。将某变量的原始值x使用小数定标标准化到x’的转换函数为：

x’=x/(10^j)

其中，j是满足使max(|x’|)<1成立的最小整数。假设变量X的值由-986到917，它的最大绝对值为986，为使用小数定标标准化，我们用1000（即，j=3）除以每个值，这样，-986被标准化为-0.986。

非线性归一化

Histogram 图像直方图

1D Histogram

X - 像素值
Y - 相应的像素数量

Gray	RGB

计算 & 绘制函数

cv.calcHist()
np.histogram()
np.bincount()
matplotlib.pyplot.hist()

2D Histogram

RGB -> HSV
- H - [0, 180]
- S - [0, 256]

计算方法

cv.calcHist([hsv], [0, 1], None, [180, 256], [0, 180, 0, 256])
np.histogram2d(h.ravel(),s.ravel(),[180,256],[[0,180],[0,256]])

3D Histogram

直方图均衡

将像素值范围均衡化到更大范围
改善图像的对比度

原图	cv.equalizeHist() - 全局直方图均衡	cv.createCLAHE() - 自适应直方图均衡

直方图反投影

应用 - 分割部分同色物体

Reference

Datasets Image2D Classification ImageNet

Reference

ImageNet Websit
2009 ImageNet: A Large-Scale Hierarchical Image Database@CVPR [Paper]
2014 ImageNet Large Scale Visual Recognition Challenge [Paper]

Brief

ImageNet
- ImageNet-1k - 1k Categories
- ImageNet-21k - 21k Sub-Categories
- ImageNet-22k - 22k Sub-Categories

ML AL Tools

Name	From	UI	Open Source
AIDE	Microsoft	√	√
VOTT	Microsoft	√	√
label-studio	Heartex	√	√
modAL	modAL	Jupyter notebook	√
ALiPy	NUAA-AL	×	√
PyTorch Active Learning	Robert Munro	×	√
active-learning	Google	×	√
EasyDL	Baidu	√	×
ModelArts	HuaWei	√	×

Reference

Brief

Build

Build with CPU only

git clone https://github.com/pjreddie/darknet
cd darknet
make

Build with OpenCV

Change the Makefile

vi Makefile

OPENCV=1

make

Test

./darknet imtest data/eagle.jpg

Run

wget https://pjreddie.com/media/files/yolov3.weights
./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg

wget https://pjreddie.com/media/files/yolov3-tiny.weights
./darknet detect cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg

OpenVINO

Reference

Brief

Model Optimizer
Inference Engine
DevCloud
Workbench
Post-Training Optimization Toolkit
Security
OpenVINO INT8 Model
OpenVINO Compile Tool
OpenVINO Tips
Device Support
- CPU
- iGPU
- VPU
- FPGA

NASNet

Learning Transferable Architectures for Scalable Image Recognition
基于 Neural Architecture Search with Reinforcement Learning
Normal Cell - 输入和输出的特征图维度相同
Reduction Cell - 输出特征图长宽减半
NASNet Search Space
ScheduledDropPath
AO - RL + Policy Gradient

Arch

Model = Normal Cell x N + Reduction Cell x M
Normal/Reduction Cell = Block x B

在 paper 中 B = 5

Block = hidden states x 2 + Operation x 3

Arch	Blocks

Cell = `B Block` / Each Block predict 5 Steps

Step 3/4 可选 Operations	Step 5 可选 Operation / Combine two hidden states
identity	element-wise addition
1x3 then 3x1 convolution	concatenation
1x7 then 7x1 convolution
3x3 dilated convolution
3x3 average pooling
3x3 max pooling
5x5 max pooling
7x7 max pooling
1x1 convolution
3x3 convolution
3x3 depthwise-separable conv
5x5 depthwise-seperable conv
7x7 depthwise-separable conv

Search Result `NASNet-A/B/C`

Name	Description
NASNet-A	- B=5
NASNet-B	- B=4 - 最后没有Concatenate - `Layer Normalization` & `Instance Normalization`
NASNet-C	- B=4 - `Layer Normalization` & `Instance Normalization`

搜素到的相对比较好的 `Normal Cell` & `Reduction Cell` - NASNet-A

NASNet-B	NASNet-C

Layer Normalization
Instance Normalization

Reference

CVAT

Reference

Brief

CVAT - Computer Vision Annotation Tool
- CVAT is completely re-designed and re-implemented version of Video Annotation Tool
标注模式
- 标注图片
- 标注视频中的物体轨迹
- 标注物体属性
- 标注分界
Types of Shapes
- box
- polygon
- polyline
- points

Build

git clone https://github.com/opencv/cvat.git
cd cvat
docker-compose build
docker-compose up -d

http://localhost:8080

如果想从其他机器访问,并共享目录：

touch docker-compose.override.yml
vi docker-compose.override.yml

添加如下内容后，重启compose

version: "2.3"

services:
 cvat:
  environment:
   ALLOWED_HOSTS: '*'
   CVAT_SHARE_URL: "Mounted from ~/works/cvat_task_file host directory"
  ports:
   - "80:8080"
  volumes:
          - /home/serverx/works/cvat_task_file/:/home/django/share/:ro

docker-compose -f docker-compose.yml -f docker-compose.override.yml build

创建管理员用户

docker exec -it cvat bash -ic '/usr/bin/python3 ~/manage.py createsuperuser'

docker exec -it cvat bash -ic '/usr/bin/python3 ~/manage.py changepassword xxx'

Components

OpenVINO
Tensorflow
CUDA
Analytics

Build all components

docker-compose -f docker-compose.yml \
  -f components/tf_annotation/docker-compose.tf_annotation.yml \
  -f components/analytics/docker-compose.analytics.yml \
  -f components/cuda/docker-compose.cuda.yml \
  -f components/openvino/docker-compose.openvino.yml  \
  -f docker-compose.override.yml build

Run the docker container

docker-compose -f docker-compose.yml \
  -f components/tf_annotation/docker-compose.tf_annotation.yml \
  -f components/analytics/docker-compose.analytics.yml \
  -f components/cuda/docker-compose.cuda.yml \
  -f components/openvino/docker-compose.openvino.yml \
  -f docker-compose.override.yml  up -d

Tips

1 CVAT 创建任务时可选 Image quality

默认 quality 50 ，最大值 95
最大值95 是由 pillow 决定的, pillow open jpeg 时只能设置 1 - 95 的quality, 默认是75

ML Tasks Image Classification VGGNet

Reference

2014 VGGNet Very Deep Convolutional Networks for Large-Scale Image Recognition [paper] [code)
VGGNet网络结构
神经网络中卷积层的堆叠
卷积神经网络之VGG
tensorflow（八）tensorflow加载VGG19模型数据并可视化每一层的输出
经典神经网络 VGG 论文解读
VGG16学习笔记

Brief

VGGNet
- 牛津大学 Visual Geometry Group
- 2014年ImageNet亚军
基于 AlexNet 改进
- 使用了同样大小的33卷积核尺寸和22最大池化尺寸，网络结果简洁。
VGGNet最明显的改进就是降低了卷积核的尺寸，增加了卷积的层数。
VGG中使用的都是3×3卷积核，并且使用了连续多个卷积层。

小池化核，使用的是2×2
通道数更多，特征度更宽
层数更深
全连接转卷积

Arch

VGG16包含16层，VGG19包含19层。一系列的VGG在最后三层的全连接层上完全一样，整体结构上都包含5组卷积层，卷积层之后跟一个MaxPool。所不同的是5组卷积层中包含的级联的卷积层越来越多。

VGG16 : 16 weight layer (D)
VGG19 : 19 weight layer (E)

VGG16

VGG16 是由13个卷积层+3个全连接层叠加而成。

keras -> vgg16

import keras
model = keras.applications.vgg16.VGG16(include_top=True, weights='imagenet', input_tensor=None, input_shape=None, pooling=None, classes=1000)
model.summary()

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
fc2 (Dense)                  (None, 4096)              16781312  
_________________________________________________________________
predictions (Dense)          (None, 1000)              4097000   
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
_________________________________________________________________

VGG19

膨胀腐蚀

Reference

Brief

Name	Description	Examples
腐蚀	侵蚀前景对象的边界 `白色/高值` => `黑色/0`	=>
膨胀	增加前景对象的边界 `黑色/0` => `白色/高值`	=>
开运算	腐蚀后膨胀
闭运算	膨胀后腐蚀
形态梯度	膨胀腐蚀差值
顶帽	原图像和开运算的差值
黑帽	原图像和闭运算的差值

腐蚀

#coding=utf-8
import cv2
import numpy as np
import matplotlib.pyplot as plt

img = cv2.imread('t.png', 0)
#img = cv2.cvtColor(oimg, cv2.COLOR_BGR2RGB)

m1 = plt.imshow(img)
m1.set_cmap('gray')
plt.show()

kernel = np.ones((5, 5), np.uint8)
erosion = cv2.erode(img, kernel)  # 腐蚀
fig = plt.figure()
m1 = plt.imshow(erosion)
m1.set_cmap('gray')
plt.show()

kernel = np.ones((10, 10), np.uint8)
erosion = cv2.erode(img, kernel)  # 腐蚀
fig = plt.figure()
m1 = plt.imshow(erosion)
m1.set_cmap('gray')
plt.show()

原图	5x5 腐蚀	10x10 腐蚀

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (10, 10))  # 矩形结构
erosion = cv2.erode(img, kernel)  # 腐蚀
fig = plt.figure()
m1 = plt.imshow(erosion)
m1.set_cmap('gray')
plt.show()

kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (10, 10))  # 椭圆结构
erosion = cv2.erode(img, kernel)  # 腐蚀
fig = plt.figure()
m1 = plt.imshow(erosion)
m1.set_cmap('gray')
plt.show()

kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (10, 10))  # 十字形结构
erosion = cv2.erode(img, kernel)  # 腐蚀
fig = plt.figure()
m1 = plt.imshow(erosion)
m1.set_cmap('gray')
plt.show()

矩形结构腐蚀	椭圆结构腐蚀	十字结构腐蚀

膨胀

kernel = np.ones((5, 5), np.uint8)
dilation = cv2.dilate(img, kernel)  # 膨胀
fig = plt.figure()
m1 = plt.imshow(dilation)
m1.set_cmap('gray')
plt.show()

kernel = np.ones((10, 10), np.uint8)
dilation = cv2.dilate(img, kernel)  # 膨胀
fig = plt.figure()
m1 = plt.imshow(dilation)
m1.set_cmap('gray')
plt.show()

原图	5x5 膨胀	10x10 膨胀

开运算 & 闭运算

开运算 - 先腐蚀后膨胀（因为先腐蚀会分开物体，这样容易记住），其作用是：分离物体，消除小区域。
闭运算 - 先膨胀后腐蚀（先膨胀会使白色的部分扩张，以至于消除/"闭合"物体里面的小黑洞，所以叫闭运算）

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))  # 定义结构元素

img = cv2.imread('t.png', 0)
plt.imshow(img, cmap='gray')
plt.show()

opening = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)  # 开运算
fig = plt.figure()
plt.imshow(opening, cmap='gray')
plt.show()


closing = cv2.morphologyEx(img, cv2.MORPH_CLOSE, kernel)  # 闭运算
fig = plt.figure()
plt.imshow(closing, cmap='gray')
plt.show()

原图	开运算	闭运算

形态学梯度/顶帽/黑帽

img = cv2.imread('t.png', 0)
plt.imshow(img, cmap='gray')
plt.show()

gradient = cv2.morphologyEx(img, cv2.MORPH_GRADIENT, kernel)
plt.imshow(gradient, cmap='gray')
plt.show()

tophat = cv2.morphologyEx(img, cv2.MORPH_TOPHAT, kernel)
plt.imshow(tophat, cmap='gray')
plt.show()

blackhat = cv2.morphologyEx(img, cv2.MORPH_BLACKHAT, kernel)
plt.imshow(blackhat, cmap='gray')
plt.show()

原图	形态学梯度	顶帽	黑帽

Metrics

IoU - Intersection over Union
TP/FP/FN
Precision
Recall
TPR - True Positive Rate
FPR - False Positive Rate
Accuracy
AP - Average Precision
mAP - Mean Average Precison
AR - Average Recall
GIoU
Image/Object Level

Object Detection Metrics History

graph TD
    O(Object Detection \n Metrics History)--> |PASCAL VOC 2005| V1(TPR / FPR)
    V1 --> |PASCAL VOC 2007| V2("11 Point Interpolation AP \n (IoU=0.5)")
    V2 --> |PASCAL VOC 2010| V3("All Point Interpolation AP \n (IoU=0.5)")
    V3 --> |MS COCO 2014| V4(" 101 Point Interpolation AP\n([email protected] mAP@[0.5:0.05:0.95])")

TP/FP/FN

根据 IoU 的值确定 TP/FP/FN

Precision

$\huge Precision = \frac{TP}{TP+FP} = \frac{TP}{\text{Total Predictions}}$

识别为正例的样例中，识别正确的比例

Recall

$\huge Recall = \frac{TP}{TP+FN} = \frac{TP}{\text{Total Ground Truths}}$

样本所有正例中，识别正确的比例
也被称作查全率，即检测出来的样本/实际应该检测出的样本。

Tools

Open-Source Visual Interface for Object Detection Metrics

Reference

PascalVOC

2005 - 2012 Challenges - Detection/Segmentation
VOC 2012
- Train/Validation Data (1.9 GB)
- Test Data (1.8 GB)
- 20 classes
VOC 2007
- Train/Validation Data (439 MB)
- Test Data With Annotations (431 MB)
- 20 classes
Pascal SBD

Tasks

Classification
Detection
- Person Layout
Segmentation
Action Classification
- Boxless Action Classification
Large Scale Recognition

Datasets Downloads

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xvf VOCtrainval_11-May-2012.tar

Details

Year	Statistics	New developments	Notes
2005	Only 4 classes: bicycles, cars, motorbikes, people. Train/validation/test: 1578 images containing 2209 annotated objects.	Two competitions: classification and detection	Images were largely taken from exising public datasets, and were not as challenging as the flickr images subsequently used. This dataset is obsolete.
2006	10 classes: bicycle, bus, car, cat, cow, dog, horse, motorbike, person, sheep. Train/validation/test: 2618 images containing 4754 annotated objects.	Images from flickr and from Microsoft Research Cambridge (MSRC) dataset	The MSRC images were easier than flickr as the photos often concentrated on the object of interest. This dataset is obsolete.
2007	20 classes:Person: personAnimal: bird, cat, cow, dog, horse, sheepVehicle: aeroplane, bicycle, boat, bus, car, motorbike, trainIndoor: bottle, chair, dining table, potted plant, sofa, tv/monitorTrain/validation/test: 9,963 images containing 24,640 annotated objects.	Number of classes increased from 10 to 20Segmentation taster introducedPerson layout taster introducedTruncation flag added to annotationsEvaluation measure for the classification challenge changed to Average Precision. Previously it had been ROC-AUC.	This year established the 20 classes, and these have been fixed since then. This was the final year that annotation was released for the testing data.
2008	20 classes. The data is split (as usual) around 50% train/val and 50% test. The train/val data has 4,340 images containing 10,363 annotated objects.	Occlusion flag added to annotationsTest data annotation no longer made public.The segmentation and person layout data sets include images from the corresponding VOC2007 sets.
2009	20 classes. The train/val data has 7,054 images containing 17,218 ROI annotated objects and 3,211 segmentations.	From now on the data for all tasks consists of the previous years' images augmented with new images. In earlier years an entirely new data set was released each year for the classification/detection tasks.Augmenting allows the number of images to grow each year, and means that test results can be compared on the previous years' images.Segmentation becomes a standard challenge (promoted from a taster)	No difficult flags were provided for the additional images (an omission).Test data annotation not made public.
2010	20 classes. The train/val data has 10,103 images containing 23,374 ROI annotated objects and 4,203 segmentations.	Action Classification taster introduced.Associated challenge on large scale classification introduced based on ImageNet.Amazon Mechanical Turk used for early stages of the annotation.	Method of computing AP changed. Now uses all data points rather than TREC style sampling.Test data annotation not made public.
2011	20 classes. The train/val data has 11,530 images containing 27,450 ROI annotated objects and 5,034 segmentations.	Action Classification taster extended to 10 classes + "other".	Layout annotation is now not "complete": only people are annotated and some people may be unannotated.
2012	20 classes. The train/val data has 11,530 images containing 27,450 ROI annotated objects and 6,929 segmentations.	Size of segmentation dataset substantially increased.People in action classification dataset are additionally annotated with a reference point on the body.	Datasets for classification, detection and person layout are the same as VOC2011.

Reference

Datasets Image2D Classification mnist

Mnist

美国国家标准与技术研究所, National Institute of Standards and Technology (NIST).
250 个不同人手写的数字构成, 其中 50% 是高中学生, 50% 来自人口普查局 (the Census Bureau) 的工作人员

Samples

Download

Datasets	Description	Images
train-images-idx3-ubyte.gz	Training images (9.9 MB, 解压后 47 MB)	60000
train-labels-idx1-ubyte.gz	Training labels(29 KB, 解压后 60 KB)	60000
train-images-idx3-ubyte.gz	Test images (1.6 MB, 解压后 7.8 MB)	10000
t10k-labels-idx1-ubyte.gz	Test labels(5KB, 解压后 10 KB)	10000

Format

图像数据保存在二进制文件中
图像的宽高为28*28
MSB First.

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):

[offset] [type]          [value]          [description] 
0000     32 bit integer  0x00000801(2049) magic number (MSB first) 
0004     32 bit integer  60000            number of items 
0008     unsigned byte   ??               label 
0009     unsigned byte   ??               label 
........ 
xxxx     unsigned byte   ??               label
The labels values are 0 to 9.

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

[offset] [type]          [value]          [description] 
0000     32 bit integer  0x00000803(2051) magic number 
0004     32 bit integer  60000            number of images 
0008     32 bit integer  28               number of rows 
0012     32 bit integer  28               number of columns 
0016     unsigned byte   ??               pixel 
0017     unsigned byte   ??               pixel 
........ 
xxxx     unsigned byte   ??               pixel

Reference

GStreamer 流媒体应用框架

Gstreamer 是一个通用跨平台的流媒体应用框架, 可以处理 Audio/Video/其他数据流
Gstreamer Install
Command Line Tools
Gstreamer Concepts
Gstreamer Plugins

Command Line Tools

Tools	Description
gst-inspect-1.0	查看 elements 信息 - src/sink/pad/Capabilities/...
gst-launch-1.0	创建 pipeline
[gst-device-monitor-1.0]	查看当前设备上的 `Device`
gst-discoverer-1.0	查看 media 相关信息 - codec/Channels/Sample rate/Bitrate/...
ges-launch-1.0	控制 timeline 开始时间/间隔/...

Arch

Pipeline

数据流通的定义(定义组件的顺序)
source - file/http/rtp/...
demux - 分离数据 audio/video/...
decoder - xxx format -> yuv/pcm
sink - 数据输出

Reference

VIA - VGG Image Annotator

Reference

Brief

VIA - VGG Image Annotator

Download

wget https://gitlab.com/vgg/via/-/archive/via-2.0.5/via-via-2.0.5.zip

UseCase

Start

Open the file via.html to start the app.

Attributes

Brisk 特征点匹配

brisk = cv2.BRISK_create()
kp = brisk.detect(img, None)
kp, des = brisk.compute(img, kp)
out_img = img.copy()
out_img = cv2.drawKeypoints(img, kp, out_img)
fig = plt.figure(figsize=(5, 5))
plt.imshow(out_img)

BF Match

matcher = cv2.BFMatcher()
matches = matcher.match(des, des_30)
out_img = cv2.drawMatches(img, kp, bk30_img, kp_30, matches[0:5], out_img,flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)
plt.figure(figsize=(25, 15))
plt.imshow(out_img)

Knn match


FLANN_INDEX_LSH = 6
index_params = dict(algorithm=FLANN_INDEX_LSH,
                                 table_number=6,
                                 key_size=12,
                                 multi_probe_level=1)
search_params = dict(checks=100)
flann = cv2.FlannBasedMatcher(index_params, search_params)
knn_matches = flann.knnMatch(des, des_30, k=2)

good_matches = []
lowe_ratio_test = 0.3
min_match_count = 10

for m, n in knn_matches:
    if m.distance < n.distance * lowe_ratio_test:
        good_matches.append(m)

    if len(good_matches) > min_match_count:
        src_pts = np.float32([kp[m.queryIdx].pt for m in good_matches]).reshape(-1, 1, 2)
        dst_pts = np.float32([kp_30[m.trainIdx].pt for m in good_matches]).reshape(-1, 1, 2)
    else:
        src_pts = None
        dst_pts = None
M, mask = cv2.findHomography(src_pts, dst_pts, method=cv2.RANSAC, ransacReprojThreshold=4.0)
matches_mask = mask.ravel().tolist()

        # Apply homography matrix.
h, w, c = img.shape
# ref image
pts = np.float32([[0, 0], [0, h - 1], [w - 1, h - 1], [w - 1, 0]]).reshape(-1, 1, 2)
# test image
dst = cv2.perspectiveTransform(pts, M)

test_img = cv2.polylines(img=img_30, pts=[np.int32(dst)], isClosed=True,
                                 color=255, thickness=3, lineType=cv2.LINE_AA)

img_matches = np.empty(
            shape=(max(img.shape[0], img_30.shape[0]),
                   img.shape[1] + img_30.shape[1],
                   3),
            dtype=np.uint8)
out_img = cv2.drawMatches(img, kp,
                        test_img, kp_30,
                        matches1to2=good_matches,
                        outImg=img_matches,
                        matchesMask=matches_mask,
                        flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)

plt.figure(figsize=(10, 10))
plt.imshow(out_img)

Rotate

im30 =  cv2.imread('t30.png',3)
(kp_i30, des_i30) = brisk.detectAndCompute(im30, None)
bk30_img = im30.copy()
o30_img = im30.copy()
o30_img = cv2.drawKeypoints(bk30_img, kp_i30, o30_img)
plt.figure(figsize=(15, 10))
plt.imshow(o30_img)


# test image
points = np.int32(dst).reshape(4, 2)
rect = np.zeros((4, 2), dtype="float32")
rect[0], rect[1], rect[2], rect[3] = points[0], points[3], points[2], points[1]
# ref image
destination = np.array([[0, 0], [w - 1, 0], [w - 1, h - 1], [0, h - 1]], dtype="float32")
# homography matrix
h_mat = cv2.getPerspectiveTransform(rect, destination)
frame_wrap = cv2.warpPerspective(src=im30, M=h_mat, dsize=(w, h))
# test image overlay
frame_overlay = frame_wrap.copy()

plt.figure(figsize=(15, 10))
plt.imshow(frame_overlay)

Reference

Brief

LabelImg is a graphical image annotation tool.
保存格式
- Pascal VOC format
- YOLO fomat

Run on Ubuntu 16.04

git clone https://github.com/tzutalin/labelImg.git
cd labelimg
sudo apt-get install pyqt4-dev-tools
sudo pip install lxml
make qt4py2
python labelImg.py
python labelImg.py [IMAGE_PATH] [PRE-DEFINED CLASS FILE]

Saved xml file

<annotation>
	<folder>demo</folder>
	<filename>demo.jpg</filename>
	<path>/home/xxx/labelImg/demo/demo.jpg</path>
	<source>
		<database>Unknown</database>
	</source>
	<size>
		<width>576</width>
		<height>324</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>flowers</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>136</xmin>
			<ymin>102</ymin>
			<xmax>292</xmax>
			<ymax>259</ymax>
		</bndbox>
	</object>
</annotation>

Tips

errors `no module named libs.resources`

Pyrcc5 -o resources.py resources.qrc
cp resources.py libs/

Media Tools GStreamer gstlaunch

gst-launch

参数为 pipeline, 使用特定字符串描述

Pipeline

使用 ! 分隔 Element
可以创建多分支 Pipeline

gst-launch-1.0 playbin uri=file:///home/xxx/xxx.mp4

gst-launch-1.0 videotestsrc ! videoconvert ! autovideosink

gst-launch-1.0 videotestsrc pattern=11 ! videoconvert ! autovideosink

gst-launch-1.0 videotestsrc ! videoconvert ! tee name=t ! queue ! autovideosink t. ! queue ! autovideosink

Reference

gstreamer基础教程10-GStreamer tools

ML Tasks Image Segmentation Instance

Instance Segmentation 实例分割

Reference

Instance Segmentation入门总结
进展综述 - 单阶段实例分割（Single Shot Instance Segmentation）
Training Oxford-IIIT Pets Dataset with tensorflow/models junxnone/tio#253
Tensorflow/models - Run an Instance Segmentation Model
instance_segmentation - opencv/openvino_training_extensions
PaddleDetection
detectron2 - facebookresearch
- Tutorial.ipynb - colab

Brief

分类

Name	Description	example
Top-Down	`目标检测` -> `BBox` -> `语义分割` -> `Instance Seg`	Mask R-CNN
Bottom-Up	`语义分割` -> `聚类&度量学习` -> `Instance Seg`	Discriminative Loss Function
Single Shot		YOLACT SOLO PolarMask AdaptIS BlendMask

History

Date	Name /paper
2016.3	InstanceFCN - Instance-sensitive Fully Convolutional Networks
2016.11	FCIS - Fully Convolutional Instance-aware Semantic Segmentation
2019.4	YOLACT/YOLACT++ - You Only Look At CoefficienTs
2019.11	CenterMask : Real-Time Anchor-Free Instance Segmentation
2019.12	SOLO: Segmenting Objects by Locations
2020	BlendMask - Top-Down Meets Bottom-Up for Instance Segmentation

Metrics

ML Tasks Image Segmentation UNet

Reference

Brief

unet 网络用于图像分割任务
基于 FCN , 扩大了网络框架，使其能够使用很少的训练图像就得到很精确的分割结果。

上采样部分有大量的特征通道，允许网络将上下文信息传播到具有更高分辨率的层
网络没有任何全连接层，仅使用每个卷积的有效部分，即分割映射仅包含在输入图像中可获得完整上下文的像素。该策略允许通过重叠图像区策略无缝分割任意大小的图像
使用弹性变形( elastic deformations)来进行数据增强
使用加权损失，其中接触单元之间的分离背景标签在损失函数中获得较大的权重。

网络结构

part1 - 下采样，池化, 不同的卷积，学习深层次的特征
part2 - 上采样, 反卷积, 恢复原图大小
最后会使用 sigmoid 或 softmax 来实现二分类或多分类
输出为一张像素级的概率图或者概率矩阵

MobileNet-UNet

MobileNet 最为backbone产生feature map
UNet UpSampling 部分 + sigmoid 或 softmax 来实现二分类或多分类

Vanilla-Unet - 即原版Unet

看到 divamgupta/image-segmentation-keras 中使用了vanilla_encoder, 然后查到一个英文梗：

vanilla 的原意是香草，而香草口味的冰淇淋被吐槽和原味没什么区别，所以香草就有了原汁原味的意思

UNet vs FCN

U-Net采用了与FCN不同的特征融合方式
- FCN 中 Pool4/Pool3 中的feature map 是通过逐点相加(Add)的方式进行特征融合的
- U-Net使用Concat Layer将特征在channel维度拼接在一起，形成更“厚”的特征

Image Augmentation

3 x 3 grid smooth deformations
- random displacement vectors
- Gaussian distribution
- bicubic interpolation 方法计算位移

Test Result

RL 强化学习

RL - Reinforcement Learning - 强化学习
- 不断的重复、不断强化认知，从错误中学习而变强

定义

Agent - 智能体 作出智能决策的主体，RL的学习者
Environment - 环境能够被感知的外部系统
State - 状态马尔科夫状态 - 历史信息总结 - 对世界的完整描述，不会隐藏世界的信息
Observation - 观察对状态的部分描述，可能会遗漏一些信息
Action - 动作
- Action Space - 在给定的环境中，有效动作的集合经常被称为动作空间
  - 离散动作空间(discrete action spaces)
  - 连续动作空间(continuous action spaces)
Reward - 奖励环境根据状态和智能体采取的动作，产生一个标量信号作为奖励反馈
Policy - 策略智能体用于决定下一步执行什么行动的规则
Return - Cumulated Future Reward
- Discounted Return - 折扣汇报 - cumulative discounted future reward

当环境可被完全观察到时，强化学习问题被称为马尔可夫决策过程（markov decision process）

当状态不依赖于之前的操作时，我们称该问题为上下文赌博机（contextual bandit problem）

当没有状态，只有一组最初未知回报的可用动作时，这个问题就是经典的多臂赌博机（multi-armed bandit problem）。

Algos

按照环境是否已知划分
- Model-based 有环境模型 去学习和理解环境，学会用一个模型来模拟环境，通过模拟的环境来得到反馈,基于之前学习的知识
- Model-free 没有环境模型 不去学习和理解环境，环境给出什么信息就是什么信息, 重新探索，不使用之前经验
按照学习方式划分
- On-Policy agent必须本人在场，并且一定是本人边玩边学习
- Off-Policy agent可以选择自己玩，也可以选择看着别人玩，通过看别人玩来学习别人的行为准则
按照学习目标划分
- Policy-based 查找状态对应下一步所有动作的概率，根据概率来选取动作适用于非连续和连续的动作
  - 每个 State 寻找最佳 Action , 例如选择 State 对应的 Action - 上下左右的概率最大的那个
- Value-based 输出的是动作的价值，选择价值最高的动作。适用于非连续的动作
  - 每个 State 对应价值，选择能够达到更大价值的 State 的 Action
Actor Critic 结合了Value-Based和Policy-Based

Algos	Description
MDP
PG	Policy Gradient
Q-Learning
DQN
SARSA	State-Action-Reward-State-Action
DDPG	Deep Deterministic Policy Gradient
TRPO	Trust Region Policy Optimization
PPO	Proximal Policy Optimization

Platform

Name	Description
OpenAI Gym & Universe
DeepMind lab

应用

机器人控制
自动驾驶
游戏
交易系统
对话系统

Reference

Books

Semi-Supervised Learning 半监督学习

Reference

Brief

收集样本数据容易，但是给每个样本打标签成本就很高。因此出现了Semi-supervised Learning. 不同与 active learning，让学习过程不依赖外界的咨询交互，自动利用未标记样本所包含的分布信息的方法便是半监督学习（semi-supervised learning），即训练集同时包含有标记样本数据和未标记样本数据。

分类

生成式方法（generative methods）
S3VM/TSVM
图半监督学习

Active Learning vs Semi-Supervised Learning

Name	Active Learning	Semi-Supervised Learning
人工标注	✅	❌
挑选高价值样例	✅	✅
优点		利用了未标注样本
缺点		引入了噪声样本
Pipeline

Pytorch

History

Year	Description
~	基于Torch
2017	由 Facebook 发布
2018	兼并 Caffe2

Reference

PyTorch 官方 [Docs] [Github]
Awesome-PyTorch-Chinese
Awesome-Pytorch-list
pytorch-handbook
PyTorch学习资源汇总
the-incredible-pytorch
Neural Network Programming - Deep Learning with PyTorch - deeplizard.com
Docs CN
- Apachecn
- pytorch1.0-cn - fendouai
Videos
- PyTorch深度学习(系列)教程 - bilibili Videos
- 深度学习与PyTorch入门实战 - bilibili Videos
深入浅出PyTorc

Datasets Image2D Classification Cifar

Cifar

由加拿大高级研究所（Canadian Institute for Advanced Research）收集和整理的
Cifar10 常见的 10种物体
Cifar100 更精细的 100 类物体

Cifar10

60000 张 32x32 彩色图片
10 个种类，每种 6000 张图片
50000 用于 Training，10000 张用于 Test

Cifar100

总体有 20 类，20 类细分为 100 子类
100 个子类，每类 600 张图片
每类 500 张 Training，100 张 Test
fine 指子类，coarse 指父类

Superclass	Classes
aquatic mammals	beaver, dolphin, otter, seal, whale
fish	aquarium fish, flatfish, ray, shark, trout
flowers	orchids, poppies, roses, sunflowers, tulips
food containers	bottles, bowls, cans, cups, plates
fruit and vegetables	apples, mushrooms, oranges, pears, sweet peppers
household electrical devices	clock, computer keyboard, lamp, telephone, television
household furniture	bed, chair, couch, table, wardrobe
insects	bee, beetle, butterfly, caterpillar, cockroach
large carnivores	bear, leopard, lion, tiger, wolf
large man-made outdoor things	bridge, castle, house, road, skyscraper
large natural outdoor scenes	cloud, forest, mountain, plain, sea
large omnivores and herbivores	camel, cattle, chimpanzee, elephant, kangaroo
medium-sized mammals	fox, porcupine, possum, raccoon, skunk
non-insect invertebrates	crab, lobster, snail, spider, worm
people	baby, boy, girl, man, woman
reptiles	crocodile, dinosaur, lizard, snake, turtle
small mammals	hamster, mouse, rabbit, shrew, squirrel
trees	maple, oak, palm, pine, willow
vehicles 1	bicycle, bus, motorcycle, pickup truck, train
vehicles 2	lawn-mower, rocket, streetcar, tank, tractor

Issues

使用 tensorflow keras api `tensorflow.keras.datasets.cifar10.load_data()` 下载数据集时特别慢

手动下载数据集，并复制为 ~/.keras/datasets/cifar-10-batches-py.tar.gz

Reference

CIFAR Dataset
- cifar10 in github - YoongiKim
keras classification sample code
- cifar10_cnn.py
- cifar10_cnn_tfaugment2d.py
pytorch tutorial cifar10 classification
- pytorch colab notebook

ML Tasks Image Detection RCNNs

RCNN VS Fast RCNN VS Faster RCNN

Reference

Diff

Pipeline
R-CNN

Fast R-CNN

Faster R-CNN

RCNN

1.在图像中确定约1000-2000个候选框 (使用选择性搜索)
2. 每个候选框内图像块缩放至相同大小，并输入到CNN内进行特征提取
3.对候选框中提取出的特征，使用分类器判别是否属于一个特定类
4.对于属于某一特征的候选框，用回归器进一步调整其位置

Fast RCNN

1.在图像中确定约1000-2000个候选框 (使用选择性搜索)
2.对整张图片输进CNN，得到feature map
3.找到每个候选框在feature map上的映射patch，将此patch作为每个候选框的卷积特征输入到SPP layer和之后的层
4.对候选框中提取出的特征，使用分类器判别是否属于一个特定类
5.对于属于某一特征的候选框，用回归器进一步调整其位置

Faster RCNN

1.对整张图片输进CNN，得到feature map
2.卷积特征输入到RPN，得到候选框的特征信息
3.对候选框中提取出的特征，使用分类器判别是否属于一个特定类
4.对于属于某一特征的候选框，用回归器进一步调整其位置

Reference

Brief

卷积可视化工具 Interactive convnet features visualization for Keras

Install

pip install quiver_engine

If you want the latest version from the repo

pip install git+git://github.com/keplr-io/quiver.git

Usecase

Build your model in keras

model = Model(...)

Launch the visualization dashboard with 1 line of code

quiver_engine.server.launch(model, classes=['cat','dog'], input_folder='./imgs')

Explore layer activations on all the different images in your input folder.

Example

keras_mobilenet_quiver.py

import keras.applications as apps
from quiver_engine.server import launch

#model = apps.vgg16.VGG16()
model = apps.mobilenet.MobileNet()
launch(model, input_folder="./data")

mkdir -p data
mkdir -p tmp

copy your image to the data folder

python keras_mobilenet_quiver.py

in your brower:
localhost:5000
or
your_ip:5000

images

N.B. quiver_engine.server 中有个 gevent.wsgi 需要替换为 gevent.pywsgi

ML Tasks Image Retrieval CBIR TripletNetwork

Triplet Network

Reference

Brief

Triplet Network是Siamese Network的一种延伸，要解决的问题与Siamese Network的基本一致。
将图像映射到某个特征空间中，其中两幅图像输入CNN得到的特征向量之间的欧式距离即为相似度。
与Siamese Network不同的是，Triplet Network采用三个样本为一组：一个参考样本，一个同类样本，一个异类样本。
在contrastive loss的基础之上构建了一个新的loss函数，就是保持类内和类间距离有一个距离限制（margin）。

	- Tripelet Network由3个相同的前馈神经网络（彼此共享参数）组成 - 每次输入三个样本, 网络会输出两个值: 候选样本与同类样本，候选样本与异类样本，在embedding层的特征向量的L2距离 - 假设输入为: x, 候选样本: x− 异类样本: x+ - 这个网络对x− 和x+ 相对于x 的距离进行了编码

Triplet loss

Triplet Loss，即三元组损失，用于训练差异性较小的数据集，数据集中标签较多，标签的样本较少。输入数据包括锚（Anchor）示例⚓️、正（Positive）示例和负（Negative）示例，通过优化模型，使得锚示例与正示例的距离小于锚示例与负示例的距离，实现样本的相似性计算。其中锚示例是样本集中随机选取的一个样本，正示例与锚示例属于同一类的样本，而负示例与锚示例属于不同类的样本。

triplet loss的优势在于细节区分，即当两个输入相似时，triplet loss能够更好地对细节进行建模，相当于加入了两个输入差异性差异的度量，学习到输入的更好表示，从而在上述两个任务中有出色的表现。当然，triplet loss的缺点在于其收敛速度慢，有时不收敛。

an anchor(基准正例)
a positive of the same class as the anchor （正例）
a negative of a different class （负例）
两个具有同样标签的样本，他们在新的编码空间里距离很近。
两个具有不同标签的样本，他们在新的编码空间里距离很远。

L=max(d(a,p)−d(a,n)+margin,0)

最终的优化目标是拉近 a, p 的距离，拉远 a, n 的距离
easy triplets: L=0 即 d(a,p)+margin<d(a,n)，这种情况不需要优化，天然a, p的距离很近， a, n的距离远
hard triplets: d(a,n)<d(a,p), 即a, p的距离远
semi-hard triplets: d(a,p)<d(a,n)<d(a,p)+margin, 即a, n的距离靠的很近，但是有一个margin

卷积在 DL 中的应用

Reference

2016 A guide to convolution arithmetic for deep learning [paper] [Code]
直观理解深度学习卷积部分
如何理解深度学习中的卷积？
理解图像卷积操作的意义
paper - Multi-scale context aggregation by dilated convolutions
Multi-Scale Context Aggregation by Dilation Convolutions
对深度可分离卷积、分组卷积、扩张卷积、转置卷积（反卷积）的理解
卷积感受野计算
关于感受野的理解与计算

Brief

深度学习领域的卷积本质上是互相关 Cross-correlation, 过滤器不反转, 逐元素乘法和加法
卷积类型
- 2D/3D 卷积
- 转置卷积 - Transposed convolution
- 扩张卷积 - Dilated Convolutions
- 可分卷积
卷积核
卷积的用途
- 提取图像 feature
- UpSample/DownSample

2D 卷积


No padding, no strides	Arbitrary padding, no strides	Half padding, no strides	Full padding, no strides

No padding, strides	Padding, strides	Padding, strides (odd)

3D 卷积

转置卷积 - Transposed convolution

又名 去卷积/反卷积
用途
- Sementation 中 UpSampling 上采样恢复原图大小


No padding, no strides, transposed	Arbitrary padding, no strides, transposed	Half padding, no strides, transposed	Full padding, no strides, transposed

No padding, strides, transposed	Padding, strides, transposed	Padding, strides, transposed (odd)

扩张卷积 - Dilated Convolutions

又名 膨胀卷积/空洞卷积/Atrous Convolutions(DeepLab)
在标准的卷积核中注入空洞来增加感受野(使核膨胀)
多一个参数 dilation rate：卷积核的点的间隔数量(扩张率表示我们希望将核加宽的程度)
用于解决反卷积中丢失部分空间信息的问题


No padding, no stride, dilation

可分卷积

空间可分卷积
深度可分卷积

depthwise convolution & pointwise convolution

使用典型例子见 MobileNet

standard convolution

depthwise separable convolution

pointwise convolution

depthwise convolution 和 pointwise convolution 拼接起来，就是一个 depthwise separable convolution

分组卷积

不同的卷积核的意义

为什么大部分基于卷积的深层网络都在用小卷积核？

ImageNet ILSVRC 2014 上，VGGNet paper 中提出：

This (stack of three 3 × 3 conv layers) can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).

7 x 7 的卷积层的正则等效于 3 个 3 x 3 的卷积层的叠加

这样的设计的优点：

大幅度的减少参数
引入更多非线性，提高决策函数判决力

为什么CNN需要固定输入大小？

卷积层和池化层的输出尺寸都是和输入尺寸相关的，它们的输入是不需要固定图片尺寸的，真正需要固定尺寸的是最后的全连接层。
由于FC层的存在，普通的CNN通过固定输入图片的大小来使得全连接层输入固定。

解决方案

1. resize/crop 到固定大小
1. spatial pyramid pooling layer

Reference

Brief

通过算法获取待检测物体的 BBox

Algos

传统方法
- Cascade + Haar
- SIFT
- SVM + HOG
- DPM
Two Stage
- R-CNN
- Fast R-CNN
- Faster R-CNN
- SPP-net
- R-FCN
One Stage
- YOLO
  - YOLO V1
  - YOLO V2
  - YOLO V3
  - YOLO tiny
  - YOLO SPP
  - YOLOv4
  - YOLOv5
- SSD
- DenseBox
+RNN
- RRC
- Deformable CNN

优化方向

backbone - 特征提取
Loss
NMS
Anchor

History

直方图均衡

直方图均衡化是图像处理领域中利用图像直方图对对比度进行调整的方法。通过这种方法，亮度可以更好地在直方图上分布。这样就可以用于增强局部的对比度而不影响整体的对比度，直方图均衡化通过有效地扩展常用的亮度来实现这种功能。

直方图均衡经典算法对整幅图像的像素使用相同的变换，对于像素值分布比较均衡的图像来说，经典算法的效果不错。但是如果图像中包括明显亮的或者暗的区域，在这些部分的对比度并不能得到增强。

AHE - 自适应直方图均衡化(Adaptive histgram equalization)
CLAHE - 限制对比度自适应直方图均衡（Contrast Limited Adaptive histgram equalization)

AHE

AHE算法与经典算法的不同点在于它通过计算图像多个局部区域的直方图，并重新分布亮度，以此改变图像对比度。所以，该算法更适合于提高图像的局部对比度和细节部分。不过呢，AHE存在过度放大图像中相对均匀区域的噪音的问题。

CLAHE

CLAHE在AHE的基础上，对每个子块直方图做了限制，很好的控制了AHE带来的噪声。
CLAHE与AHE的不同主要在于其对于对比度的限幅，在CLAHE中，对于每个小区域都必须使用对比度限幅，用来克服AHE的过度放大噪音的问题。
在计算CDF前，CLAHE通过用预先定义的阈值来裁剪直方图以达到限制放大倍数的目的。该算法的优势在于它不是选择直接忽略掉那些超出限幅的部分，而是将这些裁剪掉的部分均匀分布到直方图的其他部分。

UseCase

import numpy as np
import cv2
import matplotlib.pyplot as plt
from six import BytesIO
from PIL import Image

img = cv2.imread('t.png',3)
fig = plt.figure(figsize=(20, 15))
plt.grid(False)
plt.imshow(img)
lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
l, a, b = cv2.split(lab)
clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
cl = clahe.apply(l)
limg = cv2.merge((cl, a, b))
fig = plt.figure(figsize=(20, 15))
plt.grid(False)
plt.imshow(limg)

原图
处理后的图像

Reference

Datasets Image2D Detection BITVehicle

BIT-Vehicle

Reference

BitVehicle
彩蛋 ctrl + u ~~ backup- BIT-Vehicle Dataset.html

Brief

The BIT-Vehicle dataset contains 9,850 vehicle images.

Files

├── README.txt
├── vehicle_0000001.jpg
├      ...
├── vehicle_0009849.jpg
├── vehicle_0009850.jpg
└── VehicleInfo.mat ==> matlab annotation information.

Annotation information.

name: The filename of the image.
height: The height of the image.
wideth: The width of the image.
nvehicles: The number of the vehicles in the image.
vehicles: This field is a struct array with the size of 1*n vehicles, and each element describes a vehicle. Each element contains five fileds: left, top, right, bottom, and category. The former four fileds characterize the location of the vehicle in the image, and the field "category" represents the type of the vehicle.

junxnone / aiwiki Goto Github PK

aiwiki's Introduction

AI Wiki

aiwiki's People

Contributors

Stargazers

Watchers

Forkers

aiwiki's Issues

YOLO 系列

Papers

提升变化

YOLOv1

YOLOv2

YOLOv3

YOLOv4

Oxford-IIIT Pets

Downloads

Annotations

list.txt

Reference

Reference

Brief

Download

COCO Challenges Tasks

Annotation format

Detection

Siamese Network

Reference

Brief

Arch

应用

MobileNet

Reference

Brief

量化

线性量化

对称量化

非对称量化

非线性量化

Pipeline

Reference

Reference

Brief

VLAD

Steps

NetVLAD

NetRVLAD

NetFV

Soft-DBoW

BOW

FV - Fisher Vector

Reference

Brief

特征值 & 特征向量 & 特征分解

SVD 分解

UV 推导

Examples

特征值和特征向量计算

求解点云旋转矩阵

模型结构可视化工具 netron

Install

Usecase

Reference

Reference

Brief

Algos

Gradient Descent & Optimizer

梯度

梯度下降

对比

Reference

Global Average Pooling

Reference

Brief

GAP的工作原理

AP vs MP

GAP vs GMP vs FC

读取 RTSP Camera

Code: