Jeremy Fix, https://jeremyfix.github.io/2024_ml4oceans/
April 23, 2025
Slides made with slidemakerFrom data that have a spatial structure (locally correlated), features can be extracted with convolutions.
On Images
That also makes sense for temporal series that have a structure in time.
What is a convolution : Example in 2D
Seen as a matrix multiplication
Given two 1D-vectors \(f, k\), say \(k = [c, b, a]\) \[ (f * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ f \\ \phantom{}\end{bmatrix} \]
Convolution :
- size (e.g. \(3 \times 3\), \(5\times 5\))
- padding (e.g. \(1\), \(2\))
- stride (e.g. \(1\))
Pooling (max/average):
- size (e.g. \(2\times 2\))
- padding (e.g. \(0\))
- stride (e.g. \(2\))
We work with 4D tensors for 2D images, 3D tensors for nD temporal series (e.g. multiple simultaneous recordings), 2D tensors for 1D temporal series
In Pytorch, the tensors follow the Batch-Channel-Height-Width (BCHW, channel-first) convention. Other frameworks, like TensorFlow or CNTK, use BHWC (channel-last).
Given :
Examples from ImageNet (see here)
Bounding boxes given, in the datasets (the predictor parametrization may differ), by : \([x, y, w, h]\), \([x_{min},y_{min},x_{max},y_{max}]\), …
Datasets : Coco, ImageNet, Open Images Dataset
Recent survey : (Zou, Chen, Shi, Guo, & Ye, 2023) , (Terven, Córdova-Esparza, & Romero-González, 2023)
With this task being defined, the questions to adress are :
The metrics should ideally capture :
Given the “true” bounding boxes:
Quantify the quality of these predictions
A predictor should output labeled bounding boxes with a confidence score. They output a lot.
Your metric should evaluate the fraction of bbox you correctly detect (TP) and the fraction of bbox you incorrectly detect (FP) or incorrectly miss (FN).
For every class individually, every prediction (that has a sufficiently large confidence assigned by your predictor) of every images are considered :
The GT boxes that are not predicted are False Negatives (FN).
Examples from the Object detection metrics repository.
\[ \mbox{precision} = \frac{TP}{TP+FP} \] Which fraction of your detections are actually correct.
\[ \mbox{recall} = \frac{TP}{TP+FN} = \frac{TP}{\#\mbox{gt bbox}} \] Which fraction of labeled objects do you detect (can only increase with decreasing confidence)
If you lower your confidence threshold, your precision can either increase or decrease, your recall can either stall or increase.
AP is the average precision for different levels of recall. mAP is the average of AP for every classes/categories. Depends on a specific IOU to define TP/FP. Pascal uses mAP@0.5 while COCO averages map@0.5-0.95.
Examples from the Object detection metrics repository.
Early (2016) proposals :
\[\begin{align*} \mathcal{L} &= \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} [(t_x-t_x^*)^2+(t_y-t_y^*)^2+(t_w-t_w^*)^2+(t_h-t_h^*)^2] \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} BCE(\mathbb{1}_{ij}^{obj}, \mbox{has_obj}_{ij}) \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} \sum_{k=0}^{K} BCE(\mbox{has_class}_{ijk}, p_{ijk}) \end{align*}\]
Dense predictors (e.g. Yolo) predict a lot of bbox, a lot of negatives that overwhelm the positives
\[\begin{align*} p\in[0,1], y \in \{0, 1\}, BCE(p, y) &= -y \log(p) -(1-y) \log(1-p)\\ FL(p, y) &= -y (1-p)^\gamma \log(p) - (1-y)p^\gamma \log(1-p) \end{align*}\]
Also, the bbox metric is usually IOU but the loss was L2.
Data augmentation helps in regularizing your network : it is transforms you apply on the inputs for which you can compute the effect on the target.
Libraries such as a albumentations/torchvision greatly help applying it (see also this list in albumentations).
With albumentations :
import albumentations as A
import albumentations.pytorch
transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.RandomRotate90(p=0.5),
A.MaskDropout((1, 1), image_fill_value=255, p=1),
A.Blur(),
A.RandomBrightnessContrast(p=0.2),
A.pytorch.ToTensorV2()
], bbox_params=A.BBoxParams(format='coco'))
transformed = transform(images=images, bboxes=bboxes)
With torchvision.transforms.v2 :
from torchvision.transforms import v2
transforms = v2.Compose(
[
v2.ToImage(),
v2.RandomPhotometricDistort(p=1),
v2.RandomZoomOut(fill={tv_tensors.Image: (123, 117, 104), "others": 0}),
v2.RandomIoUCrop(),
v2.RandomHorizontalFlip(p=1),
v2.SanitizeBoundingBoxes(),
v2.ToDtype(torch.float32, scale=True),
]
)
target = {
"boxes": boxes,
"labels": torch.arange(boxes.shape[0]),
"this_is_ignored": ("arbitrary", {"structure": "!"})
}
out_img, out_target = transforms(images, targets)
Mosaic Augmentation, MixUp, ….
Your predictor will output a lot of bounding boxes. Several might be overlapping.
Non-maximum suppression (NMS) removes lower scoring boxes that overlap (IOU) with other higher scoring boxes.
Suppose you have a single object to detect, can you localize it into the image ?
As reviewed in [Zhou et al, 2023] :
How can we proceed with multiple objects ? (Ross Girshick, Donahue, Darrell, & Malik, 2014) proposed to :
Revolution in the object detection community (vs. “traditional” HOG like features).
Drawback :
Notes : pretained on ImageNet, finetuned on the considered classes with warped images. Hard negative mining (boosting).
Introduced in (R. Girshick, 2015). Idea:
Drawbacks:
Github repository. CVPR’15 slides
Notes : pretrained VGG16 on ImageNet. Fast training with multiple ROIs per image to build the \(128\) mini batch from \(N=2\) images, using \(64\) proposals : \(25\%\) with IoU>0.5 and \(75\%\) with \(IoU \in [0.1, 0.5[\). Data augmentation : horizontal flip. Per layer learning rate, SGD with momentum, etc..
Multi task loss : \[ L(p, u, t, v) = -\log(p_u) + \lambda \mbox{smooth L1}(t, v) \]
The bbox is parameterized as in (Ross Girshick et al., 2014). Single scale is more efficient than multi-scale.
Introduced in (Ren, He, Girshick, & Sun, 2016). The first end-to-end trainable network. Introducing the Region Proposal Network (RPN). A RPN is a sliding Conv(\(3\times3\)) - Conv(\(1\times1\), k + 4k) network (see here). It also introduces anchor boxes of predefined aspect ratios learned by vector quantization.
Check the paper for a lot of quantitative results. Small objects may not have a lot of features.
Bbox parametrization identical to (Ross Girshick et al., 2014), with smooth L1 loss. Multi-task loss for the RPN. Momentum(0.9), weight decay(0.0005), learning rate (0.001) for 60k minibatches, 0.0001 for 20k.
Multi-step training. Gradient is non-trivial due to the coordinate snapping of the boxes (see ROI align for a more continuous version)
With VGG-16, the conv5 layer is \(H/16,W/16\). For an image \(1000 \times 600\), there are \(60 \times 40 = 2400\) anchor boxes centers.
In practice, torchvision provides pretrained models for object detection, e.g. Faster RCNN models.
Feature Pyramid Networks (FPN) (Lin et al., 2017) introduced top-down path to propagate semantics up to the first layers,
Bottom-up path aggregation adds shortcuts to propagate accurate object boundaries to the top layers (Liu, Qi, Qin, Shi, & Jia, 2018)
(Liu et al., 2018) also introduced Adaptive Feature Pooling rather than arbitrary assignement of proposals to one level of the pyramid as in (Lin et al., 2017).
The first one-stage detector. Introduced in (Redmon, Divvala, Girshick, & Farhadi, 2016). It outputs:
Bounding box encoding:
In YoLo v3, the network is Feature Pyramid Network (FPN) like with a downsampling and an upsampling paths, with predictions at 3 stages.
The loss is multi-task with :
\[\begin{align*} \mathcal{L} &= \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} [(t_x-t_x^*)^2+(t_y-t_y^*)^2+(t_w-t_w^*)^2+(t_h-t_h^*)^2] \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} BCE(\mathbb{1}_{ij}^{obj}, \mbox{has_obj}_{ij}) \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} \sum_{k=0}^{K} BCE(\mbox{has_class}_{ijk}, p_{ijk}) \end{align*}\]
In v1 and v2, the prediction losses were L2 losses.
Multi labelling can occur in coco (e.g. women, person), hence the BCE for the classes
Starting from Yolov4, several authors release some Yolo… version, see (Terven et al., 2023)
Yolov4 (2020), YoloR (2021), Yolov7 (2022) : CSPResNet (cheap DenseNet), bag of freebies (Mosaic, CutMix, Cosine Annealing), bag of specials
YoloX (2021) by Megvii based on Yolo v3 : Anchor free, decoupled heads
Yolov5 (2020), Yolov8 (2023), Yolov11(2024) by Ultralytics
Example with the ultralytics Yolov11, released 09/2024, either as CLI or python/Pytorch :
from ultralytics import YOLO
# Load a model
model = YOLO("yolo11n.pt")
# Train the model
train_results = model.train(
data="coco8.yaml", # path to dataset YAML
epochs=100, # number of training epochs
imgsz=640, # training image size
device="cpu", # device to run on, i.e. device=0 or device=0,1,2,3 or device=cpu
)
# Evaluate model performance on the validation set
metrics = model.val()
# Perform object detection on an image
results = model("path/to/image.jpg")
results[0].show()
# Export the model to ONNX format
path = model.export(format="onnx") # return path to exported model
with coco.yaml (coco format \(x_{min}, y_{min}, width, height\))
The bounding boxes do not have to be axis-aligned and can be oriented bounding boxes (OBB).
This model now supports also segmentation, object tracking, pose estimation.
Task: detecting diatoms using oriented bounding boxes (with M. Laviale, C. Pradalier, C. Regan, A. Venkataramanan, C. Galinier)
Using ultralytics and their wandb callbacks 😍 ;
Code and instructions on https://github.com/jeremyfix/diatoms_yolo.
Running on IMagine and Teratotheca.
Task: detecting diatoms using oriented bounding boxes (with M. Laviale, C. Pradalier, C. Regan, A. Venkataramanan, C. Galinier)
Using ultralytics and their wandb callbacks 😍
Code and instructions on https://github.com/jeremyfix/diatoms_yolo.
Running on IMagine and Teratotheca.
Given an image,
Semantic segmentation : predict the class of every single pixel. We also call dense prediction/dense labelling.
Example image from MS Coco
Instance segmentation : classify all the pixels belonging to the same countable objects
Example image from MS Coco
More recently, panoptic segmentation refers to instance segmentation for countable objects (e.g. people, animals, tools) and semantic segmentation for amorphous regions (grass, sky, road).
Metrics : see Coco panotpic evaluation
Some example networks : PSP-Net, U-Net, Dilated Net, ParseNet, DeepLab, Mask RCNN, …
Introduced in (Ciresan, Giusti, Gambardella, & Schmidhuber, 2012).
Drawbacks:
(on deep neural network calibration, see also (Guo, Pleiss, Sun, & Weinberger, 2017))
Introduced in (Long, Shelhamer, & Darrell, 2015). First end-to-end convolutional network for dense labeling with pretrained networks.
The upsampling can be :
Traditional approaches involves bilinear, bicubic, etc… interpolation.
For upsampling in a learnable way, we can use fractionally strided convolution. That’s one ingredient behind Super-Resolution (Shi, Caballero, Huszár, et al., 2016).
You can initialize the upsampling kernels with a bilinear interpolation kernel. To have some other equivalences, see (Shi, Caballero, Theis, et al., 2016). See ConvTranspose2d.
This can introduce artifacts, check (Odena, Dumoulin, & Olah, 2016). Some prefers a billinear upsampling, followed by convolutions.
Several models along the same architectures : SegNet (sum), U-Net (concat). Encoder-Decoder architecture introduced in (Ronneberger, Fischer, & Brox, 2015)
There is :
To gather contextual information by enlarging the receptive fields, DeepLab employs “a-trou” convolutions (dilated) rather than max pooling. DeepLabV1 (Chen, 2014), DeepLabV2 (Chen, Papandreou, Kokkinos, Murphy, & Yuille, 2017), DeepLabV3 (Chen, Papandreou, Schroff, & Adam, 2017), DeepLabV3+ (Chen, Zhu, Papandreou, Schroff, & Adam, 2018).
With torchvision
Both torchvision (v2) and albumentations support augmentation on masks. See albumentations for example.
Task Segment non-living stuff vs living organisms in ZooScan images.
Task Segment non-living stuff vs living organisms in ZooScan images.
Inference on a new sample with tiled inference. We could have used averaged inference with overlapping tiles.
Mask produced with threshold@0.5 which may be suboptimal.
Introduced in (He, Gkioxari, Dollár, & Girshick, 2018) as an extension of Faster RCNN. It outputs a binary mask in addition the class labels + bbox regressions.
It addresses instance segmentation by predicting a mask for individualised object proposals.
Proposed to use ROI-Align (with bilinear interpolation) rather than ROI-Pool.
There is no competition between the classes in the masks. Different objects may use different kernels to compute their masks.
Can be extended to keypoint detection, outputting a \(K\) depth mask for predicting the \(K\) joints.
Ultralytics Yolo segment adopts the same strategy of predicting a mask for each bounding box.
from ultralytics import YOLO
# Load a model
model = YOLO("yolo11n-seg.pt")
# Train the model
train_results = model.train(data="coco8-seg.yaml", epochs=100, imgsz=640)
results = model("myimage.jpg") # predict on an image
Live demo with https://github.com/jeremyfix/deeplearning_demos and https://github.com/jeremyfix/onnx_models.
Introduced in (Vaswani et al., 2017), from NLP to Vision by ViT (Dosovitskiy et al., 2021).
Recent propositions for :
In practice, huggingface transformers implements DETR. Look at the paper, it is just few lines of pytorch.