Introduction to deep learning based object detection and semantic segmentation

Jeremy Fix, https://jeremyfix.github.io/2024_ml4oceans/

April 23, 2025

Slides made with slidemaker

Reminder on Convolutional Neural Networks (CNN)

Extracting features with convolutions

From data that have a spatial structure (locally correlated), features can be extracted with convolutions.

On Images

Original image
Discrete laplacian
Gaussian blur
Pattern matching
The pattern

That also makes sense for temporal series that have a structure in time.

A convolution as a sparse matrix multiply

What is a convolution : Example in 2D







Seen as a matrix multiplication

Given two 1D-vectors \(f, k\), say \(k = [c, b, a]\) \[ (f * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ f \\ \phantom{}\end{bmatrix} \]

CNN Vocabulary

The building blocks of the convolutional part of a vanilla CNN

Convolution :
- size (e.g. \(3 \times 3\), \(5\times 5\))
- padding (e.g. \(1\), \(2\))
- stride (e.g. \(1\))

Pooling (max/average):
- size (e.g. \(2\times 2\))
- padding (e.g. \(0\))
- stride (e.g. \(2\))

We work with 4D tensors for 2D images, 3D tensors for nD temporal series (e.g. multiple simultaneous recordings), 2D tensors for 1D temporal series

In Pytorch, the tensors follow the Batch-Channel-Height-Width (BCHW, channel-first) convention. Other frameworks, like TensorFlow or CNTK, use BHWC (channel-last).

Object detection

Problem statement

Given :

  • images \(x_i\),
  • targets \(y_i\) which contains objects bounding boxes and labels

Examples from ImageNet (see here)

ILSVRC2014_train_00005559 : few objects annotated.
ILSVRC2014_train_00029372 : 12 objets with occlusions

Bounding boxes given, in the datasets (the predictor parametrization may differ), by : \([x, y, w, h]\), \([x_{min},y_{min},x_{max},y_{max}]\), …

Datasets : Coco, ImageNet, Open Images Dataset

Recent survey : (Zou, Chen, Shi, Guo, & Ye, 2023) , (Terven, Córdova-Esparza, & Romero-González, 2023)

The dimensions to investigate

With this task being defined, the questions to adress are :

  • which are the specific metrics used for measuring the performances ? Classification accuracy, bbox regression quality, speed
  • which are the relevant losses ? L2 ? Huber ? CE ? Focal ? Dice ?
  • which data augmentation techniques can be applied ? Flip, mosaic, dropmask
  • which are the model’s architectures suitable for this task ? End to end trainable 1-stage and 2-stage detectors
  • how do we solve the task in practice with pytorch ?

Object detection : Metrics

Metrics to measure the performances

The metrics should ideally capture :

  • the quality of the bounding boxes
  • the quality of the label found in the bounding box

Given the “true” bounding boxes:

Pascal VOC 2012_000180.jpg

Quantify the quality of these predictions

Yolo v4 tiny predictions using darknet

A predictor should output labeled bounding boxes with a confidence score. They output a lot.

Your metric should evaluate the fraction of bbox you correctly detect (TP) and the fraction of bbox you incorrectly detect (FP) or incorrectly miss (FN).

The Pascal/COCO mAP metric

For every class individually, every prediction (that has a sufficiently large confidence assigned by your predictor) of every images are considered :

  • a true positive (TP) if IoU with ground truth bbox of the same class is higher than a threshold (say \(IOU > 0.3\))
  • a false positive (FP) otherwise or if there is a TP for the same ground truth with a higher confidence (“5 detections (TP) of a single object is counted as 1 correct detection and 4 false detections”)
Example 24 predictions (red) with 15 ground truth (green) for one arbitrary class
The 24 predictions are either TP (blue) or FP (orange) with IoU>0.3

The GT boxes that are not predicted are False Negatives (FN).

Examples from the Object detection metrics repository.

The Pascal/COCO mAP metric

\[ \mbox{precision} = \frac{TP}{TP+FP} \] Which fraction of your detections are actually correct.

\[ \mbox{recall} = \frac{TP}{TP+FN} = \frac{TP}{\#\mbox{gt bbox}} \] Which fraction of labeled objects do you detect (can only increase with decreasing confidence)

If you lower your confidence threshold, your precision can either increase or decrease, your recall can either stall or increase.

Precision/recall for the ordered predictions
Precision recall curve

AP is the average precision for different levels of recall. mAP is the average of AP for every classes/categories. Depends on a specific IOU to define TP/FP. Pascal uses mAP@0.5 while COCO averages map@0.5-0.95.

Examples from the Object detection metrics repository.

Object detection : Losses, data augmentations, post-processing

Loss

Early (2016) proposals :

  • Multi-objectives with Multi-label Cross Entropy for classification and L2 for bounding boxes coordinates. e.g. Yolo v1,2,3 :

\[\begin{align*} \mathcal{L} &= \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} [(t_x-t_x^*)^2+(t_y-t_y^*)^2+(t_w-t_w^*)^2+(t_h-t_h^*)^2] \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} BCE(\mathbb{1}_{ij}^{obj}, \mbox{has_obj}_{ij}) \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} \sum_{k=0}^{K} BCE(\mbox{has_class}_{ijk}, p_{ijk}) \end{align*}\]

Dense predictors (e.g. Yolo) predict a lot of bbox, a lot of negatives that overwhelm the positives

  • Focal loss (Lin, Goyal, Girshick, He, & Dollár, 2020)

\[\begin{align*} p\in[0,1], y \in \{0, 1\}, BCE(p, y) &= -y \log(p) -(1-y) \log(1-p)\\ FL(p, y) &= -y (1-p)^\gamma \log(p) - (1-y)p^\gamma \log(1-p) \end{align*}\]

Also, the bbox metric is usually IOU but the loss was L2.

  • Losses based on Intersection Over Union (IOU) started with Generalized IOU (Rezatofighi et al., 2019), Distance IOU (Zheng et al., 2020), (Zheng et al., 2021). Generalized IOU is a differentiable IOU loss which also handles the case when the predicted and ground truth bounding box do not overlap.

Data augmentations

Data augmentation helps in regularizing your network : it is transforms you apply on the inputs for which you can compute the effect on the target.

Libraries such as a albumentations/torchvision greatly help applying it (see also this list in albumentations).

With albumentations :


import albumentations as A
import albumentations.pytorch

transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomRotate90(p=0.5),
    A.MaskDropout((1, 1), image_fill_value=255, p=1),
    A.Blur(),
    A.RandomBrightnessContrast(p=0.2),
    A.pytorch.ToTensorV2()
], bbox_params=A.BBoxParams(format='coco'))


transformed = transform(images=images, bboxes=bboxes)

With torchvision.transforms.v2 :


from torchvision.transforms import v2

transforms = v2.Compose(
    [
        v2.ToImage(),
        v2.RandomPhotometricDistort(p=1),
        v2.RandomZoomOut(fill={tv_tensors.Image: (123, 117, 104), "others": 0}),
        v2.RandomIoUCrop(),
        v2.RandomHorizontalFlip(p=1),
        v2.SanitizeBoundingBoxes(),
        v2.ToDtype(torch.float32, scale=True),
    ]
)

target = {
    "boxes": boxes,
    "labels": torch.arange(boxes.shape[0]),
    "this_is_ignored": ("arbitrary", {"structure": "!"})
}
out_img, out_target = transforms(images, targets)

Mosaic Augmentation, MixUp, ….

Postprocessing : Non maximum suppression (NMS)

Your predictor will output a lot of bounding boxes. Several might be overlapping.

Non-maximum suppression (NMS) removes lower scoring boxes that overlap (IOU) with other higher scoring boxes.

Non maximum suppression (from (Terven et al., 2023))

Object detection : Models

A first step: object localization

Suppose you have a single object to detect, can you localize it into the image ?

The field is moving at a very high pace

As reviewed in [Zhou et al, 2023] :

Milestones in object detection

Object detection : Two stage detectors

Region Based Convolutional Neural Network : R-CNN

How can we proceed with multiple objects ? (Ross Girshick, Donahue, Darrell, & Malik, 2014) proposed to :

  • use selective search for proposing bounding boxes
  • to classify with a SVM from the features extracted by a pretrained DNN.
  • to optimize localization with linear bbox adaptors

Revolution in the object detection community (vs. “traditional” HOG like features).

Drawback :

  • external algorithm (not in the computational graph, not trainable)
  • one forward pass per bounding box proposal (\(\sim 2K\) or so) \(\rightarrow\) training and test are slow (\(47 s.\) per image with VGG16)

Notes : pretained on ImageNet, finetuned on the considered classes with warped images. Hard negative mining (boosting).

Fast R-CNN

Introduced in (R. Girshick, 2015). Idea:

  • just one forward pass
  • cropping the convolutional feature maps (e.g. \((1, 512, H/16, W/16)\) conv5 of VGG16)
  • max-pool the variable sized crop to a fixed-sized (e.g. \(7 \times 7\)) vector before dense layers: ROI pooling

Drawbacks:

  • external algorithm for ROI proposals: not trainable and slow
  • ROIs are snapped to the grid (see here) \(\rightarrow\) ROI align

Github repository. CVPR’15 slides

Notes : pretrained VGG16 on ImageNet. Fast training with multiple ROIs per image to build the \(128\) mini batch from \(N=2\) images, using \(64\) proposals : \(25\%\) with IoU>0.5 and \(75\%\) with \(IoU \in [0.1, 0.5[\). Data augmentation : horizontal flip. Per layer learning rate, SGD with momentum, etc..

Multi task loss : \[ L(p, u, t, v) = -\log(p_u) + \lambda \mbox{smooth L1}(t, v) \]

The bbox is parameterized as in (Ross Girshick et al., 2014). Single scale is more efficient than multi-scale.

Faster R-CNN : 2-stages trained end-to-end

Introduced in (Ren, He, Girshick, & Sun, 2016). The first end-to-end trainable network. Introducing the Region Proposal Network (RPN). A RPN is a sliding Conv(\(3\times3\)) - Conv(\(1\times1\), k + 4k) network (see here). It also introduces anchor boxes of predefined aspect ratios learned by vector quantization.

Architecture
Region proposal network with anchors at 3 scales, 3 aspect ratios

Check the paper for a lot of quantitative results. Small objects may not have a lot of features.

Github repository

Bbox parametrization identical to (Ross Girshick et al., 2014), with smooth L1 loss. Multi-task loss for the RPN. Momentum(0.9), weight decay(0.0005), learning rate (0.001) for 60k minibatches, 0.0001 for 20k.

Multi-step training. Gradient is non-trivial due to the coordinate snapping of the boxes (see ROI align for a more continuous version)

With VGG-16, the conv5 layer is \(H/16,W/16\). For an image \(1000 \times 600\), there are \(60 \times 40 = 2400\) anchor boxes centers.

In practice

In practice, torchvision provides pretrained models for object detection, e.g. Faster RCNN models.

In practice


from torchvision.models.detection import fasterrcnn_resnet50_fpn

model = fasterrcnn_resnet50_fpn(num_classes=...)

# Training
model.train()
model(images, targets)

# Inference
model.eval()
predictions = model(images)

During training :


targets = [
    {
        'boxes': FloatTensor[N1, 4],
        'labels': IntTensor[N1]},
    },
    {
        'boxes': FloatTensor[N2, 4],
        'labels': IntTensor[N2]},
    },
    ... 
]

During inference :


predictions = [
    {
        'boxes': FloatTensor[N1, 4]
        'labels': IntTensor[N1],
        'scores' : FloatTensor[N1]
    },
    {
        'boxes': FloatTensor[N2, 4]
        'labels': IntTensor[N2],
        'scores' : FloatTensor[N2]
    },
    ....
]

Feature pyramid networks and bottom-up path augmentation

Multi-scale features with semantic and localization

  • In deep classifiers, the top layers are semantically rich but spatially poor,
  • Feature Pyramid Networks (FPN) (Lin et al., 2017) introduced top-down path to propagate semantics up to the first layers,

  • Bottom-up path aggregation adds shortcuts to propagate accurate object boundaries to the top layers (Liu, Qi, Qin, Shi, & Jia, 2018)

Top down semantics, Bottom-up localization
Adaptive feature pooling

(Liu et al., 2018) also introduced Adaptive Feature Pooling rather than arbitrary assignement of proposals to one level of the pyramid as in (Lin et al., 2017).

Object detection : One stage detectors

You Only Look Once (Yolo v1,v2,v3)

The first one-stage detector. Introduced in (Redmon, Divvala, Girshick, & Farhadi, 2016). It outputs:

  • \(B\) bounding boxes \((x, y, w, h, conf)\) for each cell of a \(S\times S\) grid
  • the probabilities over the \(K\) classes
  • the output volume is \((5\times B+K)\times(S\times S)\) in YoloV1, then \((5+K)\times B \times (S\times S)\) from v2
YoLo v0 with one bounding box per cell B=1

Bounding box encoding:

From Yolo v2

In YoLo v3, the network is Feature Pyramid Network (FPN) like with a downsampling and an upsampling paths, with predictions at 3 stages.

You Only Look Once (Yolo v1,v2,v3)

The loss is multi-task with :

  • a term for the regression of the transformation of the anchor boxes
  • a term for detecting the presence of an object
  • a term for the (possibly multi) labelling of the boxes

\[\begin{align*} \mathcal{L} &= \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} [(t_x-t_x^*)^2+(t_y-t_y^*)^2+(t_w-t_w^*)^2+(t_h-t_h^*)^2] \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} BCE(\mathbb{1}_{ij}^{obj}, \mbox{has_obj}_{ij}) \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} \sum_{k=0}^{K} BCE(\mbox{has_class}_{ijk}, p_{ijk}) \end{align*}\]

Darknet code

In v1 and v2, the prediction losses were L2 losses.

Multi labelling can occur in coco (e.g. women, person), hence the BCE for the classes

Yolo family

Starting from Yolov4, several authors release some Yolo… version, see (Terven et al., 2023)

  • Yolov1 (2016), v2, v3 (2018 )by J. Redmon and al. : principle + anchors + multi-scale (FPN)
J. Redmon raising ethical questions about the use of these ML algorithms (Redmon & Farhadi, 2018)
  • Yolov4 (2020), YoloR (2021), Yolov7 (2022) : CSPResNet (cheap DenseNet), bag of freebies (Mosaic, CutMix, Cosine Annealing), bag of specials

  • YoloX (2021) by Megvii based on Yolo v3 : Anchor free, decoupled heads

  • Yolov5 (2020), Yolov8 (2023), Yolov11(2024) by Ultralytics

Yolov5 (2020)

  • Released in 2020 by Ultralytics
  • CSPLayer (cheap DarknetNet from Yolov3 (cheap ResNet))
  • Coupled heads, anchor based
  • Top-down FPN
  • Bottom-up PAN

See also this page

Yolov5 architecture, by mmyolo

Yolov8 (2023)

  • Released early 2023 by Ultralytics
  • different scaled versions available (nano … extra-large)
  • Updated CSPLayer : C2f module
  • Decoupled heads, Anchor free
Yolov8 architecture, by mmyolo

In practice

  • You could (possibly) implement yourself Yolo 😖 ,

Example with the ultralytics Yolov11, released 09/2024, either as CLI or python/Pytorch :


from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")

# Train the model
train_results = model.train(
    data="coco8.yaml",  # path to dataset YAML
    epochs=100,  # number of training epochs
    imgsz=640,  # training image size
    device="cpu",  # device to run on, i.e. device=0 or device=0,1,2,3 or device=cpu
)

# Evaluate model performance on the validation set
metrics = model.val()

# Perform object detection on an image
results = model("path/to/image.jpg")
results[0].show()

# Export the model to ONNX format
path = model.export(format="onnx")  # return path to exported model

with coco.yaml (coco format \(x_{min}, y_{min}, width, height\))

The bounding boxes do not have to be axis-aligned and can be oriented bounding boxes (OBB).

This model now supports also segmentation, object tracking, pose estimation.

In practice : detecting oriented diatoms

Task: detecting diatoms using oriented bounding boxes (with M. Laviale, C. Pradalier, C. Regan, A. Venkataramanan, C. Galinier)

Ground truth
Train batch with aug.
Prediction

Using ultralytics and their wandb callbacks 😍 ;

Code and instructions on https://github.com/jeremyfix/diatoms_yolo.

Running on IMagine and Teratotheca.

In practice : detecting oriented diatoms (cont. )

Task: detecting diatoms using oriented bounding boxes (with M. Laviale, C. Pradalier, C. Regan, A. Venkataramanan, C. Galinier)

Metrics during training Metrics of one run

Using ultralytics and their wandb callbacks 😍

Code and instructions on https://github.com/jeremyfix/diatoms_yolo.

Running on IMagine and Teratotheca.

Semantic, Instance and Panoptic segmentation

Problem statement

Given an image,

Semantic segmentation : predict the class of every single pixel. We also call dense prediction/dense labelling.

Example image from MS Coco

Image labeled with stuff classes

Instance segmentation : classify all the pixels belonging to the same countable objects

Example image from MS Coco

Image labeled with things classes

More recently, panoptic segmentation refers to instance segmentation for countable objects (e.g. people, animals, tools) and semantic segmentation for amorphous regions (grass, sky, road).

Metrics : see Coco panotpic evaluation

Some example networks : PSP-Net, U-Net, Dilated Net, ParseNet, DeepLab, Mask RCNN, …

Sliding a classifier

Introduced in (Ciresan, Giusti, Gambardella, & Schmidhuber, 2012).

Segmentation with a sliding window pixel classifier
Class preserving data augmentation
  • The predicted segmentation is post-processed with a median filter
  • The output probability is calibrated to compensate for different priors on being a membrane / not being a membrane.

Drawbacks:

  • one forward pass per patch
  • overlapping patches do not share the computations

(on deep neural network calibration, see also (Guo, Pleiss, Sun, & Weinberger, 2017))

Semantic segmentation

Fully Convolutional Network (FCN)

Introduced in (Long, Shelhamer, & Darrell, 2015). First end-to-end convolutional network for dense labeling with pretrained networks.

FCN network architecture with a backbone downsampling with pool5 size H/32,W/32 because the pool5 stride is 32. The segmentation prediction from each pool layer is computed with a Conv 1\times 1.
Refined segmentation by incoporating early feature maps

The upsampling can be :

  • differentiable hardcoded: e.g. nearest, bilinear, bicubic
  • learned as a fractionally strided convolution (deconvolution),

Learned upsampling : fractionally strided convolution

Traditional approaches involves bilinear, bicubic, etc… interpolation.

For upsampling in a learnable way, we can use fractionally strided convolution. That’s one ingredient behind Super-Resolution (Shi, Caballero, Huszár, et al., 2016).

Learned upsampling with fractionally strided convolution

You can initialize the upsampling kernels with a bilinear interpolation kernel. To have some other equivalences, see (Shi, Caballero, Theis, et al., 2016). See ConvTranspose2d.

This can introduce artifacts, check (Odena, Dumoulin, & Olah, 2016). Some prefers a billinear upsampling, followed by convolutions.

Encoder/Decoder architectures

Several models along the same architectures : SegNet (sum), U-Net (concat). Encoder-Decoder architecture introduced in (Ronneberger, Fischer, & Brox, 2015)

There is :

  • a contraction path from the image “down” to high level features at a coarse spatial resolution
  • an expansion upsampling path which integrates low level features for pixel wise labelling
U-Net contraction/expansion convolution/deconvolution architecture

DeepLab

To gather contextual information by enlarging the receptive fields, DeepLab employs “a-trou” convolutions (dilated) rather than max pooling. DeepLabV1 (Chen, 2014), DeepLabV2 (Chen, Papandreou, Kokkinos, Murphy, & Yuille, 2017), DeepLabV3 (Chen, Papandreou, Schroff, & Adam, 2017), DeepLabV3+ (Chen, Zhu, Papandreou, Schroff, & Adam, 2018).

  • Max-pooling in classification backbones induces invariance to translation \(\rightarrow\) a-trou convolutions
  • How to deal with objects at multiple scales ? \(\rightarrow\) Atrous Spatial Pyramid Pooling (ASPP)
  • In DeepLabV3+, combines with some decoding upsampling layers
DeepLabV3+ combines ASPP with a decoder

In practice

With torchvision


from torchvision.models.segmentation import deeplabv3_resnet50, DeepLabV3_ResNet50_Weights

model = deeplabv3_resnet50(
    DeepLabV3_ResNet50_Weights.DEFAULT,
    num_classes = 1
)

With segmentation models pytorch (smp)


import segmentation_models_pytorch as smp

cin = 1
num_classes = 1
model = smp.DeepLabV3Plus(
    in_channels = cin, 
    classes = num_classes,
    encoder_name = "resnet18",
    encoder_weights = "imagenet"
)

Both torchvision (v2) and albumentations support augmentation on masks. See albumentations for example.

In practice

Task Segment non-living stuff vs living organisms in ZooScan images.

  • Large images \(\approx 15000 \times 23000\) containing objects belonging to \(89\) categories. Non living (label < 8) or living.
  • Using either :
    • a UNet with a pretrained backbone + random decoder (lab)
    • a DeepLabv3+ pretrained on ImageNet
  • Trained on \(512 \times 512\) overlapping patches, focal loss, F1 score for early stopping
  • 76K for training, 19K for validation
  • One epoch in 10 minutes on RTX 3090
  • Flip, Rotate, MaskDropout, Blur
Metrics during training

In practice

Task Segment non-living stuff vs living organisms in ZooScan images.

Inference on a new sample with tiled inference. We could have used averaged inference with overlapping tiles.

Mask produced with threshold@0.5 which may be suboptimal.

Inference on a new unseen scan with slided inference

Instance segmentation

Mask-RCNN

Introduced in (He, Gkioxari, Dollár, & Girshick, 2018) as an extension of Faster RCNN. It outputs a binary mask in addition the class labels + bbox regressions.

It addresses instance segmentation by predicting a mask for individualised object proposals.

Proposed to use ROI-Align (with bilinear interpolation) rather than ROI-Pool.

ROI Align prevents quantization of ROI Pooling
Mask for the 80 object categories of COCO

There is no competition between the classes in the masks. Different objects may use different kernels to compute their masks.

Can be extended to keypoint detection, outputting a \(K\) depth mask for predicting the \(K\) joints.

Yolo

Ultralytics Yolo segment adopts the same strategy of predicting a mask for each bounding box.


from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n-seg.pt")

# Train the model
train_results = model.train(data="coco8-seg.yaml", epochs=100, imgsz=640)

results = model("myimage.jpg")  # predict on an image

Live demo with https://github.com/jeremyfix/deeplearning_demos and https://github.com/jeremyfix/onnx_models.

Transformers are coming

Transformers

Introduced in (Vaswani et al., 2017), from NLP to Vision by ViT (Dosovitskiy et al., 2021).

Recent propositions for :

  • Object detection : (Carion et al., 2020) proposed a transformer based object detector. The decoder outputs a predefined large number \(N\) of labeled bounding boxes.

In practice, huggingface transformers implements DETR. Look at the paper, it is just few lines of pytorch.

DETR architecture

References

Bibliography

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Computer vision – ECCV 2020: 16th european conference, glasgow, UK, august 23–28, 2020, proceedings, part i (pp. 213–229). Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/978-3-030-58452-8_13
Chen, L.-C. (2014). Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv Preprint arXiv:1412.7062.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. Retrieved from https://arxiv.org/abs/1606.00915
Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. Retrieved from https://arxiv.org/abs/1706.05587
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. Retrieved from https://arxiv.org/abs/1802.02611
Ciresan, D., Giusti, A., Gambardella, L. M., & Schmidhuber, J. (2012). Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images, 9.
Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2023). Vision transformers need registers. arXiv:2309.16588.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.
Girshick, R. (2015). Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 1440–1448). https://doi.org/10.1109/ICCV.2015.169
Girshick, Ross, Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524 [Cs]. Retrieved from http://arxiv.org/abs/1311.2524
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. arXiv:1706.04599 [Cs]. Retrieved from http://arxiv.org/abs/1706.04599
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2018). Mask R-CNN. arXiv:1703.06870 [Cs]. Retrieved from http://arxiv.org/abs/1703.06870
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., … Girshick, R. (2023). Segment anything. arXiv:2304.02643.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 936–944). https://doi.org/10.1109/CVPR.2017.106
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2020). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327. https://doi.org/10.1109/TPAMI.2018.2858826
Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 8759–8768). https://doi.org/10.1109/CVPR.2018.00913
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. arXiv:1411.4038 [Cs]. Retrieved from http://arxiv.org/abs/1411.4038
Odena, A., Dumoulin, V., & Olah, C. (2016). Deconvolution and Checkerboard Artifacts. Distill, 1(10), e3. https://doi.org/10.23915/distill.00003
Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., … Bojanowski, P. (2023). DINOv2: Learning robust visual features without supervision. arXiv:2304.07193.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. arXiv:1506.02640 [Cs]. Retrieved from http://arxiv.org/abs/1506.02640
Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. CoRR, abs/1804.02767. Retrieved from http://arxiv.org/abs/1804.02767
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497 [Cs]. Retrieved from http://arxiv.org/abs/1506.01497
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 658–666). Los Alamitos, CA, USA: IEEE Computer Society. https://doi.org/10.1109/CVPR.2019.00075
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597 [Cs]. Retrieved from http://arxiv.org/abs/1505.04597
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., … Wang, Z. (2016). Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. arXiv:1609.05158 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1609.05158
Shi, W., Caballero, J., Theis, L., Huszar, F., Aitken, A., Tejani, A., … Wang, Z. (2016). Is the deconvolution layer the same as a convolutional layer?, 7.
Terven, J., Córdova-Esparza, D.-M., & Romero-González, J.-A. (2023). A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Machine Learning and Knowledge Extraction, 5(4), 1680–1716. https://doi.org/10.3390/make5040083
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 30). Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-IoU loss: Faster and better learning for bounding box regression. In The AAAI conference on artificial intelligence (AAAI) (pp. 12993–13000).
Zheng, Z., Wang, P., Ren, D., Liu, W., Ye, R., Hu, Q., & Zuo, W. (2021). Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Transactions on Cybernetics, 52(8), 8574–8586.
Zou, Z., Chen, K., Shi, Z., Guo, Y., & Ye, J. (2023). Object detection in 20 years: A survey. Proceedings of the IEEE, 111(3), 257–276. https://doi.org/10.1109/JPROC.2023.3238524