Deep Learning for Image Segmentation

DSS5104

Overview

Image segmentation goes beyond classification: instead of assigning a single label to an entire image, the goal is to assign a label to every pixel. This makes segmentation a much richer and more challenging task, and one with high practical impact. Medical imaging relies on segmentation to delineate tumors, organs, and cells. Autonomous driving uses it to understand road scenes. Satellite imagery analysis uses it to map land use, detect buildings, or monitor deforestation.

Deep learning has transformed image segmentation. The Fully Convolutional Network (FCN) introduced the idea of end-to-end pixel-level prediction, and U-Net refined it with skip connections between encoder and decoder, becoming the standard architecture for biomedical segmentation. Since then, models like DeepLabV3, Feature Pyramid Networks, and SegFormer have pushed performance further. More recently, foundation models like Segment Anything (SAM) offer zero-shot segmentation without task-specific training. In this assignment, you will build, train, and evaluate segmentation models on a real-world dataset, and optionally compare your trained models against SAM’s zero-shot predictions.

Assignment Objectives

Understand the difference between semantic segmentation and instance segmentation
Implement and compare multiple segmentation architectures
Build a proper data augmentation pipeline and measure its impact
Work with pixel-level annotations and understand the associated data challenges, including class imbalance
Evaluate segmentation quality using appropriate metrics (not just pixel accuracy), including per-class breakdowns
Benchmark models for both accuracy and efficiency

Architectures

You should implement and compare at least three segmentation models. Suggested options:

U-Net: the classic encoder-decoder architecture with skip connections, especially strong in medical imaging
DeepLabV3 / DeepLabV3+: uses atrous (dilated) convolutions and atrous spatial pyramid pooling for multi-scale context
Feature Pyramid Network (FPN): builds a feature pyramid for multi-scale predictions
SegFormer: a Transformer-based segmentation model
Pretrained backbones: for all of the above, you can use pretrained encoders (ResNet, EfficientNet, MobileNet) via libraries like segmentation_models_pytorch

You are encouraged to use the segmentation_models_pytorch (smp) library, which provides clean implementations of all major architectures with interchangeable backbones. HuggingFace Transformers also provides segmentation models.

Optional extension. Compare your best trained model against SAM (Segment Anything) used in zero-shot mode on your dataset. When does task-specific training beat a foundation model? This is an interesting and timely question.

Data Augmentation

A proper augmentation pipeline is essential for segmentation. Unlike classification, augmentations must be applied consistently to both the image and its mask.

You are required to:

Implement at least 3 augmentation strategies using AlbumentationsX (pip install albumentationsx), which handles joint image-mask transforms correctly. AlbumentationsX is the actively maintained successor to the original Albumentations library (same API, same import names). Examples include: random horizontal/vertical flips, random rotations, random crops, elastic deformations, color jitter, Gaussian blur, and coarse dropout.
Ablate their effect. Train your best architecture with and without augmentations and report the difference in validation mIoU. If time permits, compare individual augmentation strategies to identify which ones help most for your dataset.

For medical imaging datasets, elastic deformations and rotations tend to be particularly effective. For natural images, random cropping and color jitter are standard.

Datasets

Choose one primary dataset for your experiments. Your dataset should have pixel-level annotations and be publicly available. Suggested options:

Medical imaging:
- Kvasir-SEG (polyp segmentation in colonoscopy images, ~1000 images)
- ISBI Cell Segmentation
- GlaS (gland segmentation in histology, well-defined evaluation protocol)
Satellite/aerial imagery:
- Inria Aerial Image Labeling (building segmentation)
- LandCover.ai (land use segmentation)
- Massachusetts Buildings/Roads datasets
Scene understanding:
- Pascal VOC 2012 (21 classes, ~2,900 segmentation images)
- CamVid (driving scene segmentation, 32 classes)
- Oxford-IIIT Pet Dataset (foreground/background/boundary, ~7,000 images, good for pipeline debugging)
- Cityscapes (if you have sufficient compute)

A note on dataset size: DRIVE (retinal vessels) has only 40 images and is too small for a meaningful deep learning project without very heavy augmentation. If you choose a small dataset, be aware that results will be noisy and validation protocol matters even more.

All of the above are publicly available and most run on Google Colab with appropriate batch sizes.

Validation Protocol

A sound validation protocol is necessary for meaningful results.

Train/validation/test split. Split your data into training, validation, and test sets. Use the validation set for hyperparameter tuning, model selection, and ablations. Evaluate on the held-out test set only once, for your final reported results.
Data leakage. Be careful with datasets that contain related images. For example, if you extract overlapping patches from the same source image (common in satellite and medical imaging), all patches from one source must go into the same split. A random split on patches will leak information and inflate your metrics.
Reporting. State your split sizes and strategy clearly in the report.

Experimental Design

Your experiments should address the following:

Architecture comparison: train at least three models on the same dataset with the same training setup. Compare their performance, training time, and model size.
Backbone comparison: for one architecture (e.g., U-Net), compare at least two different pretrained backbones (e.g., ResNet-34 vs. EfficientNet-B3) to assess the impact of the encoder.
Loss functions: compare at least two loss functions appropriate for segmentation:
- Cross-entropy loss: the standard pixel-wise classification loss. For imbalanced datasets, consider using class weights or focal loss to handle underrepresented classes.
- Dice loss: directly optimizes the Dice coefficient, often better for imbalanced classes.
- Combined losses: e.g., cross-entropy + Dice.
Augmentation ablation: compare training with and without your augmentation pipeline (see the Data Augmentation section).
Qualitative analysis: for your best model, show example predictions alongside ground truth masks. Include both good predictions and failure cases. Discuss what the model gets right and where it struggles.

Class imbalance. Many segmentation datasets are heavily imbalanced: background pixels vastly outnumber foreground, and some classes are much rarer than others. Beyond Dice loss, standard techniques include class weighting in cross-entropy and focal loss. Discuss how you handle class imbalance in your report.

Evaluation Metrics

Pixel accuracy alone is misleading for segmentation (a model that predicts “background” everywhere can achieve high pixel accuracy). Report the following:

Intersection over Union (IoU / Jaccard index): the primary segmentation metric \[\text{IoU} = \frac{|P \cap G|}{|P \cup G|}\] where \(P\) is the predicted mask and \(G\) is the ground truth mask.
Dice coefficient: closely related to IoU, widely used in medical imaging \[\text{Dice} = \frac{2|P \cap G|}{|P| + |G|}\] Note that the Dice coefficient is identical to the F1-score for binary segmentation. If you have seen F1 in classification contexts, this is the same formula applied to pixel-level predictions.
Mean IoU (mIoU): IoU averaged across all classes (the standard benchmark metric)
Per-class IoU breakdown: report IoU for each class individually, not just the mean. Identify which classes are hardest and discuss why.
Pixel accuracy: report it, but discuss its limitations.

Inference benchmarking. For each model, report: number of parameters, FLOPs (e.g., using fvcore or ptflops), and inference speed in FPS on your hardware. Discuss the accuracy vs. speed tradeoff across your models. In production settings, a model with slightly lower mIoU but much faster inference is often preferable.

Key Questions to Address

Architecture comparison: which model achieves the best segmentation quality? Is there a clear winner, or does performance depend on the dataset characteristics?
Encoder impact: how much does the choice of pretrained backbone matter? Does a heavier backbone always improve results?
Loss function impact: does Dice loss improve performance over cross-entropy, especially for small or underrepresented classes?
Augmentation impact: how much do augmentations improve validation performance? Which augmentations help most?
Efficiency tradeoff: which model offers the best balance of accuracy and inference speed?
Failure analysis: where does your best model fail? Are errors concentrated at object boundaries, on small objects, or on specific classes?
Classification vs. segmentation: how does the difficulty and cost of segmentation compare to classification? What are the additional challenges (annotation cost, compute, evaluation)?

Compute Guidance

Google Colab provides limited GPU time and VRAM. A few practical suggestions:

Mixed precision training (torch.cuda.amp) reduces memory usage and speeds up training. Use it by default.
Gradient accumulation lets you simulate larger batch sizes when your GPU memory is limited. For example, accumulating gradients over 4 steps with batch size 4 approximates batch size 16.
Short runs for ablations. When comparing augmentations, loss functions, or backbones, you do not need to train to convergence. Training for 10-20 epochs is often enough to see relative differences. Reserve longer training for your final model.
Image resolution. Downscaling images (e.g., to 256x256 or 512x512) significantly reduces compute cost. State your resolution choice and justify it.

Deliverables

You must submit:

A written report (PDF)
- Maximum length: 10 pages
- No code in the report
- Clear explanation of your dataset, architectures, augmentation pipeline, experimental setup, and findings
- Include a table comparing all models (mIoU, per-class IoU, Dice, parameters, FPS)
- Include training curves
- Include qualitative results: at least 5 examples showing predicted masks vs. ground truth, covering both successes and failures
Your code
- Create a GitHub repository containing all your code. Include the repository link in your report. Do not submit code on CANVAS.
- All results reported in the PDF must be reproducible from the repository. Include a README with clear instructions to run your experiments.
- Well organized, with documented hyperparameters and a requirements.txt or environment.yml

Why This Matters

Image segmentation is at the core of many real-world computer vision systems. Radiologists use segmentation models to measure tumor volumes. Urban planners use them to map buildings from satellite images. Self-driving cars use them to understand road geometry. Unlike classification, segmentation requires the model to produce spatially precise outputs, which raises the bar for both model design and evaluation. This assignment gives you experience with a task that is closer to production computer vision than simple image classification: you will deal with augmentation pipelines, class imbalance, efficiency constraints, and per-class analysis, all of which are standard concerns in applied work.