Transfer Learning for Computer Vision

DSS5104

Overview

Transfer learning has become the default approach for computer vision in industry. Rather than training deep networks from scratch, which requires massive datasets and compute, practitioners fine-tune models pretrained on ImageNet (or similar large-scale datasets) for their specific task. Transfer learning often works surprisingly well even when the target domain differs from ImageNet, though the degree of benefit varies with domain gap.

In this assignment, you will apply transfer learning to a domain-specific image classification task of your choice. You will compare multiple pretrained architectures, explore different fine-tuning strategies, and investigate a central practical question: how little labeled data do you need for transfer learning to work well?

This is one of the most common workflows in deployed computer vision systems, from medical imaging to quality control in manufacturing. Understanding when and how transfer learning works is essential for any ML practitioner.

Assignment Objectives

Apply pretrained deep learning models to a domain-specific image classification task
Compare multiple architectures and fine-tuning strategies
Investigate the effect of training set size on transfer learning performance
Benchmark transfer learning against training from scratch
Analyze model failures qualitatively, not just quantitatively
Develop practical intuition for when transfer learning is sufficient and when it is not

Architectures

You should compare at least three pretrained architectures from different families. Suggested options include:

CNNs: ResNet-50, EfficientNet-B0/B3, ConvNeXt
Vision Transformers: ViT-B/16, DeiT, Swin Transformer

These models are available through torchvision, timm (PyTorch Image Models), or HuggingFace transformers. Note that torchvision includes ViT, Swin, ResNet, EfficientNet, and ConvNeXt, but DeiT requires timm or HuggingFace (torchvision’s ViT uses DeiT’s training recipe but does not expose the DeiT distillation architecture as a separate model). You are free to use any of these libraries.

Inference benchmarking. Beyond accuracy, report the following for each architecture: number of parameters, model size in MB, and inference latency (ms per image on GPU). Present these in a single comparison table alongside accuracy. In practice, model selection is driven by efficiency constraints as much as by raw performance.

Fine-Tuning Strategies

You should explore and compare at least three of the following strategies:

Feature extraction: freeze the pretrained backbone entirely, train only the classification head (typically one or two linear layers; the choice of head architecture can matter, so document what you use)
Full fine-tuning: unfreeze all layers and train end-to-end with a small learning rate
Gradual unfreezing: progressively unfreeze layers from top to bottom during training
Discriminative learning rates: use smaller learning rates for early layers and larger ones for later layers
Data augmentation: compare performance with and without augmentation (random crops, flips, color jitter, mixup, etc.)

For each strategy, report training curves, final accuracy, and any signs of overfitting.

Hyperparameter reporting. Include a table listing all hyperparameters for each experiment: optimizer, learning rate, learning rate schedule (e.g., cosine annealing, step decay), batch size, number of epochs, early stopping patience, and augmentation details. Without this, your results are not reproducible.

Datasets

Choose one primary dataset for your main experiments, and optionally a second dataset to test generalization of your findings. Your dataset should be domain-specific (not ImageNet or CIFAR), with a reasonable number of classes and enough images to be interesting. Suggested options:

Medical imaging: MedMNIST variants (PathMNIST, DermaMNIST, BloodMNIST) are standardized, well-documented, and sized for Colab. Skin lesion classification (ISIC) and retinal disease (OCT) are also good choices. Note: CheXpert-full is 439 GB and requires a data use agreement; if you want chest X-rays, use CheXpert-small or a MedMNIST variant instead.
Satellite/aerial: EuroSAT (land use), UC Merced Land Use
Food: Food-101
Fine-grained recognition: FGVC-Aircraft, Oxford Flowers, CUB-200 Birds

All of these are publicly available and run comfortably on Google Colab.

Data Efficiency Experiment

A key part of this assignment is to study how transfer learning performs as labeled data decreases. For your primary dataset:

Select your best-performing architecture for this experiment. Train it using 100%, 50%, 25%, 10%, and 5% of the training data.
Compare against the same architecture trained from scratch (randomly initialized) at each data level. You only need to run the from-scratch baseline for this one architecture; running all three from scratch at all data fractions is unnecessary.
Plot accuracy vs. training set size for both pretrained and from-scratch models.
Run each configuration with at least 3 random seeds. Report mean and standard deviation. Results at low data fractions (5%, 10%) are noisy from a single run, so multiple seeds are needed to draw reliable conclusions.

This experiment should clearly demonstrate the practical value of transfer learning in low-data regimes.

Validation Strategy

Use a proper train/validation/test split. The test set is held out for final evaluation only; use the validation set for model selection and early stopping.

When subsampling training data (e.g., to 5%), keep the validation and test sets fixed at their original size. Subsample only the training portion. Use stratified sampling to preserve class proportions. If your dataset does not come with a predefined split, use a 70/15/15 or 80/10/10 split and document your choice.

Evaluation Metrics

Report the following metrics on a held-out test set:

Overall accuracy
Macro-averaged F1 score (important if classes are imbalanced)
Per-class precision and recall for at least the best and worst performing classes
Confusion matrix for your best model

Error Analysis

For your best model, go beyond aggregate metrics. Select 10–20 misclassified examples and inspect them. Group failures by category: ambiguous or visually similar classes, poor image quality, possible labeling errors, or cases that reflect genuine domain difficulty. Include representative examples in your report with brief commentary.

This kind of qualitative analysis is where real debugging intuition comes from. Confusion matrices tell you which classes are confused; looking at individual failures tells you why.

Key Questions to Address

Architecture comparison: which pretrained model performs best on your task? Does model size always help? How do architectures compare on inference efficiency?
Fine-tuning strategy: does full fine-tuning always beat feature extraction? When is freezing the backbone sufficient?
Data efficiency: how much labeled data is needed for transfer learning to match training from scratch with the full dataset?
Domain gap: how different is your target domain from ImageNet? Does this affect which strategies work best?
Practical recommendations: given your results, what advice would you give to a practitioner starting a new image classification project?

Deliverables

You must submit:

A written report (PDF)
- Maximum length: 10 pages (this is tight given the number of experiments; be selective about which results you present in detail)
- No code in the report
- Clear explanation of your dataset, methodology, experiments, and findings
- Include relevant figures (training curves, accuracy vs. data size plots, confusion matrices, misclassified examples)
- Include the hyperparameter table and inference benchmarking table described above
Your code
- Create a GitHub repository containing all your code. Include the repository link in your report. Do not submit code on CANVAS.
- All results reported in the PDF must be reproducible from the repository. Include a README with clear instructions to run your experiments.
- Well organized, with a requirements.txt or environment.yml

Practical Tips

Preprocessing. Pretrained models expect inputs normalized with ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]). Use the same normalization even for non-ImageNet datasets. Getting this wrong is a common source of silently degraded performance.

Mixed precision training. If you are training on Colab, consider using torch.cuda.amp (automatic mixed precision). It roughly halves GPU memory usage and speeds up training, which helps when working within Colab’s resource limits.

Compute planning. The full set of experiments (3 architectures, 3 strategies, data efficiency curves with multiple seeds) adds up. Plan your GPU time. The data efficiency experiment with the from-scratch baseline is the most expensive part; limit it to one architecture as described above. Run your other architecture and strategy comparisons at 100% data.

Why This Matters

Transfer learning is not a research curiosity; it is the standard approach for computer vision in production. Most companies do not have millions of labeled images for their specific task. They have hundreds or thousands. Understanding how to select a pretrained model, choose a fine-tuning strategy, and evaluate whether you have enough data is a core skill for any ML engineer working with images. This assignment gives you hands-on practice with exactly that workflow.