Text Classification: From Classical NLP to Transformers

DSS5104

Overview

Text classification is one of the most common NLP tasks in industry: spam filtering, sentiment analysis, customer intent detection, content moderation, and document routing all reduce to classifying text into categories. The landscape of methods has evolved dramatically, from bag-of-words models and TF-IDF features, to recurrent neural networks, to pretrained Transformers like BERT.

A key practical question has emerged: when is a fine-tuned Transformer actually worth the added complexity over a simple TF-IDF baseline? The answer is not always obvious. On some tasks, a logistic regression on TF-IDF features performs nearly as well as BERT. On others, the gap is substantial. Understanding where each method sits on the cost-accuracy spectrum is essential for making sound engineering decisions.

In this assignment, you will benchmark the full spectrum of text classification approaches, from the simplest classical baselines to fine-tuned Transformers, across two datasets chosen to test different hypotheses. All models run locally; no API calls are needed.


Assignment Objectives

  • Implement and compare classical, deep learning, and Transformer-based text classification methods
  • Evaluate performance across two datasets with different characteristics (one where you expect classical methods to be competitive, one where you expect Transformers to shine)
  • Investigate the effect of training set size on the relative performance of simple vs. complex models
  • Analyze the cost-accuracy trade-off: when is a larger model justified?
  • Perform error analysis on misclassified examples to understand failure modes
  • Develop practical recommendations for choosing a text classification approach

Methods

You should implement and compare models across three tiers of complexity.

Tier 1: Classical Baselines

  • TF-IDF + Logistic Regression: the standard baseline for text classification
  • TF-IDF + SVM: often competitive with more complex models
  • Optionally: n-gram features, feature selection, or other classical NLP techniques

Tier 2: Neural Models (choose at least one)

  • FastText: lightweight word-embedding-based classifier
  • CNN for text (e.g., TextCNN): convolutional filters over word embeddings
  • BiLSTM: recurrent model with bidirectional context

Tier 3: Pretrained Transformers (choose at least two)

  • DistilBERT: a smaller, faster version of BERT that runs well on Colab
  • BERT-base or RoBERTa-base: standard fine-tuned Transformers
  • SetFit: a few-shot learning framework that fine-tunes sentence transformers with very little labeled data. Important: SetFit is designed for low-data regimes. In the data efficiency experiment, evaluate SetFit only at low data fractions (1%, 5%, and at most 10%), not at full dataset scale, where it becomes impractically slow due to contrastive pair generation.

Libraries: scikit-learn for classical models, HuggingFace Transformers and Datasets for Transformer-based models, SetFit library for few-shot experiments. All of these run on Google Colab.


Datasets

Choose two datasets covering different text classification scenarios:

  • One dataset where you expect classical methods (TF-IDF) to be competitive. For example, a topic classification task with distinctive vocabulary per class.
  • One dataset where you expect Transformers to have a clear advantage. For example, a task requiring contextual understanding, such as sentiment, irony detection, or fine-grained intent classification.

State your hypothesis about each dataset upfront in your report.

Suggested options

  • Sentiment analysis: IMDb movie reviews, SST-2, Amazon product reviews, Yelp reviews
  • Topic classification: AG News (4 topics), 20 Newsgroups
  • Intent detection: Banking77 (77 banking intents), ATIS
  • Short/noisy text: Tweet Eval (sentiment, emotion, hate speech, irony)
  • Domain-specific: Financial PhraseBank (sentiment on financial news)
  • Emotion detection: GoEmotions, SemEval emotion datasets

These are all publicly available through HuggingFace Datasets or Kaggle.

Dataset warnings

Be aware of the following known issues:

  • 20 Newsgroups: contains header metadata (e.g., “From:” fields) that leaks class labels. You must strip headers, footers, and quotes before using this dataset. In scikit-learn, use remove=('headers', 'footers', 'quotes') when loading.
  • Jigsaw Toxic Comment Classification: this is a multi-label problem (each comment can have multiple toxicity labels). The rest of this assignment assumes single-label, multi-class classification. If you use Jigsaw, you must handle multi-label evaluation (per-label binary classification or thresholding), or convert it to a single-label problem with a clear justification.
  • CLINC150: has 150 intent classes, which fundamentally changes the difficulty and makes cross-dataset comparisons harder. Fine as one of your two datasets, but acknowledge that the large label space affects results.
  • GoEmotions: has significant class imbalance across emotion categories. See the note on class imbalance below.

Validation Protocol

Use a proper train / validation / test split for all experiments:

  1. Training set: used to train models.
  2. Validation set: used for hyperparameter selection and model comparison during development.
  3. Test set: used only once to report final numbers. Do not use test performance to choose hyperparameters or select among model variants.

If the dataset provides a standard test split (e.g., SST-2, AG News), use it. Otherwise, hold out 10-15% of the data as a test set and split the remainder into train and validation.

In your report, confirm explicitly that your test set was not used for any model selection decisions.


Hyperparameter Tuning

For a fair comparison across model tiers, each model must receive a minimal hyperparameter search. You do not need an exhaustive grid search, but you should not use arbitrary defaults either. At minimum:

  • Logistic Regression / SVM: tune the regularization parameter C over {0.1, 1, 10}. Consider experimenting with max vocabulary size and n-gram range (e.g., unigrams vs. bigrams).
  • Neural models: tune learning rate and number of epochs. Report the values you used.
  • Transformers: tune learning rate from {1e-5, 2e-5, 5e-5}, and train for 3-5 epochs. Use the validation set for early stopping.

Select hyperparameters based on validation set performance, then report final results on the test set.


Data Efficiency Experiment

A central part of this assignment is to study how model performance changes as labeled data decreases. For at least one of your datasets:

  1. Train all models using 100%, 50%, 25%, 10%, 5%, and 1% of the training data
  2. Plot accuracy (or F1) vs. training set size for each method
  3. Identify the crossover point: at what data size does TF-IDF + logistic regression match or beat a fine-tuned Transformer?

When subsampling, use stratified sampling to preserve class proportions.

Recall that SetFit should only be included at data fractions of 1%, 5%, and at most 10%.

This experiment directly addresses the practical question: how much labeled data justifies using a Transformer?


Reproducibility and Statistical Reliability

To ensure your comparisons are meaningful:

  • Run each experiment with at least 3 different random seeds. Report the mean and standard deviation of your metrics. A 0.5% difference in F1 with high variance is noise, not a real gap.

  • Include a requirements.txt (or equivalent) so that a TA can set up the environment and reproduce your results.


Evaluation Metrics

Report the following on a held-out test set (mean and standard deviation across seeds):

  • Accuracy: overall classification accuracy
  • Macro-averaged F1 score: accounts for class imbalance
  • Per-class F1 for at least the best and worst performing classes
  • Training time and inference time for each model (important for practical comparisons)

Error Analysis

For your best-performing model and your worst-performing model on each dataset:

  1. Examine 20-30 misclassified examples from the test set.
  2. Categorize the failure modes. Common categories include: ambiguous ground truth labels, short or uninformative text, sarcasm or implicit meaning, out-of-distribution vocabulary, and label noise.
  3. Discuss: are the two models failing on the same examples, or different ones? What does this tell you about their respective strengths?

This analysis is at least as valuable as the quantitative metrics. In industry, understanding why a model fails drives improvement decisions.


Practical Considerations

Class imbalance

Several suggested datasets (GoEmotions, Jigsaw, Tweet Eval hate speech) have significant class imbalance. If your chosen dataset is imbalanced, you should address this explicitly. Options include: using class weights in your loss function, stratified sampling, or oversampling the minority class. At minimum, report class distributions and discuss how imbalance affects your results. Macro-averaged F1 is more informative than accuracy in this setting.

Text length and token limits

Transformer models have a maximum input length (typically 512 tokens for BERT). TF-IDF handles arbitrary-length text without truncation. Report the distribution of text lengths in your datasets (mean, median, 95th percentile in tokens). If a significant fraction of examples exceed 512 tokens, discuss how truncation affects Transformer performance and whether this gives classical methods an advantage on longer documents.


Key Questions to Address

  1. Baseline strength: on which tasks does TF-IDF + logistic regression perform surprisingly well? Why?
  2. Transformer advantage: on which tasks do Transformers clearly outperform classical methods? What characteristics of the task or data explain this?
  3. Data efficiency: how does the ranking of methods change as labeled data decreases? When does SetFit (few-shot) become competitive?
  4. Cost-accuracy trade-off: considering training time, model size, and inference speed alongside accuracy, which method offers the best trade-off for each task?
  5. Failure modes: what types of examples does each model struggle with, and do models at different tiers fail on the same examples?
  6. Practical recommendation: if a company asked you to build a text classifier for a new task with limited labeled data, what would you recommend as a starting point?

Deliverables

You must submit:

  1. A written report (PDF)
    • Maximum length: 10 pages
    • No code in the report
    • Clear explanation of your datasets (including hypotheses about where each model tier will shine), methodology, experiments, and findings
    • Include comparison tables, learning curves (accuracy vs. data size), and training time comparisons
    • Include the error analysis section with categorized failure modes
    • State your validation protocol and confirm test set integrity
  2. Your code
    • Create a GitHub repository containing all your code. Include the repository link in your report. Do not submit code on CANVAS.
    • All results reported in the PDF must be reproducible from the repository. Include a README with clear instructions to run your experiments.
    • Well organized, with a requirements.txt or environment.yml

Why This Matters

Text classification is one of the first ML tasks deployed in most organizations. Customer support teams classify tickets. Legal teams classify documents. Marketing teams classify social media mentions. The choice between a simple TF-IDF model and a fine-tuned Transformer has real consequences: compute costs, latency requirements, maintenance burden, and labeling budgets all factor in. This assignment trains you to make that choice with evidence rather than hype, which is exactly what industry practitioners need to do every day.