Critique: Text Classification Assignment

Strengths

The assignment gets several things right.

The framing around “when is a Transformer worth it?” is the single most valuable question an ML practitioner can ask, and building the entire project around it is a strong pedagogical choice. Most courses either skip classical methods entirely or treat them as afterthoughts; this one forces students to take TF-IDF seriously, which matches real industry practice.

The data efficiency experiment (training on 1%–100% of data) is well designed. It forces students to confront the reality that labeled data is expensive and that fancy models don’t always win in low-data regimes. This is a lesson many junior ML engineers learn painfully on the job.

Requiring training time and inference time alongside accuracy is good. Too many academic projects optimize for accuracy alone. The “Why This Matters” section is honest and grounded, not selling hype.

The tiered structure (classical, neural, Transformer) gives a clear scaffold and makes grading criteria transparent.

Gaps for Career Relevance

Several things that matter in industry are absent.

No error analysis requirement. The assignment asks students to report metrics but never asks them to look at misclassified examples, build a confusion matrix, or ask “why did this fail?” In industry, error analysis drives 80% of model improvement decisions. Add a mandatory section: “For your best model, examine 20–50 misclassified examples. Categorize the failure modes. What would you fix first?”

No data preprocessing discussion. Real text data is messy: HTML artifacts, encoding issues, mixed languages, class label noise, duplicate entries. The assignment assumes clean HuggingFace datasets. Consider requiring students to document their preprocessing pipeline and justify decisions (lowercasing? removing stopwords? handling URLs and @-mentions?).

No model deployment or serving considerations. Model size, memory footprint, and latency under load are absent. Even a brief requirement to report model parameter count and single-example inference latency on CPU would add value.

No experiment tracking. Students should learn to log experiments systematically. Requiring Weights & Biases, MLflow, or even a structured CSV of all runs would teach a habit that every industry team expects.

No discussion of text length distribution. Transformers have token limits. TF-IDF handles arbitrary length. This is a real constraint that affects model choice, and the assignment should ask students to report and discuss text length statistics for their datasets.

No confidence calibration. In production, you rarely just want a label; you want to know how confident the model is. Asking students to plot reliability diagrams or compute expected calibration error (ECE) for at least one model would be a useful addition.

Scope Concerns

For a group of up to 5 students, the scope is reasonable but risks being shallow.

The requirement is: 2 classical models, at least 1 neural model, at least 2 Transformers, across at least 3 datasets, with a data efficiency experiment on at least one dataset. That is roughly 5 models x 3 datasets = 15 training runs at full data, plus ~30 more for the efficiency curve. On Colab with free GPUs, fine-tuning BERT-base on 3 datasets is feasible but tight. Students may spend most of their time on engineering plumbing rather than analysis.

Risk: the breadth encourages “run everything, analyze nothing.” Consider reducing to 2 datasets (one easy, one hard) and requiring deeper analysis, or explicitly stating that depth of analysis matters more than number of models.

For 5 students, there is also the coordination problem. The assignment does not suggest how to divide work. Adding a note like “we recommend one person owns the classical pipeline, one owns the neural model, one owns Transformer fine-tuning, one owns the data efficiency experiment, and one owns the final report and analysis” would help.

Technical Accuracy

A few claims need correction or nuance.

“DistilBERT… runs well on Colab”: this is true for fine-tuning on small datasets with a free T4 GPU, but students should be warned about Colab session timeouts and the need to checkpoint. If a dataset has 500K+ examples, fine-tuning even DistilBERT on free Colab may be painful.

SetFit framing: SetFit is described as “a few-shot learning framework,” which is correct, but the assignment then asks students to train it with 100% of data in the efficiency experiment. SetFit is not designed for that regime and will be slow (it generates contrastive pairs). Clarify that SetFit should only be evaluated at low data fractions (1%, 5%, maybe 10%), not at full scale.

“TF-IDF + Logistic Regression” as “the standard baseline”: this is accurate, but the assignment should note that the choice of regularization strength (C parameter) and max features matters a lot. Without tuning, students may get misleadingly bad results for the classical baseline and draw wrong conclusions. A note about hyperparameter tuning being necessary for fair comparison across all tiers would help.

Missing: tokenizer choices matter. For Transformer models, the tokenizer is fixed. For classical models, choices like word-level vs. character n-grams, vocabulary size, and subword handling affect results significantly. The assignment does not mention this.

Missing Practical Considerations

Class imbalance. Several suggested datasets (Jigsaw, GoEmotions) have severe class imbalance. The assignment mentions macro F1 but does not discuss strategies for handling imbalance (oversampling, class weights, stratified splitting). Students who pick Jigsaw without addressing imbalance will get misleading results.

Hyperparameter tuning. The assignment does not mention it at all. In a fair comparison, you need to tune each model. Learning rate for Transformers, C for logistic regression, number of epochs, batch size: these all matter. State explicitly whether students should do hyperparameter search, and if so, how (grid search? a few manual runs?). Without this, results are not comparable.

Validation strategy. No mention of train/validation/test splits or cross-validation. Students should hold out a validation set for model selection and report final numbers on a test set they never tuned on. This is fundamental and should be stated explicitly.

Reproducibility. “Fully reproducible” is listed as a requirement but no guidance is given. Require: random seeds, environment files (requirements.txt or conda env), and clear instructions to run.

Statistical significance. When comparing models, a 0.5% difference in F1 is often noise. Consider requiring students to run at least 3 seeds and report mean and standard deviation, or at minimum to discuss whether observed differences are meaningful.

Dataset Suggestions

The suggested datasets are mostly fine but a few notes.

Good choices: AG News (clean, balanced, 4 classes, fast to train on), SST-2 (standard binary sentiment, small), IMDb (binary sentiment, longer texts).

Problematic choices: - CLINC150 has 150 classes. This is a very different problem from 2–4 class classification and will make cross-dataset comparisons harder. Fine for one of three datasets, but warn students it changes the game. - Jigsaw is multi-label, not multi-class. The assignment frames everything as single-label classification. Either exclude it or add a note about multi-label handling. - 20 Newsgroups is old and has known header-leakage issues (the “from:” field predicts the class). If used, students must strip headers, which is a known gotcha.

Better alternatives to consider: - MASSIVE (Amazon’s multilingual intent dataset): 60 intents, realistic enterprise use case, available on HuggingFace. - Banking77: 77 banking intents, realistic customer service scenario, moderate difficulty. - Tweet Eval (sentiment, emotion, hate speech, irony): short noisy text, closer to real-world messy data than IMDb reviews. - Financial PhraseBank: sentiment on financial news, domain-specific, shows that domain matters.

Suggested Improvements

  1. Add mandatory error analysis. Require students to inspect 30+ misclassified examples for their best and worst models and categorize failure modes. This is the single highest-value addition.

  2. Require a validation protocol section. Students must describe their train/val/test split strategy and confirm that test data was not used for any model selection.

  3. State hyperparameter tuning expectations. Either provide a minimal tuning budget (e.g., “tune learning rate from {1e-5, 2e-5, 5e-5} for Transformers, C from {0.1, 1, 10} for logistic regression”) or require students to justify their choices.

  4. Reduce breadth, increase depth. Change “at least three datasets” to “exactly two datasets: one where you expect classical methods to be competitive and one where you expect Transformers to shine.” This frees time for deeper analysis.

  5. Clarify SetFit scope. State that SetFit should only be evaluated at low data fractions (1%–10%), not at full dataset scale.

  6. Add a “production readiness” section to the report. Ask: “For each model, estimate memory usage, inference latency on CPU for a single example, and what infrastructure you would need to serve it at 100 requests/second.” Even rough estimates build the right thinking habits.

  7. Warn about 20 Newsgroups header leakage and Jigsaw multi-label. Add footnotes or a “dataset notes” section so students do not fall into known traps.

  8. Require reproducibility artifacts. Mandate a requirements.txt, fixed random seeds, and a README that lets a TA reproduce results with one command.

  9. Add a team roles suggestion. For groups of 3–5, suggest a division of labor so coordination overhead does not eat into analysis time.

  10. Include a one-page “executive summary” in the report. Ask students to write a recommendation memo as if they were advising a product team. This exercises a communication skill that is as important as the technical work.