On-Device AI via Model Compression

Educating Peers on the Knowledge Distillation Technique

Christopher A. Murphy

Winter Term, 2025

Project Overview

This independent research project explored the concept of knowledge distillation, a powerful technique used to compress large language models (LLMs) into smaller, faster, and more efficient versions without significantly sacrificing performance. The goal was to analyze the mechanics and benefits of distillation and then translate this complex topic into an accessible educational presentation for fellow graduate students.

Through visual aids, curated examples, and technical breakdowns, the presentation helped peers understand the challenges of deploying LLMs in real-world scenarios — particularly on devices with limited resources — and how knowledge distillation serves as a practical solution.

Video Presentation

Problem Statement

Modern language models such as GPT and encoder models like BERT are computationally intensive, requiring substantial memory and compute. These demands can limit equitable access to advanced AI, especially in constrained on-device environments.

Methods

Research & Topic Framing: Reviewed key concepts in knowledge distillation and various NLP models.
Model Comparison & Analysis: Compared BERT to smaller variants, discussed speed, size, and accuracy. Presented applications in edge environments: mobile apps, IoT, and embedded AI.
Educational Content Creation: Designed a slide deck using storytelling to guide learners from basic concepts to real-world integrations.

Key Learnings

This project emphasized the importance of communicating complex AI concepts clearly — a critical skill in applied data science roles.

Knowledge distillation is essential for real-world deployment of NLP — especially in low-resource environments or embedded systems.
Distilled models aren’t just smaller — they often generalize better, especially when trained with noise or from ensembles of teachers.

Visualizations

Competencies Employed

Research

Locate and leverage active research to communicate technology trends.

Deep Learning

Applying neural networks for tasks like image classification and NLP.

Natural Language Processing (NLP)

Analyzing and modeling human language for applications like sentiment analysis or information extraction.

Insights & Recommendations

Turn analysis into clear, prioritized stakeholder actions with rationale, trade-offs, and measurable outcomes.

Neural Networking

Design, train, and tune deep-learning models for vision, language, and prediction tasks, with attention to data preparation, hyper-parameter optimization, model evaluation and serialization.

Transfer Learning

Applying pre-trained models to new tasks to accelerate learning and improve performance with limited data.

Model Fine-Tuning

Adjusting pre-trained neural networks on domain-specific data to optimize performance for a specific application.

Communication

Succinctly communicate complicated technical concepts.

Additional Technical Information

Data Source(s)

Rather than building a model, this project focused on meta-analysis of techniques from leading academic and industry sources:

Scientific surveys from IJCV, ACL, and ACM TKDD.
Hugging Face blog tutorials and open-source toolkits.
Foundational papers (see presentation in GitHub repo):
- Distilling the Knowledge in a Neural Network (Hinton et al., 2015)
- DistilBERT (Sanh et al., 2019)
- TinyBERT, MiniLLM (and others focused on NLP compression).

Results Summary

Delivered a clear, visual explanation of the knowledge distillation pipeline.

Discussed DistilBERT and TinyBERT, showing real-world NLP uses and comparing their accuracy, size, and latency with their non-distilled BERT counterparts.

Future Improvements

To expand the educational and analytical value of this project, future enhancements could include:

A hands-on notebook that walks through training a student model with open-source tools.
Compare latency, memory usage, and throughput across devices (e.g., laptop vs. mobile CPU) to highlight the operational efficiency of distilled models.