Synoptic Key of Life for Biological Classification: SKOL Part 2
Scalable Paragraph Classification for Automated Species Identification in Mycological Literature
La Monte Henry Piggy Yarroll, Christopher A. Murphy, David Caspers
Winter Term, 2025
Project Overview
Building on the Semantic Search for Mycology Literature project (SKOL Part 1), this effort focused on replacing manual paragraph classification methods with an automated PySpark-based classification pipeline. Paragraphs were correctly labeled as Nomenclature (species names), Description and Miscellaneous Exposition (general background text), and formatted with the Taxon class used for embedding into scientific literature search platform (SKOL Part 1).
The project is owned and led by fellow student collaborator La Monte Henry Piggy Yarroll.
Problem Statement
Mycological literature is rich in species descriptions but highly unstructured, often mixing Latin names, morphological details, and unrelated exposition. Traditional species identification workflows require manual curation or rigid decision trees.
Methods
Preprocessing & Feature Engineering: parsed YEDDA labels; segmented OCR corpus with heuristics (indentation, keywords, line breaks); merged fragments and removed empties.
Feature Extraction: TF-IDF to weight taxonomic terms; 2–4-char suffix features (e.g., -phore, -spore, -aceae, -mycetes).
Modeling at Scale: Logistic Regression (baseline/enhanced) and Random Forest; PCA was attempted for dimensionality reduction but dropped due to extreme TF-IDF dimensionality.
Trained models on 80/20 train-test splits and evaluated with precision, recall, F1-score, and accuracy.
Key Learnings
Features matter: TF-IDF is a strong baseline, but adding taxonomic suffixes (-spore, -aceae, -ous) yields even better results.
Data & scale: OCR noise/fragmentation are major bottlenecks (heuristics help; DL could do better); logistic regression works well for sparse text, and Spark scales but single-node Colab limits processing gains vs true distributed runs.
Visualizations
Competencies Employed
Big Data Analytics
Working with high-volume, high-velocity datasets using distributed computing tools.
Data Wrangling
Cleaning, transforming, and preparing data for analysis and modeling.
Applied Machine Learning
Implementing ML algorithms to solve real-world problems and optimize outcomes.
Natural Language Processing (NLP)
Analyzing and modeling human language for applications like sentiment analysis or information extraction.
Predictive Modeling
Using statistical or machine learning models to make future predictions.
Data Engineering
Designing and implementing systems for data collection, storage, and access.
Communication
Succinctly communicate complicated technical concepts.
Project Management
Planning, executing, and managing data science projects across teams and phases.
Information Retrieval
Extracting relevant information from unstructured sources like documents or text corpora.
Scripting for Analysis
Automating data processes and analysis using scripting languages like Python or R.
Cloud & Scalable Computing
Leveraging cloud platforms and parallel computing for large-scale data tasks.
Data Management
Managing structured and unstructured data using databases and data warehouses.
Additional Technical Information
Data Source(s)
The project used digitized corpora journals: Mycologia, Mycotaxon, Persoonia (1909–2010).
• Labeled Corpus (annotated using YEDDA standard): 6,192 Nomenclature, 48,564 Description.
• Unlabled Corpus (OCR): 1,021 journal issues, 25.5 million words. Highly variable due to formatting and OCR artifacts
Results Summary
The best model was logistic regression using both TF-IDF and suffix features.
Accuracy: 94.21% | Precision: 96.40% | Recall: 96.30 | F1: 94.19%
Even suffix-only models performed nearly as well - confirming that morphological patterns hold strong discriminative value in taxonomy.
Random Forest struggled, likely due to overfitting on noisy data.
Potential Improvements
Test neural network based segmenters to address OCR fragmentation and add domain cues (suffix dictionary / character-level signals).
Replace/augment TF-IDF with contextual models (BERT/BioBERT) to manage ambiguity.
Run real-time inference on a cloud Spark cluster and feed outputs to SKOL for embedding.





