Synoptic Key of Life for Biological Classification: SKOL Part 2

Scalable Paragraph Classification for Automated Species Identification in Mycological Literature

La Monte Henry Piggy Yarroll, Christopher A. Murphy, David Caspers

Winter Term, 2025

Project Overview

Building on the Semantic Search for Mycology Literature project (SKOL Part 1), this effort focused on replacing manual paragraph classification methods with an automated PySpark-based classification pipeline. Paragraphs were correctly labeled as Nomenclature (species names), Description and Miscellaneous Exposition (general background text), and formatted with the Taxon class used for embedding into scientific literature search platform (SKOL Part 1).

The project is owned and led by fellow student collaborator La Monte Henry Piggy Yarroll.

Problem Statement

Mycological literature is rich in species descriptions but highly unstructured, often mixing Latin names, morphological details, and unrelated exposition. Traditional species identification workflows require manual curation or rigid decision trees.

Methods

Preprocessing & Feature Engineering: parsed YEDDA labels; segmented OCR corpus with heuristics (indentation, keywords, line breaks); merged fragments and removed empties.
Feature Extraction: TF-IDF to weight taxonomic terms; 2–4-char suffix features (e.g., -phore, -spore, -aceae, -mycetes).
Modeling at Scale: Logistic Regression (baseline/enhanced) and Random Forest; PCA was attempted for dimensionality reduction but dropped due to extreme TF-IDF dimensionality.
- Trained models on 80/20 train-test splits and evaluated with precision, recall, F1-score, and accuracy.

Key Learnings

Features matter: TF-IDF is a strong baseline, but adding taxonomic suffixes (-spore, -aceae, -ous) yields even better results.
Data & scale: OCR noise/fragmentation are major bottlenecks (heuristics help; DL could do better); logistic regression works well for sparse text, and Spark scales but single-node Colab limits processing gains vs true distributed runs.

Visualizations

Competencies Employed

Big Data Analytics

Working with high-volume, high-velocity datasets using distributed computing tools.

Data Wrangling

Cleaning, transforming, and preparing data for analysis and modeling.

Applied Machine Learning

Implementing ML algorithms to solve real-world problems and optimize outcomes.

Natural Language Processing (NLP)

Analyzing and modeling human language for applications like sentiment analysis or information extraction.

Predictive Modeling

Using statistical or machine learning models to make future predictions.

Data Engineering

Designing and implementing systems for data collection, storage, and access.

Communication

Succinctly communicate complicated technical concepts.

Project Management

Planning, executing, and managing data science projects across teams and phases.

Information Retrieval

Extracting relevant information from unstructured sources like documents or text corpora.

Scripting for Analysis

Automating data processes and analysis using scripting languages like Python or R.

Cloud & Scalable Computing

Leveraging cloud platforms and parallel computing for large-scale data tasks.

Data Management

Managing structured and unstructured data using databases and data warehouses.

Additional Technical Information

GitHub Repository

https://github.com/piggyatbaqaqi/skol/tree/main/IST718

Data Source(s)

The project used digitized corpora journals: Mycologia, Mycotaxon, Persoonia (1909–2010).

• Labeled Corpus (annotated using YEDDA standard): 6,192 Nomenclature, 48,564 Description.

• Unlabled Corpus (OCR): 1,021 journal issues, 25.5 million words. Highly variable due to formatting and OCR artifacts

Results Summary

The best model was logistic regression using both TF-IDF and suffix features.
- Accuracy: 94.21% | Precision: 96.40% | Recall: 96.30 | F1: 94.19%
Even suffix-only models performed nearly as well - confirming that morphological patterns hold strong discriminative value in taxonomy.
Random Forest struggled, likely due to overfitting on noisy data.

Potential Improvements

Test neural network based segmenters to address OCR fragmentation and add domain cues (suffix dictionary / character-level signals).
Replace/augment TF-IDF with contextual models (BERT/BioBERT) to manage ambiguity.
Run real-time inference on a cloud Spark cluster and feed outputs to SKOL for embedding.