Synoptic Key of Life for Biological Classification: SKOL Part 1

Formatting Species Identification Content and Embedding into a Vector Search Application

La Monte Henry Piggy Yarroll, Christopher A. Murphy, Jennifer Balasi, Shintaro Osuga

Fall Term, 2024

Project Overview

This project enhanced a scientific literature search platform for mycology articles by integrating modern natural language processing (NLP) techniques. The goal was to improve the retrieval of relevant fungal taxonomy literature based on formal descriptions submitted by researchers.

The team extended two existing open-source and domain-specific search tools (SKOL, MycoSearch), standardizing content structure (Taxon class), embedding species names and descriptions, and applying cosine similarity.

The project is owned and led by fellow student collaborator La Monte Henry Piggy Yarroll.

Problem Statement

How can researchers efficiently retrieve relevant fungal species descriptions from dense, multilingual academic literature, particularly when traditional keyword search methods fall short?

Methods

Labeled ~79k paragraphs and standardized them into a dataset with metadata; sanity-checked structure via paragraph-distance visuals.
Built a Taxon class to group name + description, attach metadata (file/page), and format records for transformer input.
Created embeddings for "Taxon" content and user queries; ranked results with cosine similarity.

Key Learnings

Vector search vastly outperforms keyword search when working with scientific text.
Structuring unstructured data (via the Taxon class) is critical for reliable embeddings.
A multi-stage pipeline (manual + model labeling + visual validation) ensures data quality.
Sentence transformers like SBERT are performant for domain-specific semantic search.

Visualizations

Competencies Employed

Python Coding

Developing data pipelines, machine learning models, and automation tools using Python’s data science ecosystem (e.g., pandas, scikit-learn, TensorFlow)

Research

Locate and leverage active research to communicate technology trends.

Vector Embeddings

Creating and using vector representations of text to support semantic search and similarity matching.

LLM Integration

Connecting large language models (LLMs) to applications via APIs (e.g., OpenAI, Anthropic, Gemini).

Insights & Recommendations

Turn analysis into clear, prioritized stakeholder actions with rationale, trade-offs, and measurable outcomes.

Data Collection

Acquire, ingest, validate, and organize data using reproducible workflows and transformations to ensure compatibility with downstream data-science algorithms.

Embedding-Based Search

Implementing semantic search using vector databases

Communication

Succinctly communicate complicated technical concepts.

Data Visualization

Presenting data insights clearly using charts, dashboards, and visual storytelling.

Project Management

Planning, executing, and managing data science projects across teams and phases.

Scripting for Analysis

Automating data processes and analysis using scripting languages like Python or R.

Data Management

Managing structured and unstructured data using databases and data warehouses.

Additional Technical Information

GitHub Repository

https://github.com/piggyatbaqaqi/skol/tree/main/IST664

Data Source(s)

The team worked with a curated set of ~7,000 scientific articles focused on fungal taxonomy, sourced via partnerships with institutions like the Mycological Society of America and the Imperial Institute of Agricultural Research.

Results Summary

The final system allowed users to input partial or full species descriptions and receive semantically matched entries. Returned results include Title, Source URL, Color-coded similarity score, full species description.

Search accuracy was improved by focusing embeddings on:

Only nomenclature + description
A consistent structure (e.g., "name precedes description" rule)
Results could be exported to CSV or displayed via a color-coded terminal interface.

Future Improvements

To expand this project’s impact and usability, future development could include:

Add a Latin-capable language model with translation capabilities.
Enable a public-facing website to allow researchers to search online.
Incorporate more journals and broader taxonomic literature to enhance coverage.