Synoptic Key of Life for Biological Classification: SKOL Part 1
Formatting Species Identification Content and Embedding into a Vector Search Application
La Monte Henry Piggy Yarroll, Christopher A. Murphy, Jennifer Balasi, Shintaro Osuga
Fall Term, 2024
Project Overview
This project enhanced a scientific literature search platform for mycology articles by integrating modern natural language processing (NLP) techniques. The goal was to improve the retrieval of relevant fungal taxonomy literature based on formal descriptions submitted by researchers.
The team extended two existing open-source and domain-specific search tools (SKOL, MycoSearch), standardizing content structure (Taxon class), embedding species names and descriptions, and applying cosine similarity.
The project is owned and led by fellow student collaborator La Monte Henry Piggy Yarroll.
Problem Statement
How can researchers efficiently retrieve relevant fungal species descriptions from dense, multilingual academic literature, particularly when traditional keyword search methods fall short?
Methods
Labeled ~79k paragraphs and standardized them into a dataset with metadata; sanity-checked structure via paragraph-distance visuals.
Built a Taxon class to group name + description, attach metadata (file/page), and format records for transformer input.
Created embeddings for "Taxon" content and user queries; ranked results with cosine similarity.
Key Learnings
Vector search vastly outperforms keyword search when working with scientific text.
Structuring unstructured data (via the Taxon class) is critical for reliable embeddings.
A multi-stage pipeline (manual + model labeling + visual validation) ensures data quality.
Sentence transformers like SBERT are performant for domain-specific semantic search.
Visualizations
Competencies Employed
Python Coding
Developing data pipelines, machine learning models, and automation tools using Python’s data science ecosystem (e.g., pandas, scikit-learn, TensorFlow)
Research
Locate and leverage active research to communicate technology trends.
Vector Embeddings
Creating and using vector representations of text to support semantic search and similarity matching.
LLM Integration
Connecting large language models (LLMs) to applications via APIs (e.g., OpenAI, Anthropic, Gemini).
Insights & Recommendations
Turn analysis into clear, prioritized stakeholder actions with rationale, trade-offs, and measurable outcomes.
Data Collection
Acquire, ingest, validate, and organize data using reproducible workflows and transformations to ensure compatibility with downstream data-science algorithms.
Embedding-Based Search
Implementing semantic search using vector databases
Communication
Succinctly communicate complicated technical concepts.
Data Visualization
Presenting data insights clearly using charts, dashboards, and visual storytelling.
Project Management
Planning, executing, and managing data science projects across teams and phases.
Scripting for Analysis
Automating data processes and analysis using scripting languages like Python or R.
Data Management
Managing structured and unstructured data using databases and data warehouses.
Additional Technical Information
Data Source(s)
The team worked with a curated set of ~7,000 scientific articles focused on fungal taxonomy, sourced via partnerships with institutions like the Mycological Society of America and the Imperial Institute of Agricultural Research.
Results Summary
The final system allowed users to input partial or full species descriptions and receive semantically matched entries. Returned results include Title, Source URL, Color-coded similarity score, full species description.
Search accuracy was improved by focusing embeddings on:
Only nomenclature + description
A consistent structure (e.g., "name precedes description" rule)
Results could be exported to CSV or displayed via a color-coded terminal interface.
Future Improvements
To expand this project’s impact and usability, future development could include:
Add a Latin-capable language model with translation capabilities.
Enable a public-facing website to allow researchers to search online.
Incorporate more journals and broader taxonomic literature to enhance coverage.

