Identification Engine

Sequence-based taxonomic identification against the BOLD reference library.

The identification engine compares barcode sequences from a selected recordset against the BOLD reference database using a two-tier BLAST search strategy. Each query returns a ranked list of reference matches with taxonomic context, sequence identity, overlap coverage, and supporting hit counts.

How It Works

Sequences are queried against a compiled BOLD reference database using BLAST.

When an identification job is submitted, the platform exports the nucleotide sequences from the selected recordset into FASTA format and runs them as queries against a pre-compiled BLAST database built from BOLD reference sequences. Each query sequence is searched twice — at a high-identity threshold to find close matches, then at a lower threshold to recover more distant relationships when no close match exists.

The search results are merged with the reference taxonomy, ranked by sequence identity and overlap, and returned as a per-query hit table. A query that has strong close matches will show those first; a query with no close matches will show the best available lower-identity hits instead.

Barcode sequences as input

Sequences are extracted from the active recordset filtered to the selected marker (COI-5P by default).

BOLD reference database

The reference library is a pre-built BLAST database compiled from BCDM-formatted barcode sequences with linked taxonomy.

Ranked hit table

Each query returns an ordered list of reference matches with identity, overlap, taxon name, BIN URI, and process ID.

Downloadable package

Results are packaged as a downloadable ZIP including TSV hit tables, JSONL records, an HTML summary report, and runtime metadata.

Search Tiers

Two search passes are run per query at different identity thresholds.

A single BLAST pass risks missing good matches when sequences have variable coverage or minor divergence. The identification engine addresses this by running two sequential passes with different word sizes and identity floors. Each tier is configured to balance throughput against sensitivity.

Tier	Word size	Min identity	Max hits	Purpose
Tier 1	48 bp	96 %	300 per query	High-confidence matches — species-level identification when reference coverage is good.
Tier 2	12 bp	80 % (configurable)	100 per query	Wider net — recovers genus- or family-level hits when a species-level match is absent from the reference.

The minimum overlap requirement (how many base pairs must align between query and reference) is controlled by the sensitivity parameter selected at submission time.

Parameters

Job parameters adjust the reach and stringency of the search.

Identification jobs accept a small set of parameters that control how broadly sequences are matched and how many results are returned. Defaults are set for standard COI-5P barcode analysis.

Sensitivity

Controls the minimum sequence overlap required between query and reference hit. Low requires 250 bp overlap; medium requires 150 bp; high (most permissive) requires 100 bp. Default: medium.

Minimum identity

The floor below which hits are discarded. Applies to Tier 2; Tier 1 is always pinned to 96%. Expressed as a percentage. Default: 95%.

Marker

The target locus to use for identification. Records without a sequence for the selected marker are excluded from the analysis. Default: COI-5P.

Max hits

The maximum number of reference matches to return per query sequence. Default: 100. Reducing this limits result file size for large jobs.

Min hits

Queries returning fewer than this number of hits are flagged in the output. Default: 1 (any hit is acceptable).

Execution Pipeline

The identification job moves through the standard five-stage analysis scaffold.

Like all analysis jobs on this platform, identification follows the same staged execution pattern: validate inputs, filter to the relevant records, convert to the required format, run the computation, and package the outputs. Stage timings are recorded and included in the result package.

Validate

1.validate.sh

Confirm that params.json and records.jsonl are present and well-formed before proceeding.

Filter

2.filter.sh

Retain records that have a sequence for the selected marker and meet any sequence-length requirements.

Convert

3.convert.sh

Extract nucleotide sequences from the filtered BCDM records and write them as a FASTA query file.

Execute

4.execute.sh

Prepare BLAST queries, run both tiers against the reference database, annotate hits with taxonomy from the reference index, and produce per-query result files in JSONL and TSV format.

Package

5.package.sh

Generate summary charts, assemble the HTML report, and bundle all output files into a downloadable ZIP archive.

Result Output

Results are returned as a structured package including hit tables and an HTML report.

When the job completes, the workbench makes the result package available for download from the analysis queue. The package contains several files for different downstream uses.

per_query_results.tsv

Tab-separated hit table with one row per query–reference pair. Columns include process ID, marker, BIN URI, taxon ID, public status, identity, and overlap.

per_query_results.jsonl

Machine-readable JSONL with the full ranked hit list per query, including all BLAST scores and taxonomy fields for further downstream processing.

HTML summary report

An in-browser summary with charts showing identity distributions, hit counts, and taxonomic breakdowns across the queried record set.

blast_result.json

Raw BLAST results in JSON format, before taxonomy annotation. Retained in the package for auditability and reprocessing.

filtered_records.jsonl

The BCDM records actually used as input after filtering — useful for confirming which records were included in the analysis.

stage_timings.json

Per-stage runtime in seconds for validate, filter, convert, and execute. Included in the package metadata sidecar.

Identification results reflect the composition of the reference database at the time the job was run. Results for the same query sequences may differ if the reference database is updated between runs.