Biological Database Search

Search 8 major biological databases from the command bar — UniProt, NCBI Gene, PDB, PubMed, ChEMBL, GEO, KEGG, and NCBI Nucleotide

Overview

The command bar connects to 8 major biological databases through a unified search interface. Select a tag (e.g., Protein, Gene, Structure) then type your query to search any database without leaving the canvas.

Each search result creates a canvas node tailored to its data type — 3D protein structures rendered in Mol*, gene information cards, literature citations, interactive KEGG pathway maps, and circular genome visualizations. Results are cached locally for fast repeat access and can be arranged, connected, and annotated directly on the canvas.

Quick Reference

Database	Tag	Best For	Example Query	Creates
UniProt	Protein	Protein info, sequences, structures	`TP53`, `P04637`	3D structure or InfoCard
NCBI Gene	Gene	Gene metadata, chromosome location	`BRCA1`, `EGFR`	Gene InfoCard
RCSB PDB	Structure	3D molecular structures	`1CRN`, `ribosome`	3D Mol* viewer
PubMed	Literature	Scientific papers, reviews	`CRISPR`, `"gene therapy"`	Citation InfoCard
ChEMBL	Compound	Drugs, bioactive molecules	`aspirin`, `CHEMBL25`	Compound InfoCard
NCBI GEO	Dataset	Gene expression datasets	`scRNA PBMC`	Dataset InfoCard
KEGG	Pathway	Metabolic and signaling pathways	`glycolysis`, `MAPK`	Interactive pathway
NCBI Nucleotide	Sequence	Genomes, plasmids, gene sequences	`pUC19`, `SARS-CoV-2`	Genome map

UniProt (Protein Search)

UniProt searches query the reviewed Swiss-Prot database, returning curated protein entries with function annotations, domain architecture, disease associations, and cross-references. When a protein has an associated PDB structure, the result node renders a 3D viewer. Otherwise it displays a detailed InfoCard.

Example Queries

Query	Type	Expected Result
`TP53`	Gene symbol	Human tumor protein p53 with 3D structure
`P04637`	UniProt accession	Direct lookup — fastest path to a specific entry
`insulin receptor`	Free text	Top matches ranked by annotation score
`BRCA2_HUMAN`	Entry name	Exact entry for human BRCA2
`kinase AND organism:mouse`	Advanced	Mouse kinases filtered by organism

Tips

UniProt accession IDs (e.g., P04637, Q9Y6K9) return results instantly because they bypass full-text search and resolve directly.

Gene symbols like TP53 or EGFR work well for human proteins. For other organisms, append the species: TP53 mouse.
Partial protein names use fuzzy matching — tumor suppressor p53 finds the same entry as TP53.
Results include GO annotations, subcellular location, and links to external databases (PDB, InterPro, Pfam).

NCBI Gene

NCBI Gene searches return gene metadata including chromosomal location, aliases, RefSeq identifiers, and functional summaries. Results appear as Gene InfoCards with direct links to the NCBI Gene page.

Example Queries

Query	Type	Expected Result
`BRCA1`	HGNC symbol	Breast cancer type 1, chr17
`EGFR`	HGNC symbol	Epidermal growth factor receptor, chr7
`HER2`	Alias	Resolves to ERBB2
`7157`	Gene ID	Direct lookup for TP53
`apolipoprotein E`	Full name	APOE gene card

Tips

Use HGNC-approved gene symbols for the most reliable results. Aliases (e.g., HER2 for ERBB2) are resolved but may return multiple matches.
The search defaults to human genes. Specify the organism explicitly if you need a different species: BRCA1 rat.
Gene IDs (numeric) perform direct lookups and are the fastest query type.

Gene InfoCards show the official symbol, full name, chromosome band, and a one-paragraph functional summary pulled from NCBI's curated RefSeq records.

RCSB PDB (Structure Search)

PDB searches query the RCSB Protein Data Bank for experimentally determined 3D structures. Results create a MolstarNode — a fully interactive molecular viewer powered by Mol* with rotation, zoom, surface/cartoon toggles, and chain highlighting.

Example Queries

Query	Type	Expected Result
`1CRN`	PDB ID	Crambin crystal structure (direct load)
`ribosome`	Free text	Top ribosome structures by resolution
`kinase X-ray`	Text + method	X-ray crystallography kinase structures
`6LU7`	PDB ID	SARS-CoV-2 main protease
`hemoglobin NMR`	Text + method	NMR-resolved hemoglobin structures

Tips

PDB IDs are exactly 4 characters (one digit followed by three alphanumeric characters, e.g., 1CRN, 6LU7). When you enter a valid PDB ID, the structure loads directly without a search step.
Add an experimental method to narrow results: kinase X-ray, antibody cryo-EM, peptide NMR.
The Mol* viewer supports multiple representations (cartoon, ball-and-stick, surface) and can highlight individual chains, ligands, or residue ranges.

Double-click any residue in the 3D viewer to center and highlight it. Right-click for options including distance measurement and surface coloring.

PubMed (Literature Search)

PubMed searches query titles, abstracts, and MeSH (Medical Subject Headings) terms across the full MEDLINE database. Results appear as Citation InfoCards showing the title, authors, journal, year, and abstract excerpt.

Example Queries

Query	Type	Expected Result
`CRISPR`	Keyword	Recent CRISPR papers ranked by relevance
`"single cell sequencing"`	Exact phrase	Papers containing the exact phrase
`machine learning drug discovery`	Multi-word	Papers matching all terms
`PMID:32015507`	PMID lookup	Direct retrieval of a specific paper
`scRNA-seq AND pancreas NOT cancer`	Boolean	Filtered pancreas single-cell papers

Tips

Wrap multi-word phrases in double quotes for exact matching. "gene therapy" finds papers with that exact phrase, while gene therapy matches papers containing both words anywhere in the text.

Boolean operators (AND, OR, NOT) work as expected. Use them to refine broad topics.
MeSH terms improve precision for established concepts. PubMed automatically maps common terms to their MeSH equivalents.
Prefix a PMID with PMID: for direct paper lookup without a search round-trip.
Results are sorted by relevance. The most recent and highest-cited papers appear first.

ChEMBL (Compound Search)

ChEMBL searches query the EMBL-EBI database of bioactive molecules with drug-like properties. Results include molecular structure, clinical development phase, mechanism of action, and target information.

Example Queries

Query	Type	Expected Result
`aspirin`	Drug name	Acetylsalicylic acid — Approved
`imatinib`	Drug name	Gleevec — Approved, BCR-ABL inhibitor
`CHEMBL25`	ChEMBL ID	Direct lookup for aspirin
`kinase inhibitor`	Mechanism	Compounds targeting kinases
`CHEMBL941`	ChEMBL ID	Direct lookup for erlotinib

Tips

Both common drug names and ChEMBL accession IDs work. Accession IDs (e.g., CHEMBL25) perform direct lookups.
Compound InfoCards display the clinical phase (Approved, Phase I-III, Preclinical), molecular formula, and key physicochemical properties.
Target information links compounds to their protein targets, enabling cross-referencing with UniProt results on the same canvas.

Search for a drug target protein in UniProt, then search for compounds against that target in ChEMBL. Place both nodes on the canvas to build a target-compound relationship map.

NCBI GEO (Dataset Search)

GEO searches query the Gene Expression Omnibus for publicly available gene expression, methylation, and other functional genomics datasets. Results appear as Dataset InfoCards showing the GSE accession, title, organism, sample count, and platform.

Example Queries

Query	Type	Expected Result
`scRNA PBMC`	Keywords	Single-cell RNA-seq datasets from PBMCs
`GSE198765`	GSE accession	Direct lookup of a specific dataset
`cancer methylation human`	Keywords + organism	Human cancer methylation arrays
`ATAC-seq mouse brain`	Method + tissue	Chromatin accessibility in mouse brain
`COVID-19 bulk RNA-seq`	Disease + method	COVID transcriptomics datasets

Tips

Be specific about organism, method, and tissue type. scRNA PBMC human returns more relevant results than single cell.
GSE accession numbers (e.g., GSE198765) perform direct lookups and are the fastest way to retrieve a known dataset.
Dataset InfoCards show the number of samples and platform, helping you assess dataset suitability before downloading.

Found a dataset you want to analyze? Ask Lobster to download it directly: "Download GSE198765 and run QC." The data expert agent handles GEO downloads and format conversion automatically.

KEGG (Pathway Search)

KEGG searches query the Kyoto Encyclopedia of Genes and Genomes for metabolic pathways, signaling cascades, and disease pathways. Results create a PathwayNode — an interactive, zoomable pathway map with clickable entities (genes, compounds, reactions).

Example Queries

Query	Type	Expected Result
`glycolysis`	Pathway name	Glycolysis / Gluconeogenesis (hsa00010)
`MAPK`	Pathway name	MAPK signaling pathway (hsa04010)
`mTOR`	Pathway name	mTOR signaling pathway (hsa04150)
`hsa04110`	KEGG ID	Direct lookup for Cell cycle pathway
`purine metabolism`	Pathway name	Purine metabolism (hsa00230)

Tips

Use standard pathway names as they appear in KEGG. Common abbreviations (MAPK, mTOR, TCA cycle) are recognized.
KEGG pathway IDs (e.g., hsa04110) perform direct lookups. The hsa prefix indicates human pathways.
Double-click any entity (gene box, compound circle) in the pathway map to view its details or search for it in the corresponding database.
Pathway maps are interactive: zoom with scroll, pan by dragging, and click entities to highlight connected reactions.

For a full walkthrough of pathway navigation, entity resolution, and cross-database linking from pathway maps, see the Pathway Exploration page.

NCBI Nucleotide (Sequence Search)

NCBI Nucleotide searches query the GenBank and RefSeq sequence databases for genomes, plasmids, chromosomes, and individual gene sequences. Results create a CgviewNode — an interactive circular or linear genome map rendered with Cgview.js, showing annotated features like genes, regulatory elements, and restriction sites.

Example Queries

Query	Type	Expected Result
`pUC19`	Plasmid name	pUC19 cloning vector (circular map)
`E. coli K-12`	Organism	E. coli K-12 reference genome
`SARS-CoV-2`	Organism	SARS-CoV-2 reference genome (linear)
`NC_000913`	RefSeq accession	E. coli K-12 MG1655 genome
`lambda phage`	Common name	Bacteriophage lambda genome

Tips

Well-annotated sequences (reference genomes, common plasmids) produce the richest genome maps with full feature annotations.
RefSeq accession numbers (e.g., NC_000913) and GenBank accessions perform direct lookups.
Circular maps are generated for circular molecules (plasmids, bacterial chromosomes). Linear maps are used for linear genomes and chromosomal segments.
Zoom into specific regions of the genome map to see individual gene annotations, reading frames, and regulatory features.

Search Behavior

Caching

Search results are cached locally to speed up repeat queries and reduce load on external databases.

Cache Type	TTL	Description
Search results	60 seconds	Same query returns instantly within the window
Detail results	300 seconds	Expanded node data (structures, full records)
Maximum entries	500	LRU eviction removes least-recently-used entries first

Rate Limits

Database Group	Limit	Notes
NCBI (Gene, PubMed, GEO, Nucleotide)	3 req/s default, 10 req/s with API key	All NCBI databases share the same rate limit pool
UniProt, PDB, ChEMBL	No strict rate limits	Fair-use policies apply
Per-user unified endpoint	20 searches/minute	Applies across all databases combined

If you have an NCBI API key, configure it in your account settings to increase the NCBI rate limit from 3 to 10 requests per second. Get a free key at ncbi.nlm.nih.gov/account/settings.

Error Handling

If an external database is temporarily unavailable, the search returns "No results found" rather than an error. This is graceful degradation — retry after a few seconds.
Each database query has a timeout of 3-5 seconds. Slow upstream responses are terminated cleanly.
No 500 errors propagate to the UI. All failures are caught and displayed as user-friendly messages.

Next Steps

Command Bar

Learn all 14 tags and keyboard shortcuts

Pathway Exploration

Navigate KEGG pathways and resolve entities

Getting Started

First steps with Omics-OS Cloud

Biological Database Search

Next Steps

On this page