Biological Database Search
Search 8 major biological databases from the command bar — UniProt, NCBI Gene, PDB, PubMed, ChEMBL, GEO, KEGG, and NCBI Nucleotide
Overview
The command bar connects to 8 major biological databases through a unified search interface. Select a tag (e.g., Protein, Gene, Structure) then type your query to search any database without leaving the canvas.
Each search result creates a canvas node tailored to its data type — 3D protein structures rendered in Mol*, gene information cards, literature citations, interactive KEGG pathway maps, and circular genome visualizations. Results are cached locally for fast repeat access and can be arranged, connected, and annotated directly on the canvas.
Quick Reference
| Database | Tag | Best For | Example Query | Creates |
|---|---|---|---|---|
| UniProt | Protein | Protein info, sequences, structures | TP53, P04637 | 3D structure or InfoCard |
| NCBI Gene | Gene | Gene metadata, chromosome location | BRCA1, EGFR | Gene InfoCard |
| RCSB PDB | Structure | 3D molecular structures | 1CRN, ribosome | 3D Mol* viewer |
| PubMed | Literature | Scientific papers, reviews | CRISPR, "gene therapy" | Citation InfoCard |
| ChEMBL | Compound | Drugs, bioactive molecules | aspirin, CHEMBL25 | Compound InfoCard |
| NCBI GEO | Dataset | Gene expression datasets | scRNA PBMC | Dataset InfoCard |
| KEGG | Pathway | Metabolic and signaling pathways | glycolysis, MAPK | Interactive pathway |
| NCBI Nucleotide | Sequence | Genomes, plasmids, gene sequences | pUC19, SARS-CoV-2 | Genome map |
UniProt (Protein Search)
UniProt searches query the reviewed Swiss-Prot database, returning curated protein entries with function annotations, domain architecture, disease associations, and cross-references. When a protein has an associated PDB structure, the result node renders a 3D viewer. Otherwise it displays a detailed InfoCard.
Example Queries
| Query | Type | Expected Result |
|---|---|---|
TP53 | Gene symbol | Human tumor protein p53 with 3D structure |
P04637 | UniProt accession | Direct lookup — fastest path to a specific entry |
insulin receptor | Free text | Top matches ranked by annotation score |
BRCA2_HUMAN | Entry name | Exact entry for human BRCA2 |
kinase AND organism:mouse | Advanced | Mouse kinases filtered by organism |
Tips
UniProt accession IDs (e.g., P04637, Q9Y6K9) return results instantly because they bypass full-text search and resolve directly.
- Gene symbols like
TP53orEGFRwork well for human proteins. For other organisms, append the species:TP53 mouse. - Partial protein names use fuzzy matching —
tumor suppressor p53finds the same entry asTP53. - Results include GO annotations, subcellular location, and links to external databases (PDB, InterPro, Pfam).
NCBI Gene
NCBI Gene searches return gene metadata including chromosomal location, aliases, RefSeq identifiers, and functional summaries. Results appear as Gene InfoCards with direct links to the NCBI Gene page.
Example Queries
| Query | Type | Expected Result |
|---|---|---|
BRCA1 | HGNC symbol | Breast cancer type 1, chr17 |
EGFR | HGNC symbol | Epidermal growth factor receptor, chr7 |
HER2 | Alias | Resolves to ERBB2 |
7157 | Gene ID | Direct lookup for TP53 |
apolipoprotein E | Full name | APOE gene card |
Tips
- Use HGNC-approved gene symbols for the most reliable results. Aliases (e.g.,
HER2forERBB2) are resolved but may return multiple matches. - The search defaults to human genes. Specify the organism explicitly if you need a different species:
BRCA1 rat. - Gene IDs (numeric) perform direct lookups and are the fastest query type.
Gene InfoCards show the official symbol, full name, chromosome band, and a one-paragraph functional summary pulled from NCBI's curated RefSeq records.
RCSB PDB (Structure Search)
PDB searches query the RCSB Protein Data Bank for experimentally determined 3D structures. Results create a MolstarNode — a fully interactive molecular viewer powered by Mol* with rotation, zoom, surface/cartoon toggles, and chain highlighting.
Example Queries
| Query | Type | Expected Result |
|---|---|---|
1CRN | PDB ID | Crambin crystal structure (direct load) |
ribosome | Free text | Top ribosome structures by resolution |
kinase X-ray | Text + method | X-ray crystallography kinase structures |
6LU7 | PDB ID | SARS-CoV-2 main protease |
hemoglobin NMR | Text + method | NMR-resolved hemoglobin structures |
Tips
- PDB IDs are exactly 4 characters (one digit followed by three alphanumeric characters, e.g.,
1CRN,6LU7). When you enter a valid PDB ID, the structure loads directly without a search step. - Add an experimental method to narrow results:
kinase X-ray,antibody cryo-EM,peptide NMR. - The Mol* viewer supports multiple representations (cartoon, ball-and-stick, surface) and can highlight individual chains, ligands, or residue ranges.
Double-click any residue in the 3D viewer to center and highlight it. Right-click for options including distance measurement and surface coloring.
PubMed (Literature Search)
PubMed searches query titles, abstracts, and MeSH (Medical Subject Headings) terms across the full MEDLINE database. Results appear as Citation InfoCards showing the title, authors, journal, year, and abstract excerpt.
Example Queries
| Query | Type | Expected Result |
|---|---|---|
CRISPR | Keyword | Recent CRISPR papers ranked by relevance |
"single cell sequencing" | Exact phrase | Papers containing the exact phrase |
machine learning drug discovery | Multi-word | Papers matching all terms |
PMID:32015507 | PMID lookup | Direct retrieval of a specific paper |
scRNA-seq AND pancreas NOT cancer | Boolean | Filtered pancreas single-cell papers |
Tips
Wrap multi-word phrases in double quotes for exact matching. "gene therapy" finds papers with that exact phrase, while gene therapy matches papers containing both words anywhere in the text.
- Boolean operators (
AND,OR,NOT) work as expected. Use them to refine broad topics. - MeSH terms improve precision for established concepts. PubMed automatically maps common terms to their MeSH equivalents.
- Prefix a PMID with
PMID:for direct paper lookup without a search round-trip. - Results are sorted by relevance. The most recent and highest-cited papers appear first.
ChEMBL (Compound Search)
ChEMBL searches query the EMBL-EBI database of bioactive molecules with drug-like properties. Results include molecular structure, clinical development phase, mechanism of action, and target information.
Example Queries
| Query | Type | Expected Result |
|---|---|---|
aspirin | Drug name | Acetylsalicylic acid — Approved |
imatinib | Drug name | Gleevec — Approved, BCR-ABL inhibitor |
CHEMBL25 | ChEMBL ID | Direct lookup for aspirin |
kinase inhibitor | Mechanism | Compounds targeting kinases |
CHEMBL941 | ChEMBL ID | Direct lookup for erlotinib |
Tips
- Both common drug names and ChEMBL accession IDs work. Accession IDs (e.g.,
CHEMBL25) perform direct lookups. - Compound InfoCards display the clinical phase (Approved, Phase I-III, Preclinical), molecular formula, and key physicochemical properties.
- Target information links compounds to their protein targets, enabling cross-referencing with UniProt results on the same canvas.
Search for a drug target protein in UniProt, then search for compounds against that target in ChEMBL. Place both nodes on the canvas to build a target-compound relationship map.
NCBI GEO (Dataset Search)
GEO searches query the Gene Expression Omnibus for publicly available gene expression, methylation, and other functional genomics datasets. Results appear as Dataset InfoCards showing the GSE accession, title, organism, sample count, and platform.
Example Queries
| Query | Type | Expected Result |
|---|---|---|
scRNA PBMC | Keywords | Single-cell RNA-seq datasets from PBMCs |
GSE198765 | GSE accession | Direct lookup of a specific dataset |
cancer methylation human | Keywords + organism | Human cancer methylation arrays |
ATAC-seq mouse brain | Method + tissue | Chromatin accessibility in mouse brain |
COVID-19 bulk RNA-seq | Disease + method | COVID transcriptomics datasets |
Tips
- Be specific about organism, method, and tissue type.
scRNA PBMC humanreturns more relevant results thansingle cell. - GSE accession numbers (e.g.,
GSE198765) perform direct lookups and are the fastest way to retrieve a known dataset. - Dataset InfoCards show the number of samples and platform, helping you assess dataset suitability before downloading.
Found a dataset you want to analyze? Ask Lobster to download it directly: "Download GSE198765 and run QC." The data expert agent handles GEO downloads and format conversion automatically.
KEGG (Pathway Search)
KEGG searches query the Kyoto Encyclopedia of Genes and Genomes for metabolic pathways, signaling cascades, and disease pathways. Results create a PathwayNode — an interactive, zoomable pathway map with clickable entities (genes, compounds, reactions).
Example Queries
| Query | Type | Expected Result |
|---|---|---|
glycolysis | Pathway name | Glycolysis / Gluconeogenesis (hsa00010) |
MAPK | Pathway name | MAPK signaling pathway (hsa04010) |
mTOR | Pathway name | mTOR signaling pathway (hsa04150) |
hsa04110 | KEGG ID | Direct lookup for Cell cycle pathway |
purine metabolism | Pathway name | Purine metabolism (hsa00230) |
Tips
- Use standard pathway names as they appear in KEGG. Common abbreviations (
MAPK,mTOR,TCA cycle) are recognized. - KEGG pathway IDs (e.g.,
hsa04110) perform direct lookups. Thehsaprefix indicates human pathways. - Double-click any entity (gene box, compound circle) in the pathway map to view its details or search for it in the corresponding database.
- Pathway maps are interactive: zoom with scroll, pan by dragging, and click entities to highlight connected reactions.
For a full walkthrough of pathway navigation, entity resolution, and cross-database linking from pathway maps, see the Pathway Exploration page.
NCBI Nucleotide (Sequence Search)
NCBI Nucleotide searches query the GenBank and RefSeq sequence databases for genomes, plasmids, chromosomes, and individual gene sequences. Results create a CgviewNode — an interactive circular or linear genome map rendered with Cgview.js, showing annotated features like genes, regulatory elements, and restriction sites.
Example Queries
| Query | Type | Expected Result |
|---|---|---|
pUC19 | Plasmid name | pUC19 cloning vector (circular map) |
E. coli K-12 | Organism | E. coli K-12 reference genome |
SARS-CoV-2 | Organism | SARS-CoV-2 reference genome (linear) |
NC_000913 | RefSeq accession | E. coli K-12 MG1655 genome |
lambda phage | Common name | Bacteriophage lambda genome |
Tips
- Well-annotated sequences (reference genomes, common plasmids) produce the richest genome maps with full feature annotations.
- RefSeq accession numbers (e.g.,
NC_000913) and GenBank accessions perform direct lookups. - Circular maps are generated for circular molecules (plasmids, bacterial chromosomes). Linear maps are used for linear genomes and chromosomal segments.
- Zoom into specific regions of the genome map to see individual gene annotations, reading frames, and regulatory features.
Search Behavior
Caching
Search results are cached locally to speed up repeat queries and reduce load on external databases.
| Cache Type | TTL | Description |
|---|---|---|
| Search results | 60 seconds | Same query returns instantly within the window |
| Detail results | 300 seconds | Expanded node data (structures, full records) |
| Maximum entries | 500 | LRU eviction removes least-recently-used entries first |
Rate Limits
| Database Group | Limit | Notes |
|---|---|---|
| NCBI (Gene, PubMed, GEO, Nucleotide) | 3 req/s default, 10 req/s with API key | All NCBI databases share the same rate limit pool |
| UniProt, PDB, ChEMBL | No strict rate limits | Fair-use policies apply |
| Per-user unified endpoint | 20 searches/minute | Applies across all databases combined |
If you have an NCBI API key, configure it in your account settings to increase the NCBI rate limit from 3 to 10 requests per second. Get a free key at ncbi.nlm.nih.gov/account/settings.
Error Handling
- If an external database is temporarily unavailable, the search returns "No results found" rather than an error. This is graceful degradation — retry after a few seconds.
- Each database query has a timeout of 3-5 seconds. Slow upstream responses are terminated cleanly.
- No 500 errors propagate to the UI. All failures are caught and displayed as user-friendly messages.