Microarray Data Standards, Annotations, Ontologies and Databases:
Ontology Working Group
Chris Stoeckert's ontology based on Mike Bittner's sample classifications.
The following was posted in several mailings which have been combined together without editing. These predated the Heidelberg meeting and do not reflect those discussions.
Overall view:
Establish terminology not schema.
Establish rules (e.g., distinguish between isa and has relationships)
"or" is not exclusive. "xor" is exclusive
Experimental architecture:
Suggest: top level: Study (all experiments in a paper)
Study has a Group ("multiple determination")
experiments can belong to more than one group
Group is Ordered xor Unordered
Ordered is Time or Factor (e.g., drug study)
Time is Response or Age
Unordered is Replicates (repeats of same sample) or Type (same type of sample)
Group has a Experiment ("single determination")
If you view a 2-color microarray as two experiments (as I do), then
Experiment has a Reference
Experiment is a Reference
If you view a 2-color microarray as 1 experiment
Experiment has a Sample (2 samples if 2-color)
Sample is a Reference
Treatment of samples:
The top level is Sample.
Sample has a Treatment
Treatment is qualitative xor quantitative
quantitation has units and values
units are standard (found in a dictionary - we can debate whose) xor descriptive (text)
Treatment has Agents
Agents are chemical, physical, xor biological
chemical agents are found in the Merck Index (or some other reference)
biological agents are MeSH terms or descriptive (text)
physical agents are descriptive (text unless we can find a reference for things like stretching, heat, shock, etc.)
Agents are constant xor variable
Treatment has a Medline reference
Detectors - Array elements:
Top level is Array
Array has a Manufacturer
Manufacturer has a Protocol (can be a Medline reference).
Array has a Version
Array is Spotted xor Lithographic
Spotted Arrays have Dimensions
Spotted Arrays have Spot-spacing
Spotted Arrays have Number of spots
Spotted Arrays have Spot sets (I call them Spot Families in RAD schema).
Spot set is a Gene set (set of clones for the same gene, may be replicates)
Spot set has an Alternate identifier (accession not always available)
Alternate identifier is clone ID (IMAGE, dbEST) xor cluster ID (Unigene but could be TIGR, DOTS) (can use to distinguish splicing variants here)
Alternate identifier is sequence verified xor not sequence verifed
Spot set has Spots
Spot has an Identifier (may be different from others in same set, can distinguish splicing here)
Spot has Portion of Identifier used
Portion is all xor insert (if clone identifier) or start-to-stop (if sequence accession) - needs clarifying.at worst text, at best controlled vocabulary
Spot is a Element
Lithographic Arrays have Probe sets.
Probe set is a Gene Sets
Probe sets have a Match set
Match sets have Probe Elements
Probe Element is a Element
Probe Elements have Sequence (literal)
Probe sets have a Mismatch set
Mismatch sets have Probe Elements
Lithographic Arrays have Number of probe sets
Gene sets have an Identifier (EMBL/DDBJ/GenBank)
Identifier is for RNA (cDNA/EST) xor DNA (genomic) or Reference
Gene sets have a Description (text)
Gene sets have a Sequence coverage
Sequence coverage has a Start (integer, relative to Identifier sequence).
Sequence coverage has a Stop (integer, relative to Identifier sequence).
Gene sets have Elements
Elements are Typed
Types are oligo xor cDNA xor clone xor genomic (i.e., controlled vocabulary of terms. Note that an array can have different types of elements such as cDNA and genomic).
Elements have an ArrayLocation (grid/row/column or row/column)
Elements have a Description (text)
Elements are Good or Bad (flag)
Here is a pass at samples. It is straightforward except for organism-dependent attributes. These are Anatomy, DevelopmentalStage, and Pathology. As discussed earlier, we will need to come up with ontologies for each rather than try to represent everything with the same set of terms. We can try to cover the most-widely studied organisms initially but this will require an ongoing group to guide addition of ontologies for experiments investigating organisms not covered. Alternatively, we can identify ontologies for a model organism representing major taxa (vertebrates [mammals, fish, birds, reptiles, amphibians], invertebrates [insects, worms], plants [monocots, dicots, fungi]) that can be generalized. Correct me if I'm wrong, but I don't think anatomy, developmental stage, and pathology are applicable to bacteria (archea, etc.). Did I miss any? If we go with ontologies for the 10 areas (mammals, fish, birds, reptiles, amphibians, insects, worms, monocots, dicots, fungi), this should be manageable if we divide it up. I can do mammals - any volunteers for the others?
I refer to RAD below. It is available at http://www.cbil.upenn.edu/RAD2.
Cell Samples:
Top level: Sample
Sample has a Treatment (see previous mail)
Sample has a Organism
in RAD we call this Taxonomy and use the table from GSDB (NCBI)
Sample has an Anatomy
Anatomy is organism dependent
in RAD we call this Anatomy and combine human and mouse:
http://www.cbil.upenn.edu/anatomy.php3
Sample has a DevelopmentalStage
DevelopmentalStage is organism dependent
mouse and human stages can be obtained from:
http://www.ana.ed.ac.uk/anatomy/database/humat/
http://genex.hgu.mrc.ac.uk/Databases/Anatomy/MAstaging.shtml
Sample has a Pathology
Pathology is also organism dependent. In RAD for human we use the ICD-9 classification:
http://www.genome.ad.jp/kegg/kegg2.html
Sample has a Genotype
Genotype is a strain
strain has a name, source, catalog #, description
Genotype is transgenic
transgene has an ID in some database, a reference, a description
Genotoype has an allele
allele is wild type xor mutant
Genotype has a marker
marker has a chromosome, location, units, type, source
units: centirays, centimorgan, Mb, band
type: STS, EST, microsatellite, RFLP
Detection Methods:
RNA extraction, sample labeling , hybridization methods, and reader/scanner (image) are part of Experiment. Data extraction is part of reader/scanner (image).
Arrays were covered previously.
Top level is Experiment
Experiment has a Protocol
Protocol has a description (text)
Protocol has a reference (Medline ID)
RNA Extraction is a Protocol
RNA Extraction has a yield
Protocol is Sample Labeling
Protocol is Hybridization Methods