Microarray Data Standards, Annotations, Ontologies and Databases:
Ontology Working Group


Chris Stoeckert's ontology based on Mike Bittner's sample classifications.

The following was posted in several mailings which have been combined together without editing. These predated the Heidelberg meeting and do not reflect those discussions.

Overall view:
	Establish terminology not schema.
Establish rules (e.g., distinguish between isa and has relationships)
	"or" is not exclusive. "xor" is exclusive
Experimental architecture:
	Suggest: top level: Study (all experiments in a paper)
	Study has a Group ("multiple determination")
experiments can belong to more than one group
	Group is Ordered xor Unordered
	Ordered is Time or Factor (e.g., drug study)
	Time is Response or Age
	Unordered is Replicates (repeats of same sample) or Type (same type of sample)
	Group has a Experiment ("single determination")
	
	If you view a 2-color microarray as two experiments (as I do), then
	Experiment has a Reference
	Experiment is a Reference

	If you view a 2-color microarray as 1 experiment
	Experiment has a Sample (2 samples if 2-color)
	Sample is a Reference

Treatment of  samples:
The top level is Sample.
Sample has a Treatment
Treatment is qualitative xor quantitative
        		quantitation has units and values
                		units are standard (found in a dictionary - we can debate whose) xor descriptive (text)
Treatment has Agents
        		Agents are chemical, physical, xor biological
                		chemical agents are found in the Merck Index (or some other reference)
                		biological agents are MeSH terms or descriptive (text)
                		physical agents are descriptive (text unless we can find a reference for things like stretching, heat, shock, etc.)
        		Agents are constant xor variable
Treatment has a Medline reference

Detectors - Array elements:
	Top level is Array
	Array has a Manufacturer
		Manufacturer has a Protocol (can be a Medline reference).
	Array has a Version
	Array is Spotted xor Lithographic
		Spotted Arrays have Dimensions
		Spotted Arrays have Spot-spacing
		Spotted Arrays have Number of  spots
		Spotted Arrays have Spot sets (I call them Spot Families in RAD schema).
		Spot set is a Gene set (set of clones for the same gene, may be replicates)
		Spot set has an Alternate identifier (accession not always available)
			Alternate identifier is clone ID (IMAGE, dbEST) xor cluster ID (Unigene but could be TIGR, DOTS) (can use to distinguish splicing variants here)
			Alternate identifier is sequence verified xor not sequence verifed
			Spot set has Spots
				Spot has an Identifier (may be different from others in same set, can distinguish splicing here)
				Spot has Portion of Identifier used
					Portion is all xor insert (if clone identifier) or start-to-stop (if sequence accession) - needs clarifying.at worst text, at best controlled vocabulary
				Spot is a Element					
		Lithographic Arrays have Probe sets.
			Probe set is a Gene Sets
			Probe sets have a Match set
				Match sets have Probe Elements
					Probe Element is a Element
					Probe Elements have Sequence (literal) 
			Probe sets have a Mismatch set
				Mismatch sets have Probe Elements
		Lithographic Arrays have Number of  probe sets

Gene sets have an Identifier (EMBL/DDBJ/GenBank) 
	Identifier is for RNA (cDNA/EST) xor DNA (genomic) or Reference 
Gene sets have a Description (text)
Gene sets have a Sequence coverage
Sequence coverage has a Start (integer, relative to Identifier sequence). 	
Sequence coverage has a Stop (integer, relative to Identifier sequence). 		
Gene sets have Elements 

Elements are Typed
Types are oligo xor cDNA xor clone xor genomic (i.e., controlled vocabulary of terms. Note that an array can have different types of elements such as cDNA and genomic).
	Elements have an ArrayLocation (grid/row/column or row/column) 
	Elements have a Description (text)
	Elements are Good or Bad (flag)

Here is a pass at samples. It is straightforward except for organism-dependent attributes. These are Anatomy, DevelopmentalStage, and Pathology. As discussed earlier, we will need to come up with ontologies for each rather than try to represent everything with the same set of terms. We can try to cover the most-widely studied organisms initially but this will require an ongoing group to guide addition of ontologies for experiments investigating organisms not covered. Alternatively, we can identify ontologies for a model organism representing major taxa (vertebrates [mammals, fish, birds, reptiles, amphibians], invertebrates [insects, worms], plants [monocots, dicots, fungi]) that can be generalized. Correct me if I'm wrong, but I don't think anatomy, developmental stage, and pathology are applicable to bacteria (archea, etc.).  Did I miss any? If we go with ontologies for the 10 areas (mammals, fish, birds, reptiles, amphibians, insects, worms, monocots, dicots, fungi), this should be manageable if we divide it up. I can do mammals - any volunteers for the others?

I refer to RAD below. It is available at http://www.cbil.upenn.edu/RAD2.

Cell Samples:
	Top level: Sample
	Sample has a Treatment (see previous mail)
	Sample has a Organism
		in RAD we call this Taxonomy and use the table from GSDB (NCBI)
	Sample has an Anatomy
		Anatomy is organism dependent
		in RAD we call this Anatomy and combine human and mouse:
			http://www.cbil.upenn.edu/anatomy.php3
	Sample has a DevelopmentalStage
		DevelopmentalStage is organism dependent 
		mouse and human stages can be obtained from:
			http://www.ana.ed.ac.uk/anatomy/database/humat/
			http://genex.hgu.mrc.ac.uk/Databases/Anatomy/MAstaging.shtml
	Sample has a Pathology
		Pathology is also organism dependent. In RAD for human we use the ICD-9 classification:
			http://www.genome.ad.jp/kegg/kegg2.html
	Sample has a Genotype
	Genotype is a strain
		strain has a name, source, catalog #, description
	Genotype is transgenic 
		transgene has an ID in some database, a reference, a description
	Genotoype has an allele
		allele is wild type xor mutant
	Genotype has a marker
		marker has a chromosome, location, units, type, source
			units: centirays, centimorgan, Mb, band
			type: STS, EST, microsatellite, RFLP

Detection Methods:
RNA extraction, sample labeling , hybridization methods, and reader/scanner (image) are part of Experiment. Data extraction is part of reader/scanner (image).
Arrays were covered previously.

Top level is Experiment
Experiment has a Protocol
Protocol has a description (text)
Protocol has a reference (Medline ID)
RNA Extraction is a Protocol
RNA Extraction has a yield 
	
Protocol is Sample Labeling
Protocol is Hybridization Methods