microarray-ontol-digest Monday, June 19 2000 Volume 01 : Number 001 ---------------------------------------------------------------------- Date: Tue, 6 Jun 2000 15:52:26 +0200 (DFT) From: Paolo Romano Subject: [microarray-ontol] Ontologies for biological samples Dear microarray-ontologists, though I'm not directly involved in any microarray researches, and can't therefore commment very much on current efforts of this group, I'm particularly interested in the MGED experience and in the activity of this group because of its methodological basis and for the important reflections it may have with my main activity, that is related to the management of the CABRI site, where the electronic catalogues of some of the most important European culture collections can be searched in an integrated way through SRS (see http://www.cabri.org/ ). After having followed the recent discussions on the ontology for microarray information and participated in the Heidelberg meetings of the working group, I would like to point you out one aspect of the above mentioned CABRI experience that could be of help in the identification of controlled vocabularies for the description of biological samples. CABRI has been funded by the European Union from 1997 to 1999 and is now managed by an informal co-ordination of the involved collections (CBS, BCCM, ECACC, DSMZ, CABI, ICLC). One of the main objectives of the CABRI project was quality of the information systems and the definition of related standards. So, one of the results of the project was the delivery of standards descriptions for the following biological resources: animal cell lines, fungi and yeasts, bacteria and archea, plasmids and phages, plant cells and viruses, and DNA probes. By starting from the data structures of the databases of participating collections, a consensus on Minimum Data Sets (MDS) was achieved for each of the organism types. The MDSs include all information that are needed to clearly identify the strain. Moreover, a Recommended Data Set (RDS) was added for taking into account those information that could be useful but were not normally recorded in the catalogues. So, as a result of the CABRI project, a MDS and a RDS are available for each of the organism types included in the participating collections. Single information that are included in the MDSs and RDSs are described on the basis of the textual description of its contents and of the input process, i.e., data type, values to be used, reference lists, vocabs., etc.. I'm wondering if these standards can be useful in defining the sample attibutes and ontologies.... What do you think? Being in charge of the technical developments of the CABRI site and search engine, I'm really interested in knowing your comments also in view of a possible integration of our system with a standard microarray database and of the definition of XML schema/IDL interface for catalogues contents. Bye for now. Paolo Romano - -- Paolo Romano (paolo@ist.unige.it) Biotechnology Department, Natl Inst. for Cancer Research c/o Advanced Biotechnology Centre Largo Rosanna Benzi, 10, I-16132, Genova, Italy Tel: +39-010-5737-288 Fax: +39-010-5737-295 ------------------------------ Date: Tue, 6 Jun 2000 15:56:43 +0200 (DFT) From: Paolo Romano Subject: [microarray-ontol] Ontologies for biological samples (Addendum) Dear microarray-ontologists, sorry, I forgot to mention, for those interested, the URL where the DMS can be retrieved. Here it is: http://www.cabri.org/CABRI/home/guidelines/catalogue/CPdata.html Best regards. Paolo Romano - -- Paolo Romano (paolo@ist.unige.it) Biotechnology Department, Natl Inst. for Cancer Research c/o Advanced Biotechnology Centre Largo Rosanna Benzi, 10, I-16132, Genova, Italy Tel: +39-010-5737-288 Fax: +39-010-5737-295 ------------------------------ Date: Wed, 07 Jun 2000 14:15:52 -0400 From: Chris Stoeckert Subject: [microarray-ontol] Re: Hi Mike, I like the last part of what you did from the "A strain has an ID" on down. I'm having trouble though with your descriptions with liquid culture because they go from a general term to a specific term back to a general term which raises issues of inheritance. For example, you say " Liquid culture contains [has?] unicellular organisms" and then "A unicellular organism has a strain." But you could say the same thing about multicellular organisms (e.g., an embryoid body) or cell lines which also have strains (and phenotypes) which raises the issue of how strain is different for each. Also "Biological sample is a liquid culture" bothers me - this means that biological samples are a subset of liquid culture and therefore all properties of a liquid culture are held by biological samples. This doesn't work because not all biological samples have media for example. If you mean the reverse, that is a liquid culture is a biological sample, then your definition of "a biological sample is a collection of cells" doesn't work because a liquid culture is not a collections of cells (again all collections of cells don't have media). As an alternative, here's what I had posted earlier (prior to the Heidelberg meeting and with some modifications) for sample and how different experiments on a sample are related. xor = exclusive or Note that it needs to be expanded to include the concepts of history, environment, phenotype, and others we discussed in the working group sessions. To that end, I appended environment to indicate where I would put Liquid culture. Study (e.g., all experiments in a paper) Study has a Group ("multiple determination") experiments can belong to more than one group Group is Ordered xor Unordered Ordered is Time or Factor (e.g., drug study) Time is Response or Age Unordered is Replicates (repeats of same sample) or Type (same type of sample) Group has a Experiment ("single determination") Experiment has a Sample Sample has an Organism Organism has a Taxonomy Organism has an Anatomy Organism has a DevelopmentalStage Organism has a Pathology Organism has a Strain Strain has a name, source, catalog #, description Strain has a Genotype Genotoype has an Allele Allele is wild type xor mutant Genotype has a Marker Marker has a chromosome, location, units, type, source units: centirays, centimorgan, Mb, band type: STS, EST, microsatellite, RFLP Transgenic is a Genotype transgene has an ID in some database, a reference, a description Sample has a Treatment Treatment is qualitative xor quantitative quantitation has units and values units are standard (found in a dictionary) xor descriptive (text) Treatment has Agents Agents are chemical, physical, xor biological chemical agents are found in the Merck Index (or some other reference) biological agents are MeSH terms or descriptive (text) physical agents are descriptive (text unless we can find a reference for things like stretching, heat, shock, etc.) Agents are constant xor variable Treatment has a Medline reference Sample has an Environment Liquid culture is an environment Liquid culture has a media .. Chris Michael Eisen wrote: > Greetings. Haven't seen any traffic here, so to keep the ball rolling > on the ontology effort, I thought I'd solicit some feedback from the > group on the following things I jotted down on the train ride home. > Does this make any sort of sense? > > As I sat down and tried to describe some of our yeast and human > datasets, it occurred to me that it might be better to start with the > base concept of a cell, and to describe all biological samples in > terms of collections of cells. It seems to unify all of the types of > biological sample I can think of into one conceptual framework. > Roughly sketched out, you get something like. > > There are cells > > · A biological sample is a collection of cells {possible only a > single cell}. Since collections of biological samples are also > collections of cells, collections of biological samples are also > biological samples. > > > ·An organism is a biological sample > > o Organisms have a taxonomy. There are already standard > dictionaries of taxonomies > · An anatomical unit is a collection of cells from an organism > and is thus a biological sample > o Anatomical units have an organism > > > oAnatomical units have an anatomy, which are drawn from a > organism.taxonomy specific dictionary, some of which exist {NOTE: for > the purposes of the ontology, pathological designations should be > considered as anatomical units since they are also collections of > cells from an organism} > > · A cell-line is a biological sample > o A cell-line has a source, which is an anatomical unit > · A liquid culture is a biological sample > o A culture is made up of cells > > > oThe cells in a liquid culture can be organisms > oThe cells in a liquid culture can be cell-lines > oLiquid cultures can be mixed {i.e. contain more than one type of > biological sample} > oLiquid cultures have a media > > · A colony is a biological sample > o A colony is made up of cells > > > As I was trying to somewhat formally describe some of our yeast > experiments, it seemed to work nicely to place most of the other > concepts under some notion of history, where a history is divided into > events (i.e. perturbations of the system) and observations (i.e. any > description of the system). There are obviously different types of > events (adding something, removing something, changing the > environment) and different types of observations (measurements {e.g. > temperature, density}, descriptions {color, developmental stage}, > inferences {something you believe to be true based on other evidence > but which was not explicitly observed or measured, e.g. cell-cycle > stage for a nominally synchronized culture}). One specific type of > event would be a sampling - an event that spawns a new biological > sample. It clearly makes sense to describe the history once for an > experiment sample (like a culture, or a set of animals), rather than > individually for each sample (timepoint, individual) that is destined > for an array. A sampling from one biological sample creates a new > biological sample that inherits the properties of the parent. > > So, for example, you end up with something like > > ·Biological sample is a liquid culture > > o Liquid culture contains unicellular organisms > § Unicellular organism is an organism > · An organism has a taxonomy > o The taxonomy of this organism is > [genus=Saccharomyces, species= > cerevisiae] > § A unicellular organism has a strain > · A strain has an ID = DBY328 > > > ·A strain has an ID Source = David Botstein, Stanford University > ·A strain has a contact = botstein@genome.Stanford.edu > ·A strain has a genotype > > o Genotype [ade2, ura3, ho, a/alpha] is > from a taxonomy dependent dictionary > · A strain has a phenotype, which is drawn from a > taxonomy dependent ontology [haven?t dealt with > this yet] > o Liquid culture has a media > § Media is YPD [description of contents here] > > > §Media has a volume [=500ml] > §Media has a container [Ehlenmyer flask] > > ·Biological sample has a history > ·History is a set of events and observations > > o Event > § type=start {all events have a type} > > > §time = [universal time designation], type=absolute {all events have a > time or a time range, which is recorded in absolute or relative time} > > o Observation > § type=measurement > · measurement type=temperature > o time = 0, type=relative, units=hours > > > ovalue=30 > ounits=degrees Celsius > oinstrument=thermometer > > o Observation > § type=measurement > · measurement type=cell density > o time = 0, type=relative, units=hours > > > ovalue=0.05 > ounits=OD 595 > oinstrument=spec > > o Observation > § type=measurement > · measurement type=temperature > o time = 1, type=relative, units=hours > > > ovalue=31 > ounits=degrees Celsius > oinstrument=thermometer > > o Observation > § type = measurement > · measurement type=cell density > o time = 1, type=relative, units=hours > > > ovalue=0.25 > ounits=OD 595 > oinstrument=spec > > o Event > § type = addition > · time = 5h relative > > > ·additive = alpha-factor > ·amount = 2.5 g > ·form = liquid > ·comment = to synchronize cells > > o Observation > § type = inference > · inference type = cell cycle stage > o time = 10h relative > > > ovalue = G0 > > o Observation > § type = measurement > · measurement type = mitotic index > o time = 10h relative > > > ovalue = > > o Event > § type = sampling > · time = 10h relative > > > ·amount = 20 ml > > o Observation > § type=measurement > · measurement type=cell density > o time = 0, type=relative, units=hours > > > ovalue=0.55 > ounits=OD 595 > > · instrument=spec > o Observation > § type=measurement > · measurement type=temperature > o time = 0, type=relative, units=hours > > > ovalue=30 > ounits=degrees Celsius > > · instrument=thermometer > o Event > § type = sampling > · time = 10h15m relative > > > ·amount = 20 ml > > o Observation > § type=measurement > · measurement type=cell density > o time = 0, type=relative, units=hours > > > ovalue=0.58 > ounits=OD 595 > > · instrument=spec > o Observation > § type=measurement > · measurement type=temperature > o time = 0, type=relative, units=hours > > > ovalue=30 > ounits=degrees Celsius > > · instrument=thermometer > o Event > § type = sampling > · time = 10h30m relative > > > ·amount = 20 ml > > o Event > § type = sampling > · time = 10h45m relative > > > ·amount = 20 ml > > o Event > § type = sampling > · time = 11h relative > > > ·amount = 20 ml > · > > You get the idea. You would then describe the processing of the group > of samples up to the point where they go onto the array. > > Finally, it also seems that there is an abstraction of this which > should be defined when data is submitted, and that is the notion of a > collection of hybridized samples. Collection of hybridized samples > would be described by a set of constants and a set of variables. These > could either be extracted from the explicit description above, or > defined by the experimenter. For example, for the yeast experiment > described above, one would have constants of strain, media, and > temperature, and experimental variables of time, density, cell-cycle > stage, mitotic index. For a gene expression survey of a type of tumor, > you might have constants of pathological designation and removal > method with all sorts of clinical variables. These abstracted values > would be the easiest way for someone accessing the data from the array > side to understand in a simple form what was done. If they wanted to > see more detail, they would go into the more formal description of the > experiment and sample generation. > > > > Michael B. Eisen, Ph.D.Lawrence Berkeley National Labs, andDeparment > of Molecular and Cellular BiologyUniversity of California at > BerkeleyBerkeley, CAEmail:mbeisen@lbl.gov (or > eisen@genome.stanford.edu)FAX:786-549-0137***During Fall 1999-Summer > 2000 I am in Belgium***I can be reached at my usual email addresses - -- Chris Stoeckert, Ph.D. Center for Bioinformatics, University of Pennsylvania 1316 Blockley Hall, 418 Gaurdian Drive Philadelphia, PA 19104-6021 215-573-4409 215-573-3111 FAX stoeckrt@pcbi.upenn.edu ------------------------------ Date: Wed, 14 Jun 2000 15:01:26 +0100 From: Alvis Brazma Subject: [microarray-ontol] minimum information - second draft Dear group, Here is the next draft of the "minimum information" from the annotations group. If there are no comments, this after minor editing may go into a publication from MGED 2 meeting. I hope there will be comments. Best, - - Alvis Brazma - ------------------------------------------------------------------ MINIMUM INFORMATION ABOUT A PUBLISHED MICROARRAY EXPERIMENT (to ensure its interpretability and reproducibility) Draft, June 14, based on the recommendations from MGED 1 - 2 meetings and following discussions. - --------------------------------------------------------------- GOAL: To specify the minimum information that should be reported about a microarray based gene expression measurement based experiment to ensure the interpretability of the results and potential reproducibility. The background aim is to help to establish public repositories for gene expression data. Scientific journals will be encouraged to adopt editorial policies requiring data submissions to a repository, once such a repository (or repositories) is established. INTRODUCTION: The definition of the minimum information is aimed at cooperative authors, and not as a legal document designed to close all possible loopholes in providing the information. One of the concepts in the definition is a list of "qualifier, value" pairs, where the authors are allowed to define their own qualifiers and provide the appropriate values. The list as the whole should give enough information to interpret the particular part of the experiment, but it is left to the author to define their own "qualifier, value" pairs. In future these voluntary pair-list may be supplemented or substituted by respective ontologies developed in collaboration with the Ontology working group. DEFINITION: The minimum information about a published microarray based gene expression experiment consists of 6 following parts: 1. The set of the hybridisation experiments as a whole 2. The arrays used in the experiment and each spot on the array 3. Sample, extract preparation and labeling 4. Hybridisation procedure 5. Expression level measurements 6. Controls The details of 1 -- 6 are the following: 1. The set of the hybridisation experiments as a whole a) author (submitter), laboratory, contact information, links (URL) b) the aim of the experiment (free text description) c) type of the experiment - one line (e.g., * normal vs. diseased comparison * treated vs. untreated comparison * dose response * effect of gene knock-out * effect of gene knock-in (transgenics) * shock ) d) free text description or a link to a publication e) list of platforms used; f) comparative or absolute measurements, g) single or multiple hybridisations, *For multiple hybridisations: * ordered/unordered * serial (yes/no) * type (e.g., time course) * grouping (yes/no) * type (e.g., dose response) * list of the samples and arrays used in the experiment and description of the relationship between them (each sample and each array should be assigned a unique id in the experiment set, all the relationships should be listed with appropriate comments) h) quality related indicators * does a related peer-reviewed publication exist * number of replicates and description (type of replicates) 2. The array and each spot on the array COMMENT: for commercial or other standard arrays this information will normally be provided only once (by the array provider) and referenced by the users (once a repository for such information is established). a) array * array design name (e.g., "Stanford Human 10K set) * platform type: (spotted, synthesized, cDNA, oligos, PCR products, plasmids, colonies, etc) * provider (source) * unique ID from the provider * array dimensions * spot dimensions * number of columns and rows * substrate material (e.g., glass, nylon) b) each element (spot) on the array * clone information obligatory for cDNA elements: clone ID, clone provider, date, availability * sequence information obligatory for non cDNA elements: sequence accession number in DDBJ, EMBL, or GenBank if known sequence itself (if databases do not contain it) number of oligos and the reference sequence (or accession number) for Affymetrix type chips, plus the oligosequences, if given * gene name and links to appropriate databases (e.g., SWISS-PROT, or organism specific databases), if known and relevant * PCR status * checking of the DNA quality (none, resequenced, quality check by gel separation, amount of DNA) * if the element can be used for normalization or control (e.g., element should have expected value) * position on the array 3. Sample, extract preparation and labeling a) sample source and treatment: * organism (NCBI taxonomy) * cell source and type (if derived from primary sources (s)) * "qualifier, value" list (following qualifiers are possible but not exclusive for this item: * sex * age * development stage * organism part (tissue) * animal/plant strain or line (if applicable) * genetic variation (gene knockout, transgenic variation, ...) * individual (if applicable) * individual genetic characteristics (disease alleles, polymorphisms, etc.) * disease state (or normal) * target cell type * separation technique (none, trimming, microdissection, FACS, ...) * cell line and source (if applicable) * in vivo treatments (organism or individual treatments) * in vitro treatments (cell culture conditions) * treatment type (e.g., small molecule, heat shock, cold shock, food deprivation, ...) * compound * separation technique (none, trimming, microdissection, FACS, ...) ) COMMENT. The "qualifier : value" list should be sufficient to describe the sample and the treatment and is chosen by the author. With ontology for sample description being developed, the obligatory part "organism and cell source and type" will be expanded. See Introduction to this document) * laboratory protocol b) hybridisation extract preparation * extraction method * whether total RNA, mRNA, or genomic is extracted * amount of nucleic acids labeled * target amplification (RNA polymerases, PCR) * which label is used (e.g., Cy3, Cy5, 33P) * the labeling ratio (efficiency) * "qualifier, value" list (see Introduction) * laboratory protocol (free text) 4. Hybridisation procedure * the solution (e.g., concentration of solutes) * blocking agent * wash procedure * time, concentration, volume, temperature * description of the hybridisation instruments * "qualifier, value" list (see Introduction) * laboratory protocol (free text) 5. Expression level measurements a) row data and from the hybridised microarray scanning and annotation; a1) scanning information * scanning hardware * scanning software * parsed header of the TIFF file, including laser power, spatial resolution, pixel space * laboratory protocol (free text) a2) the TIFF image file from the hybridised microarray scanning; b) image analysis and quantitation b1) the image analysis output (of the particular image analysis software) for each spot, for each channel; b2) image analysis information * image analysis software specification and version, availability * relevant parameters c) summarized spot quantitation information c1) derived value summarizing each spot used by the author (e.g., a background subtracted intensity ratio typically used for Stanford or Incyte technologies); c2) reliability of the quantitation of the spot (either a numerical value or "unknown") c3) specification (formula) for c1 and c2 c1 and c2 specification should be based on b1 d) summarized information from possible replicates d1) derived measurement value summarizing the replicates of the spot used by the author (e.g., mean value) d2) reliability indicator summarizing the replicates used by the author (e.g., standard deviation) May be "unknown" d3) specification (formula) for d1 and d2 d1 and d2 specification should be bases on b1 or c1-2 ?QUESTION This is an old question. It seems that most potential database users will be interested only and c) or even only d) values. Is therefore a) and b) a necessary part the minimum information? I've asked this question many times and always got an answere that for reproducibility and for people to be confident in the data, a) and b) are needed. On the other hand I know that some emerting databases plan to store only c) or even only d). Any more comments? 6. Controls * control type (prelabeled and added at hybridisation [calibration of scan intensity to quantity]; added at sample labeling [quantitate sample labeling]; added at sample amplification [IVT or PCR control], ...) * ID for the controls * associated normalization type array elements ============================================================ Alvis Brazma, PhD Tel:+44-(0)1223 494658 EMBL Outstation -- Hinxton Fax:+44-(0)1223 494468 European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD UK Email:brazma@ebi.ac.uk ============================================================ ------------------------------ Date: Mon, 19 Jun 2000 12:14:45 -0400 (EDT) From: Chris Stoeckert Subject: [microarray-ontol] microarray-ontol home page Hi all, I've put up a bare bones home page for our working group at: http://www.cbil.upenn.edu/Ontology/MGED_ontology.html Please note that it contains the Biological Sample concepts we worked on in Heidelberg. Also note that it's not a listserv archive. If traffic gets to the point where that's important we can implement something. One thing missing that I plan to add is a list of resources that we can draw on for ontologies so please submit these. FYI - Peter Karp has an article in Bioinformatics (vol 16, p. 269, 2000) An ontology for biological function based on molecular interactions. Chris Chris Stoeckert, Ph.D. Center for Bioinformatics, University of Pennsylvania 1316 Blockley Hall, 418 Gaurdian Drive Philadelphia, PA 19104-6021 215-573-4409 215-573-3111 FAX stoeckrt@pcbi.upenn.edu ------------------------------ End of microarray-ontol-digest V1 #1 ************************************