microarray-ontol-digest Tuesday, March 19 2002 Volume 01 : Number 021 ---------------------------------------------------------------------- Date: Fri, 8 Mar 2002 13:40:56 -0500 From: Chris Stoeckert Subject: [microarray-ontol] Fwd: processing CV: a starting draft (proposed by E. Manduchi and S. McWeeney) (fwd) - --Apple-Mail-4-798434286 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII; format=flowed Related efforts in the data processing working group. Chris > ---------- Forwarded message ---------- > Date: Fri, 8 Mar 2002 11:23:26 -0500 (EST) > From: Elisabetta Manduchi > To: microarray-norm@ebi.ac.uk > Cc: microarray-ontol@ebi.ac.uk, > Shannon McWeeney > Subject: processing CV: a starting draft (proposed by E. Manduchi and S. > McWeeney) > > > In order to concretely start working on a "processing CV", we are > submitting below a proposal for its main scheleton (which follows up on > two emails we previously sent to the processing list, also attached to > this message for convenience). > > We are cc-ing this email to the ontology group as suggested by Alvis. > > As we understand it, our goal is to develop the means to describe > clearly > all that data undergo AFTER quantification (so it is assumed that the > image quantification software utilized has been specified) and PRIOR to > input into downstream analyses (differential expression analyses, > clustering, etc.) That is, we are aiming at establishing a common > terminology which investigators could use to explain to each other what > their preprocessing steps were in a manner which renders them > reproducible. > > In our view we need: > > (a) To establish a CV for what the main "processing operations" are > (e.g. > filtering, normalization, etc). > > (b) For each processing operation, to establish a CV for the currently > available methods. These CVs would be aimed at capturing the main ideas > of > the methods, not the specific implementations. (Ideally a list of > assumptions made by each method should be compiled as well and posted on > the processing webpage.) > The specific implementation could be illustrated by reference to a > published or available algorithm or, when this is "ad hoc", it could be > specified for example as suggested by Yves (his message is also > attached). > > Given the above, for a clear illustration of what was done to the data > prior to inputting them into downstream analyses, an author would have > to > specify: > (i) which processing operations were used and in which order > (ii) for each such processing operation, the method used > (iii) for each such method, the implementation used. > > Here is our proposed CV for (a). Do those who have submitted their > protocols to the list (e.g. besides us, John Quackenbush, Gavin > Sherlock, > and Maureen Gwinn) feel that our proposal adequately capture the > series of steps they use? Same question for those who have not submitted > their protocols, but who have one. If not, what changes should be made > to > this proposal? I guess once we all agree on how to structure (a) we can > move on to (b). > > ---PROPOSED CV for (a): a terminology for processing operations--- > > Here we list the proposed terms, explanations follow below. > > * CHOICE OF SIGNAL MEASURES > * INDIVIDUAL FILTERING > * CONTEXT FILTERING > * WITHIN-SLIDE NORMALIZATION: (i) probes used, (ii) method used > * ACROSS-SLIDE NORMALIZATION: (i) probes used, (ii) method used > * WITHIN-SLIDE PROBE COMBINATIONS > * ACROSS-SLIDE PROBE COMBINATIONS > * OTHER TRANSFORMATIONS > * DIAGNOSTICS > > > NOTES: > 1. The above are the basic operations. A processing protocol will > consist > of a series of (some of) the above operations, the input of each being > the > output of the previous (diagnostics is an exception, in that it might > not > transform the input but simply assess it; in this case its output > concides with its intput). A given operation might be used more than > once > in a protocol. > > 2. By "probe" here we mean "spot" on a spotted array, or any of > "probe cell", "probe pair", or "probe set" on an Affy array (in some > cases people working with Affy do normalization at the probe cell, or > probe pair level, rather than the probe set, and then combine the values > for a probe set value). > > ---EXPLANATIONS--- > > CHOICE OF SIGNAL MEASURES: > > This should specify the signal intensity used for each probe in terms of > the measurements output for that probe by the quantification software > used (in the case of Affy one would specify what "probe" means here: PM > or MM, PM-MM, probe set). That is this would be a precise formula, > involving > some specific foreground and possibly (if background subtraction is > used) some > specific background measurement. Such a formula would therefore clearly > specify in a compact way: > (i) which of the possibly several measurements for foreground/background > were used > (ii) if background subtraction occurred > (iii) if ratios or other transformations of these measurements > were taken to obtain the value attached to each probe > (iv) if any thresholding occurred (the formula could include "if" > conditions). > > INDIVIDUAL FILTERING > Here a probe is flagged based on some criteria, e.g. visual inspection, > spot was a blank or an anchor, PCR failure, flag from quantification > software, flag based on a cutoff for a given measurement (combination > of measurement). > > Will need a controlled vocabulary to specify the flagging criteria (this > is the (b) CV for this operation). > > CONTEXT FILTERING > Here a probe is flagged based on its behaviour as part of a group of > probes whether within an array (e.g. replicate spots on the same array > have too high of a standard deviation) or across a set of arrays. > > WITHIN-SLIDE NORMALIZATION > a. probes used to compute the normalization function (all, housekeeping > genes, spiked controls, etc.). > b. how the normalization function is computed and used (e.g. > median, lowess, print-tip lowess) > > ACROSS-SLIDE NORMALIZATION > E.g. for 2-channel arrays, when scale normalization is used on top of > within slide normalization; or for filter or short oligo arrays when a > baseline is used, or when a quantile normalization is used, etc. or when > "a la Astrand" normalization is used, etc. > a. probes used to compute the normalization function. > b. how the function is computed and used. > > WITHIN-SLIDE PROBE COMBINATIONS > E.g. averaging (or other transformation) of genes multiply-spotted on an > array, or averaging of clones representing the same gene, or combining > probe pair values to yield a probe set gene expression index on an Affy > array, etc. > > ACROSS-SLIDE PROBE COMBINATIONS > E.g. averaging of values for a given probe across a group of replicate > arrays. > > OTHER TRANSFORMATIONS (on individual probes or probe combinations): > logs, etc. > > DIAGNOSTICS > E.g. any plots used to assess the data at any given point of the series > of > processing transformation, also to guide decisions on what > operation/method is most suitable as next step. > ----- > > Elisabetta and Shannon > > > - --Apple-Mail-4-798434286 Content-Disposition: attachment; filename=YvesMsg.txt Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; x-unix-mode=0666; name="YvesMsg.txt" > Date: Thu, 21 Feb 2002 13:37:25 +0100=0D > From: Yves Moreau =0D > To: microarray-norm@ebi.ac.uk=0D > Subject: RE: [microarray-norm] Checking in=0D >=0D >=0D > Hi,=0D >=0D > Following up on Neil, I agree that replicates are an integral part of = the=0D > experiment and that they add another meaningful level to the = hierarchy. But=0D > I would think an experiment is simply any collection of arrays you = want to=0D > analyze together (so I would not say a replicate with a second source = of=0D > biomaterial creates the need for a separate experiment [for example, = two=0D > tumors from different patients but with the same histopathology give = rise to=0D > a replicate with different source of biomaterial but are usually seen = as=0D > part of the same experiment]).=0D >=0D > I also agree with Alvis that we have to be careful to avoid making an=0D= > extremely complex hierarchy that we are not able to understand = anymore. To=0D > accomodate this complexity, maybe we should think in term of a=0D > multidimensional structure (I hope not to lose the biologists among us = with=0D > what follows).=0D >=0D > Let us consider the following example: the study of the development of = a=0D > culture of human cells with a time-course experiment with 4 = time-points=0D > using cDNA microarrays with a 22K design on 5 glass slides with 4608 = clones=0D > in duplicates (right half of the microarray is a copy of the left = half)=0D > spotted by 12 pins. Please bear with me with the following lengthy=0D > description, but it is better to be concrete. To make the design of a = single=0D > slide clearer, you can visualize it as follows (f123 is feature number = 123=0D > for this slide):=0D >=0D > f1 ... ... f32 f1 ... ... f32 |=0D > f33 ... ... f64 f33 ... ... f64 |=0D > ... ... ... ... ... ... ... ... | spotted by pin 1=0D > ... ... ... ... ... ... ... ... |=0D > f353 ... ... f384 f353 ... ... ... |=0D =0D >=0D > f385 ... ... f416 f385 ... ... f416 |=0D > ... ... ... ... ... ... ... ... | spotted by pin 2=0D > f737 ... ... f768 f737 ... ... f768 |=0D >=0D > ... ... ... ... ... ... ... ...=0D >=0D > f4225 ... ... f4256 f4225 ... ... f4256 |=0D > ... ... ... ... ... ... ... ... | spotted by pin 12=0D > f4577 ... ... f4608 f4577 ... ... f4608 |=0D >=0D > Now consider an expression record as a vector of measurements together = with=0D > some attributes. The measurements could be at this point only the red=0D= > intensity R (however we define this), the green intensity, and the M = and A=0D > values (M=3Dlog2(R/G) and A=3D1/2*log2(R*G)). We identify a record = uniquely by=0D > an index value. (I personally do not like to view the data as a matrix=0D= > because the number of records can vary unpredictably during operations = such=0D > as filtering.) The attributes are the following:=0D > * Slide number Slide1-Slide20=0D > * Time-point T1-T4=0D > * Slide design Design1-Design5=0D > * Side SideLeft-SideRight=0D > * Pin Pin1-Pin12=0D > * Reporter Clone1-Clone23040=0D > So we get records of the form:=0D > (19169; R19169, G19169, M19169, A19169; Slide5, T1, Design5, SideLeft, = Pin2,=0D > Clone19169), which is the reporter spotted by pin 2 at position f737 = on the=0D > left side of the slide with design 5 for time point 1. In the same = way, the=0D > corresponding reporter on the right side of the slide will have record=0D= > (23777; R23777, G23777, M23777, A23777; Slide5, T1, Design5, = SideRight,=0D > Pin2, Clone19169).=0D >=0D > Now, with such records, normalization amounts to saying that a = measurement=0D > should be corrected so that it is independent of some other = measurement or=0D > some attribute (or any combination thereof). The data necessary to = determine=0D > the correction can also be determined easily from the records. To = describe=0D > the normalization, we essentially have to describe the plot we use, = which=0D > means giving the x axis, the y axis, the records that are plotted, and = the=0D > function (of the x axis) we compute from this data (lowess curve, = mean,=0D > trimmed mean, median, ...)=0D > Examples of normalization are=0D > * Lowess centering of M by A across all the points on the same slide = and the=0D > same pin=0D > * Mean centering of M (=3Dlog ratio) by pin across all the points of = the same=0D > slide=0D >=0D > Obviously, many other attributes can be tracked, for example the batch=0D= > number. Another form of normalization would then be=0D > * Median centering of log ratio (for example, previously corrected for=0D= > intensity and pin effects) by batch number=0D >=0D > This somewhat convoluted explanation does not go much further than = John Q's=0D > presentation. But the point is that with a bit of systematic = accounting and=0D > a clear definition of the attributes, any kind of normalization could = be=0D > decomposed into a sequence of such steps. And if anyone would wish to=0D= > introduce a normalization for some new effect, it would only amount to=0D= > defining a new attribute. I think other operations such as filtering = or=0D > averaging across replicates would benefit from the use of such = attributes.=0D >=0D > What do people think?=0D >=0D > Yves= - --Apple-Mail-4-798434286 Content-Disposition: attachment; filename=ShannonMsg.txt Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; x-unix-mode=0666; name="ShannonMsg.txt" Date: Thu, 21 Feb 2002 08:56:38 -0500 (EST)=0D From: Shannon McWeeney =0D To: Yves Moreau =0D Cc: microarray-norm@ebi.ac.uk=0D Subject: RE: [microarray-norm] Checking in=0D =0D Yves,=0D I think you made an excellent point about being explicit about terms=0D and attributes. It also is important that we keep the ontology working=0D= group /MAGE OM in mind when we are dealing with terminology/attributes=0D= that extend beyond normlization such as experiment, replicate, biosource=0D= etc, instead of coming up with new terminology.=0D >=46rom your example, several stages/attributes jump to mind for = discussion:=0D =0D 1. Filtering: Process of removing spots from data set due to=0D pre-determined criteria. This can be seen as transformation/reduction=0D of original data matrix. This criteria includes:=0D a. Flags=0D There can be flags from visual inspection by the user set manually or by=0D= the software based on automated or user-defined criteria. These flags = are=0D output in the file from the quantification software (arrayvision, = genepix=0D etc). If flag does not equal value X, spot is removed.=0D b. Spot quality is determined by Signal to Noise Ratio=0D c. etc.=0D =0D 2. Transformation of signal instensity=0D a. Selection of foreground signal intensity (median, mean etc.)=0D b. Selection of background intensity (median, mean etc.)=0D c. Correction of Signal intensity=0D i. Background Subtraction=0D Examination of coefficient of variation (CV) plots to determine=0D amount of variability introduced by background subtraction. Decision = made by=0D user to background subtract=0D or not.=0D ii. for 2-channel, representation of signal as ratio. Must define=0D transformation of ratio (log(base), etc.). Should also specify = components=0D of ratio explicitly (eg. M ratio =3D log2(R/G)).=0D iii. Types of Protocols for correction=0D STAGE: WITHIN SLIDE=0D a. Global scaling using constant factor to correct=0D for difference=0D between R and G (i.e. dye effects). Must specity factor and how it was=0D= derived. Assumption: No intensity or spatial effects=0D b. Global Lowess=0D c. print-tip group lowess normalization=0D d. scaled-print tip group lowess normalization=0D STAGE: ACROSS SLIDE=0D =0D =0D Obviously, a lot needs to be filled in but in my mind, this is how I see=0D= the progression: transformation of the matrix, transformation of the=0D signal: defintion of signal, correction of signal (background = subtraction,=0D within-slide protocols, across-slide protocols).=0D =0D Suggestions??= - --Apple-Mail-4-798434286 Content-Disposition: attachment; filename=ElisabettaMsg.txt Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; x-unix-mode=0666; name="ElisabettaMsg.txt" Date: Thu, 21 Feb 2002 10:17:04 -0500 (EST)=0D From: Elisabetta Manduchi =0D To: microarray-norm@ebi.ac.uk=0D Subject: a sample protocol=0D =0D =0D Below is an outline of my sample protocol for data processing, where the = =0D starting point is the output of image quantification software.=0D =0D The protocol REFERS TO 2-channel array data and to a situation when=0D only one slide is used for each pair of samples (as opposed to the =0D multislide experiment described by Yves).=0D =0D I've tried to list the steps with titles which reflect what I currently=0D= view as the main components of the processing workflow (and which=0D basically agree with the workflow breakdown suggested by Shannon's in = the=0D message she sent this morning).=0D =0D (I have also dealt with filter arrays and Affy data, but on a smaller=0D scale and so far have used a very basic protocol which I'm not = reporting.=0D However I think it's important that we consider this kind of data as = well=0D when developing the cv. As a pointer, there was an interesting workshop = on=0D low level analysis of Affy data in Washington DC where various people=0D presented their protocols. A report is at=0D = http://oz.berkeley.edu/users/terry/zarray/Affy/GL_Workshop/genelogic2001.h= tml.)=0D =0D - ----- SAMPLE PROCESSING PROTOCOL for 2-channel data -----=0D =0D Since I typically deal with data from a variety of collaborators, who = use=0D different image quantification software, the decision (i) on which=0D foreground and background measurements to use, (ii) on whether or not to=0D= subtract background, and (iii) on what quality measures to use for=0D filtering, depends on which quantification software was utilized. In=0D steps 1 and 2 below I only provide some examples of what I've done in = the=0D past with some specific software.=0D =0D 1. MEASUREMENTS USED=0D As explained above, this varies according to the image quantification=0D software utilized.=0D With the Spot software (see=0D http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html)=0D I've been using the mean foreground measurement and the morphological=0D opening background measurement.=0D With ArrayVision, I've been using median foreground and background.=0D With ScanAlyze, mean foreground and median background.=0D Etc.=0D =0D 2. TO SUBTRACT OR NOT TO SUBTRACT BACKGROUND=0D What I've done in the past depended again on the way the particular=0D quantification software calculates background.=0D For the Spot software, I typically do subtract the morphological opening=0D= value (in this case the background value varies from spot to spot, but =0D= is computed from an intermediate step background image over the whole =0D= slide, rather than from pixels in the immediate vicinity of the spot). =0D= For software like Genepix or ArrayVision, I first try to have a look at =0D= the diagnostic plots (see 4), both with and without background=0D subtraction, before making a decision on whether to subtract background = (I=0D look for fish tails in the M vs A plots, etc.)=0D =0D 3. FILTERING=0D a. Filter out those spots which were assigned flags upon visual = inspection=0D during quantification, due to scratches, blotches, etc.=0D b. Possibly filter out other spots which do not pass given cutoffs for=0D= selected quality measures from the image quantification software = utilized=0D (I've been changing the criteria for the latter and the criteria = depended=0D on the quality measures output by the image quantification software, so = I=0D don't have standards yet).=0D c. When background is subtracted, filter out spots with negative signal.=0D= =0D 4. DIAGNOSTIC PLOTS=0D Among other things (like data quality assessment), these help me in the=0D= choice of the normalization method. I have been using the sma R-package=0D= from the T. Speed group for this purpose (available at=0D http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html)=0D= combined, on occasions, with additional code I wrote to deal with = specific=0D needs I had.=0D The plots below were drawn both before normalization, and=0D after various choices of normalization methods to guide me to the most=0D= suitable choice. In the case in which not all spots of the array but a=0D= selected subset needs to be used to build the normalization curve(s) =0D= (lowess or print tip lowess), these curves (based only on the subset) =0D= should be drawn and the selected spots plotted in a different color. =0D= =0D For each slide:=0D a. M=3Dlog_2(R/G) vs A=3D(log_2(R)+log_2(G))/2 plots, with fitted lowess=0D= (global and print tip) curves.=0D b. boxplots of M print tip by print tip, to compare variance from one=0D print tip to another.=0D c. Spatial plots (with the plot.spatial function of the sma package)=0D =0D If the experiment involves more than one slide (as it usually does),=0D besides the plots above, I also look at the boxplots of M slide by slide=0D= to compare variance from one slide to another.=0D =0D 5. NORMALIZATION of M=0D =0D a. Spots used to compute the normalization curve(s):=0D =0D This depends on the nature of the biosources as well as on which genes = are=0D monitored by the specific slide. In view of these two factors, if using = =0D all genes on the array satisfies the required assumptions for=0D normalization, all genes can be used. Else controls need to be used.=0D =0D b. Methods:=0D =0D I've been using the methods (or slight modification of these) from the = sma=0D R-package of the T. Speed group (see also=0D http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html).=0D i. If the layout of the array permits, I've used print-tip lowess =0D normalization. Else (e.g. if genes have been spotted on the array =0D on the same print-tip when in the same functional groups, or in general=0D= where the array layout was biased) I use global lowess.=0D ii. If print-tip normalization is appropriate and the print-tip M = boxplots=0D reveal much differences in spread, I consider scaled-print-tip=0D normalization, but use this with caution, as it involves further=0D assumptions.=0D iii. If I have a series of slides, I consider across=0D slide scale normalization (again from the sma package) on top of the=0D within slide normalization as in a and b above, but this too I use =0D with caution.=0D 6. OTHER TRANSFORMATIONS=0D I might or might not apply other transformations (e.g. unlog M, etc.)=0D before inputting into downlstream analysis programs, according to what=0D= program I'm using for the questions at hand.=0D =0D - -- =0D Elisabetta Manduchi=0D =0D Computational Biology and Informatics Laboratory | phone: 215 573-4408=0D= Center for Bioinformatics | fax: 215 573-3111=0D= University of Pennsylvania |=0D - ----------------------------------------------------------------------=0D= 1428 Blockley Hall | email: manduchi@pcbi.upenn.edu=0D 423 Guardian Drive |=0D Philadelphia | http://www.cbil.upenn.edu/~manduchi=0D PA 19104-6021 |=0D - ----------------------------------------------------------------------=0D= =0D =0D =0D =0D =0D =0D =0D =0D =0D =0D =0D =0D =0D - --Apple-Mail-4-798434286-- ------------------------------ Date: Mon, 11 Mar 2002 21:57:33 -0500 From: Chris Stoeckert Subject: [microarray-ontol] updated web site Dear Group, I've reorganized and updated the Ontology Working Group web site (see http://www.cbil.upenn.edu/Ontology). Still needs a couple tweaks and I haven't updated the MGED ontology yet. Let me know if you want me to change/add/remove anything. Cheers, Chris ------------------------------ Date: Tue, 12 Mar 2002 08:28:47 +0100 From: Tim Eyres Subject: Re: [microarray-ontol] updated web site Chris Stoeckert wrote: > > Dear Group, > I've reorganized and updated the Ontology Working Group web site (see > http://www.cbil.upenn.edu/Ontology). Thanks a lot for the work on the site, I find the whole site much more readable. I can't however find any links to the Ontology tools such as Oil etc.. Where can I find them or could they be added if not there? Also I think a direct link on the homepage to the MGED Ontology on Sourceforge could be useful. If one doesn't know anything about Sourceforge/open source it may be that one doesn't know to look at the 'MGED open source software site' link to get the ontology. Tim ------------------------------ Date: Tue, 12 Mar 2002 10:11:47 -0500 From: Chris Stoeckert Subject: Re: [microarray-ontol] updated web site Tim, Thanks for the feedback. The tools are at http://www.cbil.upenn.edu/Ontology/Build_Ontology2.html#ontologytools. I will move these to their own section on the main page. I will also add a direct link (with appropriate text) to http://mged.sourceforge.net/Ontologies.shtml in the MGED links section. Cheers, Chris On Tuesday, March 12, 2002, at 02:28 AM, Tim Eyres wrote: > Chris Stoeckert wrote: >> >> Dear Group, >> I've reorganized and updated the Ontology Working Group web site (see >> http://www.cbil.upenn.edu/Ontology). > > Thanks a lot for the work on the site, I find the whole site much more > readable. > I can't however find any links to the Ontology tools such as Oil etc.. > Where can > I find them or could they be added if not there? Also I think a direct > link on > the homepage to the MGED Ontology on Sourceforge could be useful. If > one doesn't > know anything about Sourceforge/open source it may be that one doesn't > know to > look at the 'MGED open source software site' link to get the ontology. > > Tim > ------------------------------ Date: Tue, 19 Mar 2002 12:36:25 +0000 From: Susanna Sansone Subject: [microarray-ontol] MIAMEv1.1-MAGE-OntologyDraft2v1.0 Hi all, the MIAME-MAGE-Ontology mapping doc will be today replace (on the mged web site) with an updated version, reflecting the 'change' in MIAMEv1.1. At MGED4 has become very clear that we need to improve the glossary, so please send comments and suggestions! Thanks, Susanna - -- ******************************* Susanna Assunta Sansone, PhD Microarray Informatics EBI - The European Bioinformatics Institute EMBL Outstation - Hinxton, Wellcome Trust Genome Campus Cambridge CB10 1SD, UK email: sansone@ebi.ac.uk direct: +44 (0)1223 494 691 fax: +44 (0)1223 494 468 http://www.ebi.ac.uk/microarray ******************************* ------------------------------ End of microarray-ontol-digest V1 #21 *************************************