microarray-ontol-digest Thursday, September 20 2001 Volume 01 : Number 012 ---------------------------------------------------------------------- Date: Sun, 9 Sep 2001 21:12:33 -0400 From: Chris Stoeckert Subject: [microarray-ontol] Fwd: Urgent message from Declan Butler, Nature - --Apple-Mail-1793313083-3 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; format=flowed; charset=windows-1252 Dear Group, This may be of interest. Note the reference to MGED! Chris =46rom Declan Butler. d.butler@nature-france.com Opinion in Nature 413,1-3 (2001) Sept. 6 > The future of the electronic scientific literature > > The Internet's transformation of scientific communication has only=20 > begun, but already much of its promise is within reach. The vision=20 > below may change in its detail, but experimentation and lack of=20 > dogmatism are undoubtedly the way forward. > > "The Internet is easier to invent than to predict" is a maxim that = time=20 > has proven to be a truism. Much the same might be said of scientific=20= > publishing on the Internet, the history of which is littered with=20 > failed predictions. Technological advance itself will, of course, = bring=20 > dramatic changes =97 and it is a safe bet that bright software minds = will=20 > punctually overturn any vision. But it is becoming clear that=20 > developing common standards will be critical in determining both the=20= > speed and extent of progress towards a scientific web. > > 'Standards' for managing electronic content are hardly a riveting = topic=20 > for researchers. But they are key to a host of issues that affect=20 > scientists, such as searching, data mining, functionality and the=20 > creation of stable, long-term archives of research results. Moreover,=20= > just as the Internet and web owe their success to agreed network=20 > protocols on which others were able to build, common standards in=20 > science will provide a foundation for a diversity of publishing models=20= > and experiments and be a better alternative to 'one-size-fits-all'=20 > solutions. > > This explains why the Open Archives Initiative (OAI), one of many=20 > alternatives now being offered to scientists to disseminate their = work,=20 > has now broadened its focus from e-prints to promoting common web=20 > standards for digital content. > > The reason is that some of the most promising emerging technologies=20 > will only realize their full promise if they are adopted in a=20 > consensual fashion by entire communities. At the level of the online=20= > scientific 'paper', one major change, for example, is a shift in = format=20 > to make papers more computer-readable. Searches will become much more=20= > powerful; tables and figures will cease to be flat, lifeless objects,=20= > and instead will be able to be queried and manipulated by users, using=20= > suites of online visualization and data-analysis tools. > > This is being made possible by Extensible Mark-up Language (XML), = which=20 > allows a document to be tagged with machine-readable 'metadata', in=20 > effect converting it into a sort of mini-database. Most web pages = today=20 > are coded in HTML. But this contains information only about a page's=20= > appearance. Whereas HTML specifies title and author information, for=20= > example as simple headings, such as: >

The future of the electronic scientific literature

>

by John Smith

> XML specifies these in a way that computers can understand: > The future of the electronic scientific literature=20 > John =20 > Smith. > > The possibilities for tagging are endless. But a major need now is for=20= > stakeholders to agree on common metadata standards for the basic=20 > structure of scientific papers. This would allow more specific queries=20= > to be made across large swathes of the literature. Indeed, what is=20 > above all hampering the usefulness of today's online journals, e-print=20= > archives and scientific digital libraries is the lack of means to=20 > federate these resources through unified interfaces. > > The OAI has agreed metadata standards to facilitate improved searching=20= > across participating archives, which can therefore be queried by users=20= > as if they were one seamless site. The OAI is attractive compared with=20= > centralized archives in that it allows any group to create an archive=20= > while, by agreeing common standards, they become part of a greater=20 > whole. The idea is catching on: it is supported by the Digital Library=20= > Federation (DLF), a consortium of US libraries and agencies, including=20= > the Online Computer Library Center. CrossRef, a collaboration of 78=20 > learned society and commercial publishers, in which Nature's = publishers=20 > are taking a leading role, is also actively developing common metadata=20= > standards that would allow better cross-searching of the 3 million=20 > articles they hold. > > Minimal options > As metadata are expensive to create =97 it is estimated that tagging=20= > papers with even minimal metadata can add as much as 40% to costs =97 = OAI=20 > is developing its core metadata as a lowest common denominator to = avoid=20 > putting an excessive burden on those who wish to take part. But even=20= > these skimpy metadata already allow one to improve retrieval. This=20 > strategy is sensible as it acknowledges the fact that the value and=20 > nature of scientific information are heterogeneous. > > Minimal metadata will suffice for much of the literature. But there=20 > will increasingly be sophisticated and novel forms of publications=20 > built around highly organized communities working off large, shared=20 > data sets. These hubs will stand out by their large investment in rich=20= > metadata and sophisticated databases. The future electronic landscape=20= > should see such high added-value hubs evolving as overlays to vast but=20= > largely automated literature archives and databases. > > In such an early stage of development, it is essential to avoid=20 > dogmatic solutions. Not all papers will warrant the costs of marking = up=20 > with metadata, nor will much of the grey literature, such as = conference=20 > proceedings or the large internal documentation of government = agencies.=20 > Many high-cost, low-circulation print journals could be replaced by=20 > digital libraries. Overheads would be kept low, and the economics=20 > argues that the cheapest means of handling the bulk of the literature=20= > may be automated digital libraries. Tags automatically generated from=20= > machine analysis of the text, for example, might minimize the quantity=20= > of manual metadata needed. > > Or take ResearchIndex, software produced by the computer company NEC,=20= > which builds digital libraries with little human intervention. It=20 > gathers scientific papers from around the web and, using simple rules=20= > based on document formatting, can extract the title, abstract, author=20= > and references. It interprets the latter, and can conduct automatic=20 > citation analyses for all the papers indexed. Such digital libraries=20= > will also provide new tools, for example to generate new metrics based=20= > on user behaviour, which will complement and even surpass citation=20 > rankings and impact factors. > > At the other end of the spectrum, specialized communities organized=20 > around shared data sets will produce highly sophisticated electronic=20= > 'publications', making it much more arduous for authors to submit=20 > information because of the amount and detail they will be required to=20= > enter in machine-readable form. Take the Alliance for Cellular=20 > Signaling (AfCS), a 10-year, multimillion-dollar, multidisciplinary=20 > project run by a consortium of 20 US institutions. It is taking a=20 > systems view of proteins involved in signalling, and integrating large=20= > amounts of data into models that will piece together how cellular=20 > signalling functions as a whole in the cell. Here, authors would be=20 > required to input information, for example, on the protocols, tissues,=20= > cell types, specific concentration factors used and the experimental=20= > outcomes. Inputs would be chosen from menus of strictly defined terms=20= > and ranges, corresponding to predefined knowledge representations and=20= > vocabularies for cell signalling. > > The idea is that, rather than simply producing their own data,=20 > communities instead create a vast, shared pool of well-structured=20 > information, and benefit by being able to make much more powerful=20 > queries, simulations and data mining. A series of 'molecule pages'=20 > would also pull together virtually all published data and literature=20= > about individual molecules in relation to signalling. > > Indeed, the high-throughput nature of much of modern research means=20 > that, increasingly, important results can be fully expressed only in=20= > electronic rather than print format. Systems biology in particular is=20= > driving research that seeks to describe the function of whole pathways=20= > and networks of genes and proteins, and to cover scales ranging from=20= > atoms and molecules to organisms. Increasingly, the literature and=20 > biological databases will converge to create new forms of = publications.=20 > Other disciplines stand to benefit, too. > > Helping machines make sense of science on the web > Many communities, including the AfCS, are building ontologies to=20 > underpin such schemes. Ontologies mean different things to different=20= > people, but they are in effect representations that attempt to=20 > hard-code human knowledge about a topic and the intrinsic = relationships=20 > in ways that computers can use. The microarray community has been very=20= > active in this area. The Microarray Gene Expression Database group has=20= > coordinated global standards; as a result, users will be able to query=20= > vast shared data sets to find all experiments that use a specified = type=20 > of biological material, test the effects of a specified treatment or=20= > measure the expression of a specified gene, and much more. > > One major problem is that genes and proteins often have different = names=20 > in different organisms, and these often say little about what they do.=20= > To get round this problem, the Gene Ontology (GO) Consortium is=20 > creating tree-like ontologies of the 'molecular function', 'biological=20= > process' and 'cellular component' of gene products. All genes involved=20= > in 'DNA repair', for example, would be mapped to the corresponding GO=20= > term, irrespective of their name or source organism. A microarray=20 > gene-expression analysis that previously yielded only names of=20 > expressed genes would in addition carry mapped GO terms that might=20 > reveal, say, that half the genes are involved in 'protein folding'. GO=20= > terms can also help to federate disparate databases. > > Ontologies can also be used to tag literature automatically, and will=20= > be particularly useful for grey literature and archival material for=20= > which manual tagging was not justified. Papers tagged automatically=20 > with concepts can be matched, grouped into topic maps and mined. By=20 > breaking down terminological barriers between disciplines, this should=20= > also enhance interdisciplinary understanding and even serendipity.=20 > Nature is actively investigating such possibilities. > > The GO ontologies are still very incomplete, however, and the internal=20= > relationships need to be enriched. Moreover, caution is required=20 > against prematurely pigeon-holing gene functions, given the = uncertainty=20 > of most annotations. Ontologies are also the focus of intensive=20 > research in computing science, and biology is not yet up to speed on=20= > this. Efforts such as GO and the Bio-Ontologies Consortium deserve=20 > support. Indeed, given the shortcomings of existing ontologies and=20 > controlled vocabularies, there may be a case for creating a more=20 > organized international effort to ensure economy of effort,=20 > interoperability and sharing of expertise. > > The advent of structured papers that are increasingly held in=20 > literature databases blurs further the distinction between the=20 > scientific paper and entries in biological databases. Already, entries=20= > in the biological databases are often hyperlinked to relevant articles=20= > in the literature and vice versa, and CrossRef is developing standards=20= > for such linking. As text becomes more structured, it will be possible=20= > to increase the sophistication of both linking, data manipulation and=20= > retrieval. > > Biological databases and journals have evolved relatively = independently=20 > of one another. Database annotations lack the prestige of published=20 > papers; indeed, their value is largely ignored by citation metrics, = and=20 > their upkeep is often regarded as a thankless task. Database curation=20= > has consequently lacked the quality control typical of good journals.=20= > The convergence between databases and the literature means that=20 > database annotators and curators will increasingly perform the=20 > functions of journal editors and reviewers, while publishers will=20 > develop sophisticated database platforms and tools. > > New ways in > Database- and metadata-driven systems will drive interfaces to=20 > publications from simple keyword search models to ones that reflect = the=20 > structure of biological information. Visualization tools of = chromosomal=20 > location, biochemical pathways and structural interactions may become=20= > the obvious portals to the wider literature, given that there are far=20= > fewer protein structures or gene sequences than there are articles=20 > about them. As Mark Gerstein, a bioinformaticist at Yale University,=20= > points out: "One might 'fly through' a large three-dimensional=20 > molecular structure, such as the ribosome, where various surface=20 > patches would be linked to publications describing associated chemical=20= > binding studies." > > Future electronic literature will therefore be much more heterogeneous=20= > than the current journal system, and dogmatic solutions should=20 > therefore be resisted. It is significant and sensible that both=20 > CrossRef and OAI have made key strategic choices favouring openness = and=20 > adaptability. They seek to federate distributed actors rather than to=20= > create centralized structures. They also make their work independent = of=20 > the type of content, which makes it flexible enough to incorporate and=20= > link seamlessly not just papers but news, books and other media. > > Crucially, both OAI and CrossRef have also decided to build systems=20 > independent of the economic mechanisms surrounding that content. Many=20= > publishers, in particular some learned societies, may be willing to=20 > make their content free, perhaps after a certain delay. Others are=20 > exploring business models where authors or sponsors pay, which would=20= > allow free access to articles on publication. The open technological=20= > frameworks also mean that particular communities, such as scientists=20= > with specific metadata needs for their discipline, are free to build = in=20 > more complex data structures; the higher overheads incurred may = require=20 > charging for added-value services. > > Neutrality > The OAI and CrossRef strategies therefore differ fundamentally from=20 > more centralized systems proposed by PubMed Central (PMC), operated by=20= > the US National Library of Medicine, and E-Biosci, being developed by=20= > the European Molecular Biology Organization. > > But PMC and E-Biosci highlight the urgent need to index the full text=20= > of papers and their metadata and not just abstracts, as is the = practice=20 > of PubMed and other aggregators. Services that require publishers to=20= > deposit full text only for indexing and improving search are useful. > > Unfortunately, PMC, unlike E-Biosci, confounds this primarily=20 > technological issue with an economic one, by requiring that all text = be=20 > made available free after, at most, one year. It is regrettable that=20= > PMC has not in the first instance sought full-text indexing itself as = a=20 > goal, as this in itself would be an immediate boon to researchers. It=20= > would also probably have been more successful in attracting = publishers. > > The reality is that all of those involved in scientific publishing are=20= > in a period of intense experimentation, the outcome of which is=20 > difficult to predict. Getting there will require novel forms of=20 > collaboration between publishers, databases, digital libraries and=20 > other stakeholders. It would be unwise to put all of one's eggs in the=20= > basket of any one economic or technological 'solution'. Diversity is=20= > the best bet. > > This Opinion article has been inspired by many of the contributions to=20= > Nature's web forum on "Future e-access to the primary literature". The=20= > current table of contents of the forum can be found at the following=20= > address: http://www.nature.com/nature/debates/e-access/ > - --Apple-Mail-1793313083-3 Content-Transfer-Encoding: quoted-printable Content-Type: text/enriched; charset=windows-1252 Dear Group, This may be of interest. Note the reference to MGED! Chris =46rom Declan Butler. d.butler@nature-france.com Opinion in Nature 413,1-3 (2001) Sept. = 60000,0000,DEB7 = TimesThe future of the electronic scientific = literature TimesThe Internet's transformation of scientific communication has only begun, but already much of its promise is within reach. The vision below may change in its detail, but experimentation and lack of dogmatism are undoubtedly the way forward. "The Internet is easier to invent than to predict" is a maxim that time has proven to be a truism. Much the same might be said of scientific publishing on the Internet, the history of which is littered with failed predictions. Technological advance itself will, of course, bring dramatic changes =97 and it is a safe bet that bright software minds will punctually overturn any vision. But it is becoming clear that developing common standards will be critical in determining both the speed and extent of progress towards a scientific web. 'Standards' for managing electronic content are hardly a riveting topic for researchers. But they are key to a host of issues that affect scientists, such as searching, data mining, functionality and the creation of stable, long-term archives of research results. Moreover, just as the Internet and web owe their success to agreed network protocols on which others were able to build, common standards in science will provide a foundation for a diversity of publishing models and experiments and be a better alternative to 'one-size-fits-all' solutions. This explains why the Open Archives Initiative (OAI), one of many alternatives now being offered to scientists to disseminate their work, has now broadened its focus from e-prints to promoting common web standards for digital content. The reason is that some of the most promising emerging technologies will only realize their full promise if they are adopted in a consensual fashion by entire communities. At the level of the online scientific 'paper', one major change, for example, is a shift in format to make papers more computer-readable. Searches will become much more powerful; tables and figures will cease to be flat, lifeless objects, and instead will be able to be queried and manipulated by users, using suites of online visualization and data-analysis tools. This is being made possible by 1999,1999,FFFFExtensible Mark-up Language (XML), which allows a document to be tagged with machine-readable 'metadata', in effect converting it into a sort of mini-database. Most web pages today are coded in 1999,1999,FFFFHTML. But this contains information only about a page's appearance. Whereas HTML specifies title and author information, for example as simple headings, such as: <

The future of the electronic scientific literature <

<

by John Smith<

XML specifies these in a way that computers can understand: < The future of the electronic scientific literature < <<John< < Smith<. The possibilities for tagging are endless. But a major need now is for stakeholders to agree on common metadata standards for the basic structure of scientific papers. This would allow more specific queries to be made across large swathes of the literature. Indeed, what is above all hampering the usefulness of today's online journals, e-print archives and scientific digital libraries is the lack of means to federate these resources through unified interfaces. The OAI has agreed metadata standards to facilitate improved searching across participating archives, which can therefore be queried by users as if they were one seamless site. The OAI is attractive compared with centralized archives in that it allows any group to create an archive while, by agreeing common standards, they become part of a greater whole. The idea is catching on: it is supported by the 1999,1999,FFFFDigital Library Federation (DLF), a consortium of US libraries and agencies, including the 1999,1999,FFFFOnline Computer Library Center. = 1999,1999,FFFFCrossRef, a collaboration of 1999,1999,FFFF78 learned society and commercial publishers, in which Nature's publishers are taking a leading role, is also actively developing common metadata standards that would allow better cross-searching of the 3 million articles they hold. Minimal options As metadata are expensive to create =97 it is estimated that tagging papers with even minimal metadata can add as much as 40% to costs =97 OAI is developing its core metadata as a lowest common denominator to avoid putting an excessive burden on those who wish to take part. But even these skimpy metadata already allow one to improve retrieval. This strategy is sensible as it acknowledges the fact that the value and nature of scientific information are heterogeneous. Minimal metadata will suffice for much of the literature. But there will increasingly be sophisticated and novel forms of publications built around highly organized communities working off large, shared data sets. These hubs will stand out by their large investment in rich metadata and sophisticated databases. The future electronic landscape should see such high added-value hubs evolving as overlays to vast but largely automated literature archives and databases. In such an early stage of development, it is essential to avoid dogmatic solutions. Not all papers will warrant the costs of marking up with metadata, nor will much of the grey literature, such as conference proceedings or the large internal documentation of government agencies. Many high-cost, low-circulation print journals could be replaced by digital libraries. Overheads would be kept low, and the economics argues that the cheapest means of handling the bulk of the literature may be automated digital libraries. Tags automatically generated from machine analysis of the text, for example, might minimize the quantity of manual metadata needed. Or take = 1999,1999,FFFFResearchIndex, software produced by the computer company 1999,1999,FFFFNEC, which builds digital libraries with little human intervention. It gathers scientific papers from around the web and, using simple rules based on document formatting, can extract the title, abstract, author and references. It interprets the latter, and can conduct automatic citation analyses for all the papers indexed. Such digital libraries will also provide new tools, for example to generate new metrics based on user behaviour, which will complement and even surpass citation rankings and impact factors. At the other end of the spectrum, specialized communities organized around shared data sets will produce highly sophisticated electronic 'publications', making it much more arduous for authors to submit information because of the amount and detail they will be required to enter in machine-readable form. Take the 1999,1999,FFFFAlliance for Cellular Signaling (AfCS), a 10-year, multimillion-dollar, multidisciplinary project run by a consortium of 20 US institutions. It is taking a systems view of proteins involved in signalling, and integrating large amounts of data into models that will piece together how cellular signalling functions as a whole in the cell. Here, authors would be required to input information, for example, on the protocols, tissues, cell types, specific concentration factors used and the experimental outcomes. Inputs would be chosen from menus of strictly defined terms and ranges, corresponding to predefined knowledge representations and vocabularies for cell signalling. The idea is that, rather than simply producing their own data, communities instead create a vast, shared pool of well-structured information, and benefit by being able to make much more powerful queries, simulations and data mining. A series of '1999,1999,FFFFmolecule pages' would also pull together virtually all published data and literature about individual molecules in relation to signalling. Indeed, the high-throughput nature of much of modern research means that, increasingly, important results can be fully expressed only in electronic rather than print format. Systems biology in particular is driving research that seeks to describe the function of whole pathways and networks of genes and proteins, and to cover scales ranging from atoms and molecules to organisms. Increasingly, the literature and biological databases will converge to create new forms of publications. Other disciplines stand to benefit, too. Helping machines make sense of science on the web Many communities, including the AfCS, are building ontologies to underpin such schemes. Ontologies mean different things to different people, but they are in effect representations that attempt to hard-code human knowledge about a topic and the intrinsic relationships in ways that computers can use. The microarray community has been very active in this area. The 1999,1999,FFFFMicroarray Gene Expression Database group has coordinated global standards; as a result, users will be able to query vast shared data sets to find all experiments that use a specified type of biological material, test the effects of a specified treatment or measure the expression of a specified gene, and much more. One major problem is that genes and proteins often have different names in different organisms, and these often say little about what they do. To get round this problem, the 1999,1999,FFFFGene Ontology (GO) Consortium is creating 1999,1999,FFFFtree-like ontologies of the 'molecular function', 'biological process' and 'cellular component' of gene products. All genes involved in 'DNA repair', for example, would be mapped to the corresponding GO term, irrespective of their name or source organism. A microarray gene-expression analysis that previously yielded only names of expressed genes would in addition carry mapped GO terms that might reveal, say, that half the genes are involved in 'protein folding'. GO terms can also help to federate disparate databases. Ontologies can also be used to tag literature automatically, and will be particularly useful for grey literature and archival material for which manual tagging was not justified. Papers tagged automatically with concepts can be matched, grouped into topic maps and mined. By breaking down terminological barriers between disciplines, this should also enhance interdisciplinary understanding and even serendipity. Nature is actively investigating such possibilities. The GO ontologies are still very incomplete, however, and the internal relationships need to be enriched. Moreover, caution is required against prematurely pigeon-holing gene functions, given the uncertainty of most annotations. Ontologies are also the focus of intensive research in computing science, and biology is not yet up to speed on this. Efforts such as GO and the 1999,1999,FFFFBio-Ontologies Consortium deserve support. Indeed, given the shortcomings of existing ontologies and controlled vocabularies, there may be a case for creating a more organized international effort to ensure economy of effort, interoperability and sharing of expertise. The advent of structured papers that are increasingly held in literature databases blurs further the distinction between the scientific paper and entries in biological databases. Already, entries in the biological databases are often hyperlinked to relevant articles in the literature and vice versa, and CrossRef is developing standards for such linking. As text becomes more structured, it will be possible to increase the sophistication of both linking, data manipulation and retrieval. Biological databases and journals have evolved relatively independently of one another. Database annotations lack the prestige of published papers; indeed, their value is largely ignored by citation metrics, and their upkeep is often regarded as a thankless task. Database curation has consequently lacked the quality control typical of good journals. The convergence between databases and the literature means that database annotators and curators will increasingly perform the functions of journal editors and reviewers, while publishers will develop sophisticated database platforms and tools. New ways in Database- and metadata-driven systems will drive interfaces to publications from simple keyword search models to ones that reflect the structure of biological information. Visualization tools of chromosomal location, biochemical pathways and structural interactions may become the obvious portals to the wider literature, given that there are far fewer protein structures or gene sequences than there are articles about them. As 1999,1999,FFFFMark Gerstein, a bioinformaticist at Yale University, points out: "One might 'fly through' a large three-dimensional molecular structure, such as the = 1999,1999,FFFFribosome, where various surface patches would be linked to publications describing associated chemical binding studies." Future electronic literature will therefore be much more heterogeneous than the current journal system, and dogmatic solutions should therefore be resisted. It is significant and sensible that both CrossRef and OAI have made key strategic choices favouring openness and adaptability. They seek to federate distributed actors rather than to create centralized structures. They also make their work independent of the type of content, which makes it flexible enough to incorporate and link seamlessly not just papers but news, books and other media. Crucially, both OAI and CrossRef have also decided to build systems independent of the economic mechanisms surrounding that content. Many publishers, in particular some learned societies, may be willing to make their content free, perhaps after a certain delay. Others are exploring business models where authors or sponsors pay, which would allow free access to articles on publication. The open technological frameworks also mean that particular communities, such as scientists with specific metadata needs for their discipline, are free to build in more complex data structures; the higher overheads incurred may require charging for added-value services. Neutrality The OAI and CrossRef strategies therefore differ fundamentally from more centralized systems proposed by 1999,1999,FFFFPubMed Central (PMC), operated by the 1999,1999,FFFFUS National Library of Medicine, and = 1999,1999,FFFFE-Biosci, being developed by the 1999,1999,FFFFEuropean Molecular Biology Organization. But PMC and E-Biosci highlight the urgent need to index the full text of papers and their metadata and not just abstracts, as is the practice of = 1999,1999,FFFFPubMed = and other aggregators. Services that require publishers to deposit full text only for indexing and improving search are useful. Unfortunately, PMC, unlike E-Biosci, confounds this primarily technological issue with an economic one, by requiring that all text be made available free after, at most, one year. It is regrettable that PMC has not in the first instance sought full-text indexing itself as a goal, as this in itself would be an immediate boon to researchers. It would also probably have been more successful in attracting publishers. The reality is that all of those involved in scientific publishing are in a period of intense experimentation, the outcome of which is difficult to predict. Getting there will require novel forms of collaboration between publishers, databases, digital libraries and other stakeholders. It would be unwise to put all of one's eggs in the basket of any one economic or technological 'solution'. Diversity is the best bet. This Opinion article has been inspired by many of the contributions to Nature's web forum on "Future e-access to the primary literature". The current table of contents of the forum can be found at the following address: = 1999,1999,FFFFhttp://www.nature.com/natur= e/debates/e-access/
= - --Apple-Mail-1793313083-3-- ------------------------------ Date: Thu, 20 Sep 2001 18:07:10 -0400 From: Chris Stoeckert Subject: [microarray-ontol] BiosourceOntologyEntry instances Dear Group, Have started thinking about adding instances of the different classes of BiosourceOntologyEntry. First, I am going to add attributes to the DatabaseEntry class based on the MAGE class diagrams. class: DatabaseEntry attribute: accession attribute: accession_version attribute: URI Then for the instance of BiosourceOntologyEntry:Organism NCBI_taxonomy: value: Homo sapiens description: human species ID: 9606 database entry accession 9606 accession_version URI: http://www.ncbi.nlm.nih.gov/htbin- post/Taxonomy/wgetorg?mode=Info&id= In this case ID and accession are equivalent but may not always be. Accession is being used here as what you stick in the URI. Couldn't find a version. For this URI, you stick the accession on the end but may want to use a placeholder to cover cases where the term added is internal. Is there a standard URI syntax for a placeholder? This would be an instance of OrganismPart FlyBase: value: endocrine system description: organ system ID: FBcv0006479 database entry accession FBcv0006479 accession_version 433 URI: http://flybase.bio.indiana.edu:82/.bin/cvreport.html?id= This would be another instance of OrganismPart CBIL_CV: value: pancreas description: source=GXD ID: 56 database entry accession 56 accession_version URI: http://www.cbil.upenn.edu/servlet/allgenes- dev/servlet?page=anat&id= CBIL_CV can also be used for TargetedCellType so I'm using the same class instance (CBIL_CV) but different attributes. Does this make sense? CBIL_CV: value: beta cell description: GXD ID: 60 database entry accession 60 accession_version URI: http://www.cbil.upenn.edu/servlet/allgenes- dev/servlet?page=anat&id= Here's an instance of DevelopmentalStage where the accession is the html page and the URI is the directory path. Does this make sense? MouseAnatomicalDictionary: value: Stage 28 description: Postnatal development ID: database entry accession stage28.shtml accession_version URI: http://www.informatics.jax.org/mgihome/GXD/AD/ Finally, I will need to create instances of the DatabaseEntry class for each of these instances of the different BiosourceOntologyEntry classes. Then I can link the instances together using the attribute has_database_entry. Attribute values will not be specified in the ontology initially because I don't want to deal with that now. They can be which would enforce that only those terms could be used but runs into problems such as being in synch with the original data source. Chris ------------------------------ End of microarray-ontol-digest V1 #12 *************************************