1. Introduction
Rather than gathering information about a particular protein from books, scientists can access a three-dimensional picture that represents the pooling of the latest information on the molecule gathered from researchers around the world and posted on the Internet. The picture is constructed based on experimental data from X-ray Crystallography and Nuclear Magnetic Resonance (NMR). This information is currently stored in a format called Protein Data Bank (PDB), and since 1999 is under the management of the Research Collaboratory for Structural Bioinformatics (RCSB: www.rcsb.org).
It is the objective of the Bioinformatics students in our team to create a new format named “Universal Research Interchange Format” (URI) that will improve the ways of accessing the information currently stored in PDB files, by providing scientists a user-friendly interface for queries, reports and data entry.
2. Background Research
Over more than 30 years of collecting the information stored in the PDB bank, the content, amount of detail and the structure of PDB data records have changed. In the earlier years, the information in the PDB files was provided as unstructured text. Later on, many new data types and remarks were added, as well as all information became more structured, into a form with over 30 fixed format data record types.
The most recent version of the PDB file format is described in a document that can be found at http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html.
The Data-Uniformity-Project (http://www.rcsb.org/pdb/nar_pdb_du.pdf) chose a new file format MMCIF (Macromolecular Crystallographic Information File). In order to correct missing, erroneous and inconsistently reported data, nomenclature and functional annotation, they used a file-by-file approach (for evaluation of complete entries for every structure in the PDB archive) as well as a record-by-record approach (concentrating on specific PDB records across the complete archive). The new MMCIF format uses name-values pairs, with the names being described in a dictionary, and can be converted back to the PDB format if needed.
The current search methods made available on the RCSB web site are not taking advantage yet of the results of the Data-Uniformity-Project work, rather using the old PDB files. The online publication “Nucleic Acids Research” (2004, Vol. 32, available at http://nar.oupjournals.org/cgi/content/full/32/suppl_1/D223?ijkey=53RmSrjFHWdLg&keytype=ref%202004) mentions “the success of the data uniformity project and new information technologies have enabled the development of a newly re-engineered system. At the time of this writing (September 2003), the PDB has an internal alpha release of this new system and it is anticipated that a public beta release will be available in the first quarter of 2004. Details will be made available on the PDB website.”
3. Problem Definition
While from the above section it can be concluded that a new system of information query might be soon released on the PDB site to take advantage of the results of the Data Uniformity Project work and their MMCIF format, we did not find any note about work being done to create a user-friendly interface for searching within a given PDB file as well as for data entry and editing.
Many biologists still work with files in the PDB format and, at the current time, the PDB files are still in the form of plain text (ASCII) files, made up of 77-character-per-line records, in which each data item has to be entered within a fixed range of character positions in one of the PDB record types. Thus, the scientists working directly with PDB files find it difficult to search for information within the contents of the files.
The PDB format was developed many years ago. Since then, computer programming technology has improved significantly, especially with respect to database formats. The URI format that our group proposes should allow scientists access to the protein information mediated by a user-friendly interface rather then searching through PDB format files and manually extracting pertinent information. Our data model will organize and represent protein data more efficiently.
The new URI format will be based on XML (Extensible Markup Language). In its short lifetime, XML has proven to be very useful for representing data, exchanging data between environments and applications, and transporting data over distributed networks. XML is particularly useful for hierarchically structured data, and XML linking mechanisms support network data representations. The new URI format takes advantage of XML features for encoding hierarchically organized information. By providing an explicit representation of knowledge in a particular domain -protein sequences-, URI capitalizes on the general strengths of XML and the domain knowledge that URI represents.
Currently, there is only one format BSML (Bioinformation Sequence Markup Language) based on XML for the life science community. BSML was created as an evolving public domain standard for the bioinformatics community. However, BSML is not intended to be the answer to every issue of knowledge representation in the life sciences, and right now only data from a few databases (GenBank, EMBL, Ensembl, Swiss-Prot and DDBJ) can be automatically converted to BSML format. No converter is available for PDB. The objective of URI format is not only to transfer PDB format to a new XML format, but also to attempt to improve several aspects of PDB. Examples of possible improvements are:
4. Functional Requirements
From the discussions we had with our project’s clients, Dr. Martin and other biological researchers at University of Rhode Island, we identified the following possible queries for our system to address:
In order for our URI system to satisfy the above user requirements, we will need to create the following software components:
5. Use Cases
Here are a few examples of Use-Case scenarios:
Scenario: Searching for a protein w/ 3 amino acids in a row
Scenario: Find phi or psi angles in a protein.
Scenario: Find previous revisions of a protein.
6. Constraints
At this stage we have not decided in detail about all the software resources we shall need to use in order to accomplish our project’s goals. We are currently considering using the following software.
7. Project Milestones
Task # | Description | Members Leading Effort | Date Completed |
---|---|---|---|
1 | Determine characteristics of currently used system. Define existing problems. Determine user requirements. | All Team | First three weeks |
2 | Once determined what queries we want URI to address, conduct research on the PDB data structure, and decide on what type of Relational database to use. | Keeley, Catalina | Weeks 4,5 |
3 | Design the database. Create XML DTD for our URI format | Anita, Jing | Weeks 4,5 |
4 | If time permits, Create a converter from PDB to URI | Catalina, Jing | Weeks 5,6,7,8 |
5 | According to the DTD, Create XML-based URI | Anita, Jing | Weeks 6,7 |
6 | Map XML to Relational Database | Anita, Jing | Weeks 8,9 |
7 | Create XML interfaces with HTML | Mike | Weeks 10,11,12 |
8 | Testing. Make necessary adjustments to interface based on results of test. | All Team | Weeks 13 |
9 | Create documentation and user’s guide. | All Team | Weeks 14 |
10 | Presentation. | All Team | Weeks 15 |
8. Project Deliverables
We realize that our project’s goals might be too ambitious for the limited amount of time we have available until the end of the semester.
While we envision our URI system to be ultimately applied generally to any PDB file, we shall concentrate our current work on only one PDB file representing a specific protein.
We will fully document our work and present it at the end of the semester.
We hope that we shall be able to design the database to organize the data from the existing PDB file into relational tables, and to create all the necessary XML interfaces for user queries, so that we shall present a working prototype.
If time permits, we might attempt the following: