Requirement Document

1. Introduction

Rather than gathering information about a particular protein from books, scientists can access a three-dimensional picture that represents the pooling of the latest information on the molecule gathered from researchers around the world and posted on the Internet. The picture is constructed based on experimental data from X-ray Crystallography and Nuclear Magnetic Resonance (NMR). This information is currently stored in a format called Protein Data Bank (PDB), and since 1999 is under the management of the Research Collaboratory for Structural Bioinformatics (RCSB: www.rcsb.org).

It is the objective of the Bioinformatics students in our team to create a new format named “Universal Research Interchange Format” (URI) that will improve the ways of accessing the information currently stored in PDB files, by providing scientists a user-friendly interface for queries, reports and data entry.


2. Background Research

Over more than 30 years of collecting the information stored in the PDB bank, the content, amount of detail and the structure of PDB data records have changed. In the earlier years, the information in the PDB files was provided as unstructured text. Later on, many new data types and remarks were added, as well as all information became more structured, into a form with over 30 fixed format data record types.

The most recent version of the PDB file format is described in a document that can be found at http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html.

The Data-Uniformity-Project (http://www.rcsb.org/pdb/nar_pdb_du.pdf) chose a new file format MMCIF (Macromolecular Crystallographic Information File). In order to correct missing, erroneous and inconsistently reported data, nomenclature and functional annotation, they used a file-by-file approach (for evaluation of complete entries for every structure in the PDB archive) as well as a record-by-record approach (concentrating on specific PDB records across the complete archive). The new MMCIF format uses name-values pairs, with the names being described in a dictionary, and can be converted back to the PDB format if needed.

The current search methods made available on the RCSB web site are not taking advantage yet of the results of the Data-Uniformity-Project work, rather using the old PDB files. The online publication “Nucleic Acids Research” (2004, Vol. 32, available at http://nar.oupjournals.org/cgi/content/full/32/suppl_1/D223?ijkey=53RmSrjFHWdLg&keytype=ref%202004) mentions “the success of the data uniformity project and new information technologies have enabled the development of a newly re-engineered system. At the time of this writing (September 2003), the PDB has an internal alpha release of this new system and it is anticipated that a public beta release will be available in the first quarter of 2004. Details will be made available on the PDB website.”


3. Problem Definition

While from the above section it can be concluded that a new system of information query might be soon released on the PDB site to take advantage of the results of the Data Uniformity Project work and their MMCIF format, we did not find any note about work being done to create a user-friendly interface for searching within a given PDB file as well as for data entry and editing.

Many biologists still work with files in the PDB format and, at the current time, the PDB files are still in the form of plain text (ASCII) files, made up of 77-character-per-line records, in which each data item has to be entered within a fixed range of character positions in one of the PDB record types. Thus, the scientists working directly with PDB files find it difficult to search for information within the contents of the files.

The PDB format was developed many years ago. Since then, computer programming technology has improved significantly, especially with respect to database formats. The URI format that our group proposes should allow scientists access to the protein information mediated by a user-friendly interface rather then searching through PDB format files and manually extracting pertinent information. Our data model will organize and represent protein data more efficiently.

The new URI format will be based on XML (Extensible Markup Language). In its short lifetime, XML has proven to be very useful for representing data, exchanging data between environments and applications, and transporting data over distributed networks. XML is particularly useful for hierarchically structured data, and XML linking mechanisms support network data representations. The new URI format takes advantage of XML features for encoding hierarchically organized information. By providing an explicit representation of knowledge in a particular domain -protein sequences-, URI capitalizes on the general strengths of XML and the domain knowledge that URI represents.

Currently, there is only one format BSML (Bioinformation Sequence Markup Language) based on XML for the life science community. BSML was created as an evolving public domain standard for the bioinformatics community. However, BSML is not intended to be the answer to every issue of knowledge representation in the life sciences, and right now only data from a few databases (GenBank, EMBL, Ensembl, Swiss-Prot and DDBJ) can be automatically converted to BSML format. No converter is available for PDB. The objective of URI format is not only to transfer PDB format to a new XML format, but also to attempt to improve several aspects of PDB. Examples of possible improvements are:


4. Functional Requirements

From the discussions we had with our project’s clients, Dr. Martin and other biological researchers at University of Rhode Island, we identified the following possible queries for our system to address:

  1. Where in the amino acid sequence are the disulfide bonds, alpha helices, and beta plated sheets located?
  2. Where in the protein are the Phi and Psi angles?
  3. When was the last revision and what type of information did it consist of?
  4. What are the coordinates of the last revision?
  5. Where in the amino acid sequence are the binding sites?
  6. Which reported regions of the protein’s structure are most well defined (based on the error values associated with the data)?
  7. Which prolines are cis? Which pralines are trans?
  8. Are there any crystallographic waters? If so, where are they in the amino acid sequence?
  9. What is all the data concerning only the heavy chain or only the light chain?
  10. What is the data concerning only oxygen (or carbon, nitrogen, etc.) atoms?
  11. What does the PDB file of this molecule look like?
We plan to have further discussions on this topic, and we might adjust our list accordingly.

In order for our URI system to satisfy the above user requirements, we will need to create the following software components:

  1. Create a software program to find hydrogen information from PDB.
  2. Create a software program to combine the hydrogen information with other useful information to format the new URI format.
  3. Create the DTD (Document Type Descriptor) for our URI format.
  4. Create URI format and map our URI format to Relational Database to store the URI file.
  5. Based on the required query, create a new interface. The query result is a XML document.
  6. Create a query GUI (Graphical User Interface).


5. Use Cases

Here are a few examples of Use-Case scenarios:
Scenario: Searching for a protein w/ 3 amino acids in a row

  1. User opens interface.
  2. User selects "Search by Sequence" from drop down search selection menu.
  3. User enters amino acid sequence into the text box, then clicks search button.
  4. Text entered is searched through all "sequence" tags in the database.
  5. Match is made between entered text and sequence.
  6. Protein is displayed on the interface.
  7. Search continues until all sequences matching the entered text are found.

Scenario: Find phi or psi angles in a protein.

  1. User opens interface.
  2. User selects "Phi / Psi Angles" from drop down search selection menu.
  3. User enters PDB id# for protein into text box.
  4. User clicks search button.
  5. Text entered is searched through all "PDBid" tags in the database.
  6. Protein is found matching PDB id#.
  7. Proteins atomic coordinates are calculated to determine phi/psi angles.
  8. Results of phi/psi angles are displayed on the interface.

Scenario: Find previous revisions of a protein.

  1. User opens interface.
  2. User selects "Revisions" from drop down search selection menu.
  3. User enters PDB id# for protein into text box.
  4. User clicks search button.
  5. Text entered is searched through all "PDBid" tags in the database.
  6. Protein is found matching PDB id#.
  7. Proteins data for previous revision is found in XML data file.
  8. Previous revision data is displayed on the interface.


6. Constraints

At this stage we have not decided in detail about all the software resources we shall need to use in order to accomplish our project’s goals. We are currently considering using the following software.


7. Project Milestones

Task # Description Members Leading Effort Date Completed
1 Determine characteristics of currently used system. Define existing problems. Determine user requirements. All Team First three weeks
2 Once determined what queries we want URI to address, conduct research on the PDB data structure, and decide on what type of Relational database to use. Keeley, Catalina Weeks 4,5
3 Design the database. Create XML DTD for our URI format Anita, Jing Weeks 4,5
4 If time permits, Create a converter from PDB to URI Catalina, Jing Weeks 5,6,7,8
5 According to the DTD, Create XML-based URI Anita, Jing Weeks 6,7
6 Map XML to Relational Database Anita, Jing Weeks 8,9
7 Create XML interfaces with HTML Mike Weeks 10,11,12
8 Testing. Make necessary adjustments to interface based on results of test. All Team Weeks 13
9 Create documentation and user’s guide. All Team Weeks 14
10 Presentation. All Team Weeks 15

8. Project Deliverables

We realize that our project’s goals might be too ambitious for the limited amount of time we have available until the end of the semester.

While we envision our URI system to be ultimately applied generally to any PDB file, we shall concentrate our current work on only one PDB file representing a specific protein.

We will fully document our work and present it at the end of the semester.

We hope that we shall be able to design the database to organize the data from the existing PDB file into relational tables, and to create all the necessary XML interfaces for user queries, so that we shall present a working prototype.

If time permits, we might attempt the following:

  1. Create a tool for conversion from PDB to URI format and vice versa.
  2. Create a software program to find hydrogen information from PDB.
  3. Create a software program to combine the hydrogen information with other useful information to format the new URI format.