---------------------------------------------------------------------- Protein Data Bank Quarterly Newsletter Release #69 July 1994 ---------------------------------------------------------------------- July 1994 PDB Release 2684 full-release atomic coordinate entries (249 new additions) 2478 proteins, enzymes and viruses 177 DNA's 10 RNA's 9 tRNA's 10 carbohydrates 357 structure factor entries 31 NMR experimental entries The total size of the atomic coordinate entry database is 826 Mbytes uncompressed. ---------------------------------------------------------------------- The latest version of the Electronic Deposition Form should be obtained from the FTP /pub directory before depositing data. ---------------------------------------------------------------------- What's New at the PDB In an effort to provide better service to users and depositors, PDB has implemented several new procedures in the past few months. These include a new Electronic Deposition Form, the addition of multicolor images of structures on the PDB Gopher and World Wide Web (WWW) servers, and the introduction of an entry tracking system. The new Electronic Deposition Form is now available from FTP. We encourage all depositors to use it. For information on how to retrieve a copy, please see the article entitled "New Elec- tronic Deposition Form" on page 7. A series of 474 multicolor images of PDB structures has been added to the PDB Gopher and WWW servers. Dr. Manuel Peitsch of The Glaxo Institute for Molecular Biology in Geneva prepared the images, designing them so that each one depicts the most important aspects of the entry or experiment that it represents. Users may access these images from FTP (also Gopher or WWW via "The PDB's Anonymous FTP") directory /images. Because the PDB cannot issue an ident code until an entry is processed, a tracking number uniquely defining each submis- sion is issued at the time of initial deposit. This number enables users, journals and the PDB to follow the progress of every entry through the entire screening process via our Gopher sys- tem. Additional information on this tracking system may be found in the article below. We encourage you to take advantage of the new tools offered by PDB. We are very excited about them and hope that you will find them to be useful additions. As always, your comments are welcome at pdb@bnl.gov. Thank you. - Joel L. Sussman ---------------------------------------------------------------------- New Entry Tracking System Many depositors, users and journals that refer to PDB have asked us to provide a mechanism for checking the status of entries from depositor's initial contact with PDB to the time when the work is published. This is now possible due to the establishment of tracking codes, one of which is given to each depositor upon deposition of an entry. This code can be used by depositors, users and journals to track an entry's progress until full release. Upon receipt of coordinates for an entry, PDB issues a tracking number and sends a letter to the primary and secondary con- tacts that includes this number. The format of the tracking num- ber is a number preceded by the letter T (e.g., T9999). PDB issues the PDB ident code, required by most journals, only after running some preliminary checking programs and receiv- ing all required documentation, including (p)reprints. The tracking number is published, for every entry that is currently being processed by PDB, in the file /pub/pending_waiting.list which is available from FTP. If you use Gopher to access PDB, this file is indexed by tracking number, author and compound name in order to speed searches. Status for each pending and waiting entry is issued from the fol- lowing list: INCOMPLETE - if PDB is waiting for additional materials or information from depositor to make a complete deposition PROCESSING - if PDB is checking and verifying entry DEPOSITOR - if entry is with depositor for approval REVIEW - if entry is undergoing final review REL - if entry is in current release HLD - if entry is on hold at present time This number is not to be construed as an ident code. Its pur- pose is to speed up our searches, make our responses to inquiries faster and more accurate, and allow depositors to track the progress of their entries. ---------------------------------------------------------------------- CIF and the PDB Most of you have undoubtedly heard something about CIF (The Crystallographic Information File); many of you have asked for more information about CIF and how it will affect PDB users. There will be a series of Newsletter articles pertaining to CIF. This first article describes CIF from a computer user's point of view and deals with the nuts and bolts issues of using CIF on a computer. Articles in future Newsletters will discuss dictionaries and CIF from a crystallographer's point of view. One of the most frequently asked questions is, "Why have a new file format at all?" The answer will become clearer over the course of this article and articles to follow in future Newsletters. The main issue is that PDB format is simply too inflexible to deal with the ever-expanding types of data that it is asked to contain. At the time PDB format was first conceived, 40 characters seemed adequate to store a molecule's "Functional Classifica- tion", 99,999 atoms in a chain seemed perfectly adequate, and one only needed a good memory to locate the interesting struc- tures, not a computer. None of these conditions is true today, and we can foresee the need to adapt at a faster pace over the next several years. Therefore, some changes in format were inevitable and CIF seemed to be a possible candidate. CIF is a direct descendent of the STAR (Self-defining Text Archive and Retrieval) file created by Sydney Hall in 1990 [S. R. Hall, The STAR File: A New Format for Electronic Data Transfer and Archiving. J. Chem. Inf. Compt. Sci. 31, 326-333 (1992)]. Hall, with Frank Allen and David Brown, put some limi- tations on the STAR definition to arrive at CIF [S. R. Hall, F. H. Allen and I. D. Brown, The Crystallographic Information File (CIF): A New Standard Archive File for Crystallography. Acta Cryst. A47, 655 (1991)]. The stated goals of STAR (and CIF) were to have a format that would be able to store all kinds of data, be machine independent, be simple to read and access, and be flexible to future change. It was also a goal that the format not be tied to any one database model. How well STAR and CIF succeed at reaching each of these goals is open to discussion, but most people would agree that CIF is capable of storing a wide range of ASCII data, is mostly machine inde- pendent, is easily machine readable and, perhaps most impor- tantly, is flexible. So in the end, what is a CIF? It is a text file consisting of the printable ASCII characters from space to "~" and with some sort of new line character. The line lengths are limited to 80 characters. A CIF is divided into a number of data blocks, each of which has a name. A block is introduced by a character string of the form: data_code where code may be up to 32 characters in length. A block is terminated by either the beginning of another block or the end- of-file. The block name cannot be duplicated in a CIF. Each block consists of a sequence of data items which comes in two forms. The first and simplest gives the name of the item and its value. The names of all data items start with an "_", are limited to 32 characters, and must also be unique. All data items should simply be seen as text strings (we will delve deeper into this in a future article) which are delimited by spaces. The text strings can be quoted by either single or dou- ble quotes if they contain spaces. In addition, very long (more than 80 character) strings can be represented by starting a line with a semicolon and ending the data value with a subsequent line starting with a semicolon. Then, the string of characters from the first non-blank character following the first semicolon to the last non-blank character preceding the second semicolon is considered to be the value of the data item. End-of-line charac- ters are discarded and multiple spaces are replaced with one space. (These last two restrictions are not part of the specifica- tion, but are needed for practical purposes.) The second form is used to represent lists. It starts with the key- word "loop_" and is followed by a sequence of N data names, then a sequence of K*N data values. The names and values conform to the restrictions mentioned in the previous para- graph. An example will make this clearer. The following CIF consists of two data blocks (address and papers) and demonstrates the various methods of data representation. The second data block contains a simple loop. data_address _name "Sydney R. Hall" _department "Crystallography Centre" _organization 'University of Western Australia' _city Nedlands _code 6009 _country Australia data_papers loop_ _title _date ; The Crystallographic Information File (CIF): a New Standard Archive File for Crystallography ; 1991 ; The STAR File: A New Format for Electronic Data Transfer and Archiving ; 1990 It is important that one understands that the data in a CIF are constrained neither in the order in which they appear, nor in their format. For example, the following CIF conveys the same information as that in the preceding column, albeit in a less (humanly) readable form. data_address _name ;Sydney R. Hall ; _city Nedlands _code 6009 _country Australia _department `Crystallography Centre' _organization ; University of Western Australia ; data_papers loop_ _date _title 1991 ; The Crystallographic Information File (CIF): a New Standard Archive File for Crystallography ; 1990 ; The STAR File: A New Format for Electronic Data Transfer and Archiving ; To those people accustomed to PDB format this is viewed as a major disadvantage, as some of the UNIX user's favorite tools (diff and grep) fail miserably with this format. Where CIF has an advantage over PDB format is in dealing with the question, "What do you mean by a data item?" In PDB format, one must look back at a document written in English that describes what is really meant by the data field of interest. If the definition is present at all (do we all understand and agree with the meaning of insertion code?) this can be interpreted in many ways due to the ambiguity inherent in all natural languages (e.g., are Au and Australia both acceptable country names and are they equiva- lent?). In contrast each CIF has associated with it one or more dictionaries (also stored in a CIF), describing the attributes and in some cases acceptable values of every data item stored in the CIF. This allows for the possibility of automatically validating a CIF as being "correct" relative to a dictionary. If the people who write the dictionaries are careful, then it should be possible to provide air-tight definitions of each and every data item stored in a CIF. To an organization (like PDB) that must auto- mate all of its operations, this is a huge advantage. While there are some drawbacks to CIF as it currently exists, the community hopes to address many of these in the coming months via various working committees. The free format makes using familiar tools such as grep and awk difficult, therefore requiring new tools in order to access all of the data in the file. The limitation on line and word lengths are a bit archaic, and the failure to specify clearly the acceptable character set and escapes will cause some difficulty if not addressed (e.g., stor- age of binary files and other CIFs within a CIF is not currently possible). Also, CIF does not address the issue of building more complex data types (e.g., records or structs) as most data rep- resentation schemes do. Still, one must remember that CIF has been used in the small molecule field, and we look forward to similar success. You are encouraged to look at two dictionaries that will be of greatest concern to macromolecular crystallographers over the next several years - the Core Dictionary and the mmCIF Dictionary - both of which are available on the PDB file server in the directory /pub/Cif. The Core Dictionary is an IUCr standard and the mmCIF dictio- nary has been published for comments after an enormous effort by Paula Fitzgerald, Helen Berman, Phil Bourne and Keith Watenpaugh, with contributions and suggestions from many other scientists. The mmCIF dictionary has been designed to be capable of representing the results of an exper- iment. In the next Newsletter, the issue of CIF dictionaries and DDLs (data definition languages) will be discussed in much greater detail. The success or failure of CIF will depend to a large degree on the number and quality of software tools available to users. To begin with, a small library of C and FORTRAN callable func- tions which provide primitive access to the contents of a CIF has been placed on the PDB file server. These functions do not (at this time!) use any dictionaries to validate CIF files, but sim- ply provide access. The primitive functions are: int CifRead1(char* filename) Reads an entire CIF into memory. Returns 0 if successful, non-zero if failed. A diagnostic is written to the stderr. int CifLoopCount1(char* block, char* name) Returns the number of values associated with the data item named "name" in the data block named "block". A return value of 0 implies that the named data item does not exist. int CifExists1(char* block, char* name, int index) Returns a non-zero value if the item named "name" in the block called "block" has index values associated with it. (The first occurrence has an index of 0!) char* CifGet1(char* block, char* name, int index) Returns a pointer to a character string giving the values of the specified data item. With these functions, if the above examples of CIF are in the file hall.cif, the following code will read and output all of the titles of the papers stored in the CIF: #include main() { int nPapers; int iPaper; if (CifRead1("hall.cif")) { fprintf(stderr, "could not read the file hall.cif\n"); exit(1); } nPapers = CifCount1("papers","_title"); for (iPaper=0; iPaper < nPapers; iPaper++) { printf("%s\n", CifGet1("papers","_title",iPaper)); } } These functions should be considered to be beta-versions and are subject to having bugs and being less than optimal, though they are constantly being improved. Your input is gratefully accepted. In the next Newsletter, an article will discuss those CIF dictionaries which are standards and those which are experimental. Some additional CIF tools will also be discussed. More sophisticated tools are available on the network; the anonymous FTP server at cuhhca.hhmi.columbia.edu is an excellent place to start searching. Finally, it is important for the user community to understand that PDB's interest in CIF will not adversely impact your day to day operations. PDB's plans with respect to CIF are perfectly clear in one respect. We never will suddenly change the format of every entry held by PDB. Our plans call for: - Making sure that current entries are complete in PDB format and contain as much information as possible so that converting a PDB entry to a CIF entry can be automated. - Providing a tool that will take a CIF description of a PDB entry and render it in PDB format on a best effort basis. All existing files will be available in PDB format. All new entries submitted in PDB format will also be available in that form. All new entries submitted in CIF will probably have some limitations in their ability to be converted to PDB format (due to the greater information content of CIF), but for the time being, the two formats should be regarded as interchangeable. - Supporting the community in accepting and archiving data which is simply not representable in PDB format. - Working with the major distributors of codes and applications that have produced and used PDB format in the past to help them make the conversion to CIF-based representation. We look forward to your opinions. Please feel free to contact Dave Stampf (drs@bnl.gov). ---------------------------------------------------------------------- Improve Descriptions of Sequence Information PDB plans to include new record types to improve descriptions of sequence information now given in the SEQRES records. The new records will provide annotation mechanisms to corre- late information stored in PDB entries with that found in the var- ious sequence databases (e.g., PIR, SWISS-PROT). The following is a summary description for the new record types. Record - Purpose Type DBREF - To provide cross-reference links between PDB sequences and corresponding sequence database entries. SEQADV - To identify conflicts in sequence information between PDB sequences and sequence database entries. MODRES - To provide descriptions of modifications (e.g., chemical or post-translational) to protein and nucleic residues. Included will be a mapping between residue names given in a PDB entry and the standard 20 protein and 5 nucleic acid residues. A number of recently-released PDB entries contain REMARK records that already give this information. PDB now wishes to standardize the format and representation for these data. In addition, these record types will help facilitate conversion of PDB entries into CIF format. In general, PIR and SWISS-PROT entries contain information on the wild-type molecule. Each entry normally contains the sequence for one gene product, and some entries include the complete precursor sequence. Annotation is provided to describe residue modifications. In both databases, the residue names used are limited to the 20 standard amino acids. In contrast, PDB entries contain multichain molecules with sequences that may be wild type, variant, or synthetic. Sequences may also have been modified through site-directed mutagenesis experiments (engineered). A number of PDB entries report structures of domains cleaved from larger molecules. The DBREF record was designed to account for these differ- ences by providing explicit correlations between contiguous segments of sequences as given in PDB ATOM records and PIR or SWISS-PROT entries. Several cases are easily repre- sented by means of pointers between the databases using DBREF. PDB entries containing heteropolymers will be linked to different sequence database entries. In some cases, such as those PDB entries containing immunoglobulin Fab fragments, each chain will be linked to two different PIR and/or SWISS- PROT entries. This facility is needed, because these databases represent sequences for the various immunoglobulin domains as separate entries. DBREF should also be able to represent molecules engineered by altering the gene (fusing genes, altering sequences, creating chimeras, or circularly permuting sequences). This design has one additional advantage, i.e., it will be possible to construct pointers to other databases such as those describing sequence motifs (e.g., PROSITE, BLOCKS). Selection of the appropriate sequence database entry or entries to be linked to a PDB entry will be done on the basis of mole- cule name and source. Questions on entry assignment that may arise will be resolved by consultation with PIR and SWISS- PROT staff. In a number of cases, conflicts between the sequences found in PDB entries and in PIR or SWISS-PROT entries have been noted. There are several possible reasons for these conflicts: PDB may contain variant sequences or engineered sequences (mutants), polymorphic sequences, or ambiguous or conflicting experimental results. These discrepancies, which were previ- ously described in REMARK records, will now be reported in SEQADV records. Finally, residues modified post-translationally or by design will be described in MODRES records. In those cases where PDB has opted to use a non-standard residue name for the residue, MODRES will also provide a mapping to the precursor standard residue name. There remain a number of unresolved issues related to the SEQRES records. Most significant is the need to represent sequences for the multiply-branched polysaccharides. PDB plans to address this in the future. PDB wishes to solicit your input regarding these new records. Please send comments and suggestions to Enrique Abola (abola1@bnl.gov). -- Record Descriptions Record Name: DBREF Cols. Contents and Description 01 - 05 DBREF 07 - 10 PDB ident code 12 Chain name 14 - 17 Initial sequence number of PDB sequence segment 18 Initial insertion code of PDB sequence segment 20 - 23 Ending sequence number of PDB sequence segment 24 Ending insertion code of PDB sequence segment 26 - 31 Sequence database name (PIR,SWS,GDB, BLOCKS, PROSIT) 33 - 40 Sequence database accession code 42 - 53 Sequence database identification code 55 - 60 Initial sequence number in the sequence database 62 - 66 Ending sequence number in the sequence database DBREF records have been developed identifying sequence correlations between PDB ATOM records and corresponding PIR or SWISS-PROT entries. PDB entries containing chains for which residues are missing primarily due to disorder will contain several DBREF records, each linking an observed sequence segment to a sequence database entry. Examples: DBREF 1ABC A 1A 100 SWS P10725 ALR_BACSU 15 114 DBREF 1ABC A 1A 100 PIR JS0443 15 114 Record Name: SEQADV Cols. Contents and Description 01 - 06 SEQADV 08 - 11 PDB ident code 13 - 15 PDB residue name 17 PDB chain name 19 - 22 PDB sequence number 23 PDB insertion code 25 - 30 Sequence database name (PIR, SWS, PROSIT) 32 - 39 Sequence database accession number 40 - 42 Sequence database residue name 44 - 47 Sequence database sequence number 50 - 70 Conflict comment SEQADV describes conflicts between residue sequences given by PDB ATOM records and those in the appropriate sequence database entry, such as residues missing due to disorder. Example of possible conflict comments: Cloning artifact Engineered Disordered Variant Insertion Deletion When conflicts arise which are not classifiable by these terms then a reference to either a published paper, a PDB entry, or a REMARK within the entry will be given. References will be given in the form YY-VOL-PAGE-CSDCODEN where YY is year of publication, VOL is the journal volume number, PAGE is the starting page and CSDCODEN is the 4-digit code assigned to journals by PDB and the Cambridge Structural Database (CSD). When reference is made to a PDB entry, then the form is PDB: 1ABC, where 1ABC is the relevant entry ident code. Finally, the comment "SEE REMARK XXX" will be included, where XXX is the remark number within the entry in which the primary explanation to the discrepancy is given. Examples: SEQADV 1ABC ASN A 100A SWS P10725 ASP 100 1994-300-1200-0070 SEQADV 2ABC ASN A 100A SWS P10725 ASP 100 PDB: 1ABC SEQADV 1ABC MET A -1 SWS P10725 CLONING ARTIFACT SEQADV 1ABC GLY A 50 SWS P10725 VAL 50 ENGINEERED Record Name: MODRES Cols. Contents and Description 01 - 06 MODRES 08 - 11 PDB ident code 13 - 15 Residue name 17 Chain name 19 - 22 Sequence number 23 Insertion code 25 - 27 Standard residue name 30 - 70 Modification description Examples of modification descriptions: Glycosylation site Post-translational modification Designed chemical modification Phosphorylation site Blocked N-terminus Aminated C-terminus Examples: MODRES 1ABC ASN A 22A ASN GLYCOSYLATION SITE MODRES 1ABC TTQ A 50A TRP POST-TRANSLATIONAL MODIFICATION ---------------------------------------------------------------------- PDB User Group A User Group has been instituted for PDB, headed by Jane Richardson of Duke University. There is a Coordinating Committee to represent the diverse spectrum of users: so far the members of this committee are Mike Summers (University of Maryland) who represents the NMR community and Judy Voet (Swarthmore College) who represents those who use the PDB in teaching. At the broad level, of course, everyone interested is by definition a member of the User Group. The User Group aims to facilitate communication in both directions between PDB and all categories of its users, improving knowledge of what is already available or in progress, collecting user feedback, diagnosing problems quickly, and collectively arriving at the best ideas and innovations for the future. Thus far the User Group's main activities, based on user input, have been: - Initiating the setup of an on-line, user-queriable directory that gives the status of all entries between deposition and release. A query is carried out by reaching PDB via Gopher (gopher.pdb.bnl.gov), choosing the indexed search "*NEW* check the status of a pending entry by ident code, tracking, author, etc.", and supplying the desired search word. Alternatively, the file /pub/pending_waiting.list can be retrieved from FTP, Gopher or WWW and searched for the desired entry. Please see additional information in the article entitled "New Entry Tracking System" on page 1. - Initiating efforts by PDB to provide a definition of the functioning biological unit for entries and a procedure by which this may be generated. On an experimental basis, the full coordinates generated for these biological units will be made available from FTP. Please see the article below for additional information. - Establishing an on-line directory known as "/user_group" in which the User Group will have available various subsets and annotations of PDB entries, for such purposes as teaching and structural analysis. This will be available from FTP. Your input is solicited. Please let us know what type of user you are and what you consider priorities for the future. Respond by e-mail to: PDBusrgp@suna.biochem.duke.edu or by postal mail to: Jane Richardson, PDB User Group, Box 3711 DUMC, Durham, NC 27710, USA. ---------------------------------------------------------------------- The Biological Molecule PDB plans to distribute new files that contain the coordinates for complete multimeric molecules. These files are to be called bio1abc.xyz where 1abc is the PDB ident code for the entry from which the multimer is generated. Multimers will be generated when it is determined that the physiologically active molecule (or biologically relevant molecule) includes subunits for which coor- dinates are currently not given in the entry. These multimers are described in PDB entries explicitly by non-crystallographic oper- ators (MTRIX records) or implicitly via the crystallographic space group operators. Thus, unlike regular PDB entries, these files may include coordinates for crystallographically-related subunits. This work is being carried out in response to Jane Richardson's request to have these files available. In an informal survey car- ried out as part of her efforts to organize the new PDB User Group, Jane found that availability of these files was desired by a large number of users. Several steps have already been taken by PDB to ensure that the complete description of the biologically active molecules is available. Most significant is the inclusion of questions in PDB's new Electronic Deposition Form requesting information on the subunit composition of the molecule. For those entries in the current release of the PDB, subunit compositions are generated using information available from the corresponding SWISS- PROT (sequence database) entry. Loading of these files on the PDB anonymous server will start shortly and is expected to be completed (current with the release) late this fall. Please send comments and suggestions on the contents of these files directly to PDB or to the User Group. ---------------------------------------------------------------------- Getentry If you use C shell on a UNIX system, a script for easily locating and transferring PDB files is now available. Named getentry, this script, written by J. P. Rose, E. E. Abola and M. D. Libeson, allows straightforward searches, based on compound, author name, source and resolution, as well as retrieval of coordinate entries and some other files. To obtain the script, retrieve the file /pub/getentry from FTP. Installation instructions are found at the top of the file. Searches can make use of UNIX regular expressions, allowing flexibility. Options allow searching certain of the index files from FTP, retrieving the new Electronic Deposition Form, retrieving the file /pub/pending_waiting.list, and obtaining addresses of crystal- lographers from the electronic list maintained by Dr. Martha M. Teeter, Department of Chemistry, Boston College, Chestnut Hill, MA. Other options may be added in the future. ---------------------------------------------------------------------- Deposition of Data -- New Electronic Deposition Form A new Electronic Deposition Form is now available. Our goal in designing the Form was to create a document that would be easy for depositors to fill out and also for PDB staff to process electronically, minimizing errors and increasing efficiency. The Form has provisions for depositors to include information neces- sary in a complete entry, i.e., refinement details and sequence information, especially regarding any differences between the full protein sequence and the sequence of residues for which there are actual atomic coordinates. The Form also requests information for the cross-referencing of sequence information to the protein sequence databases (PIR, SWISS-PROT), making it easier to use information stored in PDB for homology model building. In addition, the Form clearly distinguishes between NMR and X-ray data. The Electronic Deposition Form begins with a description of the standard structure of a PDB coordinate file and provides guidelines for preparing data for deposition. These guidelines should be studied and followed carefully for the most efficient processing of your data. Additional information not included in PDB's old form is now requested. These data items will significantly help our staff to prepare an entry for distribution. The Form leads you through the needed material, from depositor information to the description of the experiment, bibliographic references, crystallographic data, het group description, etc. Finally, secondary structure information is now being prepared by PDB using the Kabsch and Sander algorithm [Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen- bonded and Geometrical Features. Biopolymers 22, 2577-637 (1983)], as implemented in the program Procheck, making submission of HELIX, SHEET and TURN records optional. However, we will gladly also include your secondary structure specifications if you wish. The most current version of the Electronic Deposition Form should be used when preparing your deposition. You may use FTP (ftp.pdb.bnl.gov), Gopher (gopher.pdb.bnl.gov) or WWW (www.pdb.bnl.gov) to retrieve the file /pub/dep_form.txt. You may request the Form by sending e-mail to pdb@bnl.gov. Please read and follow the instructions carefully. PDB will be happy to receive your comments on the design and ease of use of the Electronic Deposition Form. For help in filling out the form or to give us your comments, send e-mail to pdb@bnl.gov. -- Guidelines for Deposition PDB accepts depositions of biological macromolecule struc- tures and the corresponding crystallographic structure factors or NMR experimental data. Types of structures accepted include proteins, carbohydrates, viruses, DNA and RNA. We convert deposited data to standard PDB format, run verification, checking and quality control programs on the data, and archive and distribute the data worldwide. A deposition has three essential components, all of which must be received by PDB before we can begin processing an entry: the completed Elec- tronic Deposition Form, (p)reprints of all referenced papers, and the atom coordinates formatted as PDB ATOM and HETATOM records. The Electronic Deposition Form and coordinates must be submitted in machine readable form via one of the following: e-mail: pdb@bnl.gov FTP: connect to ftp.pdb.bnl.gov using anonymous FTP cd to the /new_uploads directory upload your files postal mail: Protein Data Bank Depositions Chemistry Department, Building 555 Brookhaven National Laboratory P.O. Box 5000 Upton, NY 11973-5000 USA The paper (p)reprints must be sent via facsimile (516-282-5751) or air mail. Please obtain the latest version of the new Electronic Deposi- tion Form from the FTP directory /pub each time that you make a deposit; edit and return the completed Form to us electroni- cally. Be sure to include identifying information, including your name, postal mailing address, e-mail address, facsimile number and telephone number, in the header of all items submitted. -- Deposition Contact Persons PDB requires the names of two contact persons - a primary and a secondary contact - on the Electronic Deposition Form, for each new structure. The primary contact is the person who has the authority to grant approval for release of the completed PDB entry. Both contacts should be able to help in preparing the entry if we need any answers or clarification. The secondary contact should be able to help locate the primary contact if this becomes necessary. The prepared PDB entry and request for release approval will be sent to the primary contact. A note will be sent to the secondary contact stating that these files have been sent to the primary contact. ---------------------------------------------------------------------- Discontinuation of Tape Distribution Due to the small number of tapes being ordered, PDB is con- templating discontinuation of the distribution on these media. Information will be forthcoming on the status and time frame involved. Anyone who anticipates a continuing need for tapes should speak up now. ---------------------------------------------------------------------- Help PDB to Identify JRNL Records from Years Ago Over the years, a number of PDB entries have been entered into the database before their respective journal articles appeared in print. Consequently, instead of their JRNL records giving full proper references, all that appears is "TO BE PUB- LISHED" or "IN PRESS". Unfortunately, in many cases, these records were never updated, so that entries from years ago still read this way. The PDB staff is therefore asking the scientific community, par- ticularly the original depositors of these entries, to please inform us of the complete corrected JRNL references of such entries as soon as possible. Thank you very much. ---------------------------------------------------------------------- Missing Data PDB is now handling approximately 90 new coordinate deposi- tions per month. Almost all coordinate data arrives electroni- cally, as does the Electronic Deposition Form. The (p)reprints still generally arrive by postal mail. The following should be obvious to all of us: - Transmission errors occur occasionally. These are easy to resolve if the depositor still has a copy of the file sent. - E-mail sometimes does not arrive at its destination. If this problem could be resolved everyone using the Internet would breathe a sigh of relief. - Computers crash in unexpected ways. - We try to be perfect in our operations but admit that mistakes can happen. Each item that arrives (data, Electronic Deposition Form, (p)reprints, letters, etc.) must be identified and associated with one another and, after the data has arrived, logged into our entry tracking system. We strongly recommend that depositors check the pending_waiting list on FTP (see additional information in article entitled "New Entry Tracking System" beginning on page 1) to make sure that their data has been logged. Allowing for our fluctuating work load, we suggest that you contact us if your data does not appear on this list within a month of your sending it to us. Please keep copies of everything you send us. ---------------------------------------------------------------------- Access to PDB -- FTP PDB has an anonymous FTP account on the computer system ftp.pdb.bnl.gov (Internet address 130.199.144.1). Files may be transferred to and from this system using anonymous as the FTP user name and your e-mail address as the password. Besides downloading entries, data files and documentation, it is possible to upload any files that you may wish to send to PDB, only into the directory /new_uploads. Those using VMS may need to place quotes around file names. -- Gopher PDB has a Gopher server on the system gopher.pdb.bnl.gov (130.199.144.1). This server is accessible using a Gopher client connecting to the following link: Name = Protein Data Bank FTP server Type = 1 Host = gopher.pdb.bnl.gov Port = 70 Path = 1/ As a Gopher client, you may navigate through a hierarchy of directories and documents or ask an index server to return a list of all documents that contain one or more specified words. For instance, you can choose "The PDB Anonymous FTP" af- ter reaching PDB's Gopher server in order to search and down- load the same information and coordinate files as through FTP. Alternatively, you can select "An (almost) full-text search of the PDB Bibliographic Headers" in order to search PDB using any keyword. -- World Wide Web (WWW) PDB has a World Wide Web (WWW) server on the computer system www.pdb.bnl.gov (130.199.144.1). This server is ac- cessible using the Document URL http://www.pdb.bnl.gov/. Besides including links to the PDB FTP and Gopher servers, the WWW server includes links to many other useful databases and information servers. -- Listserv PDB has a mailing list devoted to discussions concerning its op- eration, contents, and access procedures. To subscribe, send e-mail to listserv@pdb.pdb.bnl.gov with the one-line message: subscribe PDB-L Firstname Lastname To find out what can be done with this mailing list, send e-mail to the same address (listserv@pdb.pdb.bnl.gov) with the one-line message of "help." To send a message to all PDB-L subscribers, e-mail the mes- sage to: PDB-L@pdb.pdb.bnl.gov ---------------------------------------------------------------------- Ident Codes -- Issuance Of Each PDB entry is uniquely identified by a four-character ident code (also sometimes referred to as an accession code). Pres- ently PDB practice is to issue ident codes without regard for the structure name. However, we recognize that many depositors would like to have ident codes that are related mnemonically to the names of their structures. Should you have a preference for a particular ident code, PDB requests that you inform us about this on your Electronic Deposition Form. All reasonable sugges- tions will be considered. Of course, if the ident code that you are suggesting has already been used for an existing entry, then an alternative code will have to be issued by PDB. -- Obtainment for a New Entry PDB issues the ident code of a new entry only after receiving the complete deposition and verifying the correctness and integrity of the data. The prepared entry as well as a letter stating the ident code, describing any problems found in the entry and requesting approval for release is sent to the primary contact. A note will also be sent to the secondary contact stating that these files have been sent to the primary contact (for additional information see article entitled "Deposition Contact Persons" on page 7). To facilitate issuance of the ident code, the Electronic Deposition Form should be filled out and returned electronically. It is important to pick up the latest version of the Electronic Deposition Form from FTP and fill it out completely and accurately. Coordinate data must be in PDB format (see the latest Format Description document) and (p)reprints of all journal articles referenced must be sent via facsimile (516-282-5751) or air mail. -- Location for an Existing Entry A PDB ident code is helpful for retrieving the file for a particular macromolecular structure. Tables of all PDB entries with their ident codes can be obtained electronically from FTP, Gopher, WWW, the PDB Listserver or by postal mail upon request. Also, some published articles reporting results of structural analyses of biological macromolecules provide the PDB ident code. If you are searching PDB using FTP (also Gopher or WWW via "The PDB's Anonymous FTP"), there are two directories that are especially useful for locating ident codes. The first such directory is /index. This contains the following files which are updated continuously: author.idx - Ident code and author cmpd_res.idx - Ident code, resolution and compound name compound.idx - Ident code and full compound name from the COMPND records crystal.idx - Ident code, unit-cell dimensions, space group and Z value entries.idx - Ident code, classification, accession date, compound, source, author list, resolution and experiment type molecule.idx - Ident code, resolution and macromolecule name resolu.idx - Ident code and resolution source.idx - Ident code and biological source from the SOURCE records src_simple.idx - Ident code and biological source Other index files may be added from time to time. The second useful directory is /newsletter/newsletterYYmon (e.g., /newsletter/newsletter94jan). This directory contains text (.txt) and PostScript (.ps) files of tables listing all currently available and pending entries. You may retrieve and peruse index files, tables and directory listings to determine whether or not a structure of interest is available and, if it is, locate its ident code. As entries are added to FTP between quarterly releases, they are placed in the directory /newly_released. Be sure to scan this directory for the latest PDB entries. The /current_release directory contains the entire up-to-date PDB collection, while the /fullrelease directory contains the last quarterly release. Coordinate files are named pdb1abc.ent, structure factor files are named r1abcsf.ent and NMR experimental data files are named 1abc.mr, where 1abc is the PDB ident code. ---------------------------------------------------------------------- UNIX-Based Filters and Browser The response to our PC-based PDB-Shell has been so positive that a similar system, PDB-Browse, has been developed for PDB's UNIX users. The PDB-Browse system consists of three parts. The first of these builds indices (based on UNIX standard dbm files) of PDB entries as they exist on the user's computer system. The requirements for running this part are PERL and 50 Mbytes of disk space (the disk requirement can be lowered by indexing fewer fields in the PDB file). Currently indexed are the AUTHOR, COMPND, CRYST1, JRNL, SOURCE, EXPDAT, HET, FORMUL and REMARK records, plus the accession date, functional classification (as it appears in the HEADER record) and resolution fields, as well as the file location as a function of ident code. The second part of PDB-Browse consists of a number of PERL programs that act as filters and scan the above indices. The programs auth.pl, comp.pl, cryst.pl, expd.pl, jrnl.pl, rem.pl and sour.pl scan the appropriate records. For example, expd.pl -a NMR returns the ident codes of all (-a) entries for which NMR tech- niques were used to collect the data. (Adding a -v flag would report all non-nmr entries.) To find out which of these entries describe DNA, execute the command: expd.pl -a NMR | comp.pl DNA which will return the entries satisfying both conditions. To find out which of these have a SYNTHETIC source, execute expd.pl -a NMR | comp.pl DNA | sour.pl SYNTHETIC to produce the ident codes of those (currently 24 in number). On a Silicon Graphics workstation, with the indices residing on an NFS-mounted file system, the above query took less than 10 seconds. A PERL program called loc.pl is also provided which, given an ident code, will provide the location of an entry. Combining loc.pl with the pipeline, one could view each of the above files with the command: view `expd.pl -a NMR | comp.pl DNA | sour.pl SYNTHETIC | loc.pl` PERL scripts are also provided to scan the accession date (before.pl and after.pl), functional classification (head.pl), and the resolution as it appears in remark 2 (resoluge.pl and reso- lule.p). In addition, full.pl searches the full file for a particular expression including wild cards, list.pl lists an entire index and lookup.pl extracts certain records or fields from a list of ident codes. For example, to find the authors (not ident codes) from the above set, one would execute: expd.pl -a NMR | comp.pl DNA | sour.pl SYNTHETIC | lookup.pl auth The third part of the PDB-Browse system, called browse, is a graphical user interface (GUI) front-end for all of these filters (as well as custom filters that may be written by the user). This front- end requires the user to install tcl and tk (Tool Command Lan- guage and Tool Kit, public domain utilities written by John Ousterhout of UC Berkeley and available via anonymous FTP from harbor.ecn.purdue.edu). It is also helpful, but not required, to have a graphical viewing program such as RASMOL (by Roger Sayle, available from ftp.dcs.ed.ac.uk) or MidasPlus (by Conrad Huang and Thomas Ferrin of UCSF). All of the programs making up the PDB-Browse system are available in source form via each of PDB's distribution methods, in a directory named /pub/pdbbrowse. Before downloading, be sure to read the file ReadMeFirst in that directory. Please send all comments and suggestions for improvements to Dave Stampf (drs@bnl.gov). ---------------------------------------------------------------------- Affiliated Centers Twenty-one affiliated centers offer DATAPRTP information for distribution. These centers are members of the Protein Data Bank Service Association (PDBSA). Centers designated with an asterisk(*) may distribute DATAPRTP information both on-line and on magnetic or optical media; those without an asterisk are on-line distributors only. BMERC BioMolecular Engineering Research Center College of Engineering, Boston University Boston, Massachusetts Kathleen Klose (617-353-7123) klose@darwin.bu.edu *BIOSYM BIOSYM Technologies, Inc. San Diego, California Laurel Frey (619-546-5509) rcenter@biosym.com or laurel@biosym.com CAN/SND Canadian Scientific Numeric Data Base Service Ottawa, Ontario, Canada Roger Gough (613-993-3294) cansnd@vm.nrc.ca CAOS/CAMM Dutch National Facility for Computer Assisted Chemistry Nijmegen, The Netherlands Jan Noordik (+1 31-80-653386) noordik@caos.caos.kun.nl *CCDC Cambridge Crystallographic Data Centre Cambridge, United Kingdom David Watson (+1 44-223-336394) dgwl@chemcrys.cam.ac.uk CINECA NE Italy Interuniversity Computing Center Caselecchio di Reno (BO), Italy Laura Setti (+1 39-51-598411) asltco@icineca.cineca.it EMBL European Molecular Biology Laboratory Heidelberg, Germany Hans Doebbeling(+1 49-6221-387-247) hans.doebbeling@embl-heidelberg.de INN Israeli National Node Weizmann Institute of Science Rehovot, Israel Leon Esterman (+1 972-8-343934) lsestern@weizmann.weizmann.ac.il *JAICI Japan Association for International Chemical Information Tokyo, Japan Hideaki Chihara (+1 81-3-5978-3608) *MAG Molecular Applications Group Palo Alto, California Hilary Jensen (415-473-3039) hilary@suerte.mag.com *MSI Molecular Simulations Inc. Burlington, Massachusetts Lance J. Ransom Wright (617-229-9800) lance@msi.com NCHC National Center for High-Performance Computing Hsinchu, Taiwan, ROC Jyh-Shyong Ho (+1 886-35-776085; ex: 342) c00jsh00@nchc.gov.tw NCSA National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Champaign, Illinois Marcia Miller (217-244-0634) mmiller@ncsa.uiuc.edu National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland Stephen Bryant (301-496-2475) bryant@ncbi.nlm.nih.gov *OML Oxford Molecular Ltd. Oxford, United Kingdom Steve Gardner (+1 44-865-784600) steve@gardner.demon.co.uk *Osaka University Institute for Protein Research Osaka, Japan Yoshiki Matsuura (+1 81-6-879-8605) matsuura@protein.osaka-u.ac.jp Pittsburgh Supercomputing Center Pittsburgh, Pennsylvania Hugh Nicholas (412-268-4960) nicholas@cpwpsca.bitnet *Protein Science Princeton, New Jersey Joseph Villafranca (609-252-3573) villafranca@bms.com SDSC San Diego Supercomputer Center San Diego, California Lynn Ten Eyck (619-534-8189) teneyckl@sdsc.bitnet SEQNET Daresbury Laboratory Warrington, United Kingdom User Interface Group (+1 44-925-603351) uig@daresbury.ac.uk *Tripos Tripos Inc. St. Louis, Missouri Akbar Nayeem (314-647-1099; ex: 3224) akbar@tripos.com ---------------------------------------------------------------------- To Contact PDB Protein Data Bank Chemistry Department, Bldg. 555 Brookhaven National Laboratory P.O. Box 5000 Upton, NY 11973-5000 USA Telephone: +1 516-282-3629 Facsimile: +1 516-282-5751 Internet: pdb@bnl.gov (general correspondence) orders@pdb.pdb.bnl.gov (order information) sysadmin@pdb.pdb.bnl.gov (network services) listserv@pdb.pdb.bnl.gov (Listserver subscriptions) pdb-l@pdb.pdb.bnl.gov (Listserver postings) Please include your name, postal mailing address, e-mail address, facsimile number and telephone number in all correspondence. ---------------------------------------------------------------------- Statement of Support PDB is supported by a combination of Federal Government Agency funds (work supported by the U.S. National Science Foundation; the U.S. Public Health Service, National Insti- tutes of Health, National Center for Research Resources, National Institute of General Medical Sciences and National Library of Medicine; and the U.S. Department of Energy under contract DE-AC02-76CH00016) and user fees. ---------------------------------------------------------------------- PDB Staff Joel L. Sussman, Head David R. Stampf, Sr. Project Mgr. Enrique E. Abola, Science Coordinator Frances C. Bernstein Judith A. Callaway Minette Cummings Betty R. Deroski Pamela A. Esposito Arthur Forman Thomas F. Koetzle Patricia A. Langdon Michael D. Libeson Nancy O. Manning John E. McCarthy Regina K. Shea John G. Skora Karen E. Smith Dejun Xue ----------------------------------------------------------------------