From cavanaug from ncbi.nlm.nih.gov Tue Apr 14 11:43:10 2009 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Tue Apr 14 19:22:55 2009 Subject: [Genbank-bb] GenBank 171.0 Close-Of-Data Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43EC05C83391@NIHCESMLBX15.nih.gov> Greetings GenBank Users, Close-of-data for the upcoming GenBank Release 171.0 occurred on Friday April 10 2009 at approximately 1:30am EDT. The subsequently generated GenBank Incremental Update files nc0410.aso, nc0410.flat, etc. contain data through the close. Note: Release processing often does not begin until sometime during business hours on the close date. As a result, a number of sequence records processed *after* 1:30am are likely to be present in the GenBank 171.0 release files, even though they are "post-close" . Similarly, the first GenBank Incremental Update that is generated after the close date is likely to contain a number of sequence records that are unchanged, compared to their appearance in the release files. We expect to make the GenBank 171.0 data files available later today. Our apologies for the lack of advanced notice about the close date. Mark Cavanaugh GenBank NCBI/NLM/NIH/HHS From cavanaug from ncbi.nlm.nih.gov Tue Apr 14 18:14:06 2009 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Tue Apr 14 19:23:04 2009 Subject: [Genbank-bb] GenBank Release 171.0 Now Available Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43EC05C834E0@NIHCESMLBX15.nih.gov> Greetings GenBank Users, GenBank Release 171.0 is now available via FTP from the National Center for Biotechnology Information (NCBI): Ftp Site Directory Contents ---------------- --------- --------------------------------------- ftp.ncbi.nih.gov genbank GenBank Release 171.0 flatfiles ncbi-asn1 ASN.1 data used to create Release 171.0 Close-of-data for GenBank 171.0 occured on 04/10/2009. Uncompressed, the Release 171.0 flatfiles require roughly 395 GB (sequence files only) or 422 GB (including the 'short directory', 'index' and the *.txt files). The ASN.1 data require approximately 360 GB. Recent statistics for non-WGS, non-CON sequences: Release Date Base Pairs Entries 170 Feb 2009 101467270308 101815678 171 Apr 2009 102980268709 103335421 Recent statistics for WGS sequences: Release Date Base Pairs Entries 170 Feb 2009 143797800446 49036947 171 Apr 2009 144522542010 48948309 During the 56 days between the close dates for GenBank Releases 170.0 and 171.0, the non-WGS/non-CON portion of GenBank grew by 1,512,998,401 basepairs and by 1,519,743 sequence records. During that same period, 1,040,778 records were updated. An average of about 45,723 non-WGS/non-CON records were added and/or updated per day. Between releases 170.0 and 171.0, the WGS component of GenBank grew by 724,741,564 basepairs and the number of records **decreased** by 88,638. A decrease in the overall number of WGS records can occasionally occur, as a result of genome re-assemblies which yield larger (but fewer) records, and due to the submission of completed genomes which supercede WGS projects. For additional release information, see the README files in either of the directories mentioned above, and the release notes (gbrel.txt) in the genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in Release 171.0 and Upcoming Changes) have been appended below for your convenience. ** Important Notes ** * This is the final release for which the PROJECT linetype will be present in GenBank flatfiles. The new DBLINK linetype replaces PROJECT. Post-171.0 GenBank Update files will have only the new DBLINK linetype within about two weeks, and the same will be true of all future complete releases. See Section 1.3.1 for more information. * GenBank 'index' files are now provided without any EST content, and without most GSS content. See Section 1.3.5 of the release notes for further details. NCBI is considering ceasing support for the index files, so we encourage affected users to review that section and provide feedback. Release 171.0 data, and subsequent updates, are available now via NCBI's Entrez and Blast services. As a general guideline, we suggest first transferring the GenBank release notes (gbrel.txt) whenever a release is being obtained. Check to make sure that the date and release number in the header of the release notes are current (eg: April 15 2009, 171.0). If they are not, interrupt the remaining transfers and then request assistance from the NCBI Service Desk. A comprehensive check of the headers of all release files after your transfers are complete is also suggested. Here's how one might go about this on a unix platform, using csh/tcsh : set files = `ls gb*.*` foreach i ($files) head -10 $i | grep Release end Or, if the files are compressed, perhaps: gzcat $i | head -10 | grep Release If you encounter problems while ftp'ing or uncompressing Release 171.0, please send email outlining your difficulties to: info@ncbi.nlm.nih.gov Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov GenBank NCBI/NLM/NIH/HHS 1.3 Important Changes in Release 171.0 1.3.1 PROJECT linetype to be replaced by DBLINK (April 2009) The new DBLINK linetype was introduced as of the February 2009 GenBank Release 170.0. Genome Project IDs and Trace Assembly Archive IDs are now presented via DBLINK, in conjunction with the legacy PROJECT linetype, as this mock-up for CP000964 illustrates: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 PROJECT GenomeProject:28471 DBLINK Project:28471 Trace Assembly Archive:123456 .... COMMENT The source for the DNA and/or cells is: Professor Eric W. Triplett, Chair, Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, P.O. Box 110700, Gainesville, FL 32611-0700, ewt@ufl.edu. PROJECT and DBLINK have co-existed for GenBank releases 170.0 and 171.0 . But subsequent to this April release, the PROJECT line will be removed from the flatfile format. In its final state, the above mock-up for CP000964 becomes: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 DBLINK Project:28471 Trace Assembly Archive:123456 .... COMMENT The source for the DNA and/or cells is: Professor Eric W. Triplett, Chair, Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, P.O. Box 110700, Gainesville, FL 32611-0700, ewt@ufl.edu. In summary: The PROJECT linetype will cease to be displayed in post-171.0 GenBank Update data products within the next two weeks. The same will be true of all future complete GenBank releases. 1.3.2 Organizational changes The total number of sequence data files increased by 34 with this release: - the BCT division is now composed of 40 files (+2) - the CON division is now composed of 131 files (+3) - the ENV division is now composed of 13 files (+1) - the EST division is now composed of 860 files (+22) - the GSS division is now composed of 335 files (+13) - the INV division is now composed of 15 files (+1) - the MAM division is now composed of 5 files (+1) - the PAT division is now composed of 67 files (+2) - the PLN division is now composed of 38 files (+1) 1.3.3 CON-division records for 'segmented sets' restored. A previously overlooked problem in Release 170.0 processing resulted in the exclusion of roughly 14,000 CON-division entries for a type of record referred to as a 'segmented set'. Segmented sets consist of small sequence fragments of an incompletely sequenced molecule, packaged together, with a top-level sequence that specifies the order of the underlying fragments. That top-level sequence can be displayed as a CON division record. AH000819 is an example: ASN.1 for AH000819: http://www.ncbi.nlm.nih.gov/nuccore/405204?report=asn1&log$=seqview Flatfile view of the nine sequenced fragments: http://www.ncbi.nlm.nih.gov/nuccore/405204 These (largely legacy) CON-division records have been restored in Release 171.0 and can be found in gbcon131.seq . 1.3.4 File header problem for EST and GSS files A new method of generating the EST and GSS sequence files has been developed, which has reduced the time required to generate a GenBank release by one day. However, a minor problem in the formatting of the header of the sequence files was inadvertently introduced : a leading space exists before the filename on the very first line. For example: GBGSS100.SEQ Genetic Sequence Data Bank April 15 2009 It should be: GBGSS100.SEQ Genetic Sequence Data Bank April 15 2009 The problem effects all EST files and most GSS files. We doubt that it will cause significant problems for users, however the problem will be corrected for our next release. 1.3.5 Changes in the content of index files As described in the GB 153 release notes, the 'index' files which accompany GenBank releases (see Section 3.3) are considered to be a legacy data product by NCBI, generated mostly for historical reasons. FTP statistics of January 2005 seem to support this: the index files were transferred only half as frequently as the files of sequence records. The inherent inefficiencies of the index file format also lead us to suspect that they have little serious use by the user community, particularly for EST and GSS records. The software that generated the index file products received little attention over the years, and finally reached its limitations in February 2006 (Release 152.0). The required multi-server queries which obtained and sorted many millions of rows of terms from several different databases simply outgrew the capacity of the hardware used for GenBank Release generation. Our short-term solution is to cease generating some index-file content for all EST sequence records, and for GSS sequence records that originate via direct submission to NCBI. The three gbacc*.idx index files continue to reflect the entirety of the release, including all EST and GSS records, however the file contents are unsorted. These 'solutions' are really just stop-gaps, and we will likely pursue one of two options: a) Cease support of the 'index' file products altogether. b) Provide new products that present some of the most useful data from the legacy 'index' files, and cease support for other types of index data. If you are a user of the 'index' files associated with GenBank releases, we encourage you to make your wishes known, either via the GenBank newsgroup, or via email to NCBI's Service Desk: info@ncbi.nlm.nih.gov Our apologies for any inconvenience that these changes may cause. 1.3.6 GSS File Header Problem GSS sequences at GenBank are maintained in two different systems, depending on their origin, and the dumps from those systems occur in parallel. Because the second dump (for example) has no prior knowledge of exactly how many GSS files will be dumped by the first, it does not know how to number its own output files. There is thus a discrepancy between the filenames and file headers for seventy-two of the GSS flatfiles in Release 171.0. Consider gbgss264.seq : GBGSS1.SEQ Genetic Sequence Data Bank April 15 2009 NCBI-GenBank Flat File Release 171.0 GSS Sequences (Part 1) 87206 loci, 64290178 bases, from 87206 reported sequences Here, the filename and part number in the header is "1", though the file has been renamed as "254" based on the number of files dumped from the other system. We will work to resolve this discrepancy in future releases, but the priority is certainly much lower than many other tasks. 1.4 Upcoming Changes There are no scheduled format changes for GenBank.