2010-12-20 15:10:33 UTC
I'd like to be able to use the "Get Microbial Data" tool in our
local Galaxy install, which appears as though it could allow
access to a local copy of the NCBI "Bacteria" FTP site,
From looking at the tool's source code, I see I must populatemicrobial_data.loc file, however the microbial_data.loc.sample
is not very helpful:
#This is a sample file distributed with Galaxy that enables tools
#to retrieve microbial data via a URL
What this doesn't tell me is the meaning of the columns. Apparently
this is really three tables in one, determined by the first entry.
ORG entries are used by this tool for the selection of the kingdom
and species. They appear to have the following columns, one per
0. The "ORG" column itself, not counted in the XML offsets
5. Comma separated list of chromosomes/plasmids
6. URL for NCBI genome project
The CHR entries don't seem to be used directly by this tool.
There is one entry per chromosome/plasmid.
0. The "CHR" entry, not counted in the XML offsets
2. Description including species and chromosome/plasmid
4. Length of sequence (nucleotides)
5. GI number
7. URL for NCBI nucleotide database
Then there are the DATA entries, which appear to reference
local files. There are multiple DATA entries per CHR entry:
0. The "DATA" entry, not counted in the XML offsets
1. Identifier (composite of ORG id, CHR id, and data type)
2. Identifier of ORG line
3. Identifier of CHR line
4. Data type (CDS, tRNA, rRNA, sequence, GeneMark, Glimmer3)
5. File format (fasta or bed)
Want I want to do is generate a microbial_data.loc file
from a local mirror of ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
In addition to understanding the loc file format, it also seems
I need to generate some bed files from the NCBI provided
data, e.g. for NC_008265 which is one of the examples in
the sample loc files, I'd need the following files:
Referring to the NCBI FTP site for this organism, we have:
I can see for example how to map *.ptt (protein tables) into *.CDS.bed,
and similarly for the Glimmer3 and GeneMark predictions. I could
also probably parse *.gbk to generate bed tabular files for any
annotated tRNA and rRNA entries (and the CDS entries of course.).
But rather than reinventing the wheel, how do you do this at Penn State?
Also, I'd like to offer access to the chromosome, CDS, tRNA, and rRNA
sequences themselves (as FASTA files, not just bed tabular). Am I right
that currently the "Get Microbial Data" tool doesn't offer this?