FASTA
Bách khoa toàn thư mở Wikipedia
FASTA là một giải thuật bắt cặp trình tự được David J. Lipman và William R. Pearson miêu tả lần đầu tiên vào năm 1985 (Rapid and sensitive protein similarity searches).
Nhiều phần mềm tin sinh học cần dữ liệu trình tự gene hoặc protein theo kiểu định dạng FASTA như ví dụ minh hoạ dưới đây.
>tên trình tự LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
FASTA is a sequence alignment package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985 in the article . The original FASTP program was designed for protein sequence similarity searching. FASTA, described in 1988 (Improved Programs for Biological Sequence Comparison) added the ability to do DNA:DNA searches, translated protein:DNA searches and provided a more sophisticated shuffling program for evaluating statistical significance. There are several programs in this package that allow the alignment of protein sequences and DNA sequences. FASTA is pronounced "FAST-Aye", and stands for "FAST-All", because it works with any alphabet, an extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.
The current FASTA package programs for protein:protein, DNA:DNA, protein:translated DNA (with frameshifts), and ordered or unordered peptide searches. In addition to rapid heuristic search methods, the FASTA package provides SSEARCH, an implementation of the optimal Smith-Waterman algorithm. A major focus of the package is the calculation of accurate similarity statistics, so that biologists can judge whether an alignment is likely to have occurred by chance, or whether it can be used to infer homology. The FASTA package is available from ftp.virginia.edu/pub/fasta.
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. An example FASTA format:
>gi|5524211|gb|AAD44166.1| cytochrome b Elephas maximus maximus LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).
The nucleic acid codes supported are:
Nucleic Acid Code | Meaning |
---|---|
A | Adenosine |
C | Cytidine |
G | Guanine |
T | Thymidine |
U | Uracil |
R | G A (puRine) |
Y | T C (pYrimidine) |
K | G T (Ketone) |
M | A C (aMino group) |
S | G C (Strong interaction) |
W | A T (Weak interaction) |
B | G T C (not A) (B comes after A) |
D | G A T (not C) (D comes after C) |
H | A C T (not G) (H comes after G) |
V | G C A (not T, not U) (V comes after U) |
N | A G C T (aNy) |
- | gap of indeterminate length |
The amino acid codes supported are:
Amino Acid Code | Meaning |
---|---|
A | Alanine |
B | Aspartic acid or Asparagine |
C | Cysteine |
D | Aspartate |
E | Glutamate |
F | Phenylalanine |
G | Glycine |
H | Histidine |
I | Isoleucine |
K | Lysine |
L | Leucine |
M | Methionine |
N | Asparagine |
P | Proline |
Q | Glutamine |
R | Arginine |
S | Serine |
T | Threonine |
U | Selenocysteine |
V | Valine |
W | Tryptophan |
Y | Tyrosine |
Z | Glutamate or Glutamine |
X | any |
* | translation stop |
- | gap of indeterminate length |