Recognition of Genes in DNA
Sequences
One more data set donated by Noordewier et al. [1] was downloaded from: ftp://ftp.ncc.up.pt/pub/statlog/. The authors of the original work used knowledge-based neural networks to model the existence of splice-junctions within a DNA sequence based on 60 nucleotides, 30 on each side of a given point. Each symbolic variable representing one of the four nucleotides (A, C, G and T) was coded as a set of three binary variables (100, 010, 001 and 000, respectively). As a result, there were 180 input variables (coded #1, #2, ... , #180) for each sample representing one of the three output classes: an EI site, an IE site or neither. The data set came with separate training and validating sample files and the training file contained a different number of samples for each output class. To make the results as comparable as possible (the number of the samples in [1] is slightly different from that of the downloaded data), the sample sets were kept unchanged for this experiment: 2000 training and 1186 validating samples. To cut computation time, a 180 x 18 x 3 network was trained five times as a preliminary process. The ten most important inputs were selected based on the interpretation of the trained network, and thus the automatic feature selection process started with a network structure having ten input and three output nodes. With the numbering of the original inputs, the ten most important ones were: a: #82; b: #84; c: #85; d: #90; e: #93; f: #94; g: #95; h: #96; i: #97; and j: #105. Each training session continued for 300 cycles and the process was repeated to get a total of six sets of results. The order in which the inputs were deleted, the average MCSR* and average CASR** at each iteration during the five processes are shown in the table below. For the last iteration, the average CASR's were not calculated when the MCSR's were zero.
|
|
Process |
|||||||
|
Prel. |
1 |
2 |
3 |
4 |
5 |
6 |
||
|
Inputs Used |
1 |
All 180 |
abcdefghij |
abcdefghij |
abcdefghij |
abcdefghij |
abcdefghij |
abcdefghij |
|
2 |
abcdefgh j |
abcdefgh j |
abcdefgh j |
abcdefgh j |
abcdefgh j |
abcdefgh j |
||
|
3 |
abcdefgh |
bcdefgh j |
bcdefgh j |
bcdefgh j |
bcdefgh j |
bcdefgh j |
||
|
4 |
bcdefgh |
bcdefgh |
cdefgh j |
bcdefgh |
bcdefgh |
bcdefgh |
||
|
5 |
cdefgh |
cdefgh |
cdefgh |
cdefgh |
cdefgh |
cdefgh |
||
|
6 |
cdef h |
cdef h |
cdef h |
cdef h |
cdef h |
cdef h |
||
|
7 |
cdef |
cdef |
cdef |
cde h |
cdef |
def h |
||
|
8 |
def |
def |
def |
de h |
c ef |
def |
||
|
9 |
de |
de |
de |
de |
c e |
de |
||
|
10 |
d |
d |
d |
d |
e |
d |
||
|
Average MCSR |
1 |
92.071 |
93.714 |
93.740 |
93.273 |
93.643 |
93.714 |
93.571 |
|
2 |
91.280 |
91.617 |
92.460 |
91.023 |
91.881 |
92.435 |
||
|
3 |
82.571 |
87.789 |
87.723 |
88.647 |
88.317 |
88.911 |
||
|
4 |
83.929 |
83.929 |
85.809 |
83.929 |
84.071 |
84.143 |
||
|
5 |
85.000 |
85.000 |
85.000 |
85.000 |
85.000 |
85.000 |
||
|
6 |
73.571 |
73.571 |
73.571 |
73.571 |
73.571 |
73.571 |
||
|
7 |
62.500 |
62.500 |
62.500 |
62.500 |
62.500 |
74.286 |
||
|
8 |
63.214 |
63.214 |
63.214 |
63.214 |
59.736 |
63.214 |
||
|
9 |
52.143 |
52.143 |
52.143 |
52.143 |
51.429 |
52.143 |
||
|
10 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
||
|
Average CASR |
1 |
93.052 |
94.334 |
94.165 |
94.165 |
94.283 |
94.148 |
94.384 |
|
2 |
93.153 |
93.288 |
93.440 |
93.255 |
93.390 |
93.491 |
||
|
3 |
90.320 |
92.209 |
92.243 |
92.411 |
92.310 |
92.293 |
||
|
4 |
90.304 |
90.304 |
91.315 |
90.472 |
90.067 |
90.405 |
||
|
5 |
89.241 |
89.713 |
89.477 |
89.241 |
89.713 |
88.887 |
||
|
6 |
84.435 |
83.895 |
84.435 |
83.963 |
83.659 |
84.132 |
||
|
7 |
79.848 |
79.848 |
79.848 |
79.933 |
79.848 |
75.717 |
||
|
8 |
73.187 |
73.187 |
73.187 |
73.187 |
68.718 |
73.187 |
||
|
9 |
70.658 |
70.658 |
70.658 |
70.658 |
66.105 |
70.658 |
||
|
10 |
NA |
NA |
NA |
NA |
NA |
NA |
||
As inputs were deleted, the MCSR's and CASR's decreased as an overall trend. In the case of ten inputs, all the rates were recorded even higher than when all the 180 inputs were used, both in this experiment and the ones reported by Noordewier et al.[1]. Based on the way the nucleotides were coded, the importance of inputs d and e (i.e., #90 and #93 in the sequence) indicated that, to determine if a splice junction exists at a given point, and its type if it exists, it is very important to know whether or not the neighboring nucleotides are type G. Further interpretation and verification of the results from the last two experiments will be left to specialists with corresponding domain knowledge.
Reference:
[1] M. O. Noordewier, G. G. Towell and J. W. Shavlik, "Training knowledge-based neural networks to recognize genes in DNA
sequences," Advances in Neural Information Processing Systems (R. P.
Lippmann, J. E. Moody and D. S. Touretzky, Ed.), Morgan Kaufmann Publishers, San
Mateo, CA, vol. 3, pp. 530-536, 1991.
__________
*MCSR: The Minimum Class Success Rate was the lowest success rate among all the target classes. The average MCSR is the MCSR's averaged over the five training sessions within each process.
**CASR: The Class Average Success Rate is the success rate averaged over all the target classes. The average CASR is the CASR's averaged over the five training sessions within each process.
Neural Network Main Page
Character
Recognition || SPIE Challenge || Diabetes
Forecast || Gene Recognition