Recent advances in genetic testing have found novel ways of mapping out an indi vidual’s gene expression. Microarrays, biotechnological chips used for creating graphical representations of gene expression levels, are tools that have led to countless groundbreaking discoveries about the human genome. Visualizing the unique and intricate chemical struc ture of an individual’s genetic profile, microarrays offer an optimal gateway to diagnosing genetic diseases.
A genetic profile refers to information about specific genes, including variations and gene expression in an individual or a certain type of tissue. Gene expression involves the transcription of information inside DNA into coding or non-coding RNA sequences (ncRNA). In the case of diseases like cancer, mutations in the original DNA genes lead to harmful changes in protein production and ncRNA behaviour. Protein production can be altered by the transcription of mutated mRNA sequences, leading to up/downregulation of gene products, or malfunctioning sets of certain proteins. Meanwhile, defective ncRNA can lead to adverse epigenetic side effects, such as gene silencing or DNA methylation.
The method of gene analysis chosen in this paper is non-coding RNA profiling by array, which involves the extraction of ncRNA sequences (especially miRNA) from tissue. ncRNA sequences account for 98% of the human genome, making reverse transcription into DNA sequences possible. The transcribed DNA sequences are then put through microarray analysis, which involves the detection of targeted gene sequences through microscopic probes
and a comparison to a control gene profile using coloured fluorescents (figure 1). The result of microarray analysis is a graphically represented profile of an individual’s normal and abnormal levels of expression in each gene. Though too complex for human analysis, the differences between gene expression levels in healthy individuals and cancer patients can be analyzed with the use of artificial neural networks, a computer representation of the brain.
A massive database full of labelled gene expression data was vital for reliable AI train ing. Fortunately, the GEO (Gene Expression Omnibus) database, maintained by the NCBI had thousands if not millions of labelled hybridization arrays, chips, and microarrays.
However, most datasets consisting of Homosapien genetic expressions only had up to a thousand microarray samples, insufficient for deep learning. To obtain a sufficient quantity of data, labelled microarrays from a massive 2018 research project with the title, Integrated extracellular microRNA profiling for ovarian cancer screening, was used. This database provided not only 40,000 patients’ miRNA profiles, but also labelled them with 12 different types of cancer, and a non-cancer label based on future diagnoses. Using GEO’s API, 4 large datasets from this experiment, totalling up to 13,000 microarrays were processed in series matrix format.
To analyze the learning progress of the AI, 10% of the dataset was allocated for valida tion. Afterwards, a multilayer perceptron neural network was created to fit the data. Unlike convolution-based models where the inputs are downscaled for general pattern recognition, this model analyzes each input neuron independently from its neighbours, making it more fitting for the task of complex gene expression analysis.
Overall, this model had 1,186,825 trainable parameters. To prevent overfitting, and to help it learn faster, 5,024 non-trainable parameters consisting of dropout layers and batch normalization layers were implemented. The dropout layers had a 30% dropout rate, meaning that on each iteration, 30% of the neuron connections were cut, making it near-impossible for them to overfit the training data. The batch normalization layers simply scaled each layer to have a unit standard deviation, making the data easier to work with.
Each layer of the network used the ReLU (rectified linear unit) activation function to preserve neuron activations. The output layer used the Softmax activation function to convert the vector of activations into a vector of probabilities. This output vector represented the network’s confidence in the presence of each of the 13 labels.
As visible from figure 2, the perceptron model started with an accuracy of less than 60% and ended up having 98% accuracy on the training dataset while having 93% accuracy on the validation dataset. This indicated that the AI did not overfit the training data, and can reliably perform a cancer diagnosis on a Homo sapien microarray sample.
pre dictions while the vertical axis represents the doctor-diagnosed values. It is above 90% accurate on 11 of the 13 labels. However, due to the small amount of glioma microarray samples provided in the dataset, it is only 36% accurate in classifying it and confuses it with lung cancer samples.
For the purpose of data visualization, the microarrays were converted to 45 by 57 images, representing each gene’s expression within a green to red scale. Figure 4 shows 8 randomly selected microarrays diagnosed by the AI. Within these diagnoses, the AI’s median certainty for a specific type of cancer is 99%.
Artifical Intelligence’s pattern recognition of miRNA expression levels is extremely ac curate and is a powerful tool for conducting genetic disease screening, along with other forms of gene expression analysis. With the 93% accurate machine learning model presented in this paper, medical professionals can automatically analyze genetic samples for the diagnosis of 12 cancers, and work with an ever-increasingly reliable second opinion. Furthermore, the success of the program proves the presence of distinguishable patterns in genetic expres sion amongst different cancers. Differential gene analysis techniques combined with the AI’s readings can provide meaningful insight into the specific combination of genes responsible for cancers.