Unveiling Horse Breed Diversity: An Advanced Neural Network Approach

Understanding the genetic underpinnings of horse breed diversity is crucial for effective conservation and breeding strategies. This article delves into the application of Artificial Neural Networks (ANNs) to analyze high-density Single Nucleotide Polymorphism (SNP) data, aiming to identify key genetic markers that differentiate horse breeds. We explore the data preparation, ANN model construction, and feature selection methodologies employed in this advanced research.

Training and Testing Datasets for Genetic Analysis

The foundation of this study lies in comprehensive genetic data. A training dataset comprising 795 animals from 37 distinct horse breeds was utilized. These animals were genotyped using an Illumina SNP 50k Bead chip. The data underwent rigorous filtration, including a call rate threshold of 99% to manage missing genotypes, ensuring data quality for subsequent analysis. The raw predictor variable data, in the form of an SNP matrix, was processed with each marker treated as a distinct variable capable of holding three input values: 0, 1, or 2, representing different allele combinations.

To evaluate the performance and predictive power of the ANN models, a separate testing dataset was assembled. This dataset included 120 individuals from eight different breeds. Data preprocessing involved identifying common SNP markers between the 50K and a 2M SNP panels. Following quality control (a 99% call rate), approximately 14,000 markers remained for detailed analysis. Further details regarding the validation data can be found in Schaefer et al..

Constructing and Refining Artificial Neural Network Models

Artificial neural networks are sophisticated computational models inspired by the structure of the human brain, composed of interconnected units known as neurons. Various network architectures exist, including Multilayer Perceptrons (MLPs) and Self-Organizing Maps (SOMs). For this genomic data analysis, two primary ANN architectures were employed: a deep neural network (DNN) featuring two hidden layers, and a standard single hidden layer neural network (ANN) utilizing a back-propagation algorithm for weight adjustments.

The architecture of a single hidden layer ANN, commonly used in similar studies, is illustrated in Figure 1. The R software environment, specifically using the neuralnet and NeuralNetTools packages, was instrumental in selecting informative and unique SNP markers within each breed. Key algorithms, Garson and Olden, were applied via these ANNs to ascertain the relative importance of variables in characterizing breed diversity.

Figure 1: A generalized representation of a single hidden layer neural network architecture, often employed in genetic analyses. Bias and other detailed neuronal parameters are omitted for clarity.

Addressing the challenge of large SNP datasets potentially causing computational errors, the study adopted a strategy proposed by De Oña and Garrido. Instead of a single, large network, the high-density SNP chip data was partitioned into sub-datasets of manageable dimensions. These sub-datasets were then used as input to identify discriminant SNPs.

Advanced Feature Selection: Garson, Olden, and DNN Approaches

Feature selection is a critical step in identifying the most influential genetic markers. This study employed two established methods: the Garson approach and the Olden approach. The Garson method, originally described by Garson and refined by Goh, assesses the relative importance of input variables based on the calculated weights within the connections of a supervised neural network. These importance values range from 0 to 1, indicating the magnitude of a variable’s influence. The Olden approach, proposed by Olden and Jackson, also utilizes connection weights to evaluate variable contributions.

A Deep Neural Network (DNN) approach, utilizing an ANN with two hidden layers, was also implemented to pinpoint discriminant SNPs within breeds. Determining the optimal number of nodes in the hidden layers is crucial for model performance. Through systematic testing, 40 nodes were identified for the first hidden layer and 38 for the second. The ANN models incorporating the Garson and Olden algorithms featured 40 nodes in their single hidden layer.

The selection of genetic markers was based on the final fitted weights of the neural network. In the DNN approach, a linear relationship between variables and the response was assumed, represented by the equation Y = Xg + e. Here, Y represents the observed values for the breeds, g is a vector of SNP marker weights, X is the design matrix linking marker weights to observed values, and e is the vector of residual terms. The hypothesis was that higher coefficient values in this regression-like equation significantly impact the output variable. Consequently, the maximum absolute weight obtained via the DNN was used to select SNP markers responsible for breed diversity.

Figure 2 illustrates the overall analytical process. Following the convergence of the neural network, feature selection was performed based on the absolute values of the weights in the first hidden layer. Equation 2 provides a threshold for selecting informative SNP markers: E(|Wij|) + sqrt(Var(|Wij|)). This formula assumes that all variables contribute maximally and then applies a selection threshold to identify a subset of effective markers. This method acknowledges that not all variables have equal effects and allows the data to reveal the inference of each marker.

Figure 2: A flowchart outlining the key stages of the research methodology, from data processing to feature selection and analysis.

Individual Assignment Analysis and Model Evaluation

The evaluation of a classification model hinges on its ability to flexibly and accurately predict outcomes for new, unseen data. For this study, individual assignment analysis was conducted using established genetic assignment approaches. The method described by Paetkau et al., which has demonstrated high effectiveness in individual assignment when significant genetic differentiation exists between reference populations, was employed. Notably, SNP markers were used in place of the traditional microsatellites.

The performance of the assignment procedure was quantitatively assessed using log-likelihood ratios (LLRs). The LLR compares the probability of an individual being assigned to its true population against the probability of it being assigned to any other population (Equations 3 and 4).

$$LLR = log_{10}(T(g|ia)) – log{10}(T(g|i_b))$$

where,

$$log{10}(T(g|i)) = sum{j} log{10}(T(g{jk’} |i))$$

Different stringency thresholds were applied to establish confidence levels for assignment precision. Four levels were used (LLR > 1, 2, 3, and 4), signifying that a multi-locus genotype should be 10, 100, 1000, or 10000 times more similar to the true population than any other. Individuals with an LLR below these thresholds would fail to be assigned to their unique origin and would be categorized under a pseudo-reference population. Correct assignment occurred when the calculated LLR exceeded the selected stringency levels.

Ethical Considerations in Genetic Research

All animal sampling and procedures were conducted with strict adherence to ethical guidelines and were approved by relevant international and national governing bodies. For the training dataset, DNA was collected via jugular venipuncture by licensed veterinarians or from mane or tail hairs by owners or researchers. Approvals were obtained from institutions including the University of Minnesota, the University of Kentucky, University College Dublin, and various ethical boards across Europe.

Similarly, DNA samples for the testing dataset were collected with prior approval from Animal Care and Use Committees at respective institutions. These approvals encompassed protocols from institutions like the University of California, Davis, Cornell University, and others across Sweden, Israel, Germany, and Switzerland. No commercial animals were involved in this study. Written informed consent was obtained from private owners, detailing the study’s purpose, procedures, potential risks and benefits, and contact information.

References

Petersen, J. L. et al. Genetic Diversity in the modern horse illustrated from genome-wide SNP data. PLoS ONE 8, e54997. https://doi.org/10.1371/journal.pone.0054997 (2013).
Milne, L. In AI-Conference 571–571 (World Scientific Publishing).
Schaefer, R. J. et al. Developing a 670k genotyping array to tag ~2M SNPs across 24 horse breeds. BMC Genom. 18, 565. https://doi.org/10.1186/s12864-017-3943-8 (2017).
Ince, D. & Sofu, A. Estimation of lactation milk yield of Awassi sheep with artificial neural network modeling. Small Ruminant Res. 113, 15–19 (2013).
Arbib, M. A. The Handbook of Brain Theory and Neural Networks (MIT press, 2003).
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representation by back-propagation errors. Nature https://doi.org/10.1038/323533a0 (1986).
Cilimkovic, M. Neural networks and back propagation algorithm. Institute of Technology Blanchardstown, Blanchardstown Road North Dublin 15 (2015).
Stefan Fritsch & Guenther, F. neuralnet: Training of Neural Networks. https://journal.r-project.org/archive/2010/RJ-2010-006/index.html (2016).
Beck, M. NeuralNetTools: Visualization and Analysis Tools for Neural Networks. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6262849/ (2016).
De Oña, J. & Garrido, C. Extracting the contribution of independent variables in neural network models: A new approach to handle instability. Neural Comput. Appl. 25, 859–869. https://doi.org/10.1007/s00521-014-1573-5 (2014).
Garson, G. D. Interpreting neural-network connection weights. AI Expert 6, 46–51 (1991).
Goh, A. T. C. Back-propagation neural networks for modeling complex systems. Artif. Intell. Eng. 9, 143–151. https://doi.org/10.1016/0954-1810(94)00011-S (1995).
Olden, J. D. & Jackson, D. A. Illuminating the “black box”: A randomization approach for understanding variable contributions in artificial neural networks. Ecol. Model. 154, 135–150 (2002).
He, J. & Zelikovsky, A. In The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society 2840–2843 (IEEE).
R. Core, T. R: A Language and Environment for Statistical Computing. https://www.R-project.org/ (2017).
Paetkau, D., Calvert, W., Stirling, I. & Strobeck, C. Microsatellite analysis of population structure in Canadian polar bears. Mol. Ecol. 4, 347–354 (1995).
Rannala, B. & Mountain, J. L. Detecting immigration by using multilocus genotypes. Proc. Natl. Acad. Sci. USA 94, 9197–9201 (1997).
Cornuet, J. M., Piry, S., Luikart, G., Estoup, A. & Solignac, M. New methods employing multilocus genotypes to select or exclude populations as origins of individuals. Genetics 153, 1989–2000 (1999).
Wilkinson, S. et al. Evaluation of approaches for identifying population informative markers from high density SNP Chips. BMC Genet. 12, 45. https://doi.org/10.1186/1471-2156-12-45 (2011).
The ethics statements were directly extracted from the provided text.

Horse breed