In 2012, Tarleton State University created one of the first programs for a Masters Degree in Mathematics with Emphasis in Data Science (originally called Data Mining). Our graduates have secured high-paying data science jobs with rapid career advancement at prestigious employers such as Amazon, Capital One Finance, HAVI Global Solutions, JB Hunt, Informatica, US Missile Defense Agency, National Center for Atmospheric Research, and Oak Ridge National Laboratories. Others have pursued PhDs at Baylor University, Southern Methodist University, Washington University in St. Louis, and others. In the past three years, all data science graduates have found great positions within six months of graduating.

For current degree plan information, please see the page Master of Science in Mathematics.

## Recent Data Mining Research Projects

## Detecting Anomalous Crop Insurance Claims using Satellite Images

Research assistants Rebecca Ator, Charles Gibson, Dan Mysnyk, and Adam Wisseman implemented a method for screening crop insurance claims for fraud using satellite images.

Using the difference between the red and infrared bands in a satellite image, it is possible to calculate the normalized difference vegetation index, or NDVI, which serves as a proxy for the amount of green vegetation in a given geographic region, and therefore, the health of crops being grown in that area. A *k*-means algorithm was applied to cluster NDVI curves for Nebraska crop insurance claims, resulting in a relatively healthy cluster (Cluster 1) and an unhealthy one (Cluster 2).

This clustering was then compared to spot checklist (SCL) flags, used by CAE to flag anomalous insurance claims. A Fisher’s exact test comparing the clustering to the SCL flags resulted in a *p*-value less than 10^{-5}, demonstrating a highly statistically significant association between the NDVI clusters and the SCL flags.

To the left: Charles, Adam, Rebecca, and Dan are shown speaking with Kirk Bryant, Deputy Director for Strategic Data Acquisition and Analysis for the USDA Risk Management Agency at the National Consortium for Data Science Data Showcase.

## Bayesian Ensemble Models of Climate Variability in South Texas

Possibly the most important application of data mining in the 21st century is building and refining models of climate change and then using those models to predict climate behavior in local regions. Juliann Booth and Nina Culver are using Bayesian model averaging to predict future precipitation in South Texas, an important concern, given the projected decline in water availability in this region by 2050.

Thirty-five CMIP5 models *f _{1},…,f_{35} *for temperature and precipitation were obtained from the World Climate Research Programme’s Working Group on Coupled Modeling. For each model

*f*the probability of observing a temperature/precipitation measurement

_{k},*y*is

*p(y|f*and the probability that

_{k}),*f*is the best model given observed target data

_{k}*y*is

_{T}*p(f*Synthesizing these two types of probabilities using Bayes’ theorem yields the overall probability of observing a future temperature/precipitation measurement

_{k}|y_{T}).*y*as follows.

Here is a visualization of temperature predictions for the thirty-five CMIP5 models for the South Texas region being studied.

## Modeling Nitrate Contamination in Water Wells Based on Proximity to CAFOs

Nitrate contamination of ground water is a serious health concern, which can lead to conditions such as methemoglobinemia (blue baby disease), miscarriages, and non-Hodgkin lymphoma, and the EPA has therefore set a maximum contaminant level (MCL) for nitrate of 10 mg/L. Proximity of concentrated animal feed operations (CAFOs) to water wells has been linked to nitrate contamination of those wells, and Charles Tintera and Lain Tomlinson are currently applying data mining techniques to model this relationship more accurately.

A novel feature of this project is modeling flowpaths in the aquifer from a given CAFO using the hydraulic gradient obtained from the Global Information System (GIS). By taking into account the distance from a well to a CAFO’s flowpath, the length of that flowpath, and the waste application rate at that CAFO, a CAFO Migration Score (CMS) is calculated to summarize the overall impact of CAFOs on the well under consideration. The Epanechnikov kernel is applied to model diminished probabilities of contamination that result from increased distances from the flowpath.

Once CAFO migration scores were computed, a logistic regression model demonstrated a highly statistically significant relationship between CMS and nitrate contamination (*P =* 7.19 x 10^{-12}). In the image below, 344 wells have been broken into 10 deciles based on CAFO migration score, so each point in this plot represents approximately 34 wells. The *x*-coordinate of each point is the average CMS value for wells in that decile, and the *y*-coordinate is the observed number of wells in that decile with nitrate concentrations exceeding 3 mg/L. The plot indicates strong agreement between the observed data and the logistic regression model, as confirmed with a Hosmer-Lemeshow goodness of fit test.

Charles and Lain are now working to extend this analysis to include more variables, such as depth to water table, pH, total dissolved solids, percent clay, percent organic matter, and annual rainfall. They are also applying random forests, support vector machines, *k*-nearest neighbors, and other classification algorithms to improve the model’s classification accuracy. Because testing for nitrate contamination is expensive, the goal is to provide a tool that will help farmers estimate a well’s probability of being contaminated using readily available information about that well.