# M.S. in Mathematics with Emphasis in Data Mining

Tarleton State University houses the Center for Agribusiness Excellence, which uses data mining techniques to screen all of the USDA's crop insurance data for fraud, and in 2010, CAE was awarded a Top 10 Data Mining Case Study by the Institute of Electrical and Electronic Engineers (IEEE). In partnership with CAE, the mathematics department offers an M.S. in mathematics with emphasis in data mining, and students in this program are eligible for $25,000 research assistantships.

Research assistants have a 15-month appointment, beginning on June 1 and concluding on August 31 of the following year. During the first 12 months, they work 20 hours per week for CAE while completing coursework, and during the last three months, they work 40 hours per week for CAE.

Recent Data Mining Research Projects

## Fifteen-month Curriculum

Programming Languages Covered: SQL, R, Python, and SAS

Data Mining I and II: Data analytics with R, Python, and SAS, covering selected topics from decision trees, *k*-nearest neighbors, naive Bayes classification, support vector machines, neural networks, gradient descent, ensemble methods, clustering, anomaly detection, nonlinear regression, text mining, and image processing.

Statistical Models: Statistical programming in R, with applications including multiple linear regression, factor variables, interaction terms, diagnostics and remedial measures, logistic regression, discriminant analysis, and principal components analysis.

Data Warehousing: Management of data warehouses using SQL.

### 1st Summer

MATH 5362: Data Warehousing

MATH 5301: Nonparametric Statistics

### Fall

MATH 5364: Data Mining I

MATH 5305: Statistical Models

MATH 5308: Abstract Algebra

MATH 5198: Research Analysis

### Spring

MATH 5366: Data Mining II

MATH 5350: Linear Algebra

MATH 5320: Real Analysis

### 2nd Summer

MATH 5699: Internship Course (Credit received for research assistantship at CAE).

## Recent Data Mining Research Projects

### Detecting Anomalous Crop Insurance Claims using Satellite Images

Research assistants Rebecca Ator, Charles Gibson, Dan Mysnyk, and Adam Wisseman implemented a method for screening crop insurance claims for fraud using satellite images.

Using the difference between the red and infrared bands in a satellite image, it is possible to calculate the normalized difference vegetation index, or NDVI, which serves as a proxy for the amount of green vegetation in a given geographic region, and therefore, the health of crops being grown in that area. A *k*-means algorithm was applied to cluster NDVI curves for Nebraska crop insurance claims, resulting in a relatively healthy cluster (Cluster 1) and an unhealthy one (Cluster 2).

This clustering was then compared to spot checklist (SCL) flags, used by CAE to flag anomalous insurance claims. A Fisher's exact test comparing the clustering to the SCL flags resulted in a *p*-value less than 10^{-5}, demonstrating a highly statistically significant association between the NDVI clusters and the SCL flags.

To the left: Charles, Adam, Rebecca, and Dan are shown speaking with Kirk Bryant, Deputy Director for Strategic Data Acquisition and Analysis for the USDA Risk Management Agency at the National Consortium for Data Science Data Showcase.

### Bayesian Ensemble Models of Climate Variability in South Texas

Possibly the most important application of data mining in the 21st century is building and refining models of climate change and then using those models to predict climate behavior in local regions. Juliann Booth and Nina Culver are using Bayesian model averaging to predict future precipitation in South Texas, an important concern, given the projected decline in water availability in this region by 2050.

Thirty-five CMIP5 models *f _{1},...,f_{35} *for temperature and precipitation were obtained from the World Climate Research Programme's Working Group on Coupled Modeling. For each model

*f*the probability of observing a temperature/precipitation measurement

_{k},*y*is

*p(y|f*and the probability that

_{k}),*f*is the best model given observed target data

_{k}*y*is

_{T}*p(f*Synthesizing these two types of probabilities using Bayes' theorem yields the overall probability of observing a future temperature/precipitation measurement

_{k}|y_{T}).*y*as follows.

Here is a visualization of temperature predictions for the thirty-five CMIP5 models for the South Texas region being studied.

### Modeling Nitrate Contamination in Water Wells Based on Proximity to CAFOs

Nitrate contamination of ground water is a serious health concern, which can lead to conditions such as methemoglobinemia (blue baby disease), miscarriages, and non-Hodgkin lymphoma, and the EPA has therefore set a maximum contaminant level (MCL) for nitrate of 10 mg/L. Proximity of concentrated animal feed operations (CAFOs) to water wells has been linked to nitrate contamination of those wells, and Charles Tintera and Lain Tomlinson are currently applying data mining techniques to model this relationship more accurately.

A novel feature of this project is modeling flowpaths in the aquifer from a given CAFO using the hydraulic gradient obtained from the Global Information System (GIS). By taking into account the distance from a well to a CAFO's flowpath, the length of that flowpath, and the waste application rate at that CAFO, a CAFO Migration Score (CMS) is calculated to summarize the overall impact of CAFOs on the well under consideration. The Epanechnikov kernel is applied to model diminished probabilities of contamination that result from increased distances from the flowpath.

Once CAFO migration scores were computed, a logistic regression model demonstrated a highly statistically significant relationship between CMS and nitrate contamination (*P =* 7.19 x 10^{-12}). In the image below, 344 wells have been broken into 10 deciles based on CAFO migration score, so each point in this plot represents approximately 34 wells. The *x*-coordinate of each point is the average CMS value for wells in that decile, and the *y*-coordinate is the observed number of wells in that decile with nitrate concentrations exceeding 3 mg/L. The plot indicates strong agreement between the observed data and the logistic regression model, as confirmed with a Hosmer-Lemeshow goodness of fit test.

Charles and Lain are now working to extend this analysis to include more variables, such as depth to water table, pH, total dissolved solids, percent clay, percent organic matter, and annual rainfall. They are also applying random forests, support vector machines, *k*-nearest neighbors, and other classification algorithms to improve the model's classification accuracy. Because testing for nitrate contamination is expensive, the goal is to provide a tool that will help farmers estimate a well's probability of being contaminated using readily available information about that well.

## Application Instructions

- Complete the Graduate Assistantship Application. (This application is free and takes about 10 minutes to complete.)
- Email unofficial transcripts to Dr. Jesse Crawford at jcrawford@tarleton.edu.

We will begin phone interviews in late March and will begin making offers in early April. If you are offered a research assistantship, you will also need to apply to the College of Graduate Studies.