The Murillo Group is active in the areas of data science (DS) and machine learning (ML). Some of our recent works in this area are described in this page. If you would like to know more or collaborate on related topics, please contact us. We are looking for highly motivated data scientists to collaborate with, whether you are an undergraduate, graduate student, postdoc or seasoned professional.

In addition to the research works you will find below, we also like to have fun with DS and ML. Click here for a DS study of wine and here for the ML follow on.

#### Modeling Animal Movement with GPS Data

Agent-based modeling (ABM) is a technique that defines rules for interacting agents that interact to yield a bottom-up prediction of collective behavior. One can think of this as a generalization of molecular dynamics, where we define the atomic-scale force laws and examine the many-body behavior, to social, financial and political systems….and beyond. Here, we use ABM to examine the spread of a disease; specifically, our goal was to model how chronic wasting disease (CWD) spreads in deer populations. In this first work, we use a data-driven approach (from GPS collar data) to produce a movement model. ML was used (specifically, fused LASSO) as a methodology to capture trends in the movement data, which was reveals to have two distinct components of a stationary random walk punctuated with “basin hops” in which the deer relocate to another “basin” where they again temporarily perform a random walk. Other methods, such as Fourier analysis, reveals features such as daily movement patterns. A second paper should be published soon that includes specifically the spread of CWD.

Please read our paper on this if you have more interest:

#### Capturing Turbulent Behavior with Numerical Data

Turbulence is often used as an example of a grand challenge physics problem. Less well known is the economic impact it has because, as one example, of the drag it causes on the ships that are the world’s supply chains. Here, Dr. Jouybari simulated turbulent flows across synthetic surfaces with various types and degrees of roughness and employed several ML algorithms for learning the resulting behavior, thus replacing previous models with a much more predictive ML model.

In addition to being a fine research result, a portion of this work was completed as a capstone project in Prof Murillo’s Applied Machine Learning course.

Please read our paper on this if you have more interest:

#### Large-scale Molecular Dynamics with Density Functional Theory Fidelity

Molecular dynamics (MD) is an important simulation method for describing the dynamics of microscopic many-body systems. MD is, however, limited to fairly small samples because of its treatment of individual atoms (or molecules or ions or electrons). The computational cost is exacerbated when the forces between the particles must be treated through a three (or higher) body potential or, worse, an on-the-fly N-body potential. Use of such potentials can cause finite-size errors because of the immense cost of simulating large-enough systems. A fundamental question, therefore, is: under what physical circumstances can we safely use fast, pair potentials and, conversely, what physical conditions require three-body or higher? To answer this question, data is produced from expensive N-body-potential simulations and the optimal pair potential is learned; because this uses not a *model* for the pair potential, but the *optimal* pair potential (as defined by minimizing a non-parametric loss function), we can quantify when three-body physics emerges.

Note that this was a Featured Article in Physics of Plasmas with an associated AIP Invited Talk.

Please read our paper on this if you have more interest:

#### Extracting More Information From “Cheap Data”

There is a relentless push toward ever more accurate computational methods, pushed by scientific and engineering needs and pulled by ever faster computer hardware. It is tempting to believe that when a new method appears we should abandon their less accurate predecessors. What use can lower fidelity be? Here, we explore this question and show for the case of plasma transport properties that the use of low fidelity data can — *sometimes* — dramatically increase the accuracy of high fidelity. Seem impossible? Keep reading…

Please read our paper on this if you have more interest:

#### Learning Interpretable Physics Relationships From Data: White Box ML

Supervised ML models tend to predict real values from real-valued inputs (features); this is regression. It is often the case, however, that we seek not the the functional relationship embedded in the ML estimator, but a form that is human interpretable. That is, we require “white-box” ML. An exciting approach to white-box learning is to learn the underlying *equations*. Here, a new ML approach is proposed that exploits feature engineering with feature transformation, polynomial feature generation and recursive feature elimination (RFE). Because the best features are known polynomial combinations of the base features, and predictive equation emerges. A side benefit given by RFE is that the human learns what is most important to making predictions, and what is not. Importantly, the human is then a physics guide who can add suspected “best” features; for example, inverse powers are extremely common in science and notoriously difficult for most ML estimators to learn (on their own).

Please read our paper on this if you have more interest: