Learning Continuous Models for Continuous Physics

Machine learning approaches to model dynamical systems are typically trained on discrete training data, using ML methods that are not aware of the underlying continuity properties. This can result in ML models that are unable to capture the continuous dynamics of a system of interest, resulting in poor accuracy outside of the training data. To address this challenge, we develop a convergence test based on numerical analysis methods to validate whether a neural network has correctly learned the underlying continuous dynamics. ML models that pass this test are able to better extrapolate and interpolate on a number of different dynamical systems prediction tasks.

Led by Aditi Krishnapriyan, Michael Mahoney

Learning Differentiable Solvers for Systems with Hard Constraints

Machine learning has become a more commonly used approach to model physical systems. However, many challenges remain, as incorporating physical information during the machine learning process can often lead to difficult optimization. We design a differentiable neural network layer that is able to enforce physical laws exactly and demonstrate that it can solve many problem instances of parameterized partial differential equations (PDEs) efficiently and accurately.

Led by Aditi Krishnapriyan, Michael Mahoney

ROI Hide-and-Seek

We created the region of interest (ROI) Hide-and-seek protocol, an algorithm for verifying an arbitrary image classification task. It hides the ROI, in this case, an X-ray image of the lung, and classifies the image as pneumonia, normal, or COVID-19. What was surprising in this test is that the performance remained high even though important parts of the image were removed. Results showed that naïve interpretations from potentially biased data sources could lead to false COVID-19 diagnostics as the actual lungs were removed from the input. This work raises awareness of the role of different data sources and accuracy metrics used in the current DL classification of lung imaging.

Led by Dani Ushizima

The Chemical Universe through the Eyes of Generative Adversarial Neural Networks

This project is developing generative machine learning models that can discover new scientific knowledge about molecular interactions and structure-function relationships in chemical sciences. The aim is to create a deep learning network that can predict properties from structural information but can also tackle the “inverse problem,” that is, deducing structural information from properties. To demonstrate the power of the neural network, we focus on bond breaking in mass-spectrometry, combining experimental data with HPC computational chemistry data. Funded by a Lab Directed Research and Development (LDRD) grant.

Led by Bert de Jong

Union of Intersections

Parametric models are ubiquitous in science, and the inferred parameters of those models fit to data can be used to gain insight into the underlying data generation process. Union of Intersections is a novel statistical-machine learning framework that enables improved feature-selection and estimation capabilities in diverse parametric models fit to data. Together, these enhanced inference capabilities lead to improved predictive performance and model interpretability.

Led by Kristofer Bouchard

Uncovering Dynamics

Natural systems are dynamic systems. Often, scientists want to uncover dynamic low-dimensional latent structures embedded in high-dimensional noisy observations. However, most unsupervised learning methods are focused on capturing variance, which may not be related to dynamics. We have developed novel dimensionality-reduction methods that optimize an objective function directly designed for dynamic data (predictive information between the past-future of a time series). We have created two methods based on maximizing predictive information: a linear dimensionality reduction method (Dynamical Components Analysis, or DCA) and non-linear Compressed Predictive Information Coding (CPIC).

Led by Kristofer Bouchard

Topological Optimization

Optimization, a key tool in machine learning and statistics, relies on regularization to reduce overfitting. Traditional regularization methods control a norm of the solution to ensure its smoothness. We propose a method that instead builds on insights from topological data analysis. This approach enables a faster and more precise topological regularization, the benefits of which we illustrate with experimental evidence. Furthermore, we extend our method to more general topological losses.

Led by Dmitriy Morozov

Large-scale, Self-driving 5G Network for Science

The nation’s emerging fifth-generation (5G) network is significantly faster than previous network generations and has the potential to improve connectivity across the scientific infrastructure. Potential applications include linking remote experimental facilities and distributed sensing instrumentation with supercomputing resources to facilitate transporting and managing the huge volume of data generated by today’s scientific experiments. This project uses artificial intelligence (AI) combined with network virtualization to support complex end-to-end network connectivity – from edge 5G sensors to supercomputing facilities like the National Energy Research Scientific Computing Center (NERSC).

Led by Mariam Kiran

Adaptable Deep Learning and Probabilistic Graphical Model System for Semantic Segmentation

Semantic segmentation algorithms based on deep learning architectures have been applied to a diverse set of problems. Consequently, new methodologies have emerged to push the state-of-the-art in this field forward, and the need for powerful, user-friendly software increased significantly. We introduce a new encoder-decoder system that enables adaptability for complex scientific datasets and model scalability by combining convolutional neural networks (CNN) and conditional random fields (CRF) in an end-to-end training approach. The full integration of CNNs with CRFs enables more efficient training and more accurate segmentation results, in addition to learning representations of the data. Moreover, the CRF model enables the use of prior knowledge from scientific datasets that can be used for better explainability, interpretability, and uncertainty quantification.

Led by Talita Perciano, Matthew Avaylon

Produced Water Application for Beneficial Reuse, Environmental Impact and Treatment Optimization (PARETO)

The Produced Water Application for Beneficial Reuse, Environmental Impact and Treatment Optimization (PARETO) is specifically designed for produced water management and beneficial reuse. The major deliverable of this project will be an open-source, optimization-based, downloadable and executable produced water decision-support application, PARETO, that can be run by upstream operators, midstream companies, technology providers, water end users, research organizations, and regulators.

Led by Dan Gunter

Lorentz Covariant Neural Network

We introduced approximate Lorentz equivariant neural network (NN) architectures to address key high-energy physics (HEP) data processing tasks. Symmetries arising from physical laws like Einstein's special relativity can improve the expressiveness, interpretability, and resource efficiency of NNs.

Led by Xiangyang Ju

Domain-Aware, Physics-Constrained Autonomous Experimentation

The prior probability density functions of Gaussian processes are entirely defined by a covariance or kernel function and a prior mean function. Chosen correctly, the kernel function has the capability to constrain the solution space to only contain functions with certain domain-knowledge-adhering properties. Furthermore, the training itself can be formulated as a constrained optimization problem. gpCAM, our python software package for autonomous experimentation and Gaussian-process function approximation, is tailored to allow the user to inject physical knowledge via the kernel and the prior mean function. The solutions are significantly more accurate, with more realistically-estimated uncertainties, and can approximate functions using fewer data points.

Led by Marcus Noack

Normalizing Flows for Statistical Data Analysis

We use Normalizing Flows to develop fast Bayesian statistical analysis methods for scientific data analysis that can be applied to a wide range of scientific domains and problems. The methods can be used for posterior sampling and global optimization applications, with or without gradient information. Recent examples are DLMC (Deterministic Langevin Monte Carlo) and Preconditioned Monte Carlo (PocoMC).

Led by Uros Seljak.

NAWI Water Treatment Model Development (WaterTAP)

Water treatment, as a sophisticated environmental and chemical engineering practice, designs physical/chemical/biological processes to produce clean water. We provide computational and modeling solutions to optimize the performance, energy use, and economic cost of existing and developing water treatment processes and infrastructures. Conventional linear and nonlinear programming are applied to theory- and data-informed equation systems describing the engineering systems at different scales. We ultimately deliver the optimization efficacy as user-friendly and open-source software. We explore the potential advantages of novel ML and AI algorithms to complement conventional numerical optimization approaches, tackling complexity and dynamics challenges in broad water systems.

Led by Dan Gunter

Institute for the Design of Advanced Energy Systems (IDAES)

The IDAES integrated platform (IDAES-IP) brings the most advanced modeling and optimization capabilities to challenges around the reliable, environmentally sustainable, and cost-efficient transformation and decarbonization of the world’s energy systems. IDAES utilizes state-of-the-art equation-oriented optimization solvers and algorithms to enable the design, optimization, and operation of complex, innovative steady state and dynamic processes.

Led by Dan Gunter

The Chemical Universe through the Eyes of Generative Adversarial Neural Networks

This project is developing generative machine learning models that can discover new scientific knowledge about molecular interactions and structure-function relationships in chemical sciences. The aim is to create a deep learning network that can predict properties from structural information but can also tackle the “inverse problem,” that is, deducing structural information from properties. To demonstrate the power of the neural network, we focus on bond breaking in mass-spectrometry, combining experimental data with HPC computational chemistry data. Funded by a Lab Directed Research and Development (LDRD) grant.

Led by Bert de Jong

Transformers for Topic Modeling and Recommendation

This project focuses on mining scientific articles repositories such OSTI, Springer, and other databases by creating Bidirectional Encoder Representations from Transformers (BERT). This new technique can turn text data into information that helps to identify key topics within certain science domains. For example, the main technologies for quality control of batteries, key designs for avoiding short-circuiting, and new polymeric elements that improve insulation of electrodes. Besides topic modeling, our schemes use BERT to provide an unsupervised scheme to organize scientific articles and enable recommendations that are aware of text semantics. 

Led by Dani Ushizima

Codesign of Ultra-Low-Voltage, Beyond CMOS Microelectronics

The goal of the project is developing the co-design framework of atoms-to-architectures to enable sub-100mV switching of non-volatile logic-in-memory, and ultra-efficient digital signal processing for applications such as IoT, sensors and detectors. The co-design of next-generation hardware is currently a static process that involves human-in-the-loop evolution via repeated experiments, modeling, and design space exploration. Using AI, our goal is to accelerate the pace at which we can iterate on co-designing beyond CMOS microelectronics. Specifically, in the reporting period, we focused on two specific activities.

Led by Lavanya Ramakrishnan

4DCamera Distillery (National Center for Electron Microscopy)

4DCamera Distillery is a program that will develop and deploy methods and tools based on AI and ML to analyze electron scattering information from the data streams of fast direct electron detectors. The team behind the effort, composed of researchers from Brookhaven National Laboratory, Oak Ridge National Laboratory, Argonne National Laboratory, Sandia National Laboratory, and Los Alamos National Laboratory, as well as Berkeley Lab, will address both the critical need for data reduction tools for these detectors and capitalize on the scientific opportunities to create new modes of measurement and experimentation that are enabled by fast electron detection.

Led by Andrew Minor, Colin Ophus

Flash Drought Prediction

Flash droughts come on seemingly without warning, and sometimes with devastating effects. In this NSF PREEVENTS project, we are working to advance the understanding and subseasonal-to-seasonal prediction of flash droughts and their associated heat extremes. We are developing machine-learning-based estimates of photosynthesis, evapotranspiration, and respiration at high resolution, in combination with remotely sensed estimates of evaporative stress, to define and characterize flash drought.

Led by Trevor Keenan

HGDL for Hybrid Global Deflated Local Optimization

HGDL is an optimization algorithm specialized in finding not only one but a diverse set of optima, alleviating challenges of non-uniqueness that are common in modern applications such as inversion problems and training of machine learning models. HGDL is customized for distributed high-performance computing; all workers can be distributed across as many nodes or cores. All local optimizations will then be executed in parallel. As solutions are found, they are deflated which effectively removes those optima from the function, so that they cannot be re-identified by subsequent local searches.

Led by Paolo Califiura

Rotational Dynamics and Transition Mechanisms of Surface-Adsorbed Proteins

Living systems create a wide range of biomolecular arrays with complex functions and exquisite organization. This has inspired synthetic equivalents for a range of applications, enabling new high-throughput approaches to biocomposites, diagnostics, and materials research. This study with data analyses helps broaden the physical understanding of biomolecular assembly by tracking motion at unprecedented resolution and defining a general procedure for using in situ visualization and machine learning to explore such dynamics. This research characterizes the “energy landscape” for protein orientation and, by analyzing the motion of the proteins, shows how that energy landscape controls the rate of motion between different orientations.

Led by Oliver Rübel


Mobiliti is a traffic simulator for large-scale transportation networks, enabling users to model dynamic congestion and vehicle rerouting behavior in response to hypothetical traffic and network events. We are currently exploring the use of deep reinforcement learning along with graphical models (RL and DCRNNs) to predict traffic, optimize traffic signal controllers, and quantify their impact on system-level congestion.

Led by Cy Chan

Exa.TrkX: HEP Pattern Recognition at the Exascale

Reconstructing the trajectories of thousands of charged particles from a collision event as they fly through a HEP detector is a combinatorially hard pattern recognition problem. Exa.TrkX, a DOE CompHEP project and a collaboration of data scientists and computational physicists from the ATLAS, CMS, and DUNE experiments, is developing Graph Neural Network models aimed at reconstructing millions of particle trajectories per second from Petabytes of raw data produced by the next generation of detectors at the Energy and Intensity Frontiers. Exa.TrkX is also exploring the scaling of distributed training of GNN models on DOE pre-exascale systems and deploying GNN models with microsecond latencies on FPGA-based real-time processing systems.

Led by Paolo Califiura

Inferring Properties of Nanoporous with Machine Learning and Topology

We use persistent homology to describe the geometry of nanoporous materials at various scales. We combine our topological descriptor with traditional structural features and investigate the relative importance of each to the prediction tasks. Our results not only show a considerable improvement compared to the baseline, but they also highlight that topological features capture information complementary to the structural features. Furthermore, by investigating the importance of individual topological features in the model, we are able to pinpoint the location of the relevant pores, contributing to our atom-level understanding of structure-property relationships.

Led by Dmitriy Morozov

Towards Fast and Accurate Predictions of Radio Frequency Power Deposition and Current Profile via Data-Driven Modeling

Three machine learning techniques (multilayer perceptron, random forest, and Gaussian process) provide fast surrogate models for lower hybrid current drive (LHCD) simulations. A single GENRAY/CQL3D simulation without radial diffusion of fast electrons requires several minutes of wall-clock time to complete, which is acceptable for many purposes, but too slow for integrated modeling and real-time control applications. The machine learning models use a database of more than 16,000 GENRAY/CQL3D simulations for training, validation, and testing. Latin hypercube sampling methods ensure that the database covers the range of nine input parameters with sufficient density in all regions of parameter space. The surrogate models reduce the inference time from minutes to milliseconds with high accuracy across the input parameter space.

Led by Talita Perciano, Zhe Bai

Evaluating State Space Discovery by Persistent Cohomology in the Spatial Representation System

Persistent cohomology is a powerful technique for discovering topological structure in data. Strategies for its use in neuroscience are still undergoing development. We comprehensively and rigorously assess its performance in simulated neural recordings of the brain's spatial representation system. Our results reveal how dataset parameters affect the success of topological discovery and suggest principles for applying persistent cohomology to experimental neural recordings.

Led by Dmitriy Morozov

Cosmic Inference: Constraining Parameters with Observations and a Highly Limited Number of Simulations

Cosmological probes pose an inverse problem where the measurement result is obtained through observations, and the objective is to infer values of model parameters that characterize the underlying physical system—our universe, from these observations and theoretical forward-modeling. The only way to accurately forward-model physical behavior on small scales is via expensive numerical simulations, which are further "emulated" due to their high cost. Emulators are commonly built with a set of simulations covering the parameter space; the aim is to establish an approximately constant prediction error across the hypercube. We provide a description of a novel statistical framework for obtaining accurate parameter constraints. The proposed framework uses multi-output Gaussian process emulators that are adaptively constructed using Bayesian optimization methods with the goal of maintaining a low emulation error in the region of the hypercube preferred by the observational data. We compare several approaches for constructing multi-output emulators that enable us to take possible inter-output correlations into account while maintaining the efficiency needed for inference.

Led by Dmitriy Morozov, Zarija Lukić

Self-Supervised Learning for Cosmological Surveys

Sky surveys are the largest data generators in astronomy, making automated tools for extracting meaningful scientific information an absolute necessity. The rich, unlabeled image data from such surveys can be used to develop powerful self-supervised AI models that distill low-dimensional representations which are robust to symmetries, uncertainties, and noise in each image; these semantically meaningful representations make the self-supervised model immediately useful for downstream tasks like morphology classification, redshift estimation, similarity search, and detection of rare events, paving new pathways for scientific discovery.

Led by Peter Harrington

New Battery Designs and Quality Control with Deep Learning

In an ever-demanding world for zero-emission clean energy sources, vehicle electrification will bring major contributions as each clean car that substitutes one based on fossil fuel could save 1.5 tons of carbon dioxide per year. To expand the e-vehicle fleet, new solutions to store energy must deliver lighter, longer ranges, and more powerful energy batteries, such as solid-state lithium metal batteries (LMB). Different from traditional lithium-ion, LMB uses solid electrodes and electrolytes, providing superior electrochemical performance and high energy density. One of the challenges of this new technology is to predict cycling stability and prevent the formation of lithium dendrite growth. These morphologies are key to the LMB quality, and they can be captured and analyzed using X-ray microtomography. This project focuses on new deep learning based on U-net, Y-net, and viTransformers for detection and segmentation of defects in LMB.

Led by Dani Ushizima

Python-based Surrogate Modeling Objects (PySMO)

PySMO is an open-source tool for generating accurate algebraic surrogates that are directly integrated with an equation-oriented (EO) optimization platform, specifically IDAES and its underlying optimization library, Pyomo. PySMO includes implementations of several sampling and surrogate methods (polynomial regression, Kriging, and RBFs), providing a breadth of capabilities suitable for a variety of engineering applications. PySMO surrogates have been demonstrated to be very useful for enabling the algebraic representation of external simulation codes, black-box models, and complex phenomena in IDAES and other related projects.

Led by Oluwamayowa Amusat

Cosmological Hydrodynamic Modeling with Deep Learning

Multi-physics cosmological simulations are powerful tools for studying the formation and evolution of structure in the universe but require extreme computational resources. In particular, modeling the hydrodynamic interactions of baryonic matter adds significant expense but is required to accurately capture small-scale phenomena and create realistic mock-skies for key observables. This project uses deep neural networks to reconstruct important hydrodynamical quantities from coarse or N-body-only simulations, vastly reducing the amount of compute resources required to generate high-fidelity realizations while still providing accurate estimates with realistic statistical properties.

Led by Peter Harrington

Surrogate Model for simulating hadronization processes

We developed a neural network-based surrogate model for simulating the process whereby partons are converted to hadrons for high energy physics. The development is the first step towards a fully data-driven neural network-based hadronization simulator.

Led by Xiangyang Ju

Supercomputing-Scale AI on the Perlmutter System at NERSC

The Perlmutter system is a world-leading AI supercomputer consisting of over 6,000 Nvidia A100 GPUs, an all-flash filesystem, and a novel high-speed network. The National Energy Research Scientific Computing Center (NERSC) at Berkeley Lab also works closely with vendors  to ensure optimized software for AI at large computing scale, as well as consulting, joint projects, and training to enable the community to exploit these resources.

Led by Steven Farrell


FourCastNet, short for Fourier Forecasting Neural Network, is a global data-driven weather forecasting model that provides accurate short to medium-range global predictions at high resolution. FourCastNet accurately predicts high-resolution, fast-timescale variables such as the surface wind speed, precipitation, and atmospheric water vapor. It can generate forecasts with extreme computational savings compared to standard numerical weather prediction models. It has important implications for planning wind energy resources and predicting extreme weather events such as tropical cyclones, extra-tropical cyclones, and atmospheric rivers.

Led by Shashank Subramanian


MLPerf HPC is a machine learning performance benchmark suite for scientific ML workloads on large supercomputers. It measures the time to train deep learning models on massive scientific datasets as well as full system scale throughput for training many models concurrently. MLPerf HPC has had two successful annual submission rounds featuring results on systems around the world, including the Perlmutter system at NERSC.

Led by Steven Farrell

Cyber Security of Power Distribution Systems by Detecting Differences Between Real-time Micro-Synchrophasor Measurements and Cyber-Reported SCADA

The power distribution grid, like many cyber-physical systems, was developed with careful consideration for safe operation, but a number of features of the power system make it particularly vulnerable to cyber attacks via IP networks. The goal of this project was to design and implement a measurement network that can detect and report the resultant impact of cyber security attacks on the distribution system network. The result is a system that provides an independent, integrated picture of the distribution grid’s physical state, which is difficult for a cyber-attacker to subvert using data-spoofing techniques.

Led by Sean Peisert

Surrogate Modeling for Biofuel and Bioproduct Production

This project uses complex process simulation models for advanced biofuel and bioproduct production to develop and train machine learning (ML)-based surrogate models. Researchers need flexibility to explore different scenarios and understand how their work may impact upstream and downstream processes, as well as cost and greenhouse gas emissions. To address this need, the team uses the Tree-Based Pipeline Optimization Tool (TPOT) to automatically identify the best ML pipelines for predicting cost and mass/energy flow outputs. This approach has been used with two promising bio-based jet fuel blendstocks: limonane and bisabolane. The results show that ML algorithms trained on simulation trials may serve as powerful surrogates for accurately approximating model outputs at a fraction of the computational expense.

 Led by Corinne Scown

gpCAM for Domain-Aware Autonomous Experimentation

The gpCAM project consists of an API and software designed to make autonomous data acquisition and analysis for experiments and simulations faster, simpler, and more widely available by leveraging active learning. The tool is based on a flexible and powerful Gaussian process regression at the core, which proves the ability to compute surrogate models and associated uncertainties. The flexibility and agnosticism stem from the modular design of gpCAM, which allows the user to implement and import their own Python functions to customize and control almost every aspect of the software. That makes it possible to easily tune the algorithm to account for various kinds of physics and other domain knowledge and constraints and to identify and find interesting features and function characteristics. A specialized function optimizer in gpCAM can take advantage of high performance computing (HPC) architectures for fast analysis time and reactive autonomous data acquisition.

 Led by Marcus Noack

Securing Automated, Adaptive Learning-Driven Cyber-Physical System Processes

Numerous DOE-relevant processes are becoming automated and adaptive using machine learning techniques. Such processes include vehicle and traffic navigation guidance, intelligent transportation systems, adaptive control of grid-attached equipment, and large scientific instruments. This creates a vulnerability for a cyber attacker to sabotage processes through tainted training data or specially crafted inputs. Consequences might be tainted manufactured output, traffic collisions, power outages, or damage to scientific instruments or experiments. This project is developing secure machine learning methods that will enable the safer operation of automated, adaptive, learning-driven “cyber-physical system” processes.

Led by Sean Peisert

Supervisory Parameter Adjustment for Distribution Energy Storage (SPADES)

The Supervisory Parameter Adjustment for Distribution Energy Storage (SPADES) project will develop methodology and tools allowing energy storage systems (ESS) to automatically reconfigure themselves to counteract cyberattacks against both the ESS control system directly and indirectly through the electric distribution grid. The reinforcement learning defensive algorithms will be integrated into the National Rural Electric Cooperative Association (NRECA) Open Modeling Framework (OMF), thereby allowing defensive strategies to be tailored on a utility-specific basis. The major outcomes of this project will be the tools to isolate the component of the ESS control system that has been compromised during a cyberattack, as well as policies for changing the control parameters of ESS to mitigate a wide variety of cyberattacks on both the ESS device itself and the electric distribution grid.

Led by Daniel Arnold

Securing Solar for the Grid (S2G)

Berkeley Lab is leading two working groups relating to cybersecurity issues in inverter-based resources (IBR) and distributed energy resources (DER). The first working group is examining cybersecurity issues in AI-based automation for IBR/DER. Automation has brought significant advantages to the power grid for ensuring stability, increasing efficiency, and even providing cybersecurity benefits. At the same time, automation significantly increases cybersecurity risks because automated systems can be remotely attackable, and have similar vulnerabilities to other types of computing systems. The second working group is also working on data confidentiality issues for IBR/DER. Many data privacy and confidentiality issues arise when data is shared, but at the same time, data-sharing is essential to planning, research, and efficient operation. Understanding the intersection of confidentiality concerns and the role of privacy-preserving methods might enable both properties.

Led by Sean Peisert

Automating Data Acquisition & Analysis

Stochastic Processes for Function Approximation and Autonomous Data Acquisition at Large-Scale Experimental Facilities

Scientists are increasingly faced with ever more complex experiments. The vast dimensionality of parameter spaces underlying investigations in the biological, chemical, physical, and materials sciences challenges the most advanced data acquisition and analysis systems. While growing data-acquisition rates offer some relief, the complexity of experiments and the subtle dependence of the model function on input parameters remains daunting due to the sheer number of variables. 

This project aims to develop new stochastic process-based mathematical and computational methods to achieve high-quality, domain-aware function approximation, uncertainty quantification, and, by extension, autonomous experimentation. One product of this project is gpCAM, a simple-to-use, flexible, and HPC-ready Python-based software tool for Gaussian process-based function approximation and autonomous experimentation. 

Led by Marcus Noack

Machine Learning-Enabled Surrogate Modeling for Biofuel and Bioproduct Production 

This project uses complex process simulation models for advanced biofuel and bioproduct production to develop and train machine learning (ML)-based surrogate models. Researchers need flexibility to explore different scenarios and understand how their work may impact upstream and downstream processes, as well as cost and greenhouse gas emissions. To address this need, the team uses the Tree-Based Pipeline Optimization Tool (TPOT) to automatically identify the best ML pipelines for predicting cost and mass/energy flow outputs. This  approach has been used with two promising bio-based jet fuel blendstocks: limonane and bisabolane. The results show that ML algorithms trained on simulation trials may serve as powerful surrogates for accurately approximating model outputs at a fraction of the computational expense. The web-based surrogate models are posted on

 Led by Corinne Scown.

Scientific Machine Learning for Simulation and Control in Large Scale Power Systems

With the growing penetration of wind, solar, and storage technologies, all interfaced to the grid via fast-acting power electronic converters (PECs), our power systems are rapidly evolving. One of the challenges with increased PECs is conducting computer time-series simulations of these systems, critical for understanding, and reliably operating, our electrical networks. The addition of PECs is resulting in these simulations taking significantly longer, under current approaches, due to their fast response rates, and spatial diversity.

This project aims to develop new tools at the intersection of scientific machine learning (SciML) and power systems engineering. These tools will accelerate the simulation of power systems with high penetration of PEC to ensure that we can simulate these systems in near real-time. This acceleration will be achieved by 1) using SciML to develop accurate models of aggregations of PECs in order to reduce the number of equations we need to solve and 2) using SciML to improve the mathematical techniques we use for solving these equations.

 Led by Duncan Callaway

Science Search

As scientific datasets increase in both size and complexity, the ability to label, filter, and search this deluge of information has become a laborious, time-consuming, and sometimes impossible task without the help of automated tools. To overcome this challenge, researchers at Berkeley Lab are developing innovative machine learning tools to pull contextual information from scientific datasets and automatically generate metadata tags for each file. Scientists can then search these files via Science Search, a web-based search engine for scientific data. 
Led by Katie Antypas.


Experimental science is evolving. With the advent of new technology, scientific facilities are collecting data at increasing rates and higher resolution. But making sense of this data is becoming a major bottleneck. New mathematics and algorithms are needed to extract useful information from these experiments. To address these growing needs, the Center for Advanced Mathematics for Energy Research Applications (CAMERA) is working with scientists across disciplines to develop fundamental mathematics and algorithms, delivered as data analysis software that can accelerate scientific discovery. 

Led by James Sethian.

MetaBio IDS

The overarching objective of this interdisciplinary science project is to leverage new theory and observations in land, atmosphere and space-based research to accurately partition global carbon fluxes between terrestrial ecosystems and the atmosphere at high spatial and temporal resolution. Machine learning, in particular simple and deep neural networks and generalized additive models, have proven powerful tools to do so. This project uses ML tools to both diagnose biases in global land surface models, and to derive new information from time-series of carbon fluxes between ecosystems and the atmosphere provided by distributed sensing networks such as AmeriFlux. Doing so both provides novel model diagnostics and amplifies the impact and utility of DOE investments in observational platforms. 

Led by Trevor Keenan.

Route Choice Behavior at Urban Scale

Knowing how individuals move between places on the urban scale is critical to infrastructure and transportation systems planning. However, current route-choice models are stymied by the messy host of human factors that play into individual routing decisions. 

Using machine learning techniques and location data generated as a byproduct of smartphone application use, this project aims to better understand how individuals organize their travel plans into a set of routes and how similar behavior patterns emerge among distinct individual choices. This technique has the potential to inform demand management strategies that target individual users while generating large scale estimates that can be used in urban-wide traffic planning.

 Led by Marta C. Gonzalez.


Deep Learning and Satellite Imagery to Estimate Air Quality Impact at Scale

Deep Learning and Satellite Imagery to Estimate Air Quality Impact at Scale, or DeepAir, uses deep learning algorithms to analyze satellite images combined with traffic information from cell phones and data already being collected by environmental sensors to improve air quality predictions. Scientists already use sophisticated models that consider factors such as wind speed, pressure, precipitation, and temperature to make predictions about pollution levels. DeepAir uses an array of distributed, existing datasources, including mobile phones, to help inventory man-made pollutants (such as vehicle exhaust and power plant emissions) as they actually enter the environment. 

The resulting analysis aims to ultimately inform the design of more efficient and more timely interventions, such as the San Francisco Bay Area's “Spare the Air” days.

Led by Marta C. González


Reconstructing the trajectories of charged particles from a collision event as they fly through a High Energy Physics (HEP) detector is a combinatorially difficult pattern recognition problem. Exa.TrkX is a collaboration of data scientists and computational physicists who are developing graph neural networks models aimed at reconstructing millions of particle trajectories per second from petabytes of raw data produced by the next generation of particle tracking detectors at the energy and intensity frontiers. Exa.TrkX is also exploring the scaling of distributed training of graph neural networks on U.S. Department of Energy pre-exascale systems and the deployment of graph neural network models with microsecond latencies on field-programmable gate array-based (FPGA-based) real-time processing systems.

Led by Paolo Calafiura.

AR1K: Engineering Agriculture through Machine Learning in BioEPIC

In an effort to revolutionize agriculture and create sustainable farming practices that benefit both the environment and farms, researchers from Berkeley Lab, the University of Arkansas, and Glennoe Farms are bringing together molecular biology, biogeochemistry, environmental sensing technologies, and machine learning and applying them to soil research. This project aims to reduce the need for chemical fertilizers and enhance soil carbon uptake to improve the long-term viability of land and increase crop yields.  

Led by Ben Brown.


The high data-throughput of scientific instruments has made image recognition one of the most challenging problems in scientific research today. Supported by a U.S. DOE Early Career Award, Image across Domains, Experiments, Algorithms and Learning (IDEAL) focuses on computer vision and machine learning algorithms and software to enable timely interpretation of experimental data recorded as 2D or multispectral images. 

Led by Daniela Ushizima.

Data Analytics for Commercial Buildings

State of the art analytics software and modeling tools can provide valuable insights into efficiency opportunities. However, prior research has shown that key barriers include relatively limited data sources (smart meters and weather being most common in commercial tools), or reliance upon user-provided inputs for which default values may be the fallback. There is great opportunity to apply techniques based on multi-stream data fusion and machine learning to overcome these challenges. 

Led by Jessica Granderson.


As supercomputers become ever more capable in their march toward exascale levels of performance, scientists can run increasingly detailed and accurate simulations to study problems ranging from cleaner combustion to the nature of the universe. The challenge is that these powerful simulations are “computationally expensive,” consuming 10 to 50 million CPU hours for a single simulation. The ExaLearn project aims to develop new tools to help scientists overcome this challenge by applying machine learning to very large experimental datasets and simulations. 

Led by Peter Nugent.

Feedstock to Function

The goal of this project is to improve bio-based product and fuel development through adaptive technoeconomic and performance modeling. Toward this end we are developing a comprehensive Feedstock to Function software tool (F2FT) that harnesses the power of machine learning to predict properties of high-potential molecules (fuels, fuel co-products, and other bioproducts) derived from biomass. This tool can also be used to evaluate the cost, benefits, and risk of promising biobased molecules or biofuels to enable faster, less expensive bioprocess optimization and scale-up. 

Led by Vi Rapp.

Machine Learning for High Energy Physics

This project develops both simulation-based and simulation-independent deep learning techniques for high energy physics.  In particular, we are finding machine learning solutions to make the best use of physics-based simulations for inference.  Many of these physics-based simulations are too slow and and so we are also developing deep learning solutions to augment or supplant these simulations using generative models.

In parallel, we are developing simulation-independent techniques to broaden our analysis sensitivity.  These methods are less-than-supervised and include anomaly detection to search for new particles at the Large Hadron Collider. 

Led by Benjamin Nachman.


Population growth, changes in land use, climate change, and extreme weather are a few factors threatening not only freshwater supplies but all the other systems that rely on watersheds, including hydropower and agriculture.

To help meet these challenges, the DOE BER-funded ExaSheds project seeks to fundamentally change how watershed function is understood and predicted. Combining leadership class computers, big data, and machine learning with learning-assisted physics-based simulation tools, ExaSheds scientists are working to provide a full treatment of water flow and biogeochemical reactive transport at watershed to river basin scales.  

Led by Carl Steefel

CIGAR: Cybersecurity via Inverter-Grid Automatic Reconfiguration

The U.S. power distribution grid connects thousands of power plants to hundreds of millions of electricity customers across the country. In today's highly connected world, it is unrealistic to assume that energy delivery systems are isolated or immune from online threats; thus, as the grid is modernized, new features must be deployed to protect against cyberattacks. Berkeley Lab’s CIGAR project is developing and testing machine learning algorithms to counteract cyber-physical attacks that can compromise multiple systems in the electric grid.  Funded by the DOE Office of Cybersecurity, Energy Security, and Emergency Response (CESER)'s Cybersecurity for Energy Delivery Systems program

Led by Sean Peisert and Dan Arnold.


Networks are the essential links connecting science collaborations around the world. But as these collaborations grow and science experiments generate ever more data, meeting these needs with upgraded hardware alone, like routers or optical fiber, can get expensive. That’s why the Deep and Autonomously Performing High-Speed Networks (DAPHNE) project is also looking into applying artificial intelligence software tools to design and manage distributed network architectures. These tools could effectively improve data transfers, guarantee high throughput, and advance traffic engineering by understanding and better predicting network traffic flows.

Led by Mariam Kiran.

Joint Social Sequence Analysis to Predict Travel Behavior

The analysis of categorical and longitudinal time series, called sequences, has great value for social science applications to study the span of life trajectories, careers, decision points, and family structure. The goal of this project is to investigate life-long lifecycle trajectory dynamics based on demographic characteristics, education, and other lifestyle variables. By analyzing entire lifelong sequences, it is possible to discover representative patterns from the overall life trajectory of a given individual’s characteristics and the pathway through which one arrives at a given state, travel decision, or mobility behavior. In contrast to traditional big-data approaches that aim to solve the "largeness" of numerical data, we aim to tackle the "largeness" that arises from data types, dimensionality, and heterogeneity in data quality that are common in categorical social sequences. This work has resulted in one IEEE conference publication and one journal article published by Association for Computing Machinery. 

Led by Ling Jin, Anna Spurlock, and Annika Todd.

Statistical Mechanics for Interpretable Learning Algorithms

Machine learning has the potential to revolutionize scientific discovery, but it also has some limitations. One of them is interpretability. As machine learning networks grow larger and more complex, understanding how the networks behave and how they reach the results is difficult, if not impossible. This project is using statistical mechanics to interpret how popular machine learning algorithms behave, give users more control over these systems, and enable them to reach the results faster. As a proof of concept, this project is working closely with Berkeley Lab’s Distributed Acoustic Sensing project.

 Another machine learning challenge is that data is the only guide for these tools. One way to improve interpretability for scientific applications is to integrate the laws of physics into the learning process. This project will also explore machine learning algorithms informed by physics.

Led by John Wu, Michael Mahoney, and Jonathan Ajo-Franklin.

Sensor Data Integration

Although sensor and meter data are becoming more available in buildings, their application is limited. This project employs advanced data analytics and inverse modeling techniques, integrating the sensor and meter data with physics-based models to evaluate and improve building energy efficiency and demand flexibility in support of grid-interactive efficient buildings (GEB). One barrier associated with sensor data mining is privacy. To address this concern, we used a generative adversarial network (GAN) to anonymize smart meter data. GAN is a machine learning technique that can recover an unknown probability distribution purely from data. Using GAN, we can remove sensitive privacy information while capturing the key statistics of the original data. This could motivate data owners to share their data. We also applied inverse modeling and parameter identification techniques to extract thermal dynamics of residential buildings from a large smart-thermostat dataset; these data are used to estimate the demand flexibility potential of the U.S. residential sector. 

Led by Tianzhen Hong.

AlphaBuilding: Machine Learning for Advanced Building Controls

Modern buildings are becoming increasingly complex to manage and control. Buildings are a significant energy consumer and carbon emitter, and people spend 90 percent of their time indoors. A comfortable, productive, and healthy indoor environment is crucial for occupants’ well-being. Conventional building controls such as schedule-based setpoint tracking failed to optimize multiple objectives of building operation. Model-based control techniques such as model predictive control can achieve better performance while hard to implement and scale up. We applied reinforcement learning to optimize building control, with the goal to reduce energy consumption, enhance demand flexibility, curtail carbon emission, and improve occupants well-being. A two stage training process is proposed, imitation learning to learn the state-of-art building industry control standard for pre-training, and then fine tuning to enhance performance through interacting with the environment in real buildings. Physics-informed learning with the introduction of key building thermal dynamic parameters is also being explored. 

Led by Tianzhen Hong.

Combining Data-driven and Science-based Generative Models

This project investigates the many connections between data-driven and science-driven generative models.  When do scientists use physical models to create synthetic data for science applications? When do we supplement them with data driven machine learning models?  Conversely, can researchers use physical models to improve on the current data-driven generative models in machine learning Recently this project took an important step towards integrating deep learning and numerical simulations by developing FlowPM, a new distributed N-body cosmological simulation code created in Mesh Tensorflow. Funded by a Lab Directed Research and Development (LDRD) grant. 

Led by Uros Seljak.

Accelerator Advancements Through Machine Learning

The Advanced Light Source is a third-generation synchrotron light source offering high-brightness ultra-stable x-rays to experiments at over three dozen beam lines. The stability of this operating DOE user facility is now being further improved through the application of machine learning. Neural networks have for the first time been successfully trained to allow for a novel feed forward which increases source size stability by up to an order of magnitude compared to conventional physics model-based approaches. Training such networks takes substantially less dedicated machine time than previously required model calibration measurements. In addition, deep learning is being studied to accelerate and improve design algorithms for optimizing future synchrotron lattices. The ultra-high brightness ALS-U storage ring lattice presently under design is serving as an especially relevant test case. Jointly funded by DOE BES (ADRP) and ASCR Programs. 

Led by Simon C. Leemann

Interactive Machine Learning for Tomogram Segmentation and Annotation

Three-dimensional (3D) cellular bioimaging can help us understand how living systems respond to genes, regulation, and environment in time and space. However, interpreting these 3D images is a time-consuming, manually intensive process that can take up to three months. This project uses machine learning techniques to speed up manual segmentation of 3D cryo-electron tomography with the goal of reducing image interpretation time to around one hour, about the same amount of time it takes to capture a multi-gigabyte 3D image.

Funded by a Lab Directed Research and Development (LDRD) grant. 

Led by Nicholas K. Sauter.

The Chemical Universe through the Eyes of Generative Adversarial Neural Networks

This project is developing generative machine learning models that can discover new scientific knowledge about molecular interactions and structure-function relationships in chemical sciences. The aim is to create a deep learning network that can predict properties from structural information but can also tackle the “inverse problem,” that is deducing structural information from properties.  To demonstrate the power of the neural network, we focus on bond breaking in mass-spectrometry, combining experimental data with HPC computational chemistry data. Funded by a Lab Directed Research and Development (LDRD) grant.

 Led by Wibe Albert de Jong

Occupancy-Responsive Model Predictive Control at Room, Building, and District Levels

This project is focused on developing, testing, and demonstrating an open source computational framework that implements model predictive control (MPC) at three scales - room, building, and district (a group of buildings) - to optimize building operation and thus reduce energy use and improve occupant comfort. An accurate prediction of internal heat gains and occupants’ thermal demands is the prerequisite for developing and implementing MPC. Machine learning techniques are used to infer occupant count from WiFi data, recognize electricity consumption patterns, and predict plug-load and internal heat gains. 

Machine learning and occupant modeling led by Tianzhen Hong. Project led by Mary Ann Piette.

Deep Learning for Science

Deep learning tools are increasingly being adopted by the scientific community because they help scientists address a number of data-intensive analytics problems, such as identifying and analyzing extreme weather events and enabling precise measurements of the parameters that describe dark energy. To support these efforts, Berkeley Lab’s DL4SCI initiative is focused on three key computing challenges - handling complex datasets, developing interpretable methods, and improving performance and scaling - across multiple science areas, including cosmology, electron microscopy, and nuclear physics. Funded by a Lab Directed Research and Development (LDRD) grant.