Automating Data Acquisition & Analysis

Stochastic Processes for Function Approximation and Autonomous Data Acquisition at Large-Scale Experimental Facilities

Scientists are increasingly faced with ever more complex experiments. The vast dimensionality of parameter spaces underlying investigations in the biological, chemical, physical, and materials sciences challenges the most advanced data acquisition and analysis systems. While growing data-acquisition rates offer some relief, the complexity of experiments and the subtle dependence of the model function on input parameters remains daunting due to the sheer number of variables. 

This project aims to develop new stochastic process-based mathematical and computational methods to achieve high-quality, domain-aware function approximation, uncertainty quantification, and, by extension, autonomous experimentation. One product of this project is gpCAM, a simple-to-use, flexible, and HPC-ready Python-based software tool for Gaussian process-based function approximation and autonomous experimentation. 

Led by Marcus Noack

Machine Learning-Enabled Surrogate Modeling for Biofuel and Bioproduct Production 

This project uses complex process simulation models for advanced biofuel and bioproduct production to develop and train machine learning (ML)-based surrogate models. Researchers need flexibility to explore different scenarios and understand how their work may impact upstream and downstream processes, as well as cost and greenhouse gas emissions. To address this need, the team uses the Tree-Based Pipeline Optimization Tool (TPOT) to automatically identify the best ML pipelines for predicting cost and mass/energy flow outputs. This  approach has been used with two promising bio-based jet fuel blendstocks: limonane and bisabolane. The results show that ML algorithms trained on simulation trials may serve as powerful surrogates for accurately approximating model outputs at a fraction of the computational expense. The web-based surrogate models are posted on

 Led by Corinne Scown.

Scientific Machine Learning for Simulation and Control in Large Scale Power Systems

With the growing penetration of wind, solar, and storage technologies, all interfaced to the grid via fast-acting power electronic converters (PECs), our power systems are rapidly evolving. One of the challenges with increased PECs is conducting computer time-series simulations of these systems, critical for understanding, and reliably operating, our electrical networks. The addition of PECs is resulting in these simulations taking significantly longer, under current approaches, due to their fast response rates, and spatial diversity.

This project aims to develop new tools at the intersection of scientific machine learning (SciML) and power systems engineering. These tools will accelerate the simulation of power systems with high penetration of PEC to ensure that we can simulate these systems in near real-time. This acceleration will be achieved by 1) using SciML to develop accurate models of aggregations of PECs in order to reduce the number of equations we need to solve and 2) using SciML to improve the mathematical techniques we use for solving these equations.

 Led by Duncan Callaway

Science Search

As scientific datasets increase in both size and complexity, the ability to label, filter, and search this deluge of information has become a laborious, time-consuming, and sometimes impossible task without the help of automated tools. To overcome this challenge, researchers at Berkeley Lab are developing innovative machine learning tools to pull contextual information from scientific datasets and automatically generate metadata tags for each file. Scientists can then search these files via Science Search, a web-based search engine for scientific data. 
Led by Katie Antypas.


Experimental science is evolving. With the advent of new technology, scientific facilities are collecting data at increasing rates and higher resolution. But making sense of this data is becoming a major bottleneck. New mathematics and algorithms are needed to extract useful information from these experiments. To address these growing needs, the Center for Advanced Mathematics for Energy Research Applications (CAMERA) is working with scientists across disciplines to develop fundamental mathematics and algorithms, delivered as data analysis software that can accelerate scientific discovery. 

Led by James Sethian.

MetaBio IDS

The overarching objective of this interdisciplinary science project is to leverage new theory and observations in land, atmosphere and space-based research to accurately partition global carbon fluxes between terrestrial ecosystems and the atmosphere at high spatial and temporal resolution. Machine learning, in particular simple and deep neural networks and generalized additive models, have proven powerful tools to do so. This project uses ML tools to both diagnose biases in global land surface models, and to derive new information from time-series of carbon fluxes between ecosystems and the atmosphere provided by distributed sensing networks such as AmeriFlux. Doing so both provides novel model diagnostics and amplifies the impact and utility of DOE investments in observational platforms. 

Led by Trevor Keenan.

Route Choice Behavior at Urban Scale

Knowing how individuals move between places on the urban scale is critical to infrastructure and transportation systems planning. However, current route-choice models are stymied by the messy host of human factors that play into individual routing decisions. 

Using machine learning techniques and location data generated as a byproduct of smartphone application use, this project aims to better understand how individuals organize their travel plans into a set of routes and how similar behavior patterns emerge among distinct individual choices. This technique has the potential to inform demand management strategies that target individual users while generating large scale estimates that can be used in urban-wide traffic planning.

 Led by Marta C. Gonzalez.


Deep Learning and Satellite Imagery to Estimate Air Quality Impact at Scale

Deep Learning and Satellite Imagery to Estimate Air Quality Impact at Scale, or DeepAir, uses deep learning algorithms to analyze satellite images combined with traffic information from cell phones and data already being collected by environmental sensors to improve air quality predictions. Scientists already use sophisticated models that consider factors such as wind speed, pressure, precipitation, and temperature to make predictions about pollution levels. DeepAir uses an array of distributed, existing datasources, including mobile phones, to help inventory man-made pollutants (such as vehicle exhaust and power plant emissions) as they actually enter the environment. 

The resulting analysis aims to ultimately inform the design of more efficient and more timely interventions, such as the San Francisco Bay Area's “Spare the Air” days.

Led by Marta C. González


Reconstructing the trajectories of charged particles from a collision event as they fly through a High Energy Physics (HEP) detector is a combinatorially difficult pattern recognition problem. Exa.TrkX is a collaboration of data scientists and computational physicists who are developing graph neural networks models aimed at reconstructing millions of particle trajectories per second from petabytes of raw data produced by the next generation of particle tracking detectors at the energy and intensity frontiers. Exa.TrkX is also exploring the scaling of distributed training of graph neural networks on U.S. Department of Energy pre-exascale systems and the deployment of graph neural network models with microsecond latencies on field-programmable gate array-based (FPGA-based) real-time processing systems.

Led by Paolo Calafiura.

AR1K: Engineering Agriculture through Machine Learning in BioEPIC

In an effort to revolutionize agriculture and create sustainable farming practices that benefit both the environment and farms, researchers from Berkeley Lab, the University of Arkansas, and Glennoe Farms are bringing together molecular biology, biogeochemistry, environmental sensing technologies, and machine learning and applying them to soil research. This project aims to reduce the need for chemical fertilizers and enhance soil carbon uptake to improve the long-term viability of land and increase crop yields.  

Led by Ben Brown.


The high data-throughput of scientific instruments has made image recognition one of the most challenging problems in scientific research today. Supported by a U.S. DOE Early Career Award, Image across Domains, Experiments, Algorithms and Learning (IDEAL) focuses on computer vision and machine learning algorithms and software to enable timely interpretation of experimental data recorded as 2D or multispectral images. 

Led by Daniela Ushizima.

Data Analytics for Commercial Buildings

State of the art analytics software and modeling tools can provide valuable insights into efficiency opportunities. However, prior research has shown that key barriers include relatively limited data sources (smart meters and weather being most common in commercial tools), or reliance upon user-provided inputs for which default values may be the fallback. There is great opportunity to apply techniques based on multi-stream data fusion and machine learning to overcome these challenges. 

Led by Jessica Granderson.


As supercomputers become ever more capable in their march toward exascale levels of performance, scientists can run increasingly detailed and accurate simulations to study problems ranging from cleaner combustion to the nature of the universe. The challenge is that these powerful simulations are “computationally expensive,” consuming 10 to 50 million CPU hours for a single simulation. The ExaLearn project aims to develop new tools to help scientists overcome this challenge by applying machine learning to very large experimental datasets and simulations. 

Led by Peter Nugent.

Feedstock to Function

The goal of this project is to improve bio-based product and fuel development through adaptive technoeconomic and performance modeling. Toward this end we are developing a comprehensive Feedstock to Function software tool (F2FT) that harnesses the power of machine learning to predict properties of high-potential molecules (fuels, fuel co-products, and other bioproducts) derived from biomass. This tool can also be used to evaluate the cost, benefits, and risk of promising biobased molecules or biofuels to enable faster, less expensive bioprocess optimization and scale-up. 

Led by Vi Rapp.

Machine Learning for High Energy Physics

This project develops both simulation-based and simulation-independent deep learning techniques for high energy physics.  In particular, we are finding machine learning solutions to make the best use of physics-based simulations for inference.  Many of these physics-based simulations are too slow and and so we are also developing deep learning solutions to augment or supplant these simulations using generative models.

In parallel, we are developing simulation-independent techniques to broaden our analysis sensitivity.  These methods are less-than-supervised and include anomaly detection to search for new particles at the Large Hadron Collider. 

Led by Benjamin Nachman.


Population growth, changes in land use, climate change, and extreme weather are a few factors threatening not only freshwater supplies but all the other systems that rely on watersheds, including hydropower and agriculture.

To help meet these challenges, the DOE BER-funded ExaSheds project seeks to fundamentally change how watershed function is understood and predicted. Combining leadership class computers, big data, and machine learning with learning-assisted physics-based simulation tools, ExaSheds scientists are working to provide a full treatment of water flow and biogeochemical reactive transport at watershed to river basin scales.  

Led by Carl Steefel

CIGAR: Cybersecurity via Inverter-Grid Automatic Reconfiguration

The U.S. power distribution grid connects thousands of power plants to hundreds of millions of electricity customers across the country. In today's highly connected world, it is unrealistic to assume that energy delivery systems are isolated or immune from online threats; thus, as the grid is modernized, new features must be deployed to protect against cyberattacks. Berkeley Lab’s CIGAR project is developing and testing machine learning algorithms to counteract cyber-physical attacks that can compromise multiple systems in the electric grid.  Funded by the DOE Office of Cybersecurity, Energy Security, and Emergency Response (CESER)'s Cybersecurity for Energy Delivery Systems program

Led by Sean Peisert and Dan Arnold.


Networks are the essential links connecting science collaborations around the world. But as these collaborations grow and science experiments generate ever more data, meeting these needs with upgraded hardware alone, like routers or optical fiber, can get expensive. That’s why the Deep and Autonomously Performing High-Speed Networks (DAPHNE) project is also looking into applying artificial intelligence software tools to design and manage distributed network architectures. These tools could effectively improve data transfers, guarantee high throughput, and advance traffic engineering by understanding and better predicting network traffic flows.

Led by Mariam Kiran.

Joint Social Sequence Analysis to Predict Travel Behavior

The analysis of categorical and longitudinal time series, called sequences, has great value for social science applications to study the span of life trajectories, careers, decision points, and family structure. The goal of this project is to investigate life-long lifecycle trajectory dynamics based on demographic characteristics, education, and other lifestyle variables. By analyzing entire lifelong sequences, it is possible to discover representative patterns from the overall life trajectory of a given individual’s characteristics and the pathway through which one arrives at a given state, travel decision, or mobility behavior. In contrast to traditional big-data approaches that aim to solve the "largeness" of numerical data, we aim to tackle the "largeness" that arises from data types, dimensionality, and heterogeneity in data quality that are common in categorical social sequences. This work has resulted in one IEEE conference publication and one journal article published by Association for Computing Machinery. 

Led by Ling Jin, Anna Spurlock, and Annika Todd.

Statistical Mechanics for Interpretable Learning Algorithms

Machine learning has the potential to revolutionize scientific discovery, but it also has some limitations. One of them is interpretability. As machine learning networks grow larger and more complex, understanding how the networks behave and how they reach the results is difficult, if not impossible. This project is using statistical mechanics to interpret how popular machine learning algorithms behave, give users more control over these systems, and enable them to reach the results faster. As a proof of concept, this project is working closely with Berkeley Lab’s Distributed Acoustic Sensing project.

 Another machine learning challenge is that data is the only guide for these tools. One way to improve interpretability for scientific applications is to integrate the laws of physics into the learning process. This project will also explore machine learning algorithms informed by physics.

Led by John Wu, Michael Mahoney, and Jonathan Ajo-Franklin.

Sensor Data Integration

Although sensor and meter data are becoming more available in buildings, their application is limited. This project employs advanced data analytics and inverse modeling techniques, integrating the sensor and meter data with physics-based models to evaluate and improve building energy efficiency and demand flexibility in support of grid-interactive efficient buildings (GEB). One barrier associated with sensor data mining is privacy. To address this concern, we used a generative adversarial network (GAN) to anonymize smart meter data. GAN is a machine learning technique that can recover an unknown probability distribution purely from data. Using GAN, we can remove sensitive privacy information while capturing the key statistics of the original data. This could motivate data owners to share their data. We also applied inverse modeling and parameter identification techniques to extract thermal dynamics of residential buildings from a large smart-thermostat dataset; these data are used to estimate the demand flexibility potential of the U.S. residential sector. 

Led by Tianzhen Hong.

AlphaBuilding: Machine Learning for Advanced Building Controls

Modern buildings are becoming increasingly complex to manage and control. Buildings are a significant energy consumer and carbon emitter, and people spend 90 percent of their time indoors. A comfortable, productive, and healthy indoor environment is crucial for occupants’ well-being. Conventional building controls such as schedule-based setpoint tracking failed to optimize multiple objectives of building operation. Model-based control techniques such as model predictive control can achieve better performance while hard to implement and scale up. We applied reinforcement learning to optimize building control, with the goal to reduce energy consumption, enhance demand flexibility, curtail carbon emission, and improve occupants well-being. A two stage training process is proposed, imitation learning to learn the state-of-art building industry control standard for pre-training, and then fine tuning to enhance performance through interacting with the environment in real buildings. Physics-informed learning with the introduction of key building thermal dynamic parameters is also being explored. 

Led by Tianzhen Hong.

Combining Data-driven and Science-based Generative Models

This project investigates the many connections between data-driven and science-driven generative models.  When do scientists use physical models to create synthetic data for science applications? When do we supplement them with data driven machine learning models?  Conversely, can researchers use physical models to improve on the current data-driven generative models in machine learning Recently this project took an important step towards integrating deep learning and numerical simulations by developing FlowPM, a new distributed N-body cosmological simulation code created in Mesh Tensorflow. Funded by a Lab Directed Research and Development (LDRD) grant. 

Led by Uros Seljak.

Accelerator Advancements Through Machine Learning

The Advanced Light Source is a third-generation synchrotron light source offering high-brightness ultra-stable x-rays to experiments at over three dozen beam lines. The stability of this operating DOE user facility is now being further improved through the application of machine learning. Neural networks have for the first time been successfully trained to allow for a novel feed forward which increases source size stability by up to an order of magnitude compared to conventional physics model-based approaches. Training such networks takes substantially less dedicated machine time than previously required model calibration measurements. In addition, deep learning is being studied to accelerate and improve design algorithms for optimizing future synchrotron lattices. The ultra-high brightness ALS-U storage ring lattice presently under design is serving as an especially relevant test case. Jointly funded by DOE BES (ADRP) and ASCR Programs. 

Led by Simon C. Leemann

Interactive Machine Learning for Tomogram Segmentation and Annotation

Three-dimensional (3D) cellular bioimaging can help us understand how living systems respond to genes, regulation, and environment in time and space. However, interpreting these 3D images is a time-consuming, manually intensive process that can take up to three months. This project uses machine learning techniques to speed up manual segmentation of 3D cryo-electron tomography with the goal of reducing image interpretation time to around one hour, about the same amount of time it takes to capture a multi-gigabyte 3D image.

Funded by a Lab Directed Research and Development (LDRD) grant. 

Led by Nicholas K. Sauter.

The Chemical Universe through the Eyes of Generative Adversarial Neural Networks

This project is developing generative machine learning models that can discover new scientific knowledge about molecular interactions and structure-function relationships in chemical sciences. The aim is to create a deep learning network that can predict properties from structural information but can also tackle the “inverse problem,” that is deducing structural information from properties.  To demonstrate the power of the neural network, we focus on bond breaking in mass-spectrometry, combining experimental data with HPC computational chemistry data. Funded by a Lab Directed Research and Development (LDRD) grant.

 Led by Wibe Albert de Jong

Occupancy-Responsive Model Predictive Control at Room, Building, and District Levels

This project is focused on developing, testing, and demonstrating an open source computational framework that implements model predictive control (MPC) at three scales - room, building, and district (a group of buildings) - to optimize building operation and thus reduce energy use and improve occupant comfort. An accurate prediction of internal heat gains and occupants’ thermal demands is the prerequisite for developing and implementing MPC. Machine learning techniques are used to infer occupant count from WiFi data, recognize electricity consumption patterns, and predict plug-load and internal heat gains. 

Machine learning and occupant modeling led by Tianzhen Hong. Project led by Mary Ann Piette.

Deep Learning for Science

Deep learning tools are increasingly being adopted by the scientific community because they help scientists address a number of data-intensive analytics problems, such as identifying and analyzing extreme weather events and enabling precise measurements of the parameters that describe dark energy. To support these efforts, Berkeley Lab’s DL4SCI initiative is focused on three key computing challenges - handling complex datasets, developing interpretable methods, and improving performance and scaling - across multiple science areas, including cosmology, electron microscopy, and nuclear physics. Funded by a Lab Directed Research and Development (LDRD) grant.