Datasets

As a Department of Energy National Laboratory, Berkeley Lab hosts many publicly available scientific datasets. This list features a selection of datasets of potential interest for machine learning applications. Select the dataset name to learn more.

The CaloGAN dataset has been used to develop deep generative models to accelerate slow simulations of material interactions with dense material. Such simulations are critical for particle and nuclear physics experiments with thick calorimeters. In contrast to many industrial datasets for generative models, these image data provide are sparse and have irregular pixel sizes. These features provide a unique challenge that has sparked an extensive study of generative models for related applications. Contact: bpnachman@lbl.gov

The latest CosmoFlow dataset includes around 10,000 cosmological N-body dark matter simulations. The simulations are run using MUSIC to generate the initial conditions, and are evolved with pyCOLA, a multithreaded Python/Cython N-body code. The output of these simulations is then binned into a 3D histogram of particle counts in a cube of size 512x512x512, which is sampled at 4 different redshifts. More details on the process of generating these datasets can be found in the original CosmoFlow paper.

Simulation-based inference is a form of likelihood-free inference whereby one has access to a high-fidelity simulation but the probability density itself is analytically intractable. These simulated particle collisions offer a testing ground for likelihood-free inference techniques where the data are high dimensional, variable length sets of particle momenta. Contact: bpnachman@lbl.gov

This dataset is from a community challenge to develop anomaly detection techniques for the Large Hadron Collider. Participants are given a list of hundreds of particles per simulated collision and are asked to identify the presence of new particles (or not) in the dataset. Unlike many industrial anomaly detection applications, anomalies in this dataset would manifest as over-densities instead of out of sample examples. Furthermore, the data are not images (they are a list of particle momenta) and the number of particles per collision is variable (zero padded to provide a fixed-length list). Contact: bpnachman@lbl.gov

AmeriFlux is a network of PI-managed sites measuring ecosystem CO2, water, and energy fluxes in North, Central and South America. It was established to connect research on field sites representing major climate and ecological biomes, including tundra, grasslands, savanna, crops, and conifer, deciduous, and tropical forests. As a grassroots, investigator-driven network, the AmeriFlux community has tailored instrumentation to suit each unique ecosystem. This “coalition of the willing” is diverse in its interests, use of technologies and collaborative approaches. As a result, the AmeriFlux Network continually breaks new ground. Contact: ameriflux-support@lbl.gov

Ambient environmental radiological (gamma-ray and neutron) data alongside a suite of contextual sensors (video, lidar, hyperspectral). Contact: bjquiter@lbl.gov

Fruitfly functional genomics data repository. Contacts: BPBowen@lbl.gov, ORuebel@lbl.gov

The U.S. Department of Energy’s (DOE) Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) is a new data archive for Earth and environmental science data. ESS-DIVE is funded by the Data Management program within the Climate and Environmental Science Division under the DOE’s Office of Biological and Environmental Research program (BER), and is maintained by the Lawrence Berkeley National Laboratory. ESS-DIVE will archive and publicly share data obtained from observational, experimental, and modeling research that is funded by the DOE’s Office of Science under its Subsurface Biogeochemical Research (SBR) and Terrestrial Ecosystem Science (TES) programs within the Environmental Systems Science (ESS) activity. Contact: ess-dive-support@lbl.gov

Today, eddy covariance measurements of carbon dioxide and water vapor exchange are being made routinely on all continents. The flux measurement sites are linked across a confederation of regional networks in North, Central and South America, Europe, Asia, Africa, and Australia, in a global network, called FLUXNET. This global network includes more than eight hundred active and historic flux measurement sites, dispersed across most of the world’s climate space and representative biomes. This global FLUXNET dataset was built using data through 2006. Contact: fluxdata-support@lbl.gov

By computing properties of all known materials, the Materials Project aims to remove guesswork from materials design in a variety of applications. Experimental research can be targeted to the most promising compounds from computational data sets. Researchers will be able to data-mine scientific trends in materials properties. By providing materials researchers with the information they need to design better, the Materials Project aims to accelerate innovation in materials research. Contact: feedback@materialsproject.org

Mass spectrometry imaging (MSI) is widely applied to image complex samples for applications spanning health, microbial ecology, and high throughput screening of high-density arrays. MSI has emerged as a technique suited to resolving metabolism within complex cellular systems; where understanding the spatial variation of metabolism is vital for making a transformative impact on science. OpenMSI provides a web-based gateway for management and storage of MSI data, the visualization of the hyper-dimensional contents of the data, and the statistical analysis. Contacts: BPBowen@lbl.gov, ORuebel@lbl.gov

A number of features of our power distribution grid make it particularly vulnerable to cyber attacks. By installing micro phasor measurement units (µPMUs) in key locations in the electric distribution system and evaluating the data from them, we aim to design and implement a measurement network that can detect and report the resultant impact of cyber security attacks. The data collected by these units supports a variety of projects to determine whether refined measurement of voltage phase angles can enable advanced diagnostic, monitoring, and control methodologies in distribution systems, and to begin developing algorithms for diagnostic applications based on µPMU data. Contact: sppeisert@lbl.gov

Unfolding is the particle physics analog of deconvolution - we want to infer the spectra of particle momenta prior to detector distortions. Unlike traditional deconvolution with image data, these data are lists of particle momenta. A further complication is that there are many particles per collision and the number of particles is variable. Contact: bpnachman@lbl.gov

Today, eddy covariance measurements of carbon dioxide and water vapor exchange are being made routinely on all continents. The flux measurement sites are linked across a confederation of regional networks in North, Central and South America, Europe, Asia, Africa, and Australia, in a global network, called FLUXNET. This global network includes more than eight hundred active and historic flux measurement sites, dispersed across most of the world’s climate space and representative biomes. This global FLUXNET dataset was built using data through 2015. Contact: fluxdata-support@lbl.gov

Flow is a traffic control benchmarking framework that provides a suite of traffic control scenarios (benchmarks), tools for designing custom traffic scenarios, and integration with deep reinforcement learning and traffic microsimulation libraries. Flow software and datasets are open-source for public use under the MIT license. Contact: flow.berkeley@gmail.com