How to use OpenSCMDB¶
In this notebook we describe how to use the OpenSCMDB class.
This class is useful for writing data to
and reading data from a database of files.
Imports¶
import concurrent.futures
import contextlib
import itertools
import tarfile
import tempfile
import traceback
from functools import partial
from pathlib import Path
import filelock
import numpy as np
import pandas as pd
import pandas_indexing as pix
import tqdm
from pandas_openscm.db import (
DATA_BACKENDS,
INDEX_BACKENDS,
AlreadyInDBError,
EmptyDBError,
OpenSCMDB,
)
from pandas_openscm.parallelisation import ParallelOpConfig
Basics¶
The database is file based.
We simply write files of data to a directory
and keep track of where the data is using an index file.
As a result, every OpenSCMDB instance
needs to know which directory to write into.
In addition, the instance needs to know which file format, i.e. backend, to use to write and read the data and the index. The available data and index backends are below. You can, of course, write your own backend that matches the required interface.
DATA_BACKENDS.options
(('csv', pandas_openscm.db.csv.CSVDataBackend),
('feather', pandas_openscm.db.feather.FeatherDataBackend),
('in_memory', pandas_openscm.db.in_memory.InMemoryDataBackend),
('netCDF', pandas_openscm.db.netcdf.netCDFDataBackend))
INDEX_BACKENDS.options
(('csv', pandas_openscm.db.csv.CSVIndexBackend),
('feather', pandas_openscm.db.feather.FeatherIndexBackend),
('in_memory', pandas_openscm.db.in_memory.InMemoryIndexBackend),
('netCDF', pandas_openscm.db.netcdf.netCDFIndexBackend))
Knowing the directory in which to work and the back-ends to use, we can initialise a database.
db = OpenSCMDB(
db_dir=Path(tempfile.mkdtemp()),
backend_data=DATA_BACKENDS.get_instance("csv"),
backend_index=INDEX_BACKENDS.get_instance("csv"),
)
db
OpenSCMDB(backend_data=CSVDataBackend(ext='.csv'), backend_index=CSVIndexBackend(ext='.csv'), db_dir=PosixPath('/tmp/tmppxl3pdmo'), index_file_lock=<filelock._unix.UnixFileLock object at 0x792d12633e90>)
Empty databases¶
If you try and operate on an empty database, you will either get an error or nothing will happen.
# Deleting an empty database does nothing
db.delete()
# Trying to load an empty database raises an error
try:
db.load()
except EmptyDBError:
traceback.print_exc(limit=0, chain=False)
pandas_openscm.db.openscm_db.EmptyDBError: The database is empty: db=OpenSCMDB(backend_data=CSVDataBackend(ext='.csv'), backend_index=CSVIndexBackend(ext='.csv'), db_dir=PosixPath('/tmp/tmppxl3pdmo'), index_file_lock=<filelock._unix.UnixFileLock object at 0x792d12633e90>)
Saving data¶
If we have some data, we can save it to the database. In theory, this will work with any data type. However, it is optimised to timeseries data i.e. data that has a multi-index with metadata values for each timeseries, columns which define the time axis and values for each timeseries.
df_timeseries_like = pd.DataFrame(
np.arange(12).reshape(4, 3),
columns=[2010, 2015, 2025],
index=pd.MultiIndex.from_tuples(
[
("scenario_a", "climate_model_a", "Temperature", "K"),
("scenario_b", "climate_model_a", "Temperature", "K"),
("scenario_b", "climate_model_b", "Temperature", "K"),
("scenario_b", "climate_model_b", "Ocean Heat Uptake", "J"),
],
names=["scenario", "climate_model", "variable", "unit"],
),
)
df_timeseries_like
| 2010 | 2015 | 2025 | ||||
|---|---|---|---|---|---|---|
| scenario | climate_model | variable | unit | |||
| scenario_a | climate_model_a | Temperature | K | 0 | 1 | 2 |
| scenario_b | climate_model_a | Temperature | K | 3 | 4 | 5 |
| climate_model_b | Temperature | K | 6 | 7 | 8 | |
| Ocean Heat Uptake | J | 9 | 10 | 11 |
# Saving is then trivial
db.save(df_timeseries_like)
# Loading metadata is trivial.
# A MultiIndex is returned by default.
metadata = db.load_metadata()
metadata
MultiIndex([('climate_model_a', 'scenario_a', 'K', 'Temperature'),
('climate_model_a', 'scenario_b', 'K', 'Temperature'),
('climate_model_b', 'scenario_b', 'K', 'Temperature'),
('climate_model_b', 'scenario_b', 'J', 'Ocean Heat Uptake')],
names=['climate_model', 'scenario', 'unit', 'variable'])
# The MultiIndex can be trivially turned into a DataFrame if you wish
metadata.to_frame(index=False)
| climate_model | scenario | unit | variable | |
|---|---|---|---|---|
| 0 | climate_model_a | scenario_a | K | Temperature |
| 1 | climate_model_a | scenario_b | K | Temperature |
| 2 | climate_model_b | scenario_b | K | Temperature |
| 3 | climate_model_b | scenario_b | J | Ocean Heat Uptake |
Data¶
Then we can load data.
# By default, we just load everything
db.load()
| 2010 | 2015 | 2025 | ||||
|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | |||
| climate_model_a | scenario_a | K | Temperature | 0 | 1 | 2 |
| scenario_b | K | Temperature | 3 | 4 | 5 | |
| climate_model_b | scenario_b | K | Temperature | 6 | 7 | 8 |
| J | Ocean Heat Uptake | 9 | 10 | 11 |
Sub-selecting¶
If we don't want all the data, we can sub-select.
# For example, this can be done using the previously loaded metadata
db.load(metadata[:2])
| 2010 | 2015 | 2025 | ||||
|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | |||
| climate_model_a | scenario_a | K | Temperature | 0 | 1 | 2 |
| scenario_b | K | Temperature | 3 | 4 | 5 |
When combined with pandas-indexing, this provides particularly powerful functionality.
db.load(pix.isin(scenario="scenario_b"))
| 2010 | 2015 | 2025 | ||||
|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | |||
| climate_model_a | scenario_b | K | Temperature | 3 | 4 | 5 |
| climate_model_b | scenario_b | K | Temperature | 6 | 7 | 8 |
| J | Ocean Heat Uptake | 9 | 10 | 11 |
db.load(pix.ismatch(variable="Ocean*"))
| 2010 | 2015 | 2025 | ||||
|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | |||
| climate_model_b | scenario_b | J | Ocean Heat Uptake | 9 | 10 | 11 |
Deleting¶
If we wish, we can delete our database with delete.
# The database is not empty to start
db.is_empty
False
db.delete()
# Now the database is empty
db.is_empty
True
Summary so far¶
The functionality shown up to here is the key functionality. If all you need to do is basic saving and loading of data, this is all you need. If you are looking to support more complex use cases (which you probably are, otherwise you wouldn't be here), read on.
Advanced topics¶
Sharing the database¶
If you need to share a database, you can zip it and pass it to someone else.
We start by putting some data in a database.
top_level_dir = Path(tempfile.mkdtemp())
db_start = OpenSCMDB(
db_dir=top_level_dir / "start",
backend_data=DATA_BACKENDS.get_instance("csv"),
backend_index=INDEX_BACKENDS.get_instance("csv"),
)
db_start.save(df_timeseries_like)
Then we create a gzipped tar archive of our database.
gzipped = top_level_dir / "db_archive.tar.gz"
db_start.to_gzipped_tar_archive(gzipped)
PosixPath('/tmp/tmp9lqbi1y_/db_archive.tar.gz')
To demonstrate that this does not rely on the original data, we delete the original database.
db_start.delete()
We can inspect the tar file's contents.
with tarfile.open(gzipped) as tar:
print(f"{tar.getmembers()=}")
tar.getmembers()=[<TarInfo 'db' at 0x792d127bb280>, <TarInfo 'db/0.csv' at 0x792d127bb100>, <TarInfo 'db/filemap.csv' at 0x792d127bb340>, <TarInfo 'db/index.csv' at 0x792d127ba680>]
A new database can be initialised from the gzipped tar archive.
db_moved = OpenSCMDB.from_gzipped_tar_archive(
gzipped,
db_dir=top_level_dir / "moved",
)
db_moved
OpenSCMDB(backend_data=CSVDataBackend(ext='.csv'), backend_index=CSVIndexBackend(ext='.csv'), db_dir=PosixPath('/tmp/tmp9lqbi1y_/moved'), index_file_lock=<filelock._unix.UnixFileLock object at 0x792d126a8850>)
As above, we remove the archive to demonstrate that there is no reliance on it for the following operations.
gzipped.unlink()
You can then use this database like normal, but now from the new location (whether on your machine or someone else's).
db_moved.load()
| 2010 | 2015 | 2025 | ||||
|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | |||
| climate_model_a | scenario_a | K | Temperature | 0 | 1 | 2 |
| scenario_b | K | Temperature | 3 | 4 | 5 | |
| climate_model_b | scenario_b | K | Temperature | 6 | 7 | 8 |
| J | Ocean Heat Uptake | 9 | 10 | 11 |
db_moved.load(pix.isin(unit="J"))
| 2010 | 2015 | 2025 | ||||
|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | |||
| climate_model_b | scenario_b | J | Ocean Heat Uptake | 9 | 10 | 11 |
We clean up the files before moving onto the next demonstration.
db_moved.delete()
Grouping data¶
For many use cases, it won't make sense to save all your data in a single file. You can control how the data is grouped while saving using the arguments shown below.
df_many_timeseries = pd.DataFrame(
np.arange(5 * 3 * 4 * 4 * 3).reshape(5 * 3 * 4 * 4, 3),
columns=[2010, 2015, 2025],
index=pd.MultiIndex.from_tuples(
[
(scenario, variant, climate_model, variable, unit)
for scenario, variant, climate_model, (variable, unit) in itertools.product(
["scenario_a", "scenario_b", "scenario_c", "scenario_d", "scenario_e"],
["high", "medium", "low"],
[
"climate_model_a",
"climate_model_b",
"climate_model_c",
"climate_model_d",
],
[
("Temperature", "K"),
("Ocean Heat Uptake", "J"),
("Effective Radiative Forcing", "W / m^2"),
("Warming rate", "K / yr"),
],
)
],
names=["scenario", "variant", "climate_model", "variable", "unit"],
),
)
df_many_timeseries
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| scenario | variant | climate_model | variable | unit | |||
| scenario_a | high | climate_model_a | Temperature | K | 0 | 1 | 2 |
| Ocean Heat Uptake | J | 3 | 4 | 5 | |||
| Effective Radiative Forcing | W / m^2 | 6 | 7 | 8 | |||
| Warming rate | K / yr | 9 | 10 | 11 | |||
| climate_model_b | Temperature | K | 12 | 13 | 14 | ||
| ... | ... | ... | ... | ... | ... | ... | ... |
| scenario_e | low | climate_model_c | Warming rate | K / yr | 705 | 706 | 707 |
| climate_model_d | Temperature | K | 708 | 709 | 710 | ||
| Ocean Heat Uptake | J | 711 | 712 | 713 | |||
| Effective Radiative Forcing | W / m^2 | 714 | 715 | 716 | |||
| Warming rate | K / yr | 717 | 718 | 719 |
240 rows × 3 columns
# For example, we might want to group by climate model and scenario on saving
db.save(df_many_timeseries, groupby=["climate_model", "scenario"])
Parallelisation and progress bars¶
We provide a variety of ways to control parallelisation during operations as well as the display of progress bars. The simplest way to control these is via the arguments shown below.
# Turn on default progress bars
# (note that these don't show up in the rendered docs,
# but will if you actually use the package in a notebook)
db.load(progress=True)
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | high | 0 | 1 | 2 |
| J | Ocean Heat Uptake | high | 3 | 4 | 5 | ||
| W / m^2 | Effective Radiative Forcing | high | 6 | 7 | 8 | ||
| K / yr | Warming rate | high | 9 | 10 | 11 | ||
| K | Temperature | medium | 48 | 49 | 50 | ||
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_d | scenario_e | K / yr | Warming rate | medium | 669 | 670 | 671 |
| K | Temperature | low | 708 | 709 | 710 | ||
| J | Ocean Heat Uptake | low | 711 | 712 | 713 | ||
| W / m^2 | Effective Radiative Forcing | low | 714 | 715 | 716 | ||
| K / yr | Warming rate | low | 717 | 718 | 719 |
240 rows × 3 columns
# Turn on default parallelisation
db.load(max_workers=4)
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_b | K | Temperature | high | 144 | 145 | 146 |
| J | Ocean Heat Uptake | high | 147 | 148 | 149 | ||
| W / m^2 | Effective Radiative Forcing | high | 150 | 151 | 152 | ||
| K / yr | Warming rate | high | 153 | 154 | 155 | ||
| K | Temperature | medium | 192 | 193 | 194 | ||
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_d | scenario_e | K / yr | Warming rate | medium | 669 | 670 | 671 |
| K | Temperature | low | 708 | 709 | 710 | ||
| J | Ocean Heat Uptake | low | 711 | 712 | 713 | ||
| W / m^2 | Effective Radiative Forcing | low | 714 | 715 | 716 | ||
| K / yr | Warming rate | low | 717 | 718 | 719 |
240 rows × 3 columns
# Turn on default progress bars and parallelisation
# (again, note that the progress bars don't show up in the rendered docs,
# but will if you actually use the package in a notebook)
db.load(progress=True, max_workers=4)
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | high | 0 | 1 | 2 |
| J | Ocean Heat Uptake | high | 3 | 4 | 5 | ||
| W / m^2 | Effective Radiative Forcing | high | 6 | 7 | 8 | ||
| K / yr | Warming rate | high | 9 | 10 | 11 | ||
| K | Temperature | medium | 48 | 49 | 50 | ||
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_d | scenario_c | K / yr | Warming rate | medium | 381 | 382 | 383 |
| K | Temperature | low | 420 | 421 | 422 | ||
| J | Ocean Heat Uptake | low | 423 | 424 | 425 | ||
| W / m^2 | Effective Radiative Forcing | low | 426 | 427 | 428 | ||
| K / yr | Warming rate | low | 429 | 430 | 431 |
240 rows × 3 columns
If you want fine-grained control over the behaviour,
then the operations support receiving ParallelOpConfig instances
to control the parallelisation and progress for different operations.
An example is shown below.
with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
loaded = db.load(
parallel_op_config=ParallelOpConfig(
progress_results=partial(
tqdm.tqdm, desc="Custom description and non-HTML bar"
),
executor=executor,
progress_parallel_submission=partial(
tqdm.tqdm, desc="Custom submission description and non-HTML bar"
),
)
)
loaded
Custom submission description and non-HTML bar: 0%| | 0/20 [00:00<?, ?it/s]
Custom submission description and non-HTML bar: 100%|██████████| 20/20 [00:00<00:00, 1448.48it/s]
Custom description and non-HTML bar: 0%| | 0/20 [00:00<?, ?it/s]
Custom description and non-HTML bar: 100%|██████████| 20/20 [00:00<00:00, 908.04it/s]
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | high | 0 | 1 | 2 |
| J | Ocean Heat Uptake | high | 3 | 4 | 5 | ||
| W / m^2 | Effective Radiative Forcing | high | 6 | 7 | 8 | ||
| K / yr | Warming rate | high | 9 | 10 | 11 | ||
| K | Temperature | medium | 48 | 49 | 50 | ||
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_d | scenario_e | K / yr | Warming rate | medium | 669 | 670 | 671 |
| K | Temperature | low | 708 | 709 | 710 | ||
| J | Ocean Heat Uptake | low | 711 | 712 | 713 | ||
| W / m^2 | Effective Radiative Forcing | low | 714 | 715 | 716 | ||
| K / yr | Warming rate | low | 717 | 718 | 719 |
240 rows × 3 columns
Overwriting data¶
By default, it is not possible to overwrite data that is already in the database. This ensures that you don't accidentally overwrite.
# Trying to load an empty database raises an error
try:
db.save(df_many_timeseries)
except AlreadyInDBError:
traceback.print_exc(limit=0, chain=False)
pandas_openscm.db.openscm_db.AlreadyInDBError: The following rows are already in the database:
scenario variant climate_model variable unit
0 scenario_a high climate_model_a Temperature K
1 scenario_a high climate_model_a Ocean Heat Uptake J
2 scenario_a high climate_model_a Effective Radiative Forcing W / m^2
3 scenario_a high climate_model_a Warming rate K / yr
4 scenario_a high climate_model_b Temperature K
.. ... ... ... ... ...
235 scenario_e low climate_model_c Warming rate K / yr
236 scenario_e low climate_model_d Temperature K
237 scenario_e low climate_model_d Ocean Heat Uptake J
238 scenario_e low climate_model_d Effective Radiative Forcing W / m^2
239 scenario_e low climate_model_d Warming rate K / yr
[240 rows x 5 columns]
If you are sure, you can overwrite data as shown below. As the progress bars show, this happens in two steps:
- the new data is written to disk
- the data that it overwrites is removed
# A helper to ensure our parallel bars
# show in the rendered docs.
# If you're actually using the package,
# you can just use `progress=True` for the majority of use cases.
save_pbars = dict(
parallel_op_config_save=ParallelOpConfig(partial(tqdm.tqdm, desc="File saving")),
parallel_op_config_delete=ParallelOpConfig(
partial(tqdm.tqdm, desc="File deletion")
),
parallel_op_config_rewrite=ParallelOpConfig(
partial(tqdm.tqdm, desc="File re-writing")
),
)
load_pbar = dict(
parallel_op_config=ParallelOpConfig(partial(tqdm.tqdm, desc="File loading")),
)
db.save(df_many_timeseries, allow_overwrite=True, **save_pbars)
File saving: 0%| | 0/3 [00:00<?, ?it/s]
File saving: 100%|██████████| 3/3 [00:00<00:00, 706.43it/s]
File deletion: 0%| | 0/20 [00:00<?, ?it/s]
File deletion: 100%|██████████| 20/20 [00:00<00:00, 18737.12it/s]
Overwriting works whether you are doing a full overwrite or a partial overwrite. However, if you are doing a partial overwrite, the operation can be quite expensive. The reason is that you have to load the written data and re-write the data that is not being overwritten. As a result, considering your data grouping carefully will make a big difference in performance if you expected to be doing lots of overwrites.
# Save into a single file
db.save(df_many_timeseries, allow_overwrite=True, **save_pbars)
File saving: 0%| | 0/3 [00:00<?, ?it/s]
File saving: 100%|██████████| 3/3 [00:00<00:00, 697.27it/s]
File deletion: 0%| | 0/1 [00:00<?, ?it/s]
File deletion: 100%|██████████| 1/1 [00:00<00:00, 7384.34it/s]
# Now overwrite only a selection of the data.
# This will force the existing file to be re-written
# so that the data we wish to keep is not lost,
# but we do get rid of the data we are overwriting.
db.save(df_many_timeseries.iloc[:20, :], allow_overwrite=True, **save_pbars)
/home/docs/.asdf/installs/python/3.11.14/lib/python3.11/site-packages/pandas_openscm/db/openscm_db.py:740: UserWarning: Overwriting the data will require re-writing. This may be slow. If that is an issue, the way to solve it is to update your workflow to ensure that you are not overwriting data or are only overwriting entire files. warnings.warn(msg)
File re-writing: 0%| | 0/1 [00:00<?, ?it/s]
File re-writing: 100%|██████████| 1/1 [00:00<00:00, 199.85it/s]
File saving: 0%| | 0/3 [00:00<?, ?it/s]
File saving: 100%|██████████| 3/3 [00:00<00:00, 721.83it/s]
File deletion: 0%| | 0/1 [00:00<?, ?it/s]
File deletion: 100%|██████████| 1/1 [00:00<00:00, 7884.03it/s]
# You can disable the warning as shown below
db.save(
df_many_timeseries.iloc[-20:, :],
allow_overwrite=True,
warn_on_partial_overwrite=False,
**save_pbars,
)
File re-writing: 0%| | 0/1 [00:00<?, ?it/s]
File re-writing: 100%|██████████| 1/1 [00:00<00:00, 185.93it/s]
File saving: 0%| | 0/3 [00:00<?, ?it/s]
File saving: 100%|██████████| 3/3 [00:00<00:00, 782.37it/s]
File deletion: 0%| | 0/1 [00:00<?, ?it/s]
File deletion: 100%|██████████| 1/1 [00:00<00:00, 8701.88it/s]
More on grouping¶
Now that you have seen some more features, it is easier to explain why grouping is so important. By default, when we save, we save the data as a single file.
# Before continuing, clear out the database
db.delete()
# The progress bar shows that three files are being saved, these are:
# 1. the data (as a single file)
# 2. the index
# 3. the file map
db.save(df_many_timeseries, **save_pbars)
File saving: 0%| | 0/3 [00:00<?, ?it/s]
File saving: 100%|██████████| 3/3 [00:00<00:00, 753.83it/s]
If we save the data like this, then we have to read all of the data every time, even if we only want to get a subset of it.
# All the data is in one file, which is read in its entirety,
# even though we only want some of the data.
db.load(pix.isin(variant="low"), **load_pbar)
File loading: 0%| | 0/1 [00:00<?, ?it/s]
File loading: 100%|██████████| 1/1 [00:00<00:00, 659.27it/s]
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | low | 96 | 97 | 98 |
| J | Ocean Heat Uptake | low | 99 | 100 | 101 | ||
| W / m^2 | Effective Radiative Forcing | low | 102 | 103 | 104 | ||
| K / yr | Warming rate | low | 105 | 106 | 107 | ||
| climate_model_b | scenario_a | K | Temperature | low | 108 | 109 | 110 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_c | scenario_e | K / yr | Warming rate | low | 705 | 706 | 707 |
| climate_model_d | scenario_e | K | Temperature | low | 708 | 709 | 710 |
| J | Ocean Heat Uptake | low | 711 | 712 | 713 | ||
| W / m^2 | Effective Radiative Forcing | low | 714 | 715 | 716 | ||
| K / yr | Warming rate | low | 717 | 718 | 719 |
80 rows × 3 columns
If you know how you are likely to want to access your data, you can give yourself much more fine-grained control.
# For example, if we know that the variants are our group of interest,
# we should save the database like that.
# As the progress bars show, the data is now grouped into more files
# (3, in fact, with the other two save operations being for the index and file map).
db.save(df_many_timeseries, groupby=["variant"], allow_overwrite=True, **save_pbars)
File saving: 0%| | 0/5 [00:00<?, ?it/s]
File saving: 100%|██████████| 5/5 [00:00<00:00, 846.00it/s]
File deletion: 0%| | 0/1 [00:00<?, ?it/s]
File deletion: 100%|██████████| 1/1 [00:00<00:00, 5433.04it/s]
# If we only want to load one variant, now we only load data for that variant,
# not all of the data.
db.load(pix.isin(variant="low"), **load_pbar)
File loading: 0%| | 0/1 [00:00<?, ?it/s]
File loading: 100%|██████████| 1/1 [00:00<00:00, 572.91it/s]
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | low | 96 | 97 | 98 |
| J | Ocean Heat Uptake | low | 99 | 100 | 101 | ||
| W / m^2 | Effective Radiative Forcing | low | 102 | 103 | 104 | ||
| K / yr | Warming rate | low | 105 | 106 | 107 | ||
| climate_model_b | scenario_a | K | Temperature | low | 108 | 109 | 110 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_c | scenario_e | K / yr | Warming rate | low | 705 | 706 | 707 |
| climate_model_d | scenario_e | K | Temperature | low | 708 | 709 | 710 |
| J | Ocean Heat Uptake | low | 711 | 712 | 713 | ||
| W / m^2 | Effective Radiative Forcing | low | 714 | 715 | 716 | ||
| K / yr | Warming rate | low | 717 | 718 | 719 |
80 rows × 3 columns
# This is clearer if we load data for two variants, but not all.
# The progress bars show that two files are being loaded.
db.load(pix.isin(variant=["low", "high"]), **load_pbar)
File loading: 0%| | 0/2 [00:00<?, ?it/s]
File loading: 100%|██████████| 2/2 [00:00<00:00, 748.92it/s]
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | high | 0 | 1 | 2 |
| J | Ocean Heat Uptake | high | 3 | 4 | 5 | ||
| W / m^2 | Effective Radiative Forcing | high | 6 | 7 | 8 | ||
| K / yr | Warming rate | high | 9 | 10 | 11 | ||
| climate_model_b | scenario_a | K | Temperature | high | 12 | 13 | 14 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_c | scenario_e | K / yr | Warming rate | low | 705 | 706 | 707 |
| climate_model_d | scenario_e | K | Temperature | low | 708 | 709 | 710 |
| J | Ocean Heat Uptake | low | 711 | 712 | 713 | ||
| W / m^2 | Effective Radiative Forcing | low | 714 | 715 | 716 | ||
| K / yr | Warming rate | low | 717 | 718 | 719 |
160 rows × 3 columns
# If we instead want to load all the temperature data for a given climate model,
# because of the way the data is grouped,
# we have to load all the files
# (you can see this by looking at the progress bar output).
# This would be a good example where,
# if loading a single variable-climate model combination
# at a time is of interest,
# the data should be saved with a grouping that supports
# that access pattern more directly.
db.load(pix.isin(climate_model="climate_model_c", variable="Temperature"), **load_pbar)
File loading: 0%| | 0/3 [00:00<?, ?it/s]
File loading: 100%|██████████| 3/3 [00:00<00:00, 931.38it/s]
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_c | scenario_a | K | Temperature | high | 24 | 25 | 26 |
| scenario_b | K | Temperature | high | 168 | 169 | 170 | |
| scenario_c | K | Temperature | high | 312 | 313 | 314 | |
| scenario_d | K | Temperature | high | 456 | 457 | 458 | |
| scenario_e | K | Temperature | high | 600 | 601 | 602 | |
| scenario_a | K | Temperature | low | 120 | 121 | 122 | |
| scenario_b | K | Temperature | low | 264 | 265 | 266 | |
| scenario_c | K | Temperature | low | 408 | 409 | 410 | |
| scenario_d | K | Temperature | low | 552 | 553 | 554 | |
| scenario_e | K | Temperature | low | 696 | 697 | 698 | |
| scenario_a | K | Temperature | medium | 72 | 73 | 74 | |
| scenario_b | K | Temperature | medium | 216 | 217 | 218 | |
| scenario_c | K | Temperature | medium | 360 | 361 | 362 | |
| scenario_d | K | Temperature | medium | 504 | 505 | 506 | |
| scenario_e | K | Temperature | medium | 648 | 649 | 650 |
The tradeoff with grouping the data can be that reading more files takes more time (although sometimes even then not thanks to parallelisation). It's up to you to make the grouping choice that suits your access pattern.
Locking¶
The database supports locking.
This ensures that e.g. only one process can write to the database.
Locking is handled via the index_file_lock attribute.
You can set this at initialisation if you wish.
If it is not supplied, a default value is used.
The lock is an attribute of the instance. In other words, every database instance has its own lock. We use filelock by default. This is a powerful locking library, but it is important to understand its model.
Of most importance is the fact that the locks are recursive. This means that repeated calls to acquire the same lock will not block.
with db.index_file_lock.acquire():
# Even though we just acquired the lock,
# we can acquire it again because of the recursive behaviour.
db.index_file_lock.acquire()
However, if we now try and acquire a lock for a database which is working on the same directory, we will find that we can't.
# Get another db instance
# (the issue doesn't arise with the current instance because of the recursive locking)
db_other_view = OpenSCMDB(
db_dir=db.db_dir, # Pointing at the same directory
backend_data=DATA_BACKENDS.get_instance("csv"),
backend_index=INDEX_BACKENDS.get_instance("csv"),
)
# The default timeout is inifinity,
# so the below would just block forever by default.
# We make this more sensible here.
db_other_view.index_file_lock.timeout = 0.5
# We can't acquire the lock from this other db instance
try:
db_other_view.index_file_lock.acquire()
except filelock.Timeout:
traceback.print_exc(limit=0, chain=False)
filelock._error.Timeout: The file lock '/tmp/tmppxl3pdmo/index.csv.lock' could not be acquired.
The locking is helpful to avoid putting the database in a corrupt state. By default, the lock is acquired for all operations. You can override this by providing a different lock to the method.
# The lock is currently being held
db.index_file_lock.is_locked
True
# Despite this, the recursive locks
# mean we can still load data with the db instance
# that is holding the lock
db.load()
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | high | 0 | 1 | 2 |
| J | Ocean Heat Uptake | high | 3 | 4 | 5 | ||
| W / m^2 | Effective Radiative Forcing | high | 6 | 7 | 8 | ||
| K / yr | Warming rate | high | 9 | 10 | 11 | ||
| climate_model_b | scenario_a | K | Temperature | high | 12 | 13 | 14 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_c | scenario_e | K / yr | Warming rate | medium | 657 | 658 | 659 |
| climate_model_d | scenario_e | K | Temperature | medium | 660 | 661 | 662 |
| J | Ocean Heat Uptake | medium | 663 | 664 | 665 | ||
| W / m^2 | Effective Radiative Forcing | medium | 666 | 667 | 668 | ||
| K / yr | Warming rate | medium | 669 | 670 | 671 |
240 rows × 3 columns
# However, we cannot load data from the other instance
try:
db_other_view.load()
except filelock.Timeout:
traceback.print_exc(limit=0, chain=False)
filelock._error.Timeout: The file lock '/tmp/tmppxl3pdmo/index.csv.lock' could not be acquired.
# Unless we override the lock
db_other_view.load(
# Providing a nullcontext bypasses the lock.
# Obviously, only do this if you know what you're doing.
index_file_lock=contextlib.nullcontext()
)
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | high | 0 | 1 | 2 |
| J | Ocean Heat Uptake | high | 3 | 4 | 5 | ||
| W / m^2 | Effective Radiative Forcing | high | 6 | 7 | 8 | ||
| K / yr | Warming rate | high | 9 | 10 | 11 | ||
| climate_model_b | scenario_a | K | Temperature | high | 12 | 13 | 14 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_c | scenario_e | K / yr | Warming rate | medium | 657 | 658 | 659 |
| climate_model_d | scenario_e | K | Temperature | medium | 660 | 661 | 662 |
| J | Ocean Heat Uptake | medium | 663 | 664 | 665 | ||
| W / m^2 | Effective Radiative Forcing | medium | 666 | 667 | 668 | ||
| K / yr | Warming rate | medium | 669 | 670 | 671 |
240 rows × 3 columns
For more information on the different ways to acquire the lock, see the filelock docs. If you are interested in locking, we really recommend reading the docs. The locking is very powerful, but quite subtle so you really have to understand the implementation to get the most out of it.
# Release the lock fully before moving on
# (if you want to know why this is needed,
# read the filelock docs)
db.index_file_lock.release()
db.index_file_lock.is_locked
False
OpenSCMDBReader¶
If you just want to read data,
then we provide a class optimised to this use case, OpenSCMDBReader.
This holds the index in memory,
so it does not need to be read from disk every time we wish to load data.
The easiest way to create a reader is from an existing OpenSCMDB instance.
reader = db.create_reader()
# the index is in memory
reader.db_index
| file_id | |||||
|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |
| climate_model_a | scenario_a | K | Temperature | high | 1 |
| J | Ocean Heat Uptake | high | 1 | ||
| W / m^2 | Effective Radiative Forcing | high | 1 | ||
| K / yr | Warming rate | high | 1 | ||
| climate_model_b | scenario_a | K | Temperature | high | 1 |
| ... | ... | ... | ... | ... | ... |
| climate_model_c | scenario_e | K / yr | Warming rate | medium | 3 |
| climate_model_d | scenario_e | K | Temperature | medium | 3 |
| J | Ocean Heat Uptake | medium | 3 | ||
| W / m^2 | Effective Radiative Forcing | medium | 3 | ||
| K / yr | Warming rate | medium | 3 |
240 rows × 1 columns
The reader's load method is basically identical to that of OpenSCMDB,
it just has to do less work because it doesn't need to read the index.
# Having the index in memory can make data reading faster
# (in the case of a large index).
reader.load(pix.isin(unit="K", variant="medium") & pix.ismatch(scenario="*_c"))
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_c | K | Temperature | medium | 336 | 337 | 338 |
| climate_model_b | scenario_c | K | Temperature | medium | 348 | 349 | 350 |
| climate_model_c | scenario_c | K | Temperature | medium | 360 | 361 | 362 |
| climate_model_d | scenario_c | K | Temperature | medium | 372 | 373 | 374 |
By default, the reader is given its own lock. This allows us to use the reader to lock the database.
reader.lock.acquire()
<filelock._api.AcquireReturnProxy at 0x792d124471d0>
# Now we can't perform operations with any other view into the db,
# so we can gurantee our reader's safety.
try:
db_other_view.load()
except filelock.Timeout:
traceback.print_exc(limit=0, chain=False)
filelock._error.Timeout: The file lock '/tmp/tmppxl3pdmo/index.csv.lock' could not be acquired.
# If we release the lock, we can load again
reader.lock.release()
db_other_view.load()
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | high | 0 | 1 | 2 |
| J | Ocean Heat Uptake | high | 3 | 4 | 5 | ||
| W / m^2 | Effective Radiative Forcing | high | 6 | 7 | 8 | ||
| K / yr | Warming rate | high | 9 | 10 | 11 | ||
| climate_model_b | scenario_a | K | Temperature | high | 12 | 13 | 14 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_c | scenario_e | K / yr | Warming rate | medium | 657 | 658 | 659 |
| climate_model_d | scenario_e | K | Temperature | medium | 660 | 661 | 662 |
| J | Ocean Heat Uptake | medium | 663 | 664 | 665 | ||
| W / m^2 | Effective Radiative Forcing | medium | 666 | 667 | 668 | ||
| K / yr | Warming rate | medium | 669 | 670 | 671 |
240 rows × 3 columns
The reader can also be used as a context manager, in which the lock is automatically acquired and released.
with reader:
# The lock is held by the reader here
# so we can load data via the reader
reader.load()
# but we can't e.g. save data via another view into the database
try:
db_other_view.save(df_timeseries_like)
except filelock.Timeout:
traceback.print_exc(limit=0, chain=False)
# Outside the context block, the lock is released by the reader
# so we can use other views for operations again.
db_other_view.load()
filelock._error.Timeout: The file lock '/tmp/tmppxl3pdmo/index.csv.lock' could not be acquired.
| 2010 | 2015 | 2025 | |||||
|---|---|---|---|---|---|---|---|
| climate_model | scenario | unit | variable | variant | |||
| climate_model_a | scenario_a | K | Temperature | high | 0 | 1 | 2 |
| J | Ocean Heat Uptake | high | 3 | 4 | 5 | ||
| W / m^2 | Effective Radiative Forcing | high | 6 | 7 | 8 | ||
| K / yr | Warming rate | high | 9 | 10 | 11 | ||
| climate_model_b | scenario_a | K | Temperature | high | 12 | 13 | 14 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| climate_model_c | scenario_e | K / yr | Warming rate | medium | 657 | 658 | 659 |
| climate_model_d | scenario_e | K | Temperature | medium | 660 | 661 | 662 |
| J | Ocean Heat Uptake | medium | 663 | 664 | 665 | ||
| W / m^2 | Effective Radiative Forcing | medium | 666 | 667 | 668 | ||
| K / yr | Warming rate | medium | 669 | 670 | 671 |
240 rows × 3 columns
If you aren't worried about the reader having a lock, you can simply disable it when creating the reader.
reader_no_lock = db.create_reader(lock=False)
reader_no_lock.lock is None
True
Summary¶
We believe that OpenSCMDB provides a powerful way for saving timeseries data. However, it is in a work progress. If there is a feature you would like or a bug or anything else, please raise an issue.