How to use OpenSCMDB¶

In this notebook we describe how to use the OpenSCMDB class. This class is useful for writing data to and reading data from a database of files.

Imports¶

In [1]:

Copied!





import concurrent.futures
import contextlib
import itertools
import tarfile
import tempfile
import traceback
from functools import partial
from pathlib import Path

import filelock
import numpy as np
import pandas as pd
import pandas_indexing as pix
import tqdm

from pandas_openscm.db import (
    DATA_BACKENDS,
    INDEX_BACKENDS,
    AlreadyInDBError,
    EmptyDBError,
    OpenSCMDB,
)
from pandas_openscm.parallelisation import ParallelOpConfig
import concurrent.futures
import contextlib
import itertools
import tarfile
import tempfile
import traceback
from functools import partial
from pathlib import Path

import filelock
import numpy as np
import pandas as pd
import pandas_indexing as pix
import tqdm

from pandas_openscm.db import (
    DATA_BACKENDS,
    INDEX_BACKENDS,
    AlreadyInDBError,
    EmptyDBError,
    OpenSCMDB,
)
from pandas_openscm.parallelisation import ParallelOpConfig

Basics¶

The database is file based. We simply write files of data to a directory and keep track of where the data is using an index file. As a result, every OpenSCMDB instance needs to know which directory to write into.

In addition, the instance needs to know which file format, i.e. backend, to use to write and read the data and the index. The available data and index backends are below. You can, of course, write your own backend that matches the required interface.

In [2]:

Copied!

DATA_BACKENDS.options
DATA_BACKENDS.options

Out[2]:

(('csv', pandas_openscm.db.csv.CSVDataBackend),
 ('feather', pandas_openscm.db.feather.FeatherDataBackend),
 ('in_memory', pandas_openscm.db.in_memory.InMemoryDataBackend),
 ('netCDF', pandas_openscm.db.netcdf.netCDFDataBackend))

In [3]:

Copied!

INDEX_BACKENDS.options
INDEX_BACKENDS.options

Out[3]:

(('csv', pandas_openscm.db.csv.CSVIndexBackend),
 ('feather', pandas_openscm.db.feather.FeatherIndexBackend),
 ('in_memory', pandas_openscm.db.in_memory.InMemoryIndexBackend),
 ('netCDF', pandas_openscm.db.netcdf.netCDFIndexBackend))

Knowing the directory in which to work and the back-ends to use, we can initialise a database.

In [4]:

Copied!





db = OpenSCMDB(
    db_dir=Path(tempfile.mkdtemp()),
    backend_data=DATA_BACKENDS.get_instance("csv"),
    backend_index=INDEX_BACKENDS.get_instance("csv"),
)
db
db = OpenSCMDB(
    db_dir=Path(tempfile.mkdtemp()),
    backend_data=DATA_BACKENDS.get_instance("csv"),
    backend_index=INDEX_BACKENDS.get_instance("csv"),
)
db

Out[4]:

OpenSCMDB(backend_data=CSVDataBackend(ext='.csv'), backend_index=CSVIndexBackend(ext='.csv'), db_dir=PosixPath('/tmp/tmppxl3pdmo'), index_file_lock=<filelock._unix.UnixFileLock object at 0x792d12633e90>)

Empty databases¶

If you try and operate on an empty database, you will either get an error or nothing will happen.

In [5]:

Copied!

# Deleting an empty database does nothing
db.delete()
# Deleting an empty database does nothing
db.delete()

In [6]:

Copied!





# Trying to load an empty database raises an error
try:
    db.load()
except EmptyDBError:
    traceback.print_exc(limit=0, chain=False)
# Trying to load an empty database raises an error
try:
    db.load()
except EmptyDBError:
    traceback.print_exc(limit=0, chain=False)

pandas_openscm.db.openscm_db.EmptyDBError: The database is empty: db=OpenSCMDB(backend_data=CSVDataBackend(ext='.csv'), backend_index=CSVIndexBackend(ext='.csv'), db_dir=PosixPath('/tmp/tmppxl3pdmo'), index_file_lock=<filelock._unix.UnixFileLock object at 0x792d12633e90>)

Saving data¶

If we have some data, we can save it to the database. In theory, this will work with any data type. However, it is optimised to timeseries data i.e. data that has a multi-index with metadata values for each timeseries, columns which define the time axis and values for each timeseries.

In [7]:

Copied!





df_timeseries_like = pd.DataFrame(
    np.arange(12).reshape(4, 3),
    columns=[2010, 2015, 2025],
    index=pd.MultiIndex.from_tuples(
        [
            ("scenario_a", "climate_model_a", "Temperature", "K"),
            ("scenario_b", "climate_model_a", "Temperature", "K"),
            ("scenario_b", "climate_model_b", "Temperature", "K"),
            ("scenario_b", "climate_model_b", "Ocean Heat Uptake", "J"),
        ],
        names=["scenario", "climate_model", "variable", "unit"],
    ),
)
df_timeseries_like
df_timeseries_like = pd.DataFrame(
    np.arange(12).reshape(4, 3),
    columns=[2010, 2015, 2025],
    index=pd.MultiIndex.from_tuples(
        [
            ("scenario_a", "climate_model_a", "Temperature", "K"),
            ("scenario_b", "climate_model_a", "Temperature", "K"),
            ("scenario_b", "climate_model_b", "Temperature", "K"),
            ("scenario_b", "climate_model_b", "Ocean Heat Uptake", "J"),
        ],
        names=["scenario", "climate_model", "variable", "unit"],
    ),
)
df_timeseries_like

Out[7]:

				2010	2015	2025
scenario	climate_model	variable	unit
scenario_a	climate_model_a	Temperature	K	0	1	2
scenario_b	climate_model_a	Temperature	K	3	4	5
	climate_model_b	Temperature	K	6	7	8
	climate_model_b	Ocean Heat Uptake	J	9	10	11

In [8]:

Copied!

# Saving is then trivial
db.save(df_timeseries_like)
# Saving is then trivial
db.save(df_timeseries_like)

Loading data¶

Metadata¶

The first thing we might want to load is the metadata in our database.

In [9]:

Copied!





# Loading metadata is trivial.
# A MultiIndex is returned by default.
metadata = db.load_metadata()
metadata
# Loading metadata is trivial.
# A MultiIndex is returned by default.
metadata = db.load_metadata()
metadata

Out[9]:

MultiIndex([('climate_model_a', 'scenario_a', 'K',       'Temperature'),
            ('climate_model_a', 'scenario_b', 'K',       'Temperature'),
            ('climate_model_b', 'scenario_b', 'K',       'Temperature'),
            ('climate_model_b', 'scenario_b', 'J', 'Ocean Heat Uptake')],
           names=['climate_model', 'scenario', 'unit', 'variable'])

In [10]:

Copied!

# The MultiIndex can be trivially turned into a DataFrame if you wish
metadata.to_frame(index=False)
# The MultiIndex can be trivially turned into a DataFrame if you wish
metadata.to_frame(index=False)

Out[10]:

	climate_model	scenario	unit	variable
0	climate_model_a	scenario_a	K	Temperature
1	climate_model_a	scenario_b	K	Temperature
2	climate_model_b	scenario_b	K	Temperature
3	climate_model_b	scenario_b	J	Ocean Heat Uptake

Data¶

Then we can load data.

In [11]:

Copied!

# By default, we just load everything
db.load()
# By default, we just load everything
db.load()

Out[11]:

				2010	2015	2025
climate_model	scenario	unit	variable
climate_model_a	scenario_a	K	Temperature	0	1	2
climate_model_a	scenario_b	K	Temperature	3	4	5
climate_model_b	scenario_b	K	Temperature	6	7	8
climate_model_b	scenario_b	J	Ocean Heat Uptake	9	10	11

Sub-selecting¶

If we don't want all the data, we can sub-select.

In [12]:

Copied!

# For example, this can be done using the previously loaded metadata
db.load(metadata[:2])
# For example, this can be done using the previously loaded metadata
db.load(metadata[:2])

Out[12]:

				2010	2015	2025
climate_model	scenario	unit	variable
climate_model_a	scenario_a	K	Temperature	0	1	2
climate_model_a	scenario_b	K	Temperature	3	4	5

When combined with pandas-indexing, this provides particularly powerful functionality.

In [13]:

Copied!

db.load(pix.isin(scenario="scenario_b"))
db.load(pix.isin(scenario="scenario_b"))

Out[13]:

				2010	2015	2025
climate_model	scenario	unit	variable
climate_model_a	scenario_b	K	Temperature	3	4	5
climate_model_b	scenario_b	K	Temperature	6	7	8
climate_model_b	scenario_b	J	Ocean Heat Uptake	9	10	11

In [14]:

Copied!

db.load(pix.ismatch(variable="Ocean*"))
db.load(pix.ismatch(variable="Ocean*"))

Out[14]:

				2010	2015	2025
climate_model	scenario	unit	variable
climate_model_b	scenario_b	J	Ocean Heat Uptake	9	10	11

Deleting¶

If we wish, we can delete our database with delete.

In [15]:

Copied!

# The database is not empty to start
db.is_empty
# The database is not empty to start
db.is_empty

Out[15]:

False

In [16]:

Copied!

db.delete()
# Now the database is empty
db.is_empty
db.delete()
# Now the database is empty
db.is_empty

Out[16]:

True

Summary so far¶

The functionality shown up to here is the key functionality. If all you need to do is basic saving and loading of data, this is all you need. If you are looking to support more complex use cases (which you probably are, otherwise you wouldn't be here), read on.

Advanced topics¶

If you need to share a database, you can zip it and pass it to someone else.

We start by putting some data in a database.

In [17]:

Copied!

top_level_dir = Path(tempfile.mkdtemp())
top_level_dir = Path(tempfile.mkdtemp())

In [18]:

Copied!





db_start = OpenSCMDB(
    db_dir=top_level_dir / "start",
    backend_data=DATA_BACKENDS.get_instance("csv"),
    backend_index=INDEX_BACKENDS.get_instance("csv"),
)
db_start.save(df_timeseries_like)
db_start = OpenSCMDB(
    db_dir=top_level_dir / "start",
    backend_data=DATA_BACKENDS.get_instance("csv"),
    backend_index=INDEX_BACKENDS.get_instance("csv"),
)
db_start.save(df_timeseries_like)

Then we create a gzipped tar archive of our database.

In [19]:

Copied!

gzipped = top_level_dir / "db_archive.tar.gz"
db_start.to_gzipped_tar_archive(gzipped)
gzipped = top_level_dir / "db_archive.tar.gz"
db_start.to_gzipped_tar_archive(gzipped)

Out[19]:

PosixPath('/tmp/tmp9lqbi1y_/db_archive.tar.gz')

To demonstrate that this does not rely on the original data, we delete the original database.

In [20]:

Copied!

db_start.delete()
db_start.delete()

We can inspect the tar file's contents.

In [21]:

Copied!

with tarfile.open(gzipped) as tar:
    print(f"{tar.getmembers()=}")
with tarfile.open(gzipped) as tar:
    print(f"{tar.getmembers()=}")

tar.getmembers()=[<TarInfo 'db' at 0x792d127bb280>, <TarInfo 'db/0.csv' at 0x792d127bb100>, <TarInfo 'db/filemap.csv' at 0x792d127bb340>, <TarInfo 'db/index.csv' at 0x792d127ba680>]

A new database can be initialised from the gzipped tar archive.

In [22]:

Copied!





db_moved = OpenSCMDB.from_gzipped_tar_archive(
    gzipped,
    db_dir=top_level_dir / "moved",
)
db_moved
db_moved = OpenSCMDB.from_gzipped_tar_archive(
    gzipped,
    db_dir=top_level_dir / "moved",
)
db_moved

Out[22]:

OpenSCMDB(backend_data=CSVDataBackend(ext='.csv'), backend_index=CSVIndexBackend(ext='.csv'), db_dir=PosixPath('/tmp/tmp9lqbi1y_/moved'), index_file_lock=<filelock._unix.UnixFileLock object at 0x792d126a8850>)

As above, we remove the archive to demonstrate that there is no reliance on it for the following operations.

In [23]:

Copied!

gzipped.unlink()
gzipped.unlink()

You can then use this database like normal, but now from the new location (whether on your machine or someone else's).

In [24]:

Copied!

db_moved.load()
db_moved.load()

Out[24]:

				2010	2015	2025
climate_model	scenario	unit	variable
climate_model_a	scenario_a	K	Temperature	0	1	2
climate_model_a	scenario_b	K	Temperature	3	4	5
climate_model_b	scenario_b	K	Temperature	6	7	8
climate_model_b	scenario_b	J	Ocean Heat Uptake	9	10	11

In [25]:

Copied!

db_moved.load(pix.isin(unit="J"))
db_moved.load(pix.isin(unit="J"))

Out[25]:

				2010	2015	2025
climate_model	scenario	unit	variable
climate_model_b	scenario_b	J	Ocean Heat Uptake	9	10	11

We clean up the files before moving onto the next demonstration.

In [26]:

Copied!

db_moved.delete()
db_moved.delete()

Grouping data¶

For many use cases, it won't make sense to save all your data in a single file. You can control how the data is grouped while saving using the arguments shown below.

In [27]:

Copied!





df_many_timeseries = pd.DataFrame(
    np.arange(5 * 3 * 4 * 4 * 3).reshape(5 * 3 * 4 * 4, 3),
    columns=[2010, 2015, 2025],
    index=pd.MultiIndex.from_tuples(
        [
            (scenario, variant, climate_model, variable, unit)
            for scenario, variant, climate_model, (variable, unit) in itertools.product(
                ["scenario_a", "scenario_b", "scenario_c", "scenario_d", "scenario_e"],
                ["high", "medium", "low"],
                [
                    "climate_model_a",
                    "climate_model_b",
                    "climate_model_c",
                    "climate_model_d",
                ],
                [
                    ("Temperature", "K"),
                    ("Ocean Heat Uptake", "J"),
                    ("Effective Radiative Forcing", "W / m^2"),
                    ("Warming rate", "K / yr"),
                ],
            )
        ],
        names=["scenario", "variant", "climate_model", "variable", "unit"],
    ),
)
df_many_timeseries
df_many_timeseries = pd.DataFrame(
    np.arange(5 * 3 * 4 * 4 * 3).reshape(5 * 3 * 4 * 4, 3),
    columns=[2010, 2015, 2025],
    index=pd.MultiIndex.from_tuples(
        [
            (scenario, variant, climate_model, variable, unit)
            for scenario, variant, climate_model, (variable, unit) in itertools.product(
                ["scenario_a", "scenario_b", "scenario_c", "scenario_d", "scenario_e"],
                ["high", "medium", "low"],
                [
                    "climate_model_a",
                    "climate_model_b",
                    "climate_model_c",
                    "climate_model_d",
                ],
                [
                    ("Temperature", "K"),
                    ("Ocean Heat Uptake", "J"),
                    ("Effective Radiative Forcing", "W / m^2"),
                    ("Warming rate", "K / yr"),
                ],
            )
        ],
        names=["scenario", "variant", "climate_model", "variable", "unit"],
    ),
)
df_many_timeseries

Out[27]:

					2010	2015	2025
scenario	variant	climate_model	variable	unit
scenario_a	high	climate_model_a	Temperature	K	0	1	2
			Ocean Heat Uptake	J	3	4	5
			Effective Radiative Forcing	W / m^2	6	7	8
			Warming rate	K / yr	9	10	11
		climate_model_b	Temperature	K	12	13	14
...	...	...	...	...	...	...	...
scenario_e	low	climate_model_c	Warming rate	K / yr	705	706	707
		climate_model_d	Temperature	K	708	709	710
			Ocean Heat Uptake	J	711	712	713
			Effective Radiative Forcing	W / m^2	714	715	716
			Warming rate	K / yr	717	718	719

240 rows × 3 columns

In [28]:

Copied!

# For example, we might want to group by climate model and scenario on saving
db.save(df_many_timeseries, groupby=["climate_model", "scenario"])
# For example, we might want to group by climate model and scenario on saving
db.save(df_many_timeseries, groupby=["climate_model", "scenario"])

Parallelisation and progress bars¶

We provide a variety of ways to control parallelisation during operations as well as the display of progress bars. The simplest way to control these is via the arguments shown below.

In [29]:

Copied!





# Turn on default progress bars
# (note that these don't show up in the rendered docs,
# but will if you actually use the package in a notebook)
db.load(progress=True)
# Turn on default progress bars
# (note that these don't show up in the rendered docs,
# but will if you actually use the package in a notebook)
db.load(progress=True)

Out[29]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	high	0	1	2
		J	Ocean Heat Uptake	high	3	4	5
		W / m^2	Effective Radiative Forcing	high	6	7	8
		K / yr	Warming rate	high	9	10	11
		K	Temperature	medium	48	49	50
...	...	...	...	...	...	...	...
climate_model_d	scenario_e	K / yr	Warming rate	medium	669	670	671
		K	Temperature	low	708	709	710
		J	Ocean Heat Uptake	low	711	712	713
		W / m^2	Effective Radiative Forcing	low	714	715	716
		K / yr	Warming rate	low	717	718	719

240 rows × 3 columns

In [30]:

Copied!

# Turn on default parallelisation
db.load(max_workers=4)
# Turn on default parallelisation
db.load(max_workers=4)

Out[30]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_b	K	Temperature	high	144	145	146
		J	Ocean Heat Uptake	high	147	148	149
		W / m^2	Effective Radiative Forcing	high	150	151	152
		K / yr	Warming rate	high	153	154	155
		K	Temperature	medium	192	193	194
...	...	...	...	...	...	...	...
climate_model_d	scenario_e	K / yr	Warming rate	medium	669	670	671
		K	Temperature	low	708	709	710
		J	Ocean Heat Uptake	low	711	712	713
		W / m^2	Effective Radiative Forcing	low	714	715	716
		K / yr	Warming rate	low	717	718	719

240 rows × 3 columns

In [31]:

Copied!





# Turn on default progress bars and parallelisation
# (again, note that the progress bars don't show up in the rendered docs,
# but will if you actually use the package in a notebook)
db.load(progress=True, max_workers=4)
# Turn on default progress bars and parallelisation
# (again, note that the progress bars don't show up in the rendered docs,
# but will if you actually use the package in a notebook)
db.load(progress=True, max_workers=4)

Out[31]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	high	0	1	2
		J	Ocean Heat Uptake	high	3	4	5
		W / m^2	Effective Radiative Forcing	high	6	7	8
		K / yr	Warming rate	high	9	10	11
		K	Temperature	medium	48	49	50
...	...	...	...	...	...	...	...
climate_model_d	scenario_c	K / yr	Warming rate	medium	381	382	383
		K	Temperature	low	420	421	422
		J	Ocean Heat Uptake	low	423	424	425
		W / m^2	Effective Radiative Forcing	low	426	427	428
		K / yr	Warming rate	low	429	430	431

240 rows × 3 columns

If you want fine-grained control over the behaviour, then the operations support receiving ParallelOpConfig instances to control the parallelisation and progress for different operations. An example is shown below.

In [32]:

Copied!





with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
    loaded = db.load(
        parallel_op_config=ParallelOpConfig(
            progress_results=partial(
                tqdm.tqdm, desc="Custom description and non-HTML bar"
            ),
            executor=executor,
            progress_parallel_submission=partial(
                tqdm.tqdm, desc="Custom submission description and non-HTML bar"
            ),
        )
    )

loaded
with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
    loaded = db.load(
        parallel_op_config=ParallelOpConfig(
            progress_results=partial(
                tqdm.tqdm, desc="Custom description and non-HTML bar"
            ),
            executor=executor,
            progress_parallel_submission=partial(
                tqdm.tqdm, desc="Custom submission description and non-HTML bar"
            ),
        )
    )

loaded

Custom submission description and non-HTML bar:   0%|          | 0/20 [00:00<?, ?it/s]

Custom submission description and non-HTML bar: 100%|██████████| 20/20 [00:00<00:00, 1448.48it/s]

Custom description and non-HTML bar:   0%|          | 0/20 [00:00<?, ?it/s]

Custom description and non-HTML bar: 100%|██████████| 20/20 [00:00<00:00, 908.04it/s]

Out[32]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	high	0	1	2
		J	Ocean Heat Uptake	high	3	4	5
		W / m^2	Effective Radiative Forcing	high	6	7	8
		K / yr	Warming rate	high	9	10	11
		K	Temperature	medium	48	49	50
...	...	...	...	...	...	...	...
climate_model_d	scenario_e	K / yr	Warming rate	medium	669	670	671
		K	Temperature	low	708	709	710
		J	Ocean Heat Uptake	low	711	712	713
		W / m^2	Effective Radiative Forcing	low	714	715	716
		K / yr	Warming rate	low	717	718	719

240 rows × 3 columns

Overwriting data¶

By default, it is not possible to overwrite data that is already in the database. This ensures that you don't accidentally overwrite.

In [33]:

Copied!





# Trying to load an empty database raises an error
try:
    db.save(df_many_timeseries)
except AlreadyInDBError:
    traceback.print_exc(limit=0, chain=False)
# Trying to load an empty database raises an error
try:
    db.save(df_many_timeseries)
except AlreadyInDBError:
    traceback.print_exc(limit=0, chain=False)

pandas_openscm.db.openscm_db.AlreadyInDBError: The following rows are already in the database:
       scenario variant    climate_model                     variable     unit
0    scenario_a    high  climate_model_a                  Temperature        K
1    scenario_a    high  climate_model_a            Ocean Heat Uptake        J
2    scenario_a    high  climate_model_a  Effective Radiative Forcing  W / m^2
3    scenario_a    high  climate_model_a                 Warming rate   K / yr
4    scenario_a    high  climate_model_b                  Temperature        K
..          ...     ...              ...                          ...      ...
235  scenario_e     low  climate_model_c                 Warming rate   K / yr
236  scenario_e     low  climate_model_d                  Temperature        K
237  scenario_e     low  climate_model_d            Ocean Heat Uptake        J
238  scenario_e     low  climate_model_d  Effective Radiative Forcing  W / m^2
239  scenario_e     low  climate_model_d                 Warming rate   K / yr

[240 rows x 5 columns]

If you are sure, you can overwrite data as shown below. As the progress bars show, this happens in two steps:

the new data is written to disk
the data that it overwrites is removed

In [34]:

Copied!





# A helper to ensure our parallel bars
# show in the rendered docs.
# If you're actually using the package,
# you can just use `progress=True` for the majority of use cases.
save_pbars = dict(
    parallel_op_config_save=ParallelOpConfig(partial(tqdm.tqdm, desc="File saving")),
    parallel_op_config_delete=ParallelOpConfig(
        partial(tqdm.tqdm, desc="File deletion")
    ),
    parallel_op_config_rewrite=ParallelOpConfig(
        partial(tqdm.tqdm, desc="File re-writing")
    ),
)
load_pbar = dict(
    parallel_op_config=ParallelOpConfig(partial(tqdm.tqdm, desc="File loading")),
)
# A helper to ensure our parallel bars
# show in the rendered docs.
# If you're actually using the package,
# you can just use `progress=True` for the majority of use cases.
save_pbars = dict(
    parallel_op_config_save=ParallelOpConfig(partial(tqdm.tqdm, desc="File saving")),
    parallel_op_config_delete=ParallelOpConfig(
        partial(tqdm.tqdm, desc="File deletion")
    ),
    parallel_op_config_rewrite=ParallelOpConfig(
        partial(tqdm.tqdm, desc="File re-writing")
    ),
)
load_pbar = dict(
    parallel_op_config=ParallelOpConfig(partial(tqdm.tqdm, desc="File loading")),
)

In [35]:

Copied!

db.save(df_many_timeseries, allow_overwrite=True, **save_pbars)
db.save(df_many_timeseries, allow_overwrite=True, **save_pbars)

File saving:   0%|          | 0/3 [00:00<?, ?it/s]

File saving: 100%|██████████| 3/3 [00:00<00:00, 706.43it/s]

File deletion:   0%|          | 0/20 [00:00<?, ?it/s]

File deletion: 100%|██████████| 20/20 [00:00<00:00, 18737.12it/s]

Overwriting works whether you are doing a full overwrite or a partial overwrite. However, if you are doing a partial overwrite, the operation can be quite expensive. The reason is that you have to load the written data and re-write the data that is not being overwritten. As a result, considering your data grouping carefully will make a big difference in performance if you expected to be doing lots of overwrites.

In [36]:

Copied!

# Save into a single file
db.save(df_many_timeseries, allow_overwrite=True, **save_pbars)
# Save into a single file
db.save(df_many_timeseries, allow_overwrite=True, **save_pbars)

File saving:   0%|          | 0/3 [00:00<?, ?it/s]

File saving: 100%|██████████| 3/3 [00:00<00:00, 697.27it/s]

File deletion:   0%|          | 0/1 [00:00<?, ?it/s]

File deletion: 100%|██████████| 1/1 [00:00<00:00, 7384.34it/s]

In [37]:

Copied!





# Now overwrite only a selection of the data.
# This will force the existing file to be re-written
# so that the data we wish to keep is not lost,
# but we do get rid of the data we are overwriting.
db.save(df_many_timeseries.iloc[:20, :], allow_overwrite=True, **save_pbars)
# Now overwrite only a selection of the data.
# This will force the existing file to be re-written
# so that the data we wish to keep is not lost,
# but we do get rid of the data we are overwriting.
db.save(df_many_timeseries.iloc[:20, :], allow_overwrite=True, **save_pbars)

/home/docs/.asdf/installs/python/3.11.14/lib/python3.11/site-packages/pandas_openscm/db/openscm_db.py:740: UserWarning: Overwriting the data will require re-writing. This may be slow. If that is an issue, the way to solve it is to update your workflow to ensure that you are not overwriting data or are only overwriting entire files.
  warnings.warn(msg)

File re-writing:   0%|          | 0/1 [00:00<?, ?it/s]

File re-writing: 100%|██████████| 1/1 [00:00<00:00, 199.85it/s]

File saving:   0%|          | 0/3 [00:00<?, ?it/s]

File saving: 100%|██████████| 3/3 [00:00<00:00, 721.83it/s]

File deletion:   0%|          | 0/1 [00:00<?, ?it/s]

File deletion: 100%|██████████| 1/1 [00:00<00:00, 7884.03it/s]

In [38]:

Copied!





# You can disable the warning as shown below
db.save(
    df_many_timeseries.iloc[-20:, :],
    allow_overwrite=True,
    warn_on_partial_overwrite=False,
    **save_pbars,
)
# You can disable the warning as shown below
db.save(
    df_many_timeseries.iloc[-20:, :],
    allow_overwrite=True,
    warn_on_partial_overwrite=False,
    **save_pbars,
)

File re-writing:   0%|          | 0/1 [00:00<?, ?it/s]

File re-writing: 100%|██████████| 1/1 [00:00<00:00, 185.93it/s]

File saving:   0%|          | 0/3 [00:00<?, ?it/s]

File saving: 100%|██████████| 3/3 [00:00<00:00, 782.37it/s]

File deletion:   0%|          | 0/1 [00:00<?, ?it/s]

File deletion: 100%|██████████| 1/1 [00:00<00:00, 8701.88it/s]

More on grouping¶

Now that you have seen some more features, it is easier to explain why grouping is so important. By default, when we save, we save the data as a single file.

In [39]:

Copied!

# Before continuing, clear out the database
db.delete()
# Before continuing, clear out the database
db.delete()

In [40]:

Copied!





# The progress bar shows that three files are being saved, these are:
# 1. the data (as a single file)
# 2. the index
# 3. the file map
db.save(df_many_timeseries, **save_pbars)
# The progress bar shows that three files are being saved, these are:
# 1. the data (as a single file)
# 2. the index
# 3. the file map
db.save(df_many_timeseries, **save_pbars)

File saving:   0%|          | 0/3 [00:00<?, ?it/s]

File saving: 100%|██████████| 3/3 [00:00<00:00, 753.83it/s]

If we save the data like this, then we have to read all of the data every time, even if we only want to get a subset of it.

In [41]:

Copied!

# All the data is in one file, which is read in its entirety,
# even though we only want some of the data.
db.load(pix.isin(variant="low"), **load_pbar)
# All the data is in one file, which is read in its entirety,
# even though we only want some of the data.
db.load(pix.isin(variant="low"), **load_pbar)

File loading:   0%|          | 0/1 [00:00<?, ?it/s]

File loading: 100%|██████████| 1/1 [00:00<00:00, 659.27it/s]

Out[41]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	low	96	97	98
		J	Ocean Heat Uptake	low	99	100	101
		W / m^2	Effective Radiative Forcing	low	102	103	104
		K / yr	Warming rate	low	105	106	107
climate_model_b	scenario_a	K	Temperature	low	108	109	110
...	...	...	...	...	...	...	...
climate_model_c	scenario_e	K / yr	Warming rate	low	705	706	707
climate_model_d	scenario_e	K	Temperature	low	708	709	710
		J	Ocean Heat Uptake	low	711	712	713
		W / m^2	Effective Radiative Forcing	low	714	715	716
		K / yr	Warming rate	low	717	718	719

80 rows × 3 columns

If you know how you are likely to want to access your data, you can give yourself much more fine-grained control.

In [42]:

Copied!





# For example, if we know that the variants are our group of interest,
# we should save the database like that.
# As the progress bars show, the data is now grouped into more files
# (3, in fact, with the other two save operations being for the index and file map).
db.save(df_many_timeseries, groupby=["variant"], allow_overwrite=True, **save_pbars)
# For example, if we know that the variants are our group of interest,
# we should save the database like that.
# As the progress bars show, the data is now grouped into more files
# (3, in fact, with the other two save operations being for the index and file map).
db.save(df_many_timeseries, groupby=["variant"], allow_overwrite=True, **save_pbars)

File saving:   0%|          | 0/5 [00:00<?, ?it/s]

File saving: 100%|██████████| 5/5 [00:00<00:00, 846.00it/s]

File deletion:   0%|          | 0/1 [00:00<?, ?it/s]

File deletion: 100%|██████████| 1/1 [00:00<00:00, 5433.04it/s]

In [43]:

Copied!

# If we only want to load one variant, now we only load data for that variant,
# not all of the data.
db.load(pix.isin(variant="low"), **load_pbar)
# If we only want to load one variant, now we only load data for that variant,
# not all of the data.
db.load(pix.isin(variant="low"), **load_pbar)

File loading:   0%|          | 0/1 [00:00<?, ?it/s]

File loading: 100%|██████████| 1/1 [00:00<00:00, 572.91it/s]

Out[43]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	low	96	97	98
		J	Ocean Heat Uptake	low	99	100	101
		W / m^2	Effective Radiative Forcing	low	102	103	104
		K / yr	Warming rate	low	105	106	107
climate_model_b	scenario_a	K	Temperature	low	108	109	110
...	...	...	...	...	...	...	...
climate_model_c	scenario_e	K / yr	Warming rate	low	705	706	707
climate_model_d	scenario_e	K	Temperature	low	708	709	710
		J	Ocean Heat Uptake	low	711	712	713
		W / m^2	Effective Radiative Forcing	low	714	715	716
		K / yr	Warming rate	low	717	718	719

80 rows × 3 columns

In [44]:

Copied!

# This is clearer if we load data for two variants, but not all.
# The progress bars show that two files are being loaded.
db.load(pix.isin(variant=["low", "high"]), **load_pbar)
# This is clearer if we load data for two variants, but not all.
# The progress bars show that two files are being loaded.
db.load(pix.isin(variant=["low", "high"]), **load_pbar)

File loading:   0%|          | 0/2 [00:00<?, ?it/s]

File loading: 100%|██████████| 2/2 [00:00<00:00, 748.92it/s]

Out[44]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	high	0	1	2
		J	Ocean Heat Uptake	high	3	4	5
		W / m^2	Effective Radiative Forcing	high	6	7	8
		K / yr	Warming rate	high	9	10	11
climate_model_b	scenario_a	K	Temperature	high	12	13	14
...	...	...	...	...	...	...	...
climate_model_c	scenario_e	K / yr	Warming rate	low	705	706	707
climate_model_d	scenario_e	K	Temperature	low	708	709	710
		J	Ocean Heat Uptake	low	711	712	713
		W / m^2	Effective Radiative Forcing	low	714	715	716
		K / yr	Warming rate	low	717	718	719

160 rows × 3 columns

In [45]:

Copied!





# If we instead want to load all the temperature data for a given climate model,
# because of the way the data is grouped,
# we have to load all the files
# (you can see this by looking at the progress bar output).
# This would be a good example where,
# if loading a single variable-climate model combination
# at a time is of interest,
# the data should be saved with a grouping that supports
# that access pattern more directly.
db.load(pix.isin(climate_model="climate_model_c", variable="Temperature"), **load_pbar)
# If we instead want to load all the temperature data for a given climate model,
# because of the way the data is grouped,
# we have to load all the files
# (you can see this by looking at the progress bar output).
# This would be a good example where,
# if loading a single variable-climate model combination
# at a time is of interest,
# the data should be saved with a grouping that supports
# that access pattern more directly.
db.load(pix.isin(climate_model="climate_model_c", variable="Temperature"), **load_pbar)

File loading:   0%|          | 0/3 [00:00<?, ?it/s]

File loading: 100%|██████████| 3/3 [00:00<00:00, 931.38it/s]

Out[45]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_c	scenario_a	K	Temperature	high	24	25	26
	scenario_b	K	Temperature	high	168	169	170
	scenario_c	K	Temperature	high	312	313	314
	scenario_d	K	Temperature	high	456	457	458
	scenario_e	K	Temperature	high	600	601	602
	scenario_a	K	Temperature	low	120	121	122
	scenario_b	K	Temperature	low	264	265	266
	scenario_c	K	Temperature	low	408	409	410
	scenario_d	K	Temperature	low	552	553	554
	scenario_e	K	Temperature	low	696	697	698
	scenario_a	K	Temperature	medium	72	73	74
	scenario_b	K	Temperature	medium	216	217	218
	scenario_c	K	Temperature	medium	360	361	362
	scenario_d	K	Temperature	medium	504	505	506
	scenario_e	K	Temperature	medium	648	649	650

The tradeoff with grouping the data can be that reading more files takes more time (although sometimes even then not thanks to parallelisation). It's up to you to make the grouping choice that suits your access pattern.

Locking¶

The database supports locking. This ensures that e.g. only one process can write to the database. Locking is handled via the index_file_lock attribute. You can set this at initialisation if you wish. If it is not supplied, a default value is used.

The lock is an attribute of the instance. In other words, every database instance has its own lock. We use filelock by default. This is a powerful locking library, but it is important to understand its model.

Of most importance is the fact that the locks are recursive. This means that repeated calls to acquire the same lock will not block.

In [46]:

Copied!





with db.index_file_lock.acquire():
    # Even though we just acquired the lock,
    # we can acquire it again because of the recursive behaviour.
    db.index_file_lock.acquire()
with db.index_file_lock.acquire():
    # Even though we just acquired the lock,
    # we can acquire it again because of the recursive behaviour.
    db.index_file_lock.acquire()

However, if we now try and acquire a lock for a database which is working on the same directory, we will find that we can't.

In [47]:

Copied!





# Get another db instance
# (the issue doesn't arise with the current instance because of the recursive locking)
db_other_view = OpenSCMDB(
    db_dir=db.db_dir,  # Pointing at the same directory
    backend_data=DATA_BACKENDS.get_instance("csv"),
    backend_index=INDEX_BACKENDS.get_instance("csv"),
)
# The default timeout is inifinity,
# so the below would just block forever by default.
# We make this more sensible here.
db_other_view.index_file_lock.timeout = 0.5

# We can't acquire the lock from this other db instance
try:
    db_other_view.index_file_lock.acquire()
except filelock.Timeout:
    traceback.print_exc(limit=0, chain=False)
# Get another db instance
# (the issue doesn't arise with the current instance because of the recursive locking)
db_other_view = OpenSCMDB(
    db_dir=db.db_dir,  # Pointing at the same directory
    backend_data=DATA_BACKENDS.get_instance("csv"),
    backend_index=INDEX_BACKENDS.get_instance("csv"),
)
# The default timeout is inifinity,
# so the below would just block forever by default.
# We make this more sensible here.
db_other_view.index_file_lock.timeout = 0.5

# We can't acquire the lock from this other db instance
try:
    db_other_view.index_file_lock.acquire()
except filelock.Timeout:
    traceback.print_exc(limit=0, chain=False)

filelock._error.Timeout: The file lock '/tmp/tmppxl3pdmo/index.csv.lock' could not be acquired.

The locking is helpful to avoid putting the database in a corrupt state. By default, the lock is acquired for all operations. You can override this by providing a different lock to the method.

In [48]:

Copied!

# The lock is currently being held
db.index_file_lock.is_locked
# The lock is currently being held
db.index_file_lock.is_locked

Out[48]:

True

In [49]:

Copied!





# Despite this, the recursive locks
# mean we can still load data with the db instance
# that is holding the lock
db.load()
# Despite this, the recursive locks
# mean we can still load data with the db instance
# that is holding the lock
db.load()

Out[49]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	high	0	1	2
		J	Ocean Heat Uptake	high	3	4	5
		W / m^2	Effective Radiative Forcing	high	6	7	8
		K / yr	Warming rate	high	9	10	11
climate_model_b	scenario_a	K	Temperature	high	12	13	14
...	...	...	...	...	...	...	...
climate_model_c	scenario_e	K / yr	Warming rate	medium	657	658	659
climate_model_d	scenario_e	K	Temperature	medium	660	661	662
		J	Ocean Heat Uptake	medium	663	664	665
		W / m^2	Effective Radiative Forcing	medium	666	667	668
		K / yr	Warming rate	medium	669	670	671

240 rows × 3 columns

In [50]:

Copied!





# However, we cannot load data from the other instance
try:
    db_other_view.load()
except filelock.Timeout:
    traceback.print_exc(limit=0, chain=False)
# However, we cannot load data from the other instance
try:
    db_other_view.load()
except filelock.Timeout:
    traceback.print_exc(limit=0, chain=False)

filelock._error.Timeout: The file lock '/tmp/tmppxl3pdmo/index.csv.lock' could not be acquired.

In [51]:

Copied!





# Unless we override the lock
db_other_view.load(
    # Providing a nullcontext bypasses the lock.
    # Obviously, only do this if you know what you're doing.
    index_file_lock=contextlib.nullcontext()
)
# Unless we override the lock
db_other_view.load(
    # Providing a nullcontext bypasses the lock.
    # Obviously, only do this if you know what you're doing.
    index_file_lock=contextlib.nullcontext()
)

Out[51]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	high	0	1	2
		J	Ocean Heat Uptake	high	3	4	5
		W / m^2	Effective Radiative Forcing	high	6	7	8
		K / yr	Warming rate	high	9	10	11
climate_model_b	scenario_a	K	Temperature	high	12	13	14
...	...	...	...	...	...	...	...
climate_model_c	scenario_e	K / yr	Warming rate	medium	657	658	659
climate_model_d	scenario_e	K	Temperature	medium	660	661	662
		J	Ocean Heat Uptake	medium	663	664	665
		W / m^2	Effective Radiative Forcing	medium	666	667	668
		K / yr	Warming rate	medium	669	670	671

240 rows × 3 columns

For more information on the different ways to acquire the lock, see the filelock docs. If you are interested in locking, we really recommend reading the docs. The locking is very powerful, but quite subtle so you really have to understand the implementation to get the most out of it.

In [52]:

Copied!





# Release the lock fully before moving on
# (if you want to know why this is needed,
# read the filelock docs)
db.index_file_lock.release()
db.index_file_lock.is_locked
# Release the lock fully before moving on
# (if you want to know why this is needed,
# read the filelock docs)
db.index_file_lock.release()
db.index_file_lock.is_locked

Out[52]:

False

OpenSCMDBReader¶

If you just want to read data, then we provide a class optimised to this use case, OpenSCMDBReader. This holds the index in memory, so it does not need to be read from disk every time we wish to load data.

The easiest way to create a reader is from an existing OpenSCMDB instance.

In [53]:

Copied!

reader = db.create_reader()
# the index is in memory
reader.db_index
reader = db.create_reader()
# the index is in memory
reader.db_index

Out[53]:

					file_id
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	high	1
		J	Ocean Heat Uptake	high	1
		W / m^2	Effective Radiative Forcing	high	1
		K / yr	Warming rate	high	1
climate_model_b	scenario_a	K	Temperature	high	1
...	...	...	...	...	...
climate_model_c	scenario_e	K / yr	Warming rate	medium	3
climate_model_d	scenario_e	K	Temperature	medium	3
		J	Ocean Heat Uptake	medium	3
		W / m^2	Effective Radiative Forcing	medium	3
		K / yr	Warming rate	medium	3

240 rows × 1 columns

The reader's load method is basically identical to that of OpenSCMDB, it just has to do less work because it doesn't need to read the index.

In [54]:

Copied!

# Having the index in memory can make data reading faster
# (in the case of a large index).
reader.load(pix.isin(unit="K", variant="medium") & pix.ismatch(scenario="*_c"))
# Having the index in memory can make data reading faster
# (in the case of a large index).
reader.load(pix.isin(unit="K", variant="medium") & pix.ismatch(scenario="*_c"))

Out[54]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_c	K	Temperature	medium	336	337	338
climate_model_b	scenario_c	K	Temperature	medium	348	349	350
climate_model_c	scenario_c	K	Temperature	medium	360	361	362
climate_model_d	scenario_c	K	Temperature	medium	372	373	374

By default, the reader is given its own lock. This allows us to use the reader to lock the database.

In [55]:

Copied!

reader.lock.acquire()
reader.lock.acquire()

Out[55]:

<filelock._api.AcquireReturnProxy at 0x792d124471d0>

In [56]:

Copied!





# Now we can't perform operations with any other view into the db,
# so we can gurantee our reader's safety.
try:
    db_other_view.load()
except filelock.Timeout:
    traceback.print_exc(limit=0, chain=False)
# Now we can't perform operations with any other view into the db,
# so we can gurantee our reader's safety.
try:
    db_other_view.load()
except filelock.Timeout:
    traceback.print_exc(limit=0, chain=False)

filelock._error.Timeout: The file lock '/tmp/tmppxl3pdmo/index.csv.lock' could not be acquired.

In [57]:

Copied!

# If we release the lock, we can load again
reader.lock.release()
db_other_view.load()
# If we release the lock, we can load again
reader.lock.release()
db_other_view.load()

Out[57]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	high	0	1	2
		J	Ocean Heat Uptake	high	3	4	5
		W / m^2	Effective Radiative Forcing	high	6	7	8
		K / yr	Warming rate	high	9	10	11
climate_model_b	scenario_a	K	Temperature	high	12	13	14
...	...	...	...	...	...	...	...
climate_model_c	scenario_e	K / yr	Warming rate	medium	657	658	659
climate_model_d	scenario_e	K	Temperature	medium	660	661	662
		J	Ocean Heat Uptake	medium	663	664	665
		W / m^2	Effective Radiative Forcing	medium	666	667	668
		K / yr	Warming rate	medium	669	670	671

240 rows × 3 columns

The reader can also be used as a context manager, in which the lock is automatically acquired and released.

In [58]:

Copied!





with reader:
    # The lock is held by the reader here
    # so we can load data via the reader
    reader.load()
    # but we can't e.g. save data via another view into the database
    try:
        db_other_view.save(df_timeseries_like)
    except filelock.Timeout:
        traceback.print_exc(limit=0, chain=False)

# Outside the context block, the lock is released by the reader
# so we can use other views for operations again.
db_other_view.load()
with reader:
    # The lock is held by the reader here
    # so we can load data via the reader
    reader.load()
    # but we can't e.g. save data via another view into the database
    try:
        db_other_view.save(df_timeseries_like)
    except filelock.Timeout:
        traceback.print_exc(limit=0, chain=False)

# Outside the context block, the lock is released by the reader
# so we can use other views for operations again.
db_other_view.load()

filelock._error.Timeout: The file lock '/tmp/tmppxl3pdmo/index.csv.lock' could not be acquired.

Out[58]:

					2010	2015	2025
climate_model	scenario	unit	variable	variant
climate_model_a	scenario_a	K	Temperature	high	0	1	2
		J	Ocean Heat Uptake	high	3	4	5
		W / m^2	Effective Radiative Forcing	high	6	7	8
		K / yr	Warming rate	high	9	10	11
climate_model_b	scenario_a	K	Temperature	high	12	13	14
...	...	...	...	...	...	...	...
climate_model_c	scenario_e	K / yr	Warming rate	medium	657	658	659
climate_model_d	scenario_e	K	Temperature	medium	660	661	662
		J	Ocean Heat Uptake	medium	663	664	665
		W / m^2	Effective Radiative Forcing	medium	666	667	668
		K / yr	Warming rate	medium	669	670	671

240 rows × 3 columns

If you aren't worried about the reader having a lock, you can simply disable it when creating the reader.

In [59]:

Copied!

reader_no_lock = db.create_reader(lock=False)
reader_no_lock.lock is None
reader_no_lock = db.create_reader(lock=False)
reader_no_lock.lock is None

Out[59]:

True

Summary¶

We believe that OpenSCMDB provides a powerful way for saving timeseries data. However, it is in a work progress. If there is a feature you would like or a bug or anything else, please raise an issue.

How to use OpenSCMDB¶

Imports¶

Basics¶

Empty databases¶

Saving data¶

Loading data¶

Metadata¶

Data¶

Sub-selecting¶

Deleting¶

Summary so far¶

Advanced topics¶

Sharing the database¶

Grouping data¶

Parallelisation and progress bars¶

Overwriting data¶

More on grouping¶

Locking¶

OpenSCMDBReader¶

Summary¶