Skip to content

Datasets

This section contains the definition of the datasets used in the experiment, during the training and evaluation phases.

The most minimal dataset definition is done in the following way:

datasets:
  - name: dataset_name
    path: /path/to/dataset/root
    split:
        train: 0.7
        val: 0.1
        test: 0.2  
  - ...

The name field is used to identify the dataset in the experiment, and the path field is the path to the root directory of the dataset.

The split field is used to define the split of the dataset into training, validation and test sets. The three numbers (each between 0 and 1, both inclusive) represent the percentage of the dataset that will be used for each of the three sets.

Note

The sum of the three numbers must be at most 1.

However, it's not required that the sum of the three numbers is exactly 1: for instance, if you don't want to insert the dataset in the validation set, you can set the val field to 0. The same applies to both train and test.

This feature is particularly useful when you want to use only part of a dataset for testing, because having it in its entirety would unbalance the test set.

These three fields are required, but you can also add other fields to the dataset definition, as described in the following sections.

Dataset loading

The way in which a dataset is loaded is defined by the so-called dataset loader, which is a class that scans the dataset root directory and returns a list of items with their respective classes, which are then used to create the whole dataset.

By default, the name of the dataset also determines the dataset loader that will be used to load the dataset. For instance, if the dataset name is morphdb, the dataset loader that will be used is MorphDBLoader, which is the default dataset loader for the MorphDB dataset.

Note

The dataset name is case-insensitive, and all hyphens and underscores are ignored, so MORPHDB, MorphDB, morph-db and morph_db are all equivalent and will use the same MorphDBLoader.

If you want to know how to implement a dataset loader, see the reference for further details.

Some loaders can also accept additional parameters, which can be specified in the loader.args section of the dataset definition:

datasets:
  - name: dataset_name
    path: /path/to/dataset/root
    split:
        train: 0.7
        val: 0.1
        test: 0.2
    loader:
        args:
            arg1: value1
            arg2: value2
            ...  
  - ...

The args field is a dictionary of arguments that will be passed to the dataset loader.

Sometimes, it is useful to explicitly specify the dataset loader to use, even if the dataset name would normally imply a different loader. This can be done by specifying the loader.name field in the dataset definition:

datasets:
  - name: dataset_name
    path: /path/to/dataset/root
    split:
        train: 0.7
        val: 0.1
        test: 0.2
    loader:
        name: MyCustomLoader
        args:
            arg1: value1
            arg2: value2
            ...  
  - ...

Warning

Unlike the name field, the loader.name field is case-sensitive, and an exact match is required.

Testing groups

By default, the model is evaluated on the entirety of the test set. Sometimes it is useful to evaluate the performance of a model only on a subset of the test set, rather than on its entirety.

Therefore, you can assign a dataset to multiple testing groups, and the metrics will be computed for each group separately.

To do so, you can specify the testing_groups field in the dataset definition:

datasets:
  - name: dataset_name
    path: /path/to/dataset/root
    split:
        train: 0.7
        val: 0.1
        test: 0.2
    testing_groups:
        - group1
        - group2
        - ...
  - ...

Note

The testing_groups field is ignored if the dataset is not in the test set.

The model will be evaluated on each of the groups separately, and on the whole test set.

Warning

Make sure that each group contains at least one item for each class, otherwise some metrics such as EER and BPCER@APCER may not be computed.