Quick-start guide

We present here a recipe for a full bioacoustics ML workflow, from data pre-processing to training, to performance assessments, and finally, to using a trained model for analyzing soundscape/field recordings.

As an example, we considered the North Atlantic Right Whale (NARW) up-call dataset from the DCLDE 2013 challenge. The dataset contained 7 days of round-the-clock recordings out of which recordings from the first 4 days were earmarked as a training set and recordings from the remaining 3 days were set aside as a test set. Each audio file was 15 minutes in duration, and files from each day were organized in day-specific subdirectories. The original dataset contained annotations in the legacy Xbat format, which we converted to RavenPro selection table format for compatibility with Koogu. A sample of the dataset, with converted annotations, can be accessed here.

You may test the below code snippets yourself, using the sample dataset. Once, you have it working, you could modify the program to suit your own dataset.

The code sections below expect the training and test audio files and corresponding annotation files to be organized in a directory structure as shown below:

πŸ“ projects
└─ πŸ“ NARW
   └─ πŸ“ data
      β”œβ”€ πŸ“ train_audio
      β”œβ”€ πŸ“ train_annotations
      β”œβ”€ πŸ“ test_audio
      └─ πŸ“ test_annotations

Imports

First, import the necessary modules and functions from the Koogu package.

1from koogu.data import preprocess, feeder
2from koogu.model import architectures
3from koogu import train, assessments, recognize
4
5from matplotlib import pyplot as plt           # used for plotting graphs
6

1. Data preparation

Point out where to fetch the training dataset from.

We also need to specify which annotation files correspond to which audio files (or, in this example, to sub-directories containing a collection of files).

 8# The root directories under which the training data (audio files and
 9# corresponding annotation files) are available.
10audio_root = '/home/shyam/projects/NARW/data/train_audio'
11annots_root = '/home/shyam/projects/NARW/data/train_annotations'
12
13# Map audio files (or containing folders) to respective annotation files
14audio_annot_list = [
15    ['NOPP6_EST_20090328', 'NOPP6_20090328_RW_upcalls.selections.txt'],
16    ['NOPP6_EST_20090329', 'NOPP6_20090329_RW_upcalls.selections.txt'],
17    ['NOPP6_EST_20090330', 'NOPP6_20090330_RW_upcalls.selections.txt'],
18    ['NOPP6_EST_20090331', 'NOPP6_20090331_RW_upcalls.selections.txt'],
19]

Define parameters for preparing training audio, and for converting them to spectrograms.

23data_settings = {
24    # Settings for handling raw audio
25    'audio_settings': {
26        'clip_length': 2.0,
27        'clip_advance': 0.4,
28        'desired_fs': 1000
29    },
30
31    # Settings for converting audio to a time-frequency representation
32    'spec_settings': {
33        'win_len': 0.128,
34        'win_overlap_prc': 0.75,
35        'bandwidth_clip': [46, 391]
36    }
37}

1.1 Preprocess

The preprocessing step will split up the audio files into clips (defined by data_settings['audio_settings']), match available annotations to the clips, and mark each clip to indicate if it matched one or more annotations.

We believe that the available annotations in the training set covered almost all occurrences of the target NARW up-calls in the recordings, with no (or only a small number of) missed calls. As such, we can consider all un-annotated time periods in the recordings as inputs for the negative class (by setting the parameter negative_class_label).

41# Path to the directory where pre-processed data will be written.
42# Directory will be created if it doesn't exist.
43prepared_audio_dir = '/home/shyam/projects/NARW/prepared_data'
44
45# Convert audio files into prepared data
46clip_counts = preprocess.from_selection_table_map(
47    data_settings['audio_settings'],
48    audio_annot_list,
49    audio_root, annots_root,
50    output_root=prepared_audio_dir,
51    negative_class_label='Other')

See also

Koogu supports annotations in different popular formats, besides the default RavenPro format. See koogu.data.annotations for a list of supported formats.

See also

If your project does not have annotations, but you have audio files corresponding to each species/call type organized under separate directories, you can pre-process the data using from_top_level_dirs() instead of from_selection_table_map().

You can check how many clips were generated for each class -

54# Display counts of how many inputs we got per class
55for label, count in clip_counts.items():
56    print(f'{label:<10s}: {count:d}')

1.2. Feeder setup

Now, we define a feeder that efficiently feeds all the pre-processed clips, in batches, to the training/validation pipeline. The feeder is also transforms the audio clips into spectrograms.

Typically, model training is performed on computers having one or more GPUs. While the GPUs consume data at extreme speeds during training, it is imperative that the mechanism to feed the training data doesn’t keep the GPUs waiting for inputs. The feeders provided in Koogu utilize all available CPU cores to ensure that GPU utilization remains high during training.

61data_feeder = feeder.SpectralDataFeeder(
62    prepared_audio_dir,                        # where the prepared clips are at
63    data_settings['audio_settings']['desired_fs'],
64    data_settings['spec_settings'],
65    validation_split=0.15,                     # set aside 15% for validation
66    max_clips_per_class=20000                  # use up to 20k inputs per class
67)

The considered sample dataset contains very many annotated calls, covering a reasonable range of input variations. As such, in this example we do not employ any data augmentation techniques. However, you could easily add some of the pre-canned data augmentations when you adopt this example to work with your dataset.

2. Training

First, describe the architecture of the model that is to be used. With Koogu, you do not need to write lot’s of code to build custom models; simply chose an exiting/available architecture (e.g., ConvNet, DenseNet) and specify how you’d want it customized.

In this example, we use a light-weight custom DenseNet architecture.

71model = architectures.DenseNet(
72    [4, 4, 4],                                 # 3 dense-blocks, 4 layers each
73    preproc=[ ('Conv2D', {'filters': 16}) ],   # Add a 16-filter pre-conv layer
74    dense_layers=[32]                          # End with a 32-node dense layer
75)

The training process can be controlled, along with hyperparameter and regularization settings, by setting appropriate values in the Python dictionary that is input to train(). See the function API documentation for all available options.

 79# Settings that control the training process
 80training_settings = {
 81    'batch_size': 64,
 82    'epochs': 50,                              # run for 50 epochs
 83
 84    # Start with a learning rate of 0.01, and drop it to a tenth of its value,
 85    # successively, at epochs 20 & 40.
 86    'learning_rate': 0.01,
 87    'lr_change_at_epochs': [20, 40],
 88    'lr_update_factors': [1.0, 1e-1, 1e-2],    # up to 20, beyond 20, beyond 40
 89
 90    'dropout_rate': 0.05                       # Helps model generalize better
 91}
 92
 93# Path to the directory where model files will be written
 94model_dir = '/home/shyam/projects/NARW/models/my_first_model'
 95
 96# Perform training
 97history = train(
 98    data_feeder,
 99    model_dir,
100    data_settings,
101    model,
102    training_settings
103)

You can visualize how well the training progressed by plotting the contents of the history variable returned.

106# Plot training & validation history
107fig, ax = plt.subplots(2, sharex=True, figsize=(12, 9))
108ax[0].plot(
109    history['train_epochs'], history['binary_accuracy'], 'r',
110    history['eval_epochs'], history['val_binary_accuracy'], 'g')
111ax[0].set_ylabel('Accuracy')
112ax[1].plot(
113    history['train_epochs'], history['loss'], 'r',
114    history['eval_epochs'], history['val_loss'], 'g')
115ax[1].set_yscale('log')
116ax[1].set_xlabel('Epoch')
117ax[1].set_ylabel('Loss')
118plt.show()

You may tune the training parameters above and repeat the training step until the training and validation accuracy/loss reach desired levels.

3. Performance assessment

3.1. Run on test dataset

If you have a test dataset available for assessing performance, you can easily run the trained model on that dataset. Simply point out where to fetch the test dataset from.

Similar to how training annotation data were presented (by associating annotation files to audio files), we also need to specify which test annotation files correspond to which test audio files (or, in this example, to sub-directories containing a collection of test files).

122# The root directories under which the test data (audio files and
123# corresponding annotation files) are available.
124test_audio_root = '/home/shyam/projects/NARW/data/test_audio'
125test_annots_root = '/home/shyam/projects/NARW/data/test_annotations'
126
127# Map audio files to corresponding annotation files
128test_audio_annot_list = [
129    ['NOPP6_EST_20090401', 'NOPP6_20090401_RW_upcalls.selections.txt'],
130    ['NOPP6_EST_20090402', 'NOPP6_20090402_RW_upcalls.selections.txt'],
131    ['NOPP6_EST_20090403', 'NOPP6_20090403_RW_upcalls.selections.txt'],
132]

Now apply the trained model to this test dataset. During testing, it is useful to save raw per-clip recognition scores which can be subsequently analyzed for assessing the model’s recognition performance.

135# Directory in which raw detection scores will be saved
136raw_detections_root = '/home/shyam/projects/NARW/test_audio_raw_detections'
137
138# Run the model (detector/classifier)
139recognize(
140    model_dir,
141    test_audio_root,
142    raw_detections_dir=raw_detections_root,
143    batch_size=64,     # Increasing this may improve speed if there's enough RAM
144    recursive=True,    # Process subdirectories also
145    show_progress=True
146)

The recognize() function supports many customizations. See function API documentation for more details.

3.2. Determine performance

Now, compute performance metrics.

149# Initialize a metric object with the above info
150metric = assessments.PrecisionRecall(
151    test_audio_annot_list,
152    raw_detections_root, test_annots_root)
153# The metric supports several options (including setting explicit thresholds).
154# Refer to class documentation for more details.
155
156# Run the assessments and gather results
157per_class_pr, overall_pr = metric.assess()

And, visualize the assessments.

160# Plot PR curves.
161for class_name, pr in per_class_pr.items():
162    print('-----', class_name, '-----')
163    plt.plot(pr['recall'], pr['precision'], 'rd-')
164    plt.xlabel('Recall')
165    plt.ylabel('Precision')
166    plt.grid()
167    plt.show()
168
169# Similarly, you could plot the contents of 'overall_pr' too

By analyzing the precision-recall curve, you can pick an operational threshold that yields the desired precision vs. recall trade-off.

4. Use the trained model

Once you are settled on a choice of detection threshold that yields a suitable precision-recall trade-off, you can apply the trained model on new recordings.

Koogu supports two ways of using a trained model.

4.1. Batch processing

In most common applications, one would want to be able to batch process large collections of audio files with a trained model.

In this mode, automatic recognition results are written out in RavenPro selection table format after applying an algorithm to group together detections of the same class across contiguous clips.

173# Path to directory containing audio files (may contain subdirectories too)
174field_recordings_root = '/home/shyam/projects/NARW/field_recordings'
175field_rec_detections_root = '/home/shyam/projects/NARW/field_rec_detections'
176
177chosen_threshold = 0.75
178
179recognize(
180    model_dir,
181    field_recordings_root,
182    output_dir=field_rec_detections_root,
183    threshold=chosen_threshold,
184    reject_class='Other',                      # Only output target class dets
185    #clip_advance=0.5,                         # Can use different clip advance
186    batch_size=64,                             # Can go higher on good computers
187    num_fetch_threads=4,                       # Parallel-process for speed
188    recursive=True,                            # Process subdirectories also
189    show_progress=True
190)

The recognize() function supports many customizations. See function API documentation for more details.

4.2 Custom processing

Sometimes, one may need to process audio data that is not available in the form of audio files (or in unsupported formats). For example, one may want to apply a trained model to live-stream acoustic feeds. Koogu facilitates such use of a trained model via an additional interface in which you implement the task of preparing the data (breaking up into clips) in the format that a model expects. Then, you simply pass the clips to analyze_clips().

from koogu.model import TrainedModel
from koogu.inference import analyze_clips

# Load the trained model
trained_model = TrainedModel(model_dir)

# Read in the audio samples from a file (using one of SoundFile, AudioRead,
# scipy.io.wavfile, etc.), or buffer-in from a live stream.

# As with the model trained in the above example, you may need to resample the
# new data to 1 kHz, and then break them up into clips of length 2 s to match
# the trained model's input size.

not_end = True

while not_end:

    my_clips = ...
    # say we got 6 clips, making it a 6 x 2000 numpy array

    # Run detections and get per-clip scores for each class
    scores, processing_time = analyze_clips(trained_model, my_clips)
    # Given 6 clips, we get 'scores' to be a 6 x 2 array

    # ... do something with the results
    ...