Data transformation

Certain data transformations that are unavailable in TensorFlow/Keras are implemented as custom Keras layers in Koogu.

class koogu.data.tf_transformations.Audio2Spectral(*args: Any, **kwargs: Any)

Layer for converting waveforms into time-frequency representations.

Parameters:
  • fs – sampling frequency of the data in the last dimension of inputs.

  • spec_settings

    A Python dictionary describing the settings to be used for producing spectrograms. Supported keys in the dictionary include:

    • win_len: (required) Length of the analysis window (in seconds)

    • win_overlap_prc: (required) Fraction of the analysis window to have as overlap between successive analysis windows. Commonly, a 50% (or 0.50) overlap is considered.

    • nfft_equals_win_len: (optional; boolean) If True (default), NFFT will equal the number of samples resulting from win_len. If False, NFFT will be set to the next power of 2 that is ≥ the number of samples resulting from win_len.

    • tf_rep_type: (optional) A string specifying the transformation output. ‘spec’ results in a linear scale spectrogram. ‘spec_db’ (default) results in a logarithmic scale (dB) spectrogram.

    • eps: (default: 1e-10) A small positive quantity added to avoid computing log(0.0).

    • bandwidth_clip: (optional; 2-element list/tuple) If specified, the generated spectrogram will be clipped along the frequency axis to only include components in the specified bandwidth.

  • eps – (optional) If specified, will override the eps value in spec_settings.

  • name – (optional; string) Name for the layer.

class koogu.data.tf_transformations.GaussianBlur(*args: Any, **kwargs: Any)

Layer for applying Gaussian blur to time-frequency representations.

Parameters:
  • sigma – Scalar value defining the Gaussian kernel.

  • apply_2d – (boolean; default: True) If True, will apply smoothing along both time- and frequency axes. Otherwise, smoothing is only applied along the frequency axis.

class koogu.data.tf_transformations.Linear2dB(*args: Any, **kwargs: Any)

Layer for converting time-frequency representations from linear to decibel scale.

Parameters:
  • eps – Epsilon value to add, for avoiding computing log(0.0).

  • full_scale – (boolean) Whether to convert to dB full-scale.

class koogu.data.tf_transformations.LoG(*args: Any, **kwargs: Any)

Layer for applying Laplacian of Gaussian (LoG) operator(s) to time-frequency representations (Madhusudhana et al. 2021).

Parameters:
  • scales_sigmas

    Must be a tuple or list of sigma values at different (usually, geometrically progressing) scales. You may use this formula to determine the possible set of sigma values beyond the lowest_sigma:

    lowest_sigma * (2 ^ (range(2, floor(
    log2((max_len - 1) / ((2 x 3) x lowest_sigma)) + 1) + 1) - 1))
    For example, if lowest_sigma is 4 & max_len is 243, the
    resulting set of sigma values should be (4, 8, 16, 32).

  • add_offsets – If True (default is False), add a trainable offset value to LoG responses.

  • conv_filters – If not None, must either be a single integer (applicable to outputs of all scales) or a list-like group of integers (one per scale, applicable to outputs of respective scales). As many 3x3 filters (trainable) will be created and will be applied to the final outputs of this layer.

  • retain_LoG – If True, and if conv_filters is enabled, the LoG outputs will be included in the outputs.

class koogu.data.tf_transformations.NormalizeAudio(*args: Any, **kwargs: Any)

Layer for applying normalization to audio. Normalization (means subtraction followed by scaling to the range [-1.0, 1.0]) is applied by determining the mean and range along the last axis of the inputs.

class koogu.data.tf_transformations.Spec2Img(*args: Any, **kwargs: Any)

Layer for converting time-frequency representations into images. The layer’s inputs can either be a single spectrogram (shape: H x W) or a batch of B spectrograms (shape: B x H x W).

Parameters:
  • cmap – An Nx3 array of RGB color values. Typically, N is 256. If cmap also contains alpha values (Nx4 instead of Nx3), the last channel will be discarded. For example, to specify a ‘jet’ colorscale, you could use matplotlib.colormaps[‘jet’](range(256)).

  • vmin – (optional; default: None) If specified along with vmax, spectrogram values will be scaled to the range [vmin, vmax].

  • vmax – (optional; default: None) If specified along with vmin, spectrogram values will be scaled to the range [vmin, vmax].

  • img_size – (optional; default: None) If not None, must specify a 2-element tuple (new H, new W) that indicates the shape that the output image must be resized to.

  • resize_method – (optional; default: ‘bilinear’) If resizing of spectrogram(s) is enabled (via img_size), this parameter will define the method used for resizing. For available options, see TensorFlow’s tf.image.resize().

If only one of vmin or vmax is specified, it will be ignored and spectrogram values will be scaled relative to the minimum and maximum values within each spectrogram. If both vmin and vmax are specified, vmin must be < vmax.

Returns:

If img_size was None, will return a tensor of shape [H x W x 3] or [B x H x W x 3]. If img_size was specified, then replace H with img_size[0] and W with img_size[1].