Dataset Exploration
Explore and analyze your dataset to understand annotation patterns, class distributions, and data quality before proceeding with training. This helps -
validate data quality before training
identify class imbalances that may need addressing
understand calls’ frequency characteristics for optimal spectrogram settings
spot annotation inconsistencies early in the workflow
$> # Explore and analyze a project's dataset
$> koogu-explore my_project.config
This command analyzes both training and test annotations (if available) and generates interactive HTML reports:
`training_annotations_stats.html` - Analysis of training data annotations
`test_annotations_stats.html` - Analysis of test data annotations
The reports include:
Class distribution - Number of annotations per class
Duration statistics - Distribution of per-class call durations
Bandwidth statistics - Frequency range distribution per class
If [data.audio]
and [data.spec]
are set in the config file, it also prints out
information about model inputs that would result from applying the defined transformations.
Parameters
Positional arguments
<CONFIG FILE/ANNOTATIONS DIR>
If running within the scope a Koogu project, specify path to the project config file (training and test annotations’ info will be read from the specified config file). Otherwise, specify path to the directory containing the set of annotation files you wish to analyze (also see additional options under Direct access).
Direct access
These options are only applicable when running outside the scope of a Koogu project, i.e., directly accessing annotation files by specifying <ANNOTATIONS DIR>. If <CONFIG FILE> was specified, all these below options will be ignored.
--reader
Set this based on the format of your annotation files.
--filelist
Path to a text file containing one-per-line entries of the annotation files to analyze. If not specified, all discoverable files under <ANNOTATIONS DIR> will be analyzed.
--output
Analysis outputs (HTML) will be written to this file, if specified. Otherwise, outputs will be written to the file annotation_stats.html in the current directory.
Process control
--threads NUM
Number of threads to spawn for parallel execution.
Default: as many CPUs