Unsupervised Word Discovery: Prominence Segmentation with Clustering and ES-KMeans+

Simon Malan, Benjamin van Niekerk, Herman Kamper

Arxiv paper: https://arxiv.org/abs/2409.14486
Code: PromSegClus, ES-KMeans+

Abstract: We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster. Project webpage: https://s-malan.github.io/prom-seg-clus.

Prominence-Based Word Segmentation

First we look at the word segmentation step used in both Prom. Seg. Clus. and ES-KMeans+. The method is first introduced by Pasad et al. in the paper at https://arxiv.org/abs/2307.00162 titled: What do self-supervised speech models know about words?

PromWordSeg
Fig.1: An example of word boundaries from the prominence- based approach of Pasad et al. The red line is the dissimilarity curve between adjacent frames, which is smoothed to produce the white line. The crosses are the predicted boundaries. The black vertical lines are the ground truth boundaries.

Figure 1 shows an example of the prominence based approach. The code for this step can be found in our GitHub repository: https://github.com/s-malan/prom-wordseg. The boundaries can be tuned using the parameters (distance, window_size, prominence) described in the README. These boundaries serve as input to the lexicon building step described next.


Building a Lexicon

Using the prominence-based word boundaries, we can build a lexicon by simply clustering these boundaries (like in Prom. Seg. Clus.) or we can use dynamic programming to sub-select boundaries which are then used to build the lexicon (like in ES-KMeans+).

LexiconBuild
Fig.2: Our lexicon building step. After extracting frame-level features (a), PCA dimensionality reduction is applied (b). For each segment from the prominence-based approach (Fig. 1), an averaged embedding is obtained (c). These are K-means clustered (d) to get a lexicon.

Figure 2 shows the lexicon building step. Both Prom. Seg. Clus. and ES-KMeans+ use this step. Prom. Seg. Clus. directly builds a lexicon like in Figure 2 using the prominence-based boundaries of Figure 1, the full codebase can be found in our GitHub repository: https://github.com/s-malan/prom-seg-clus. ES-KMeans+ also builds a lexicon like in Figure 2 but, after extracting prominence-based word boundaries like in Figure 1, it iteratively clusters and re-segments the utterance to select a near-optimal subset of word-like boundaries with its corresponding lexicon. The full codebase for ES-KMeans+ can be found in our GitHub repository: https://github.com/s-malan/es-kmeans-plus.


Evaluation

When using Track 2 of the ZeroSpeech Challenge's datasets downloaded from https://download.zerospeech.com, the ZeroSpeech toolkit https://github.com/zerospeech/benchmarks can be used for evaluation. When working with LibriSpeech or BuckEye, our evaluation scripts can be used, found in our GitHub repository: https://github.com/s-malan/evaluation.

Results
Table I: Performance (%) of prominence segmentation with clustering on English HuBERT, and other state-of-the-art methods for word segmentation and lexicon building on Track 2 of the ZeroSpeech Challenge.

Table I shows the results we get on the five-language ZeroSpeech Challenge. For more results and their discussions please refer to our paper.


Resynthesis

Resynthesis is not covered in the paper, but it provides insights into how intelligible the systems are. The basic premise is to recreate target (hold-out) utterances by using the learned lexicon to replace word-like segments in the original utterances with segments out of the same lexicon cluster. This gives an audible representation of how well our systems have "learned" the language. We provide one example with a target utterance and its resynthesized versions.

Resynthesis Example
Original/Target Prom. Seg. Clus. ES-KMeans+