Brief Review — CyTex: Transforming speech to textured images for speech emotion recognition

CyTex, Transforms Speech Signal into Image

Sik-Ho Tsang
4 min readMay 30, 2024
CyTex, Transforms Speech Signal into Image

CyTex: Transforming speech to textured images for speech emotion recognition
CyTex
, by The University of Newcastle, and Islamic Azad University
2022 Elsevier J. Speech Communication, Over 30 Citations (Sik-Ho Tsang @ Medium)

Sound Classification / Audio Tagging / Sound Event Detection (SED)
2015
[ESC-50, ESC-10, ESC-US] 2017 [AudioSet / Audio Set] [M3, M5, M11, M18, M34-res (DaiNet)] [Sample-Level DCNN (LeeNet)] 2021 [Audio Spectrogram Transformer (AST)]
==== My Other Paper Readings Are Also Over Here ====

  • A new speech-to-image transform, CyTex, is proposed that maps the raw speech signal directly to a textured image by using calculations based on the fundamental frequency of each speech frame.
  • The textured RGB images resulting from the CyTex transform is then classified using standard deep neural network models, e.g.: ResNet, for the recognition of different classes of emotion.

Outline

  1. CyTex
  2. Modified ResNet
  3. Results

1. CyTex

1.1. Fundamental Frequency

In the CyTex transform, each period of the speech signal lies in one row of the image, so that consecutive periods of speech form consecutive rows of the output image.

  • The speech signal is analysed according to its fundamental periods, and each period constructs a row of the output image. The fundamental period of speech can be calculated using the pitch frequency of the speech as:
  • where 𝑛 denotes the frame number.
  • To calculate the fundamental frequency F0n for the 𝑛th frame, the liibrosa library is used, which tracks pitch on the thresholded parabolically-interpolated STFT.
  • This library requires three input parameters: the frame length, the minimum limit for the pitch and the maximum limit for the pitch. These were set to 10 ms, 40 Hz, and 600 Hz, respectively. This pitch range allows the utterances of people of all ages, genders, and nationalities to be taken into account.
  • In the CyTex image construction procedure, the longest period of the speech signal determines the image size. Assuming a minimum pitch of 40 Hz with a sampling rate of 16 kHz, the longest period, 𝑇𝑚𝑎𝑥, and the maximum number of samples within it, 𝑆𝑇𝑚𝑎𝑥 can be calculated as follows:
  • Thus, each row of the image should contain 400 pixels. For smaller periods, zero-padding was applied.

The above create one channel of the image only.

1.2. Gradients

After creating the original image, horizontal and vertical gradients are calculated and used as the second and third channels of a RGB image.

  • The horizontal gradients provide valuable information about the dynamics of speech samples.
  • The vertical gradients capture the dynamical information between speech cycles.

Altogether, we have 3-channel image, which can be easily input into pretrained convolutional neural network (CNN) model.

1.3. Analysis

Feature Analysis

Two datasets are analyzed. In both datasets, the highest mean and variance values are achieved for anger and happiness emotions. In addition, the lowest mean and variance values are obtained for sadness.

CyTex Image Examples
  • CyTex images are also visualized above.

2. Modified ResNet

Modified ResNet

As shown in Fig. 5, modified ResNet152 architecture is used so that it can discriminate between different classes of emotion on CyTex images.

  • The last fully connected layer is replaced with a sequence of dropout and linear fully connected layers, where a batch normalisation layer is inserted after each linear fully connected layer.
  • The weights of the first two blocks of the model were frozen, and only the weights of the remaining blocks were trained.

3. Results

The described ResNet152 model was selected for the rest of the experiments because it showed the best overall results in the pilot tests.

SOTA Comparisons

Second last rows of 2 tables: Regardless of the proposed approach, the highest recognition rates were achieved for models that used the raw speech data and Mel Spectrogram images.

Last rows of 2 tables: Using the proposed CyTex transform leads to a better recognition rate compared with the state-of-the-art methods.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.