Research Article

The Journal of the Acoustical Society of Korea. 30 September 2025. 516-523
https://doi.org/10.7776/ASK.2025.44.5.516

ABSTRACT


MAIN

  • I. Introduction

  • II. Related Works

  • III. Proposed Method

  •   3.1 Overall Architecture

  •   3.2 Multi-Route Res2Block (MRBlock)

  •   3.3 Classifier

  • IV. Experiment

  •   4.1 Dataset

  •   4.2 Experiment setting

  •   4.3 Results

  • V. Conclusions

I. Introduction

The explosive growth of the digital music industry over the past decade has led to the release of vast amounts of new music every year.[1] This surge has driven significant interest in technologies that can automatically analyze and organize large-scale music collections.[2] One such field is Music Information Retrieval (MIR), which aims to extract high-level semantic information from audio signals, including genre, artist identity, tempo, and mood. These extracted representations serve as the foundation for various downstream tasks such as classification, recommendation, and music generation.[3,4]

Deep Neural Networks (DNNs) have shown remarkable success across many domains due to their strong capacity for data-driven feature learning. In MIR, DNNs have become increasingly dominant, with Convolutional Neural Networks (CNNs) in particular proving to be well-suited for learning from raw or spectrogram-based audio inputs.[5,6,7,8] This suitability arises from the hierarchical nature of music: short-term features like pitch, rhythm, and timbre accumulate over time to form higher-level patterns such as melody, emotional tone, or structure.[9] CNNs inherently exploit this compositionality by progressively aggregating local features into global representations through stacked convolutional layers.

Motivated by these characteristics, many prior works have designed CNN-based MIR models that capture either local or global musical traits. However, recent findings suggest that effective music representations are not confined to deep layers alone. Previous works demonstrated that shallow-layer features can also carry rich musical information, highlighting the importance of incorporating multiple levels of abstraction in MIR models.[10,11] Building on these insights, we identify two key considerations for designing an effective MIR system: (i) the ability to capture features across a wide range of time scales, and (ii) the ability to utilize representations from different processing depths.

To address both, we propose the Multi-Route Neural Network (MRNet), a CNN-based architecture specifically designed to extract music representations at various temporal resolutions and abstraction levels. MRNet is constructed by stacking Res2Blocks[12] with varying dilation rates along the time axis, allowing each block to specialize in a different temporal context. We further introduce the multi-route Res2Block (MRBlock), an enhanced module that splits the feature extraction path into three branches. Each branch is processed at a different depth, enabling parallel extraction of low-, mid-, and high-level features from a shared input.

We conduct evaluations of MRNet using four widely used datasets: GTZAN ,[13] FMA Small, FMA Large[14] and Melon Playlist,[15] with an emphasis on music classification tasks such as genre prediction. Through these experiments, we demonstrate that MRNet outperforms conventional CNN-based architectures and effectively learns hierarchical, multi-scale representations suitable for MIR.

II. Related Works

CNNs have been widely adopted in MIR due to their ability to model hierarchical patterns in time–frequency representations.[5,11] Many prior studies have explored CNN-based architectures tailored to various MIR tasks. For example, CNNs have been applied to timbre classification,[8] music tagging,[7] and genre classification across multiple datasets.[6,10,16] CNN-based models have also shown promise in music recommendation systems by learning user preference aligned representations from audio content.[9]

A key reason for CNNs’ popularity in MIR is their effectiveness in capturing both local and global acoustic features through progressive convolutional layers.[5] This aligns well with the hierarchical nature of music, where short-term elements such as pitch or rhythm combine over time to form long-term patterns like melody and mood.[2,4] However, many existing CNN architectures treat deep- layer features as the primary source of semantic information, often overlooking the potential of shallow- layer representations.

Recent work by Liu et al.[10] challenged this assumption by showing that shallow features can also carry discriminative information in MIR tasks. Their findings highlight the need for architectures that integrate multi- level processing and adapt to varying temporal resolutions.

III. Proposed Method

Our goal is to design a neural architecture capable of learning music representations across both diverse temporal ranges and multiple abstraction levels. To this end, we propose the Multi-Route Neural Network (MRNet), a CNN-based framework that captures features from low, mid, and high processing depths while also modeling temporal information at multiple resolutions.

3.1 Overall Architecture

Fig. 1 presents the overall structure of MRNet. MRNet comprises three MRBlocks, five Convolutional layers, and a classifier. A detailed description of MRBlock, a variant of the Res2Block, is provided in Fig. 2 and the following subsection. To capture temporal structures of varying durations, the MRBlocks are stacked with increasing dilation rates (2, 3, and 4), allowing each block to specialize in a different temporal context. We explored various configurations and numbers of stacked blocks and found that this design consistently yielded the best performance across our experiments.

https://cdn.apub.kr/journalsite/sites/ask/2025-044-05/N0660440511/images/ASK_44_05_11_F1.jpg
Fig. 1.

(Color available online) The overall architecture of the proposed Multi-Route Neural Network. The feature map Xconv is constructed by concatenating the outputs o2(1), o2(2), o2(3).

https://cdn.apub.kr/journalsite/sites/ask/2025-044-05/N0660440511/images/ASK_44_05_11_F2.jpg
Fig. 2.

(Color available online) Description of our proposed multi-route Res2Block. k means kernel size of convolution. The blue-highlighted region indicates the original Res2Block.

Given an input feature map, the first convolutional layer produces the input to the first MRBlock. For each MRBlock, the input is either the initial convolution output (when i=1) or the element-wise sum of the previous MRBlock’s second output o2(i-1) and its own input:

(1)
Xmi=Conv1DXinput ,i=1Xmi-1+o2(i-1),i>1.

Each MRBlock produces three outputs o1(i), o2(i), o3(i). These are generated from three independent SE layers without hierarchical dependency, and the indices are assigned sequentially for notational clarity. The input of MRBlock, X, is split into segment xi, where each xi is processed with residual convolution to produce yi. The resulting yi features are further transformed into zi and passed through SE modules to generate the final outputs o1(i), o2(i), o3(i). These outputs are used to extract representations at different depths: Low-level features (L1, L2, L3 in Fig. 1) are directly taken from o1(i) without further processing. High-level feature (H) is computed by concatenating o2(1), o2(2), o2(3) and applying an additional convolution layer as:

(2)
Xconv =concato2(1),o2(2),o2(3).
(3)
H= Conv1D Xconv .

Mid-level features (M1, M2, M3) are generated by adding the global context H to each o3(1), o3(2), o3(3) and passing the result through a separate convolution:

(4)
Mi=Conv1DO3(i)+H.

This architectural design allows MRNet to extract low-, mid-, and high-level features in parallel, offering diverse representations for downstream classification.

3.2 Multi-Route Res2Block (MRBlock)

MRBlock is the core component that enables MRNet to extract multi-depth features. Inspired by findings that shallow-layer features can be effective in MIR tasks, MRBlock explicitly separates feature processing into three branches.

As illustrated in Fig. 2, the input feature map is first passed through a convolution layer and split into four segments (y1, ... , y4) along the channel axis. Each segment is processed independently and then recombined. The output is subsequently divided into three segments corresponding to o1(i), o2(i) and o3(i)​, and each is individually refined using a Squeeze-and-Excitation (SE) mechanism. This split-aggregate-refine process allows each output to emphasize a different level of representation while sharing the same temporal scope.

3.3 Classifier

MIR tasks vary in nature, and different target types (e.g., genre, mood) may rely more heavily on features from specific abstraction levels. To accommodate this, MRNet includes a classifier that adaptively weights and integrates the extracted features (L1, ... , L3, M1, ... , M3, H).

As shown in Table 1, each of the seven features is first passed through an Attentive Statistics Pooling (ASP) layer, which summarizes the temporal sequence into a fixed-length vector using attention-weighted mean and standard deviation. These vectors are then scaled by a learnable weight vector W7×1, and passed through a linear classifier. This design enables the network to learn task-dependent importance across multiple feature depths.

Table 1.

Detailed structure of the classifier. ASP denotes Attentive Statistics Pooling, which is applied for global pooling along the temporal axis. BN refers to 1D batch normalization.

Name Layer Input Output
Classifier ASP × 7 (7 × C × T) (7 × C)
W (7 × C) (7 × C)
BN & Linear (7C) # of class
ASP Conv (k = 1, s = 1) (C × T) (C / 16 × T)
Conv (k = 1, s = 1) (C / 16 ×T) (C × T)
Softmax (C × T) (C × T)
Statistics pool (C × T) (C × 2)

IV. Experiment

4.1 Dataset

We evaluated MRNet on four publicly available music datasets commonly used in music information retrieval tasks. GTZAN[13] contains 1,000 audio tracks categorized into 10 distinct genres. Each track has a fixed duration of 30 seconds. It is widely used as a benchmark for genre classification. We also used both the small and large subsets of the Free Music Archive (FMA)[14] dataset. The small subset consists of 8,000 samples evenly distributed across 8 genre categories. The large subset contains 106,574 tracks labeled with 161 genre categories, with highly imbalanced class distributions. Melon Playlist[15] dataset includes 649,091 songs accompanied by metadata such as artist, album, and genre. For our experiments, we clustered tracks based on their genre annotations to create a genre classification task. Note that only spectrograms were available in this dataset; raw audio waveforms were not provided. For dataset splits, we followed the official training and evaluation protocols provided with the FMA datasets. For GTZAN and Melon dataset, we applied 10-fold cross validation following common practice in prior research. To facilitate reproducibility, we have also released the implementation of our K-fold splitting procedure on GitHub.1)

4.2 Experiment setting

We used classification accuracy as the primary evaluation metric across all datasets. For the FMA Large and Melon datasets, which contain highly imbalanced genre distributions, we additionally report the macro- averaged F1-score to better reflect performance across classes.

To obtain input features, we employed WavLM[17] as a pretrained feature extractor for the GTZAN and FMA datasets. WavLM has demonstrated strong performance in various audio-related tasks, outperforming traditional hand-crafted features. Each input waveform was transformed into frame-level embeddings using the base version of WavLM.

For the Melon dataset, since raw waveforms were unavailable, we extracted 48-dimensional Mel-spectrograms directly from the provided data. WavLM was not used for this dataset. In training process, we used the AdamW optimizer with a weight decay of 10–4. The learning rate was initialized at 10–3 and decayed to 5 × 10–4 following a cosine annealing schedule. Models were trained for 1000 epochs with a mini-batch size of 48. All experiments were conducted on two NVIDIA A5000 GPUs. The full experimental code is available on GitHub.

4.3 Results

Comparison with Previous Works. To assess the competitiveness of the proposed MRNet, we conducted music genre classification experiments across four benchmark datasets. The results are summarized in Table 2. In addition to accuracy, we report macro F1-scores for the FMA Large and Melon datasets, which exhibit significant class imbalance. MRNet achieved top performance on all four datasets, recording classification accuracies of 94.5 %, 56.6 %, 63.2 %, and 71.3 % on GTZAN, FMA Small, FMA Large, and Melon, respectively. Notably, the gains on FMA Large and Melon are particularly significant, suggesting that MRNet is well-suited for handling datasets with large-scale, fine-grained class structures. These results demonstrate MRNet’s superior generalization capability and robust clustering of musical characteristics, even in complex and diverse music collections.

Table 2.

Experimental results across various datasets. Macro F1-scores are reported only for unbalanced datasets to complement the accuracy metric.

Models Dataset Samples Accuracy (%) Macro-f1
MoER[18] GTZAN 1,000 86.4 N/A
BBNN[10] 93.9 N/A
Siddiquee et al.[19] 90.0 N/A
MRNet (ours) 94.5 N/A
MoER[18] FMA
(Small)
8,000 55.9 N/A
BBNN[10]* 54.8 N/A
LFCNet[16] 55.1 N/A
MRNet (ours) 56.6 N/A
BBNN[10]* FMA
(Large)
106,574 53.9 0.34
LFCNet[16]* 52.7 0.35
MRNet 63.20.38
ResNet34 Melon 649,091 63.6 0.36
SE-ResNet34 64.1 0.38
BBNN[10]* 60.2 0.36
LFCNet[16]* 64.7 0.41
MRNet (ours) 71.30.55

* indicates results from our implementation.

Ablation Study. To validate the contribution of the multi-route architecture, we conducted a route ablation study using the FMA Small dataset. Table 3 shows the performance degradation when individual routes were disabled.

Table 3.

Experiment results of route ablation experiments on the FMA-Small dataset.

L1L2L3M1M2M3H Acc (%)
56.6
× 53.9
× 51.9
× 54.5
× 54.8
× 54.6
× 53.0
× 53.0

The second row in the table presents the baseline performance of the full MRNet model, which utilizes all seven feature paths. Rows 3 to 9 depict the results when one of the feature branches (e.g., L1, M2, H, etc.) was removed during training and evaluation. In all configurations, performance dropped below the original 56.6 % accuracy of the full model. These findings confirm that each route contributes meaningfully to the model’s discriminative power and that the full multi-path design is essential for optimal performance.

Feature utilization by task. MRNet extracts seven feature representations: three from low-level (L, ... , L3), three from mid-level (M1, ... , M3), and one from high-level (H) branches. Depending on the task, the importance of each feature type may vary. For example, mood classification may rely more on long-term global features, while genre classification might benefit from short-term local patterns.

To explore this, we examined the learned weights of the feature scaling vector W, introduced in Section 3.3, Fig. 3 illustrates the distribution of W values after training MRNet for both genre and mood classification tasks.

https://cdn.apub.kr/journalsite/sites/ask/2025-044-05/N0660440511/images/ASK_44_05_11_F3.jpg
Fig. 3.

(Color available online) Task-specific distribution of the learned feature scaling vector W for genre and mood classification.

The analysis shows that low-level features, especially L1 and L2, were most influential in genre classification, supporting prior findings that shallow features are highly effective for this task. In contrast, mood classification placed more weight on L3, M3, and H features, which have deeper and broader temporal receptive fields. These results confirm that MRNet dynamically adjusts its feature emphasis according to the target MIR objective, enhancing task-specific performance.

Although the contribution of M1, M2, M3 and H appeared relatively weak in the FMA Small dataset, Table 3 shows that removing mid-level features consistently degraded performance, indicating that these branches are not redundant and do contribute to the overall effectiveness of MRNet.

V. Conclusions

In this paper, we introduced MRNet, a novel convolutional architecture tailored for MIR. MRNet is designed to capture musical representations across multiple time scales and processing depths by employing distinct feature extraction routes. Leveraging a stack of dilated Res2Blocks and the proposed multi-route Res2Block (MRBlock), MRNet effectively extracts low-, mid-, and high-level features in parallel.

Through extensive evaluations on four benchmark datasets including GTZAN, FMA Small, FMA Large, and Melon Playlist, we demonstrated that MRNet consistently outperforms previous CNN-based approaches in genre classification tasks. In addition, our analysis of the learned feature scaling weights revealed that MRNet can dynamically prioritize different types of features depending on the specific MIR objective, such as genre or mood classification.

These results highlight MRNet’s ability to learn robust, hierarchical representations that are both flexible and task adaptive. As part of future work, we plan to extend MRNet to other MIR tasks beyond classification, including playlist recommendation, emotion prediction, and music generation, to further explore its generalization and compositional capabilities. Furthermore, we intend to analyze the differences between misclassified and correctly classified tracks to better understand the limitations of MRNet.

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2023R1A2C1005744).

Notes

[2] https://github.com/Jungwoo4021/MRNet

References

1

G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans on speech and audio process. 10, 293-302 (2002).

10.1109/TSA.2002.800560
2

A. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney, “Content-based music information retrieval: Current directions and future challenges,” Proc. IEEE, 96, 668-696 (2008).

10.1109/JPROC.2008.916370
3

D. Bogdanov, N. Wack, E. Gómez Gutiérrez, S. Gulati, P. H. Boyer, O. Mayor, G. R. Trepat, J. Salamon, J. R. Z. González, and X. Serra, “Essentia: An audio analysis library for music information retrieval,” Proc. 14th ISMIR, 493-498 (2013).

4

A. Van Den Oord, S. Dieleman, and B. Schrauwen, “Deep content-based music recommendation,” Proc. NIPS, 26 (2013).

5

K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Convolutional recurrent neural networks for music classification,” Proc. ICASSP, 2392-2396 (2017).

10.1109/ICASSP.2017.7952585
6

Y. Xu and W. Zhou, “A deep music genres classification model based on CNN with Squeeze & Excitation Block,” Proc. IEEE APSIPA ASC. 332- 338 (2020).

7

A. Ferraro, D. Bogdanov, X. S. Jay, H. Jeon, and J. Yoon, “How low can you go? Reducing frequency and time resolution in current CNN architectures for music auto-tagging,” Proc. 28th EUSIPCO, 131-135 (2021).

10.23919/Eusipco47968.2020.9287769
8

D. Kim, T. T. Sung, Y. S. Cho, G. Lee, and B. C. Sohn, “A single predominant instrument recognition of polyphonic music using CNN-based timbre analysis,” Int. J. Eng. Technol. 7, 590-595 (2018).

10.14419/ijet.v7i3.34.19388
9

S. Joshi, T. Jain, and N. Nair, “Emotion based music recommendation system using LSTM-CNN architecture,” Proc. IEEE ICCCNT. 01-06 (2021).

10.1109/ICCCNT51525.2021.9579813
10

C. Liu, L. Feng, G. Liu, H. Wang, and S. Liu, “Bottom-up broadcast neural network for music genre classification,” Multimed. Tools Appl. 80, 7313-7331 (2021).

10.1007/s11042-020-09643-6
11

J. Lee and J. Nam, “Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging,” IEEE Signal Processing Letters, 24, 1208-1212 (2017).

10.1109/LSP.2017.2713830
12

S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2Net: A new multi- scale backbone architecture,” IEEE Trans. Pattern Anal. Mach. Intell. 43, 652-662 (2019).

10.1109/TPAMI.2019.2938758
13

I. Ikhsan, L. Novamizanti, and I. N. A. Ramatryana, “Automatic musical genre classification of audio using Hidden Markov Model,” Proc. 2nd ICoICT, 397-402 (2014).

10.1109/ICoICT.2014.6914095
14

FMA: A Dataset for Music Analysis, https://arxiv.org/abs/1612.01840, (Last viewed September 16, 2025).

15

A. Ferraro, Y. Kim, S. Lee, B. Kim, N. Jo, S. Lim, S. Lim, J. Jang, S. Kim, and X. Serra, “Melon playlist dataset: A public dataset for audio-based playlist generation and music tagging,” Proc. IEEE ICASSP, 536-540 (2021).

10.1109/ICASSP39728.2021.9413552
16

S.-H. Cho, Y. Park, and J. Lee, “Effective music genre classification using late fusion convolutional neural network with multiple spectral features,” Proc. IEEE ICCE-Asia, 1-4 (2022).

10.1109/ICCE-Asia57006.2022.9954732
17

S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian , Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre- training for full stack speech processing,” IEEE J. Sel. Top. Signal Process. 16, 1505-1518 (2022).

10.1109/JSTSP.2022.3188113
18

Y. Yi, K.-Y. Chen, and H.-Y.Gu, “Mixture of CNN experts from multiple acoustic feature domain for music genre classification,” Proc. IEEE APSIPA ASC. 1250-1255 (2019).

10.1109/APSIPAASC47483.2019.9023314
19

Md. N. A. Siddiquee, Md. A. Hossain, and F. Wahida, “An effective machine learning approach for music genre classification with mel spectrograms and Knn,” Proc. IEEE IC3S, 1-4 (2013).

10.1109/IC3S57698.2023.10169397
페이지 상단으로 이동하기