-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
123 lines (104 loc) · 3.25 KB
/
Copy pathindex.html
File metadata and controls
123 lines (104 loc) · 3.25 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>DiTSinger</title>
<link rel="stylesheet" href="css/style.css">
</head>
<body>
<div class="container">
<!-- ===== Paper Header ===== -->
<div class="paper-header">
<h1 class="paper-title">DiTSinger</h1>
<p class="paper-subtitle">
DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
</p>
<div class="paper-authors">
<p class="author-line">
Zongcai Du<sup>*</sup>, Guilin Deng<sup>*</sup>, Xiaofeng Guo<sup>*</sup>,
Xin Gao, Linke Li
</p>
<p class="author-line">
Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu
</p>
<p class="author-note">
<sup>*</sup> Equal contribution.
</p>
<p class="author-affiliation">
Migu Music, China Mobile
</p>
</div>
</div>
<!-- ===== End Paper Header ===== -->
<!-- Abstract -->
<section class="abstract-section">
<h2 class="section-title">Abstract</h2>
<p class="abstract-text">
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates
strong expressiveness but remains limited by data scarcity and model scalability.
We introduce a two-stage pipeline: a compact seed set of human-sung recordings
is constructed by pairing fixed melodies with diverse LLM-generated lyrics,
and melody-specific models are trained to synthesize over 500 hours of
high-quality Chinese singing data. Building on this corpus, we propose
<b>DiTSinger</b>, a Diffusion Transformer with RoPE and qk-norm, systematically
scaled in depth, width, and resolution for enhanced fidelity.
Furthermore, we design an implicit alignment mechanism that obviates
phoneme-level duration labels by constraining phoneme-to-acoustic attention
within character-level spans, thereby improving robustness under noisy or
uncertain alignments. Extensive experiments validate that our approach enables
scalable, alignment-free, and high-fidelity SVS.
</p>
</section>
<!-- Data Pipeline -->
<section>
<h2>Data Construction Pipeline</h2>
<img
src="assets/figures/datapipeline.png"
alt="Data Construction Pipeline"
class="paper-figure"
>
</section>
<!-- Architecture -->
<section>
<h2>Model Architecture</h2>
<img
src="assets/figures/architecture.png"
alt="Model Architecture"
class="paper-figure"
>
</section>
<!-- Scaling Results -->
<section>
<h2>Scaling Results</h2>
<div class="scaling-container">
<div class="scaling-item">
<h3>Model Scaling</h3>
<img
src="assets/figures/model_scaling.png"
alt="Model Scaling"
class="paper-figure"
>
</div>
<div class="scaling-item">
<h3>Data Scaling</h3>
<img
src="assets/figures/data_scaling.png"
alt="Data Scaling"
class="paper-figure"
>
</div>
</div>
</section>
<!-- Demos -->
<section>
<h2>Singing Voice Synthesis Demos</h2>
<p>
All demo songs are <b>unseen during training</b> and only use
<b>character-level timestamps</b>.
</p>
<div id="demo-table"></div>
</section>
</div>
<script src="load_demos.js"></script>
</body>
</html>