DiTSinger/index.html at main · NJU-Jet/DiTSinger · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>DiTSinger</title>
  <link rel="stylesheet" href="css/style.css">
</head>

<body>

<div class="container">

  <!-- ===== Paper Header ===== -->
  <div class="paper-header">

    <h1 class="paper-title">DiTSinger</h1>

    <p class="paper-subtitle">
      DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
    </p>

    <div class="paper-authors">
      <p class="author-line">
        Zongcai Du<sup>*</sup>, Guilin Deng<sup>*</sup>, Xiaofeng Guo<sup>*</sup>,
        Xin Gao, Linke Li
      </p>
      <p class="author-line">
        Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu
      </p>
      <p class="author-note">
        <sup>*</sup> Equal contribution.
      </p>
      <p class="author-affiliation">
        Migu Music, China Mobile
      </p>
    </div>

  </div>
  <!-- ===== End Paper Header ===== -->

  <!-- Abstract -->
  <section class="abstract-section">
    <h2 class="section-title">Abstract</h2>
    <p class="abstract-text">
      Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates
      strong expressiveness but remains limited by data scarcity and model scalability.
      We introduce a two-stage pipeline: a compact seed set of human-sung recordings
      is constructed by pairing fixed melodies with diverse LLM-generated lyrics,
      and melody-specific models are trained to synthesize over 500 hours of
      high-quality Chinese singing data. Building on this corpus, we propose
      <b>DiTSinger</b>, a Diffusion Transformer with RoPE and qk-norm, systematically
      scaled in depth, width, and resolution for enhanced fidelity.
      Furthermore, we design an implicit alignment mechanism that obviates
      phoneme-level duration labels by constraining phoneme-to-acoustic attention
      within character-level spans, thereby improving robustness under noisy or
      uncertain alignments. Extensive experiments validate that our approach enables
      scalable, alignment-free, and high-fidelity SVS.
    </p>
  </section>

  <!-- Data Pipeline -->
<section>
  <h2>Data Construction Pipeline</h2>
  <img
    src="assets/figures/datapipeline.png"
    alt="Data Construction Pipeline"
    class="paper-figure"
  >
</section>

  <!-- Architecture -->
<section>
  <h2>Model Architecture</h2>
  <img
    src="assets/figures/architecture.png"
    alt="Model Architecture"
    class="paper-figure"
  >
</section>


  <!-- Scaling Results -->
<section>
  <h2>Scaling Results</h2>

  <div class="scaling-container">
    <div class="scaling-item">
      <h3>Model Scaling</h3>
      <img
        src="assets/figures/model_scaling.png"
        alt="Model Scaling"
        class="paper-figure"
      >
    </div>

    <div class="scaling-item">
      <h3>Data Scaling</h3>
      <img
        src="assets/figures/data_scaling.png"
        alt="Data Scaling"
        class="paper-figure"
      >
    </div>
  </div>
</section>

  <!-- Demos -->
  <section>
    <h2>Singing Voice Synthesis Demos</h2>
    <p>
      All demo songs are <b>unseen during training</b> and only use
      <b>character-level timestamps</b>.
    </p>

    <div id="demo-table"></div>
  </section>

</div>

<script src="load_demos.js"></script>
</body>
</html>