124 Commits
v1.5.0 ... main

Author SHA1 Message Date
Vanka0051
42a73394e9 Merge pull request #537 from pandalee99/perf/gumbel_softmax_sampler
feat(sampler): enhance with greedy sampling mode
2025-11-07 15:32:31 +08:00
Vanka0051
a963ca3d0b Merge branch 'main' into perf/gumbel_softmax_sampler 2025-11-07 15:32:22 +08:00
Vanka0051
44b5ffcb5d Merge pull request #536 from storyicon/support_batch
feat(accel): add batch support
2025-11-07 15:29:53 +08:00
PAN
a3b884ff6f feat: gumbel_softmax_sampler
Signed-off-by: PAN <1162953505@qq.com>
2025-11-06 13:12:11 +08:00
storyicon
3c360273da feat: add batch support
Signed-off-by: storyicon <storyicon@foxmail.com>
2025-11-06 05:00:28 +00:00
Vanka0051
1d5d079aaa Merge pull request #517 from storyicon/gpt2_accel
feat: achieve inference acceleration for the gpt2 stage (3.79×)
2025-10-30 16:14:46 +08:00
Vanka0051
5d67f6271b Merge branch 'main' into gpt2_accel 2025-10-30 16:14:34 +08:00
Vanka0051
e42480ced8 Merge pull request #516 from storyicon/s2mel_accel
feat: achieve inference acceleration for the s2mel stage (1.61×)
2025-10-30 15:55:25 +08:00
storyicon
c1ef4148af feat: achieve inference acceleration for the gpt2 stage
Signed-off-by: storyicon <storyicon@foxmail.com>
2025-10-24 08:15:00 +00:00
storyicon
31e7e855e2 feat: optimize s2mel stage
Signed-off-by: storyicon <storyicon@foxmail.com>
2025-10-24 07:30:20 +00:00
wangyining02
bde7d0bdf0 doc: update chat groups 2025-10-10 14:01:36 +08:00
nanaoto
db5b39bb6a Merge pull request #461 from Yttrin/main
fix: Empty generator -> IndexError problem on non-streaming infer()
2025-10-02 01:09:44 +08:00
Yt Zhong
750d9d9d15 fix: Empty generator -> IndexError problem on non-streaming infer() 2025-10-01 03:25:28 +08:00
Yt Zhong
b0c6ab8a93 Simple streaming return implementation, lower latency for the first sound. (#417)
* Add stream_return switch to get wavs from yield

* Add more_segment_before arg for more segmenting.

more_segment_before is a int, for token_index < more_segment_before, more segmenting will be applied.
0: no effect; 80 is recommended for better first-wav-latency

* Uncomment silence insertion

* fix: rename quick streaming tokens argument

* fix: rename quick streaming tokens argument

* fix: Add a wrapper for the yield function. It will not return a generator in normal condition.
2025-09-30 14:05:39 +08:00
nanaoto
2ca41d738f Merge pull request #441 from coezbek/patch-3
Feat: Warn if input text contains UNK tokens
2025-09-29 16:15:48 +08:00
Christopher Özbek
34be9bfb14 feat: Warn if input text contains UNK tokens
Added warnings for unknown tokens in input text.
2025-09-27 09:08:18 +02:00
root
84a5ef97b8 update. 2025-09-23 17:52:52 +08:00
nanaoto
c7602c1f59 Merge pull request #397 from Arcitec/indextts2-arc
IndexTTS2 Documentation Update
2025-09-23 15:58:55 +08:00
Arcitec
5471d8256f docs: Install HuggingFace CLI with high-speed download feature
The Xet storage method uses de-duplication and chunked downloads to speed up transfers in some situations:

https://pypi.org/project/hf-xet/

But most importantly, installing the Xet support gets rid of some annoying HuggingFace CLI messages about missing the feature.
2025-09-19 21:39:52 +02:00
Arcitec
ae5653986c chore: Note why package build isolation was disabled for DeepSpeed 2025-09-19 02:35:27 +02:00
Arcitec
cc9c6b6cfe docs: Clarify that UV handles Python and the environment creation
- Some users have been confused and were manually creating and activating Python venvs, which is not good since it can lead to the wrong Python version or dependency conflicts.

- Therefore, we add more detailed guidance to explain that `uv` manages the whole environment, the Python version, all dependencies and automatic environment activation.

- A few users were also confused about where `uv tool` installs binaries, but instead of explaining that in depth, we now add a link to the documentation page which explains how it works, and also instruct users to carefully read the `uv tool` output since it tells them how to add the installation to the system's path.
2025-09-18 20:28:11 +02:00
kj863257rc
64cb31a6c3 Update infer_v2.py: solve the problem of persistent cache buildup (#382)
* Update infer_v2.py

clear old cache

* Update infer_v2.py: solve the problem of persistent cache buildup

clear old cache
2025-09-18 13:59:45 +08:00
nanaoto
9e391a920a Merge pull request #354 from Arcitec/indextts2-arc
IndexTTS2 Maintenance Patches
2025-09-18 13:58:23 +08:00
Arcitec
c24d73ea44 chore: Small dependency updates 2025-09-17 21:55:44 +02:00
Arcitec
ec368de932 fix(webui): Experimental checkbox bugfixes and add visual warning label
- We can't use the original "Show experimental features" checkbox implementation, because it *deeply* breaks Gradio.

- Gradio's `gr.Examples()` API binds itself to the original state of the user interface. Gradio crashes and causes various bugs if we try to change the available UI controls later.

- Instead, we must use `gr.Dataset()` which acts like a custom input/output control and doesn't directly bind itself to the target control. We must also provide a secret, hidden "all mode choices" component so that it knows the names of all "control modes" that are possible in examples.

- We now also have a very visible warning label in the user interface, to clearly mark the experimental features.

- Bugs fixed:

* The code was unable to toggle the visibility of Experimental demos in the Examples list. It was not possible with Examples (since it's a wrapper around Dataset, but Examples contains its own internal state/copy of all data). Instead, we use a Dataset and manipulate its list directly.

* Gradio crashes with a `gradio.exceptions.Error` exception if you try to load an example that tries to use an experimental feature if we have removed its UI element. This is because Examples binds to the original user interface and *remembers* the list of choices, and it *cannot* dynamically select something that did not exist when the `gr.Examples()` was initially created. This problem is fixed by switching to `gr.Dataset()`.

* Furthermore, Gradio's `gr.Examples()` handler actually remembers and caches the list of UI options. So every time we load an example, it rewrites the "Emotion Control Mode" selection menu to only show the options that were available when the Examples table was created. This means that even if we keep the "Show experimental features" checkbox, Gradio itself will erase the experimental mode from the Control Mode selection menu every time the user loads an example. There are no callbacks or "update" functions to allow us to override this automatic Gradio behavior. But by switching to `gr.Dataset()`, we completely avoid this deep binding.

* The "Show experimental features" checkbox is no longer tied to a column in the examples-table, to avoid fighting between Gradio's example table trying to set the mode, and the experimental checkbox being toggled and also trying to set the mode.

* Lastly, the "Show experimental features" checkbox now remembers and restores the user's current mode selection when toggling the checkbox, instead of constantly resetting to the default mode ("same as voice reference"), to make the UI more convenient for users.
2025-09-17 21:54:48 +02:00
Arcitec
c5f9a31127 fix(webui): Make the Emotion Control Weight slider visible again
- The emotion weight is always applied in every mode except "Same as voice reference", so we must make the slider visible so that users can control the value. Otherwise it would silently apply the last-set value without the user knowing, which is very confusing.

- Furthermore, having the slider even on the Emotion Vectors page is *very* useful, because it allows users to rapidly change the total strength of the current emotion vectors without having to manually/carefully move every individual emotion slider.
2025-09-17 19:56:07 +02:00
Arcitec
e185fa1ce7 fix(webui): Make the Advanced Settings visible by default again
- The Advanced Settings contains some very advanced features which users shouldn't tweak, but it also contains important insight into segmentation generations, and the "max tokens per generation segment" feature which users must tweak if they have low VRAM.

- Therefore it's very important that users notice the "Advanced Settings" section so that they can read the VRAM help text and reduce the segment length if they have VRAM issues. So let's make the advanced category visible by default again until a better solution is determined.
2025-09-17 19:56:07 +02:00
Arcitec
c266910cc6 refactor(webui): Remove repeated code in Examples loader 2025-09-17 19:56:07 +02:00
Arcitec
8aa8064a53 feat: Add reusable Emotion Vector normalization helper
- The WebUI was secretly squashing all emotion vectors and re-scaling them. It's a good idea for user friendliness, but it makes it harder to learn what values will work in Python when using the WebUI for testing.

- Instead, let's move the normalization code into IndexTTS2 as a helper function which is used by Gradio and can be used from other people's code too.

- The emotion bias (which reduces the influence of certain emotions) has also been converted into an optional feature, which can be turned off if such biasing isn't wanted. And all biasing values have been re-scaled to use 1.0 as the reference, to avoid scaling relative to 0.8 (which previously meant that it applied double scaling).
2025-09-17 19:56:07 +02:00
Arcitec
1520d0689b fix(webui): New default emo_alpha recommendation instead of scaling
- Silently scaling the value internally is confusing for users. They may be tuning their settings via the Web UI before putting the same values into their Python code, and would then get a different result since the Web UI "lies" about the slider values.

- Instead, let's remove the silent scaling, and just change the default weight to a better recommendation.
2025-09-17 19:56:07 +02:00
Arcitec
ef097101b7 fix(webui): Add support for Gradio 5.45.0 and higher
- We were using ".select" to detect when tabs are changed, but Gradio has modified behavior in 5.45.0 to only trigger from user clicks. They now require that we use ".change" to detect tab changes from code. This fix makes the Examples work when loading on new Gradio versions.
2025-09-17 19:56:07 +02:00
index-tts
cb5c98011f Merge pull request #378 from index-tts/tts2dev
update Contributors
2025-09-17 11:39:05 +08:00
shujingchen
d50340aa5b update Contributors 2025-09-17 11:37:20 +08:00
index-tts
12ee39996f Merge pull request #375 from index-tts/tts2dev
update Contributors
2025-09-16 20:22:52 +08:00
shujingchen
a37d808923 update Contributors 2025-09-16 20:20:50 +08:00
index-tts
02c1e5a234 Merge pull request #374 from index-tts/tts2dev
Update contributors
2025-09-16 19:45:47 +08:00
shujingchen
901a5a4111 update Contributors 2025-09-16 19:43:32 +08:00
shujingchen
1361244010 update Contributors 2025-09-16 19:38:33 +08:00
shujingchen
c2482142d6 Merge remote-tracking branch 'origin/main' into tts2dev 2025-09-16 19:28:59 +08:00
shujingchen
3e416dc598 update Contributors 2025-09-16 19:28:09 +08:00
index-tts
70aa801b25 Merge pull request #372 from index-tts/tts2dev
update readme
2025-09-16 15:55:13 +08:00
shujingchen
58f8a9d2b1 Merge remote-tracking branch 'origin/main' into tts2dev 2025-09-16 15:53:38 +08:00
shujingchen
e3595faec1 add Contributors in Bilibili 2025-09-16 15:51:46 +08:00
shujingchen
ef86774658 update Official Statement 2025-09-16 14:21:02 +08:00
shujingchen
de949be82a update Official Statement 2025-09-16 14:18:49 +08:00
index-tts
45d8d13f0b Merge pull request #368 from index-tts/tts2dev
Include usage notes for Pinyin
2025-09-16 13:22:22 +08:00
shujingchen
961dcc23f4 add pinyin.vocab 2025-09-16 13:18:55 +08:00
shujingchen
be4af061f1 update 2025-09-16 13:13:21 +08:00
shujingchen
10c1fcd3ad add tips: pinyin usage 2025-09-16 13:10:40 +08:00
shujingchen
7b4f0880d9 update modelscope demo page link 2025-09-16 11:31:15 +08:00
shujingchen
aad61c2afc Merge remote-tracking branch 'origin/main' into tts2dev 2025-09-16 11:25:54 +08:00
nanaoto
a058502865 Add Docker publish workflow configuration 2025-09-15 17:47:08 +08:00
nanaoto
ee23371296 Merge pull request #338 from yrom/fix/preload-bigvgan-cuda
Correct the import path of BigVGAN's custom cuda kernel
2025-09-15 16:27:40 +08:00
nanaoto
009428b62d Merge pull request #347 from index-tts/cut_audio
feat: 裁剪过长的输入音频至15s,减少爆内存和显存
2025-09-12 16:48:14 +08:00
nanaoto
0828dcb098 feat: 裁剪过长的输入音频至15s,减少爆内存和显存 2025-09-12 16:45:37 +08:00
shujingchen
6118d0ecf9 update modelscope demo page link 2025-09-12 16:20:37 +08:00
nanaoto
48a71aff6d Merge pull request #345 from index-tts/webui_update
feat: 归一化参数到推荐的范围,改善用户体验
2025-09-12 14:23:24 +08:00
nanaoto
af2b06e061 feat: 归一化参数到推荐的范围,改善用户体验 2025-09-12 14:20:04 +08:00
LGZwr
2cfc76ad9c fix: 修复样本音频太长报错的问题,对音频进行裁切。 2025-09-12 14:08:46 +08:00
Arcitec
d777b8a029 docs: Add FP16 usage advice for faster inference 2025-09-12 14:06:30 +08:00
Yrom
e409c4a19b fix(infer_v2): Correct the import path of BigVGAN's custom cuda kernel 2025-09-11 16:55:18 +08:00
nanaoto
8336824c71 Merge pull request #325 from Arcitec/indextts2-arc
IndexTTS2 New Features & Maintenance Patches
2025-09-11 12:55:38 +08:00
Arcitec
85ba55a1d3 docs: Document the DeepSpeed performance effects 2025-09-11 06:37:03 +02:00
Arcitec
f041d8eb64 fix(webui): Fix unintentional empty spacing between control groups 2025-09-11 06:08:08 +02:00
Arcitec
3b5b6bca85 docs: Document the new emo_alpha feature for text-to-emotion mode 2025-09-11 05:42:39 +02:00
Arcitec
d899770313 feat(webui): Implement emotion weighting for vectors and text modes
- This is a major new feature, which now allows for much more natural speech generation by lowering the influence of the emotion vector/text control modes.

- It is particularly useful for the "emotion text description" control mode, where a strength of 0.6 or lower is useful to get much more natural speech.
2025-09-11 04:25:26 +02:00
Arcitec
9668064377 feat: Implement emo_alpha scaling of emotion vectors and emotion text
- Added support for `emo_alpha` scaling of emotion vectors and emotion text inputs.

- This is a major new feature, which now allows for much more natural speech generation by lowering the influence of the emotion vector/text control modes.

- It is particularly useful for the "emotion text description" control mode, where a strength of 0.6 or lower is useful to get much more natural speech. Before this feature, it was not possible to make natural speech with that mode, because QwenEmotion assigns emotion scores to the text from 0.0-1.0, and that score was used directly as an emotion vector. This meant that the text mode always used very high strengths. Now, the user can adjust the strength of the emotions to get very natural results.

- Refactored `IndexTTS2.infer()` variable initialization logic to avoid repetition and ensure cleaner code paths.
2025-09-11 04:24:47 +02:00
Arcitec
555e146fb4 feat(webui): Implement speech synthesis progress bar 2025-09-11 04:17:02 +02:00
Arcitec
55095de317 chore: Lock Gradio version due to bug in 5.45.0
Their new 5.45.0 release today breaks the ability to load examples. We have to lock the last working version of Gradio.
2025-09-11 04:16:46 +02:00
Arcitec
39a035d106 feat: Extend GPU Check utility to support more GPUs
- Refactored to a unified device listing function.

- Now checks every supported hardware acceleration device type and lists the devices for all of them, to give a deeper system analysis.

- Added Intel XPU support.

- Improved AMD ROCm support.

- Improved Apple MPS support.
2025-09-11 04:16:27 +02:00
Arcitec
6113567e94 fix(cli): More robust device priority checks 2025-09-11 04:16:27 +02:00
Arcitec
c3d7ab4adc docs: Add usage note regarding random sampling 2025-09-11 04:15:58 +02:00
Arcitec
30848efd45 docs: Add Alibaba's high-bandwidth PyPI mirror for China 2025-09-11 04:15:58 +02:00
Arcitec
752df30549 chore: Move docs to new directory 2025-09-11 04:15:58 +02:00
Arcitec
f0badb13af feat(webui)!: Easier DeepSpeed launch argument 2025-09-11 04:15:58 +02:00
nanaoto
97d06383da Merge pull request #327 from index-tts/doc_zh
中文文档
2025-09-11 00:36:33 +08:00
nanaoto
ce2f71aae5 fix: 中文readme标题显示问题 2025-09-11 00:32:42 +08:00
nanaoto
5e257cc909 doc: +中文readme 2025-09-11 00:29:33 +08:00
DDXDB
e83df4e427 feat(cli): Support XPU (#322)
* Support XPU

* Support XPU
2025-09-10 22:35:06 +08:00
nanaoto
242604d27e Merge pull request #324 from Arcitec/indextts2-arc
docs: Remove redundant "python" command instruction
2025-09-10 22:32:46 +08:00
Arcitec
3236fa496a docs: Remove redundant "python" command instruction 2025-09-10 16:28:21 +02:00
nanaoto
4ba37b5736 Merge pull request #323 from Arcitec/indextts2-arc
IndexTTS2 Documentation Update
2025-09-10 22:22:04 +08:00
Arcitec
429c06c787 docs: Add quick uv installation technique
- Makes the "uv" installation page link larger.

- Adds the quickest and easiest installation technique for convenience.
2025-09-10 16:19:51 +02:00
nanaoto
831fc4f5bd Merge pull request #311 from Arcitec/indextts2-arc
IndexTTS2 Maintenance Patches
2025-09-10 14:57:24 +08:00
Arcitec
6c768073e9 docs: Add a stronger warning about unsupported installation methods
- Several users have unfortunately disregarded the `uv` instructions and ended up with broken `conda` / `pip` installations. We require `uv` for a reason: It's the *only* way to guarantee an exact, well-tested installation environment.

- The warning is now clearly highlighted, with a deeper explanation about why `uv` is required, so that everyone can enjoy IndexTTS without hassle!
2025-09-10 06:04:00 +02:00
Arcitec
936e6ac4dd feat: DeepSpeed is now an optional dependency which can be disabled
- Windows users sometimes have trouble installing DeepSpeed. Therefore, this feature has been converted into an optional installation flag, which can be skipped if it's causing issues on your system.

- Improved documentation to mention how to enable/disable DeepSpeed and all the other speed-related features (such as compiled CUDA kernels).
2025-09-10 02:28:28 +02:00
Arcitec
05a8ae45e5 fix: Don't load DeepSpeed if use_deepspeed is False
- A recent change made DeepSpeed optional (off by default), but the code was still trying to load DeepSpeed even when `use_deepspeed = False`. This means users would still have a big startup slowdown and a lot of error messages if their DeepSpeed module isn't working (usually because it's not able to compile itself on their machines).

- We now only load DeepSpeed if the user requested it.

- Translated the DeepSpeed error message to English, since all other errors in the same function were already English.
2025-09-09 18:20:28 +02:00
Arcitec
7aca90ba6c refactor: Simplify use_cuda_kernel check 2025-09-09 18:11:12 +02:00
Arcitec
57f1c11d4a chore: Tag custom license identifier in pyproject.toml 2025-09-09 18:11:12 +02:00
nanaoto
ec530fb0a7 add deepspeed cmd option (#307) 2025-09-09 20:35:54 +08:00
仙舟龙脉研究所
5d6a18a776 Add support for Intel GPUs (#298) 2025-09-09 20:31:50 +08:00
十字鱼
055a23a12b Add startup parameters for cuda_kernel (#302)
* Update webui.py

* Update infer_v2.py
2025-09-09 20:31:05 +08:00
index-tts
32f111d906 Merge pull request #303 from index-tts/tts2dev
Add a HuggingFace experience entry
2025-09-09 19:36:39 +08:00
shujingchen
726eb19ca7 update link 2025-09-09 19:34:56 +08:00
shujingchen
0a98f3082c IndexTTS-2-Demo page link 2025-09-09 19:31:49 +08:00
nanaoto
0dc66d2621 Update License (#300)
* (WIP) test license for IndexTTS-2

* remove old license for v1.5 & rename latest license

* Update license descriptions in code

---------

Co-authored-by: wangyining02 <wangyining02@bilibili.com>
2025-09-09 16:34:34 +08:00
wangyining02
4d66ccbba4 Update license for IndexTTS-2 2025-09-09 15:59:23 +08:00
Johnny Arcitec
cdcc62ae22 IndexTTS2 Release Preparation, Part 2 (#291)
* fix: Configure "uv" build system to use CUDA on supported platforms

- Linux builds of PyTorch always have CUDA acceleration built-in, but Windows only has it if we request a CUDA build.

- The built-in CUDA on Linux uses old libraries and can be slow.

- We now request PyTorch built for the most modern CUDA Toolkit on Linux + Windows, to solve both problems.

- Mac uses PyTorch without CUDA support, since it doesn't exist on that platform.

- Other dependencies have received new releases and are included in this fix too:

* click was downgraded because the author revoked 8.2.2 due to a bug.

* wetext received a new release now.

* fix: Use PyPI as the hashing reference in "uv" lockfile

- PyPI is the most trustworthy source for package hashes. We need to remove the custom mirror from the config, otherwise that mirror always becomes the default lockfile/package source, which leads to user trust issues and package impersonation risks.

- Regional mirrors should be added by users during installation instead, via the `uv sync --default-index` flag. Documented with example for Chinese mirror.

- When users add `--default-index`, "uv" will try to discover the exact same packages via the mirror to improve download speeds, but automatically uses PyPI if the mirror didn't have the files or if the mirror's file hashes were incorrect. Thus ensuring that users always have the correct package files.

* docs: Improve README for IndexTTS2 release!

- "Abstract" separated into paragraphs for easier readability.

- Clearer document structure and many grammatical improvements.

- More emojis, to make it easier to find sections when scrolling through the page!

- Added missing instructions:

* Needing `git-lfs` to clone the code.
* Needing CUDA Toolkit to install the dependencies.
* How to install the `hf` or `modelscope` CLI tools to download the models.

- Made our web demo the first section within "quickstart", to give users a quick, fun demo to start experimenting with.

- Fixed a bug in the "PYTHONPATH" recommendation. It must be enclosed in quotes `""`, otherwise the new path would break on systems that had spaces in their original path.

- Improved all Python code-example descriptions to make them much easier to understand.

- Clearly marked the IndexTTS1 legacy section as "legacy" to avoid confusion.

- Removed outdated Windows "conda/pip" instruction which is no longer relevant since we use "uv" now.

* refactor(webui): Remove unused imports

The old IndexTTS1 module and ModelScope were being loaded even though we don't need them. They also have a lot of dependencies, which slowed down loading and could even cause some conflicts.

* feat!: Remove obsolete build system (setup.py)

BREAKING CHANGE: The `setup.py` file has been removed.

Users should now use the new `pyproject.toml` based "uv" build system for installing and developing the project.

* feat: Add support for installing IndexTTS as a CLI tool

- We now support installing as a CLI tool via "uv".

- Uses the modern "hatchling" as the package / CLI build system.

- The `cli.py` code is currently outdated (doesn't support IndexTTS2). Marking as a TODO.

* chore: Add authors and classifiers metadata to pyproject.toml

* feat: Faster installs by making WebUI dependencies optional

* refactor!: Rename "sentences" to "segments" for clarity

- When we are splitting text into generation chunks, we are *not* creating "sentences". We are creating "segments". Because a *sentence* must always end with punctuation (".!?" etc). A *segment* can be a small fragment of a sentence, without any punctuation, so it's not accurate (and was very misleading) to use the word "sentences".

- All variables, function calls and strings have been carefully analyzed and renamed.

- This change will be part of user-facing code via a new feature, which is why the change was applied to the entire codebase.

- This change also helps future code contributors understand the code.

- All affected features are fully tested and work correctly after this refactoring.

- The `is_fp16` parameter has also been renamed to `use_fp16` since the previous name could confuse people ("is" implies an automatic check, "use" implies a user decision to enable/disable FP16).

- `cli.py`'s "--fp16" default value has been set to False, exactly like the web UI.

- `webui.py`'s "--is_fp16" flag has been changed to "--fp16" for easier usage and consistency with the CLI program, and the help-description has been improved.

* feat(webui): Set "max tokens per generation segment" via CLI flag

- The "Max tokens per generation segment" is a critical setting, as it directly impacts VRAM usage. Since the optimal value varies significantly based on a user's GPU, it is a frequent point of adjustment to prevent out-of-memory issues.

- This change allows the default value to be set via a CLI flag. Users can now conveniently start the web UI with the correct setting for their system, eliminating the need to manually reconfigure the value on every restart.

- The `webui.py -h` help text has also been enhanced to automatically display the default values for all CLI settings.

* refactor(i18n): Improve clarity of all web UI translation strings

* feat(webui): Use main text as emotion guidance when description is empty

If the user selects "text-to-emotion" control, but leaves the emotion description empty, we now automatically use the main text prompt instead.

This ensures that web users can enjoy every feature of IndexTTS2, including the ability to automatically guess the emotion from the main text prompt.

* feat: Add PyTorch GPU acceleration diagnostic tool

* chore: Use NVIDIA CUDA Toolkit v12.8

Downgrade from CUDA 12.9 to 12.8 to simplify user installation, since version 12.8 is very popular.

* docs: Simplify "uv run" command examples

The "uv run" command can take a `.py` file as direct argument and automatically understands that it should run via python.
2025-09-09 12:51:45 +08:00
nanaoto
3fe385af69 Update README.md 2025-09-09 00:02:00 +08:00
wangyining02
95e703e86c Merge branch 'Arcitec-indextts2-arc' 2025-09-08 23:27:55 +08:00
Arcitec
45b2f1f3eb chore: Add metadata to pyproject.toml 2025-09-08 17:23:23 +02:00
Arcitec
dcdb0614bf fix: Use WeTextProcessing on Linux, and wetext on other platforms 2025-09-08 17:04:19 +02:00
Arcitec
17359d3582 refactor: Change build system to "uv" to ensure robust installation
- Remove legacy "requirements.txt" support, since "pip" is too basic and very often leads to broken installations.

- Require "uv" installation method to ensure user success.
2025-09-08 16:56:44 +02:00
Arcitec
dfe1bd41d8 refactor: Remove code duplication in DeepSpeed dependency check 2025-09-08 16:17:17 +02:00
Arcitec
a6a955d2aa fix: Add support for melancholic emotion in text-to-emotion vectors
- The "低落" (melancholic) emotion will always be mapped to "悲伤" (sad) by QwenEmotion's text analysis. It doesn't know the difference between those emotions even if the user writes the exact words.

- Since the words and their meanings are so similar, it might not be possible to train QwenEmotion to learn the difference.

- As a workaround, we perform input text analysis and look for words that mean "melancholic", and swap the "sad" detection result, to make the melancholic/low-energy speech emotion work correctly for users via text-to-emotion.
2025-09-08 16:14:38 +02:00
Arcitec
58ad225fb4 fix: Fast and robust text-to-emotion algorithm
- The new algorithm is now very fast and uses less memory, since it doesn't chain multiple `.replace()` calls or create a bunch of temporary strings and temporary dictionaries and lists anymore.

- Parses the JSON output from the QwenEmotion model directly instead of trying to manually parse it. If JSON parsing fails, it falls back to a fast and highly-accurate RegEx search which finds all key-value pairs.

- The desired emotion vector order is now stored as a static class attribute instead of being created from scratch on every call.

- The emotion dictionary creation has been completely rewritten to use a clear algorithm which takes the QwenEmotion answers, builds a new dictionary using `self.desired_vector_order`, maps each key's name to their English translations, fetches the values from QwenEmotion's answers or 0.0 if no value was given by QE, and clamps the values to the min/max ranges.

- The `backup_dict` is now removed, since it was error-prone and fragile. It could grow out of sync with the code if not carefully maintained to keep the correct order and labels.

- To handle the "fallback" dictionary creation, we now automatically scan the final emotion vectors, and if none of them are above 0.0 (meaning we didn't detect any emotions in the input text), we give the final vectors a "calm: 1.0" value. This means that we never have to worry about the fallback dictionary's correctness.

- The previous algorithm had multiple bugs. This rewrite fixes a serious vector order bug: The old algorithm built the dictionary via the found keys, and only checked if there's 8 keys in QwenEmotion's response, but it didn't check that the keys were valid. When building the final emotion dict, it skipped any values if they were not found in QE's response. Meaning that if the QE response only contained 4 of the 8 expected emotion vector labels, those would all be added at the start of the new dictionary as the "first 4 dict slots". After that, it looped through the "backup_dict" and appended any missing values at the end. This resulted in a final emotion dictionary with the wrong order for the emotion vectors. The new code always produces the correct emotion vector order.

- Discovered another bug in the text-to-emotion handling for the "melancholic" emotion, which has never worked for Chinese or English at all. It will be fixed in an upcoming patch.
2025-09-08 16:14:38 +02:00
Arcitec
feba501013 fix: Fix internal text-to-emotion vector labels
- The order of the `convert_dict` now matches the desired order of the emotion vectors, for clarity.

- Internal text labels now match the updated English translations.

- This (and the previous commit) also fixes a bug: The previous, inaccurate Emotion translations meant that QwenEmotion could not understand words such as "low" at all (no emotion mapping), and it always mapped "hate" to "angry". With the fixed translations, QwenEmotion now correctly maps text-to-emotions from English inputs when users input the words that they've been taught by the user interface.
2025-09-08 16:14:38 +02:00
Arcitec
cb0e07f982 refactor(i18n): More accurate emotion translations to improve clarity
- Matches training data better.

- Easier to understand the purpose of the sliders.

- Consistent use of adjectives so that the user interface looks nicer.

- Reasoning:

* Hate -> Disgusted: The original Chinese word expresses disgust, not hatred. Hatred is an angry emotion and was confusing since there's already an Angry slider.

* Low -> Melancholic: The original Chinese word talks about feeling melancholic / down low, a state of slow speech and subdued emotions, which is not the same as sadness or depression. The word "Low" is very confusing for this emotion. The most accurate word for the emotion is "melancholic".

* Neutral -> Calm: The original Chinese word describes a state of being at peace and tranquility. It's not a neutral, non-emotional state. It's a state of being calm and relaxed.
2025-09-08 16:14:38 +02:00
Arcitec
1845e60aa5 refactor(i18n): Improve description of generation segmentation 2025-09-08 16:14:38 +02:00
Arcitec
5f0b0a9f9c feat(i18n): Add missing UI translation strings 2025-09-08 16:14:38 +02:00
Arcitec
55b7d32149 fix: Fix character encoding in examples 2025-09-08 16:14:38 +02:00
Arcitec
d5cdb5eb3c fix: Suppress pandas PyArrow future dependency warning
Moving the import of pandas *after* we've suppressed FutureWarning, to hide a big and useless warning saying that "Pandas 3.x will require PyArrow".
2025-09-08 16:14:38 +02:00
Arcitec
3e64c4ac11 fix: Update pandas to fix Gradio errors
Gradio requires Pandas >= 2.2.0, otherwise it will throw errors in some situations (such as when GPU is Out of Memory).
2025-09-08 16:14:38 +02:00
Arcitec
5ffb84b427 fix: Improve .gitignore and re-add config file
- Improves organization and removes extra junk files.

- Unignores *.yaml files such as config.yaml from the /checkpoints/ directory since we need that file.
2025-09-08 16:14:38 +02:00
kemuriririn
050a4c821e Dev kemurin (#287)
* update deps for windows

* update reqs & README

* update README.md

* update README.md

* use wetext to replace WeTextProcessing on windows

---------

Co-authored-by: wangyining02 <wangyining02@bilibili.com>
2025-09-08 22:14:34 +08:00
十字鱼
9d4776b082 Use without deepspeed (#280)
Use without deepspeed
2025-09-08 22:09:26 +08:00
kemuriririn
474ec9b6cf Dev kemurin (#284)
* update deps for windows

* update reqs & README

* update README.md

* update README.md

---------

Co-authored-by: wangyining02 <wangyining02@bilibili.com>
2025-09-08 21:53:27 +08:00
kemuriririn
fd0a77d390 update deps for windows (#282)
Co-authored-by: wangyining02 <wangyining02@bilibili.com>
2025-09-08 20:59:31 +08:00
kemuriririn
c1a5e39716 Indextts2 (#278)
* indextts2

* update lfs for audio files

* fix pypi source & add python version

---------

Co-authored-by: wangyining02 <wangyining02@bilibili.com>
2025-09-08 18:42:02 +08:00
index-tts
92d50a6ba0 Merge pull request #277 from index-tts/tts2dev
Tts2dev fix video
2025-09-08 17:55:53 +08:00
shujingchen
f61d128893 update 2025-09-08 17:55:14 +08:00
shujingchen
3355074853 remove video 2025-09-08 17:52:14 +08:00
shujingchen
2c88c9731f add download link& update video 2025-09-08 17:51:20 +08:00
kemuriririn
72c09ec0b7 Indextts2 (#276)
* indextts2

* update lfs for audio files

---------

Co-authored-by: wangyining02 <wangyining02@bilibili.com>
2025-09-08 17:36:39 +08:00
207 changed files with 53993 additions and 920 deletions

15
.gitattributes vendored Normal file
View File

@@ -0,0 +1,15 @@
examples/voice_02.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_04.wav filter=lfs diff=lfs merge=lfs -text
examples/emo_sad.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_03.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_06.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_08.wav filter=lfs diff=lfs merge=lfs -text
tests/sample_prompt.wav filter=lfs diff=lfs merge=lfs -text
examples/emo_hate.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_01.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_05.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_09.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_10.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_12.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_07.wav filter=lfs diff=lfs merge=lfs -text
examples/voice_11.wav filter=lfs diff=lfs merge=lfs -text

44
.github/workflows/docker-publish.yml vendored Normal file
View File

@@ -0,0 +1,44 @@
name: Build and Publish Docker Image
on:
workflow_dispatch:
jobs:
build-amd64:
runs-on: ubuntu-22.04
strategy:
matrix:
include:
- cuda_version: 11.8
torch_version: 2.4.1
tag_prefix: pytorch2.4.1-cuda11.8
- cuda_version: 12.8
torch_version: 2.8.0
tag_prefix: pytorch2.8.0-cuda12.8
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Extract Docker Meta
id: meta
uses: docker/metadata-action@v5
with:
images: nanaoto/index-tts
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build Docker Image
uses: docker/build-push-action@v5
with:
context: .
file: ./Dockerfile
push: false
platforms: linux/amd64
build-args: |
CUDA_VERSION=${{ matrix.cuda_version }}
TORCH_VERSION=${{ matrix.torch_version }}
tags: |
nanaoto/index-tts:${{ matrix.tag_prefix }}-${{ steps.meta.outputs.tags }}-amd64
nanaoto/index-tts:latest-${{ matrix.tag_prefix }}-amd64

38
.gitignore vendored
View File

@@ -1,15 +1,31 @@
venv/
__pycache__
*.egg-info
*.DS_Store
# Development Tools.
.mypy_cache/
.ruff_cache/
__pycache__/
.idea/
.vscode/
checkpoints/*.pth
checkpoints/*.vocab
checkpoints/*.model
checkpoints/.cache
outputs/
build/
# Environments.
.venv*/
venv*/
conda_env*/
# Python Bytecode.
*.py[cod]
# Distribution/Packaging.
/build/
/dist/
*.egg-info/
.venv
.pypirc
# Operating System Junk.
*.DS_Store
Thumbs.db
desktop.ini
# IndexTTS.
/cache/
/checkpoints/*
!/checkpoints/*.yaml
/outputs/

1
.python-version Normal file
View File

@@ -0,0 +1 @@
3.10

View File

@@ -1,65 +0,0 @@
bilibili Index-TTS 模型许可协议
版本 1.02025 年 3 月 17 日
版权所有 (c) 2025 bilibili Index
第一部分:前言
大型生成模型正在被广泛采用和使用,但也存在对其潜在滥用的担忧,无论是由于其技术限制还是伦理考虑。本许可证旨在促进所附模型的开放和负责任的下游使用。
因此,现在您和 bilibili Index 同意如下:
1. 定义
“许可证”是指本文件中定义的使用、复制和分发的条款和条件。
“数据”是指从与模型一起使用的数据集提取的信息和/或内容的集合,包括用于训练、预训练或以其他方式评估模型的数据。数据不受本许可证的许可。
“输出”是指操作模型的结果,以由此产生的信息内容体现。
“模型”是指任何伴随的机器学习基础组件(包括检查点),由学习的权重、参数(包括优化器状态)组成。
“模型的衍生品”是指对bilibili Index在该许可证下开放的模型的所有修改、基于模型的作品或任何其他通过将模型的权重、参数、激活或输出的模式转移到另一个模型而创建或初始化的模型以便使另一个模型的性能类似于本模型包括但不限于涉及使用中间数据表示的蒸馏方法或基于模型生成合成数据用于训练另一个模型的方法。
“补充材料”是指用于定义、运行、加载、基准测试或评估模型的伴随源代码和脚本,如果有,还包括用于准备数据进行训练或评估的任何伴随文档、教程、示例等。
“分发”是指将模型或模型的衍生物传输、复制、发布或以其他方式共享给第三方,包括通过电子或其他远程方式提供模型作为托管服务 - 例如基于 API 或 Web 访问。
“bilibili Index”或“我们”是指上海宽娱数码科技有限公司或其任何关联公司。
“您”(或“您的”)是指行使本许可证授予的权限并/或出于任何目的和在任何使用领域使用模型的个人或法律实体,包括在最终使用应用程序(例如聊天机器人、翻译器等)中使用模型。
“第三方”是指与 bilibili Index 或您没有共同控制的个人或法律实体。
“商业用途”是指使用 bilibili Index-TTS 模型,直接或间接为实体或个人进行运营、推广或产生收入,或用于任何其他盈利目的。
第二部分:许可及许可限制
根据本许可协议的条款和条件许可方特此授予您一个非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。您可以出于非商业用途使用此许可。许可方对您使用bilibili Index-TTS模型的输出或基于bilibili Index-TTS模型得到的模型衍生品不主张任何权利但您必须满足如下许可限制条件
1. 您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建bilibili Index-TTS 模型的全部或部分衍生品。您同意在使用bilibili Index许可的模型或其模型的衍生物品时严格遵守本协议附件A所列举的各项使用限制。
2. 如果您计划将 bilibili Index-TTS 模型及模型衍生品用作商业用途,应当按照本协议附则提供的联络方式,事先向许可方登记并获得许可方的书面授权。
3. 您对 bilibili Index-TTS 模型的使用和修改(包括使用 bilibili Index-TTS 模型的输出或者基于 bilibili Index-TTS 模型得到的模型衍生品)不得违反任何国家的法律法规,尤其是中华人民共和国的法律法规,不得侵犯任何第三方的合法权益,包括但不限于肖像权、名誉权、隐私权等人格权,著作权、专利权、商业秘密等知识产权,或者其他财产权益。
4. 您必须向 bilibili Index-TTS 模型或其模型衍生品的任何第三方使用者提供 bilibili Index-TTS 模型的来源以及本协议的副本。
5. 您修改 bilibili Index-TTS 模型得到模型衍生品,必须以显著的方式说明修改的内容,且上述修改不得违反本协议的许可限制条件,也不能允许、协助或以其他方式使得第三方违反本协议中的许可限制条件。
第三部分:知识产权
1. bilibili Index-TTS 模型的所有权及其相关知识产权,由许可方单独所有。
2. 在任何情况下,未经许可方事先书面同意,您不得使用许可方任何商标、服务标记、商号、域名、网站名称或其他显著品牌特征(以下统称为"标识"),包括但不限于明示或暗示您自身为“许可方”。未经许可方事先书面同意,您不得将本条款前述标识以单独或结合的任何方式展示、使用或申请注册商标、进行域名注册等,也不得向他人明示或暗示有权展示、使用、或以其他方式处理这些标识的权利。由于您违反本协议使用许可方上述标识等给许可方或他人造成损失的,由您承担全部法律责任。
3. 在许可范围内,您可以对 bilibili Index-TTS 模型进行修改以得到模型衍生品,对于模型衍生品中您付出创造性劳动的部分,您可以主张该部分的知识产权。
第四部分:免责声明及责任限制
1. 在任何情况下,许可方不对您根据本协议使用 bilibili Index-TTS 模型而产生或与之相关的任何直接、间接、附带的后果、以及其他损失或损害承担责任。若由此导致许可方遭受损失,您应当向许可方承担全部赔偿责任。
2. 模型中的模型参数仅仅是一种示例,如果您需要满足其他要求,需自行训练,并遵守相应数据集的许可协议。您将对 bilibili Index-TTS 模型的输出及模型衍生品所涉及的知识产权风险或与之相关的任何直接、间接、附带的后果、以及其他损失或损害负责。
3. 尽管许可方在 bilibili Index-TTS 模型训练的所有阶段,都坚持努力维护数据的合规性和准确性,但受限于 bilibili Index-TTS 模型的规模及其概率固有的随机性因素影响其输出结果的准确性无法得到保证bilibili Index-TTS模型存在被误导的可能。因此许可方在此声明许可方不承担您因使用 bilibili Index-TTS 模型及其源代码而导致的数据安全问题、声誉风险,或任何涉及 bilibili Index-TTS 模型被误导、误用、传播或不正当使用而产生的任何风险和责任。
4. 本协议所称损失或损害包括但不限于下列任何损失或损害(无论此类损失或损害是不可预见的、可预见的、已知的或其他的):(i)收入损失;(ii)实际或预期利润损失;(ii)货币使用损失;(iv)预期节约的损失;(v)业务损失;(vi)机会损失;(vii)商誉、声誉损失;(viii)软件的使用损失;或(x)任何间接、附带的特殊或间接损害损失。
5. 除非适用的法律另有要求或经过许可方书面同意否则许可方将按“现状”授予bilibili Index-TTS 模型的许可。针对本协议中的 bilibili Index-TTS 模型,许可方不提供任何明示、暗示的保证,包括但不限于:关于所有权的任何保证或条件、关于适销性的保证或条件、适用于任何特定目的的保证或条件、过去、现在或未来关于 bilibili Index-TTS 模型不侵权的任何类型的保证、以及因任何交易过程、贸易使用(如建议书、规范或样品)而产生的任何保证。您将对其通过使用、复制或再分发等方式利用 bilibili Index-TTS 模型所产生的风险与后果,独自承担责任。
6. 您充分知悉并理解同意bilibili Index-TTS 模型中可能包含个人信息。您承诺将遵守所有适用的法律法规进行个人信息的处理,特别是遵守《中华人民共和国个人信息保护法》的相关规定。请注意,许可方给予您使用 bilibili Index-TTS 模型的授权,并不意味着您已经获得处理相关个人信息的合法性基础。您作为独立的个人信息处理者,需要保证在处理 bilibili Index-TTS 模型中可能包含的个人信息时,完全符合相关法律法规的要求,包括但不限于获得个人信息主体的授权同意等,并愿意独自承担由此可能产生的任何风险和后果。
7. 您充分理解并同意,许可方有权依合理判断对违反有关法律法规或本协议规定的行为进行处理,对您的违法违规行为采取适当的法律行动,并依据法律法规保存有关信息向有关部门报告等,您应独自承担由此而产生的一切法律责任。
第五部分:品牌曝光与显著标识
1. 您同意并理解,如您将您基于 bilibili Index-TTS 模型二次开发的模型衍生品在国内外的开源社区提供开源许可的,您需要在该开源社区以显著方式标注该模型衍生品系基于 bilibili Index-TTS 模型进行的二次开发标注内容包括但不限于“bilibili Index ”以及与 bilibili Index-TTS 模型相关的品牌的其他元素。
2. 您同意并理解,如您将 bilibili Index-TTS 模型二次开发的模型衍生品参加国内外任何组织和个人举行的排名活动,包括但不限于针对模型性能、准确度、算法、算力等任何维度的排名活动,您均需在模型说明中以显著方式标注该模型衍生品系基于 bilibili Index-TTS 模型进行的二次开发标注内容包括但不限于“bilibili Index Inside”以及与 bilibili Index-TTS 模型相关的品牌的其他元素。
第六部分:其他
1.许可方在法律法规许可的范围内对协议条款享有最终解释权。
2.本协议的订立、效力、解释、履行、修改和终止,使用 bilibili Index-TTS 模型以及争议的解决均适用中华人民共和国大陆地区(仅为本协议之目的,不包括香港、澳门和台湾)法律,并排除冲突法的适用。
3.因使用 bilibili Index-TTS 模型而发生的任何争议,各方应首先通过友好协商的方式加以解决。协商不成时,向许可方所在地人民法院提起诉讼。
4.本协议的英文版本如若在理解上与中文版本产生冲突的,以中文版本为准。
5.若您期望基于本协议的许可条件与限制,将 bilibili Index-TTS 模型或其衍生品用作商业用途请您按照如下方式联系许可方以进行登记并向许可方申请书面授权联系邮箱xuanwu@bilibili.com
附件 A :使用限制
您同意不以下述目的和方式使用模型或模型的衍生物:
以任何违反任何适用的国家或国际法律或法规或侵犯任何第三方合法权益的方式;
用于任何军事目的;
以任何方式用于剥削、伤害或企图剥削或伤害未成年人;
生成或传播可验证的虚假信息和/或内容,意图伤害他人;
生成或传播受适用监管要求限制的不适当内容;
在未经适当授权或不合理使用的情况下生成或传播个人可识别信息;
诽谤、贬低或以其他方式骚扰他人;
用于对个人的法律权利产生不利影响或创建或修改具有约束力的可执行义务的完全自动化决策;
用于基于在线或离线社会行为或已知或预测的个人或个性特征对个人或群体进行歧视或伤害的任何目的;
为了对特定群体的个人造成或可能造成身体或心理伤害,利用该群体的年龄、社会、身体或心理特征的任何漏洞,从而严重扭曲属于该群体的个人的行为;
用于任何旨在或具有基于法律保护的特征或类别对个人或群体进行歧视的目的

238
LICENSE
View File

@@ -1,201 +1,57 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
bilibili Model Use License Agreement
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
By clicking “I agree” to this bilibili Model Use License Agreement (“this Agreement”) , or by otherwise using any portion or element of the Model or any Derivative Work, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately. If you do not agree to this Agreement, you must immediately cease all use and permanently delete the Model and any Derivative Works.
1. Definitions.
1. Definitions
1.1 “This Agreement”: means the bilibili Model Use License Agreement, including all of its terms and conditions.
1.2 “We”, “us”, or “our”: means bilibili , the original right-holder of the Model.
1.3 “You”: means any natural person or legal entity exercising rights granted by this Agreement and/or using the Model for any purpose and in any field of use.
1.4 “Model”: means the artificial-intelligence model named “bilibili indextts2”, including but not limited to model weights and final code, in each case only to the extent that such components are published by us at https://github.com/index-tts/index-tts.
1.5 “Derivative Work”: means any derivative of the Model, including without limitation:
(i) any modification of the Model, model outputs, or their derivatives;
(ii) any work based on the Model, model outputs, or their derivatives;
(iii) any other machine learning model which is created by re-training, fine-tuning, quantizing, LoRA, parameter-efficient fine-tuning, or any other method involving incremental weights or merged checkpoints, in each case based on the Model, model outputs, or their derivatives.
1.6 “Use”: means downloading, copying, training, modifying, creating Derivative Works, distributing, publishing, running, fine-tuning, publicly displaying, communicating to the public, or otherwise exploiting the Model or any Derivative Work.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
2. Scope of License and Restrictions
2.1 Subject to the terms and conditions of this Agreement, we grant you a worldwide, non-exclusive, non-transferable, royalty-free limited license to Use the Model or any Derivative Work based on the intellectual properties or other rights owned by Us embodied in the Model or any Derivative Work.
2.2 If You intend to Use, or have already Used, the Model or any Derivative Work, and either (i) your or any of your Affiliates products or services had more than 100 million monthly active users in the immediately preceding calendar month, or (ii) your or any of your Affiliates annual revenue in the immediately preceding calendar year exceeded RMB 1 billion, You must request a separated license from us, which We may grant to You in our sole discretion. You are not authorized to exercise any of the rights under this Agreement unless and until We have expressly granted You such rights in writing.
2.3 This Agreement is an open-source license for the Model in which we possess intellectual properties and other rights. It governs your Use of the Model only and does not limit any rights that we have regarding the Model.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
3. Disclaimer and Risk Allocation
3.1 The Model and any outputs generated thereby are provided “AS IS,” without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, absence of errors or omissions, continuity, accuracy, reliability, or stability. You are solely responsible for determining the appropriateness of using or redistributing the Model and assume all risks associated with exercising any rights granted under this Agreement.
3.2 You shall bear sole responsibility for any infringement, illegality, breach of contract, damages, fines, regulatory investigations, or other liabilities (including, without limitation, infringement of third-party patents, copyrights, trademarks, trade secrets, personality rights, data-protection rights, or any other rights) arising out of or related to your Use of the Model or any outputs generated thereby. We assume no joint, several, supplementary, or advance payment liability.
3.3 Under no circumstances shall we be liable to you or any third party for any direct, indirect, incidental, special, punitive, or consequential damages (including, without limitation, loss of data, business interruption, or loss of profits) arising out of or related to the Use of the Model, even if we have been advised of the possibility of such damages.
3.4 Additional Obligations for You and Downstream Recipients
a) You must ensure that any downstream recipient of the Model or any Derivative Work that you distribute complies with this Agreement, and you must impose appropriate contractual terms on such downstream recipients. If any downstream recipient breaches this Agreement, you shall be responsible for the consequences thereof.
b) You must retain all original copyright notices and a copy of this Agreement in every copy of the Model or any Derivative Work that you Use.
c) You may not Use the bilibili indextts2 or any Derivative Work to improve any AI model, except for the bilibili indextts2 itself, its Derivative Worksor non-commercial AI models.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
4. Compliance Obligations
4.1 Usage Restrictions
a) If you distribute a Derivative Work, you must clearly state in the distribution page or accompanying documentation: “Any modifications made to the original model in this Derivative Work are not endorsed, warranted, or guaranteed by the original right-holder of the original model, and the original right-holder disclaims all liability related to this Derivative Work.”
b) If your Use of the Model or any Derivative Work incorporates any third-party data or weights, you must obtain all necessary authorizations on your own and bear full responsibility for compliance.
c) You may not Use the Model or any Derivative Work for any purpose that violates the laws or regulatory requirements of the jurisdiction where the outputs and/or the Model are generated or used (including, without limitation, generating false information, discriminatory content, or content that infringes privacy).
d) If the Model or any Derivative Work is capable of generating content, you must ensure that such content does not violate the laws or regulatory requirements of the applicable jurisdiction (including, without limitation, generating false information, discriminatory content, or content that infringes privacy).
4.2 Prohibited High-Risk Use
You must ensure that the Model and any Derivative Work are not deployed, directly or indirectly, in high-risk scenarios such as medical diagnosis, autonomous driving, military applications, critical-infrastructure control, large-scale biometric surveillance, or automated decision-making (e.g., credit or employment evaluations). If you insist on such deployment, you must independently complete all compliance obligations under applicable laws and regulations (including but not limited to GDPR, CCPA, HIPAA, export-control laws, and AI-specific regulations), and we shall bear no liability for any consequences arising therefrom.
4.3 Infringement Liability
Should any third party raise claims against you with respect to any Derivative Work you develop or your Use of the Model or any Derivative Work, you shall bear full and independent responsibility for defending against and resolving such claims. If your actions cause us to incur any third-party claims, administrative penalties, or other losses, you shall indemnify us for all losses we thereby suffer, including but not limited to attorney fees, litigation costs, damages, and fines, and shall take all necessary measures to eliminate any adverse impact on us.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
5. Reserved Rights
5.1 We reserve the right to revoke the license granted to you under this Agreement in the event of your breach. Upon revocation, you must immediately cease all Use and permanently delete all copies of the Model and any Derivative Work. Sections 3 and 6 of this Agreement shall survive termination of this Agreement under this circumstance.
5.2 Nothing in this Agreement grants you any right to use our trade names, trademarks, service marks, or product names, except as reasonably and customarily required to describe the origin of the Model or any Derivative Work—such as reproducing the content of a NOTICE file under Section 3.4 of this Agreement.
5.3 If you or any of your Affiliates institutes or participates in any legal proceeding (including any cross-claim or counterclaim in a lawsuit) against us or any of our Affiliates, alleging that the Model or any output or any portion thereof infringes any intellectual property or other rights that you own or control, all licenses granted to you under this Agreement shall terminate automatically as of the date such proceeding is filed.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
6. Governing Law and Dispute Resolution
6.1 This Agreement shall be governed by and construed in accordance with the laws of the Peoples Republic of China.
6.2 In the event of any dispute arising out of or in connection with this Agreement, the parties shall first attempt to resolve such dispute through friendly negotiation. If negotiation fails, the dispute shall be submitted to the Shanghai Arbitration Commission for arbitration in accordance with its then-effective arbitration rules. The arbitration award shall be final and binding on both parties. The prevailing party shall be entitled to recover reasonable costs, including notarization and investigation fees, arbitration costs, attorneys fees, and travel expenses.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
7. Severability
If any provision of this Agreement is held to be invalid or unenforceable, the remaining provisions shall remain in full force and effect. The invalid or unenforceable provision shall be replaced with a valid and enforceable provision that, to the maximum extent permitted by law, most closely reflects the original intent of the invalid or unenforceable provision.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
8. Version Updates
We may release new versions of the AI Model Use License Agreement. Any new version will apply only to Uses occurring after the date of its release. If you obtained the Model under an earlier version, the new version will not have retroactive effect; nevertheless, you are encouraged to adopt the new version voluntarily.
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
9. Language Version
In the event of any discrepancy or conflict between the English-language version set forth above and the Chinese-language version of this bilibili Model Use License Agreement, the Chinese-language version shall prevail for all purposes and shall govern the rights and obligations of the parties.

52
LICENSE_ZH.txt Normal file
View File

@@ -0,0 +1,52 @@
bilibili模型使用许可协议
若您点击同意《bilibili模型使用许可协议》“本协议”或使用我方模型或衍生品的任何部分或元素即视为您已确认并接受本协议内容本协议立即生效。若您不同意本协议应立即停止使用并删除模型及衍生品。
1.定义
1.1 本协议指《bilibili 模型使用许可协议》,包括本协议所规定的所有条款和条件。
1.2 我方指bilibili即模型的原始权利人。
1.3 您:指行使本许可协议授予的权利和/或使用“模型”的自然人或法人实体。
1.4 模型指名为“bilibili indextts2”的AI模型包括模型权重、最终代码等组件具体范围以我方在https://github.com/index-tts/index-tts发布的组件为限。
1.5 衍生品指模型的衍生品包括但不限于i对模型、模型输出及其衍生品的修改ii基于模型、模型输出及其衍生品的创作iii对模型、模型输出及其衍生品再训练、微调、量化、LoRA、参数高效微调、以任何增量权重或合并的检查点等方式创建的任何模型。
1.6 使用:指通过下载、复制、训练、修改、创作衍生品、分发、发布、运行、微调、公开展示、传播或以其他方式利用本模型或其衍生品的行为。
2. 许可范围和限制
2.1 根据本协议的条款与条件,基于对模型或其衍生品中包含的我方拥有的任何知识产权和其他权利,我方特此授予您一项全球范围、非独占、不可转让、免费的使用许可。
2.2若您拟使用或者已使用我方模型或其衍生品如果您或者您的关联方提供的产品或服务在前一自然月的月活跃用户数超过1亿或者如果您或者您的关联方在上一自然年的年收入超过1亿人民币的您必须向我方申请该模型或其衍生品的商业许可我方可自行决定是否授予您该许可。您无权行使本协议项下的任何权利除非我方另行明确授予您该等许可。
2.3 本协议作为我方享有知识产权和其他权利的模型的开源许可协议,仅约束您对我方模型的使用行为,并不限制我方对该模型享有的任何权利。
3. 免责声明与风险约定
3.1 模型及其任何输出均“按原样”提供,我方及其关联方不提供任何形式的明示或暗示的保证,包括但不限于适销性、特定用途适用性、不侵权、没有错误或疏漏、持续性、准确性、可靠性、稳定性的保证。您需自行负责判断使用或再分发本作品的适当性,并承担行使本许可证所授予权限相关的所有风险。
3.2 您因使用模型或利用其输出内容而产生的任何侵权、违法、违约、赔偿、罚款、监管调查或其他法律责任(包括但不限于侵犯第三方专利、版权、商标、商业秘密、人格权、数据保护权等),均由您独自承担。我方不承担任何连带责任、补充责任或垫付责任。
3.3 在任何情况下,我方对因使用本模型而产生的任何直接、间接、附带、特殊、惩罚性或后果性损失(包括但不限于数据丢失、业务中断、利润损失等)不承担责任,即使我方已被告知该等损失的可能性。
3.4 对您和下游用户的其他约束
a)您应确保下游用户在使用您发布的本模型或您基于本模型开发的衍生品时,同样遵守本协议的相关规定,并通过合适的协议或条款对下游用户进行约束。若下游用户违反本协议规定,您需承担相应责任。
b)您需在您使用的本模型或您基于本模型开发的衍生品的所有副本中保留原始版权声明及本使用许可协议。
c您不得使用bilibili indextts2或其衍生品来改进任何AI模型bilibili indextts2或其衍生品、非商业用途的AI模型除外
4. 合规义务
4.1使用限制
a) 若您发布模型的衍生品,必须在发布页面或附随文档中清晰声明“该衍生品对原模型所作的任何改动与原模型原始权利人无关,原始权利人对该衍生品不背书、不担保、不承担责任”。
b) 若您使用模型或模型衍生品的过程中引入任何第三方数据或权重,您须自行取得合法授权并承担全部合规责任。
c) 不得将模型及模型衍生品用于违反输出地/使用地法律或监管要求的用途(包括但不限于生成虚假信息、歧视性内容、侵犯隐私等)。
d) 若模型或模型衍生品具备生成内容功能,您须确保其输出内容不违反输出地/使用地法律或监管要求的用途(包括但不限于生成虚假信息、歧视性内容、侵犯隐私等)。
4.2 禁止高风险场景
您须自行确保不在医疗诊断、自动驾驶、军事、关键基础设施控制、大规模生物识别监控、自动化决策(如信贷、就业评估)等高风险场景直接部署本模型及其衍生品。若您坚持部署,应自行完成符合适用法规(包括 GDPR、CCPA、HIPAA、出口管制、AI 特定法规等)的全部合规要求,我方对因此产生的任何后果概不负责。
4.3 侵权责任
如第三方就您开发的模型衍生品或您使用模型或其衍生品等行为主张权利,您应独立承担全部责任。若因您的行为导致我方遭受任何第三方索赔、行政处罚或其他损失,您应负责赔偿我方因此遭受的全部损失,包括但不限于律师费、诉讼费、赔偿金、罚款等,并采取一切必要措施消除对我方的负面影响。
5. 保留权利
5.1我方保留在您违反协议的情况下撤销本协议对您授权之权利。协议撤销后您必须立即删除并停止使用材料。在本协议终止后本协议第3条、第6条仍然有效。
5.2 本许可证不授予使用我方的商号、商标、服务标记或产品名称的权限除非在合理且惯例性地描述模型或衍生品的来源例如本许可证3.4的规定,以及复制 NOTICE 文件内容时需要使用。
5.3 若您或您的关联方对我方或我方任何关联实体提起诉讼或其他程序(包括诉讼中的交叉索赔或反诉),主张模型或其任何输出结果或其任何部分侵犯了您拥有或可许可的知识产权或其他权利,则本协议授予您的所有许可自该诉讼或程序提起之日起终止。
6. 法律适用与争议解决
6.1 本协议适用中华人民共和国法律法规。
6.2 在本协议履行中,若发生争议,双方应本着友好协商的原则解决问题;如协商不成,双方均应将争议提交至上海仲裁委员会根据其仲裁规则进行仲裁,仲裁是一裁终局的,对双方均有约束力。由仲裁败诉方承担本次仲裁产生的公证调查费、仲裁费、律师费、差旅费等实际产生费用。
7. 可分割性
若本协议任何条款被认定为无效或不可执行,不影响其余条款之效力;无效部分应在法律允许的最大范围内按最接近原意的有效条款替代。
8. 协议版本更新
我方可发布新版 AI模型使用许可协议。新版仅适用于发布后新产生的使用行为若您已按旧版获取模型新版协议并无溯及力但鼓励您主动更新。

View File

@@ -1,3 +1,3 @@
global-exclude *~ *.py[cod]
include indextts/BigVGAN/alias_free_activation/cuda/*.cu indextts/BigVGAN/alias_free_activation/cuda/*.cpp
include indextts/BigVGAN/alias_free_activation/cuda/*.h
include *.cu *.cpp
include *.h *.hpp

550
README.md
View File

@@ -1,247 +1,479 @@
<div align="center">
<img src='assets/index_icon.png' width="250"/>
</div>
<div align="center">
<a href="docs/README_zh.md" style="font-size: 24px">简体中文</a> |
<a href="README.md" style="font-size: 24px">English</a>
</div>
<h2><center>IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System</h2>
## 👉🏻 IndexTTS2 👈🏻
<p align="center">
<a href='https://arxiv.org/abs/2502.05512'><img src='https://img.shields.io/badge/ArXiv-2502.05512-red'></a>
<center><h3>IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech</h3></center>
## 👉🏻 IndexTTS 👈🏻
[![IndexTTS2](assets/IndexTTS2_banner.png)](assets/IndexTTS2_banner.png)
<div align="center">
<a href='https://arxiv.org/abs/2506.21619'>
<img src='https://img.shields.io/badge/ArXiv-2506.21619-red?logo=arxiv'/>
</a>
<br/>
<a href='https://github.com/index-tts/index-tts'>
<img src='https://img.shields.io/badge/GitHub-Code-orange?logo=github'/>
</a>
<a href='https://index-tts.github.io/index-tts2.github.io/'>
<img src='https://img.shields.io/badge/GitHub-Demo-orange?logo=github'/>
</a>
<br/>
<a href='https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo'>
<img src='https://img.shields.io/badge/HuggingFace-Demo-blue?logo=huggingface'/>
</a>
<a href='https://huggingface.co/IndexTeam/IndexTTS-2'>
<img src='https://img.shields.io/badge/HuggingFace-Model-blue?logo=huggingface' />
</a>
<br/>
<a href='https://modelscope.cn/studios/IndexTeam/IndexTTS-2-Demo'>
<img src='https://img.shields.io/badge/ModelScope-Demo-purple?logo=modelscope'/>
</>
<a href='https://modelscope.cn/models/IndexTeam/IndexTTS-2'>
<img src='https://img.shields.io/badge/ModelScope-Model-purple?logo=modelscope'/>
</a>
</div>
### Abstract
Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a significant limitation in applications requiring strict audio-visual synchronization, such as video dubbing.
This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control.
The method supports two generation modes: one explicitly specifies the number of generated tokens to precisely control speech duration; the other freely generates speech in an autoregressive manner without specifying the number of tokens, while faithfully reproducing the prosodic features of the input prompt.
Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion. In the zero-shot setting, the model can accurately reconstruct the target timbre (from the timbre prompt) while perfectly reproducing the specified emotional tone (from the style prompt).
To enhance speech clarity in highly emotional expressions, we incorporate GPT latent representations and design a novel three-stage training paradigm to improve the stability of the generated speech. Additionally, to lower the barrier for emotional control, we designed a soft instruction mechanism based on text descriptions by fine-tuning Qwen3, effectively guiding the generation of speech with the desired emotional orientation.
Finally, experimental results on multiple datasets show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity. Audio samples are available at: <a href="https://index-tts.github.io/index-tts2.github.io/">IndexTTS2 demo page</a>.
**Tips:** Please contact the authors for more detailed information. For commercial usage and cooperation, please contact <u>indexspeech@bilibili.com</u>.
### Feel IndexTTS2
<div align="center">
**IndexTTS2: The Future of Voice, Now Generating**
[![IndexTTS2 Demo](assets/IndexTTS2-video-pic.png)](https://www.bilibili.com/video/BV136a9zqEk5)
*Click the image to watch the IndexTTS2 introduction video.*
</div>
[[HuggingFace Demo]](https://huggingface.co/spaces/IndexTeam/IndexTTS) [[ModelScope Demo]](https://modelscope.cn/studios/IndexTeam/IndexTTS-Demo) \
[[Paper]](https://arxiv.org/abs/2502.05512) [[Demos]](https://index-tts.github.io)
**IndexTTS** is a GPT-style text-to-speech (TTS) model mainly based on XTTS and Tortoise. It is capable of correcting the pronunciation of Chinese characters using pinyin and controlling pauses at any position through punctuation marks. We enhanced multiple modules of the system, including the improvement of speaker condition feature representation, and the integration of BigVGAN2 to optimize audio quality. Trained on tens of thousands of hours of data, our system achieves state-of-the-art performance, outperforming current popular TTS systems such as XTTS, CosyVoice2, Fish-Speech, and F5-TTS.
<span style="font-size:16px;">
Experience **IndexTTS**: Please contact <u>xuanwu@bilibili.com</u> for more detailed information. </span>
### Contact
QQ群二群1048202584 \
QQ Group663272642(No.4) 1013410623(No.5) \
Discordhttps://discord.gg/uT32E7KDmy \
简历indexspeech@bilibili.com \
Emailindexspeech@bilibili.com \
You are welcome to join our community! 🌏 \
欢迎大家来交流讨论!
> [!CAUTION]
> Thank you for your support of the bilibili indextts project!
> Please note that the **only official channel** maintained by the core team is: [https://github.com/index-tts/index-tts](https://github.com/index-tts/index-tts).
> ***Any other websites or services are not official***, and we cannot guarantee their security, accuracy, or timeliness.
> For the latest updates, please always refer to this official repository.
## 📣 Updates
- `2025/05/14` 🔥🔥 We release the **IndexTTS-1.5**, Significantly improve the model's stability and its performance in the English language.
- `2025/03/25` 🔥 We release IndexTTS-1.0 model parameters and inference code.
- `2025/02/12` 🔥 We submitted our paper on arXiv, and released our demos and test sets.
- `2025/09/08` 🔥🔥🔥 We release **IndexTTS-2** to the world!
- The first autoregressive TTS model with precise synthesis duration control, supporting both controllable and uncontrollable modes. <i>This functionality is not yet enabled in this release.</i>
- The model achieves highly expressive emotional speech synthesis, with emotion-controllable capabilities enabled through multiple input modalities.
- `2025/05/14` 🔥🔥 We release **IndexTTS-1.5**, significantly improving the model's stability and its performance in the English language.
- `2025/03/25` 🔥 We release **IndexTTS-1.0** with model weights and inference code.
- `2025/02/12` 🔥 We submitted our paper to arXiv, and released our demos and test sets.
## 🖥️ Method
The overview of IndexTTS is shown as follows.
## 🖥️ Neural Network Architecture
Architectural overview of IndexTTS2, our state-of-the art speech model:
<picture>
<img src="assets/IndexTTS.png" width="800"/>
<img src="assets/IndexTTS2.png" width="800"/>
</picture>
The main improvements and contributions are summarized as follows:
- In Chinese scenarios, we have introduced a character-pinyin hybrid modeling approach. This allows for quick correction of mispronounced characters.
- **IndexTTS** incorporate a conformer conditioning encoder and a BigVGAN2-based speechcode decoder. This improves training stability, voice timbre similarity, and sound quality.
- We release all test sets here, including those for polysyllabic words, subjective and objective test sets.
The key contributions of **IndexTTS2** are summarized as follows:
- We propose a duration adaptation scheme for autoregressive TTS models. IndexTTS2 is the first autoregressive zero-shot TTS model to combine precise duration control with natural duration generation, and the method is scalable for any autoregressive large-scale TTS model.
- The emotional and speaker-related features are decoupled from the prompts, and a feature fusion strategy is designed to maintain semantic fluency and pronunciation clarity during emotionally rich expressions. Furthermore, a tool was developed for emotion control, utilizing natural language descriptions for the benefit of users.
- To address the lack of highly expressive speech data, we propose an effective training strategy, significantly enhancing the emotional expressiveness of zeroshot TTS to State-of-the-Art (SOTA) level.
- We will publicly release the code and pre-trained weights to facilitate future research and practical applications.
## Model Download
| 🤗**HuggingFace** | **ModelScope** |
| **HuggingFace** | **ModelScope** |
|----------------------------------------------------------|----------------------------------------------------------|
| [😁 IndexTTS-2](https://huggingface.co/IndexTeam/IndexTTS-2) | [IndexTTS-2](https://modelscope.cn/models/IndexTeam/IndexTTS-2) |
| [IndexTTS-1.5](https://huggingface.co/IndexTeam/IndexTTS-1.5) | [IndexTTS-1.5](https://modelscope.cn/models/IndexTeam/IndexTTS-1.5) |
| [IndexTTS](https://huggingface.co/IndexTeam/Index-TTS) | [IndexTTS](https://modelscope.cn/models/IndexTeam/Index-TTS) |
| [😁IndexTTS-1.5](https://huggingface.co/IndexTeam/IndexTTS-1.5) | [IndexTTS-1.5](https://modelscope.cn/models/IndexTeam/IndexTTS-1.5) |
## 📑 Evaluation
**Word Error Rate (WER) Results for IndexTTS and Baseline Models on the** [**seed-test**](https://github.com/BytedanceSpeech/seed-tts-eval)
| **WER** | **test_zh** | **test_en** | **test_hard** |
|:----------------------:|:-----------:|:-----------:|:-------------:|
| **Human** | 1.26 | 2.14 | - |
| **SeedTTS** | 1.002 | 1.945 | **6.243** |
| **CosyVoice 2** | 1.45 | 2.57 | 6.83 |
| **F5TTS** | 1.56 | 1.83 | 8.67 |
| **FireRedTTS** | 1.51 | 3.82 | 17.45 |
| **MaskGCT** | 2.27 | 2.62 | 10.27 |
| **Spark-TTS** | 1.2 | 1.98 | - |
| **MegaTTS 3** | 1.36 | 1.82 | - |
| **IndexTTS** | 0.937 | 1.936 | 6.831 |
| **IndexTTS-1.5** | **0.821** | **1.606** | 6.565 |
**Word Error Rate (WER) Results for IndexTTS and Baseline Models on the other opensource test**
| **Model** | **aishell1_test** | **commonvoice_20_test_zh** | **commonvoice_20_test_en** | **librispeech_test_clean** | **avg** |
|:---------------:|:-----------------:|:--------------------------:|:--------------------------:|:--------------------------:|:--------:|
| **Human** | 2.0 | 9.5 | 10.0 | 2.4 | 5.1 |
| **CosyVoice 2** | 1.8 | 9.1 | 7.3 | 4.9 | 5.9 |
| **F5TTS** | 3.9 | 11.7 | 5.4 | 7.8 | 8.2 |
| **Fishspeech** | 2.4 | 11.4 | 8.8 | 8.0 | 8.3 |
| **FireRedTTS** | 2.2 | 11.0 | 16.3 | 5.7 | 7.7 |
| **XTTS** | 3.0 | 11.4 | 7.1 | 3.5 | 6.0 |
| **IndexTTS** | 1.3 | 7.0 | 5.3 | 2.1 | 3.7 |
| **IndexTTS-1.5** | **1.2** | **6.8** | **3.9** | **1.7** | **3.1** |
**Speaker Similarity (SS) Results for IndexTTS and Baseline Models**
| **Model** | **aishell1_test** | **commonvoice_20_test_zh** | **commonvoice_20_test_en** | **librispeech_test_clean** | **avg** |
|:---------------:|:-----------------:|:--------------------------:|:--------------------------:|:--------------------------:|:---------:|
| **Human** | 0.846 | 0.809 | 0.820 | 0.858 | 0.836 |
| **CosyVoice 2** | **0.796** | 0.743 | 0.742 | **0.837** | **0.788** |
| **F5TTS** | 0.743 | **0.747** | 0.746 | 0.828 | 0.779 |
| **Fishspeech** | 0.488 | 0.552 | 0.622 | 0.701 | 0.612 |
| **FireRedTTS** | 0.579 | 0.593 | 0.587 | 0.698 | 0.631 |
| **XTTS** | 0.573 | 0.586 | 0.648 | 0.761 | 0.663 |
| **IndexTTS** | 0.744 | 0.742 | **0.758** | 0.823 | 0.776 |
| **IndexTTS-1.5** | 0.741 | 0.722 | 0.753 | 0.819 | 0.771 |
**MOS Scores for Zero-Shot Cloned Voice**
| **Model** | **Prosody** | **Timbre** | **Quality** | **AVG** |
|-----------------|:-----------:|:----------:|:-----------:|:---------:|
| **CosyVoice 2** | 3.67 | 4.05 | 3.73 | 3.81 |
| **F5TTS** | 3.56 | 3.88 | 3.56 | 3.66 |
| **Fishspeech** | 3.40 | 3.63 | 3.69 | 3.57 |
| **FireRedTTS** | 3.79 | 3.72 | 3.60 | 3.70 |
| **XTTS** | 3.23 | 2.99 | 3.10 | 3.11 |
| **IndexTTS** | **3.79** | **4.20** | **4.05** | **4.01** |
## Usage Instructions
### Environment Setup
1. Download this repository:
```bash
git clone https://github.com/index-tts/index-tts.git
```
2. Install dependencies:
Create a new conda environment and install dependencies:
### ⚙️ Environment Setup
1. Ensure that you have both [git](https://git-scm.com/downloads)
and [git-lfs](https://git-lfs.com/) on your system.
The Git-LFS plugin must also be enabled on your current user account:
```bash
conda create -n index-tts python=3.10
conda activate index-tts
apt-get install ffmpeg
# or use conda to install ffmpeg
conda install -c conda-forge ffmpeg
git lfs install
```
Install [PyTorch](https://pytorch.org/get-started/locally/), e.g.:
2. Download this repository:
```bash
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
git clone https://github.com/index-tts/index-tts.git && cd index-tts
git lfs pull # download large repository files
```
> [!NOTE]
> If you are using Windows you may encounter [an error](https://github.com/index-tts/index-tts/issues/61) when installing `pynini`:
`ERROR: Failed building wheel for pynini`
> In this case, please install `pynini` via `conda`:
3. Install the [uv package manager](https://docs.astral.sh/uv/getting-started/installation/).
It is *required* for a reliable, modern installation environment.
> [!TIP]
> **Quick & Easy Installation Method:**
>
> There are many convenient ways to install the `uv` command on your computer.
> Please check the link above to see all options. Alternatively, if you want
> a very quick and easy method, you can install it as follows:
>
> ```bash
> # after conda activate index-tts
> conda install -c conda-forge pynini==2.1.6
> pip install WeTextProcessing --no-deps
> pip install -U uv
> ```
Install `IndexTTS` as a package:
```bash
cd index-tts
pip install -e .
```
> [!WARNING]
> We **only** support the `uv` installation method. Other tools, such as `conda`
> or `pip`, don't provide any guarantees that they will install the correct
> dependency versions. You will almost certainly have *random bugs, error messages,*
> ***missing GPU acceleration**, and various other problems* if you don't use `uv`.
> Please *do not report any issues* if you use non-standard installations, since
> almost all such issues are invalid.
>
> Furthermore, `uv` is [up to 115x faster](https://github.com/astral-sh/uv/blob/main/BENCHMARKS.md)
> than `pip`, which is another *great* reason to embrace the new industry-standard
> for Python project management.
3. Download models:
4. Install required dependencies:
Download by `huggingface-cli`:
We use `uv` to manage the project's dependency environment. The following command
will *automatically* create a `.venv` project-directory and then installs the correct
versions of Python and all required dependencies:
```bash
huggingface-cli download IndexTeam/IndexTTS-1.5 \
config.yaml bigvgan_discriminator.pth bigvgan_generator.pth bpe.model dvae.pth gpt.pth unigram_12000.vocab \
--local-dir checkpoints
uv sync --all-extras
```
Recommended for China users. 如果下载速度慢,可以使用镜像:
```bash
export HF_ENDPOINT="https://hf-mirror.com"
```
Or by `wget`:
If the download is slow, please try a *local mirror*, for example any of these
local mirrors in China (choose one mirror from the list below):
```bash
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bigvgan_discriminator.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bigvgan_generator.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bpe.model -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/dvae.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/gpt.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/unigram_12000.vocab -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/config.yaml -P checkpoints
uv sync --all-extras --default-index "https://mirrors.aliyun.com/pypi/simple"
uv sync --all-extras --default-index "https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
```
> [!TIP]
> **Available Extra Features:**
>
> - `--all-extras`: Automatically adds *every* extra feature listed below. You can
> remove this flag if you want to customize your installation choices.
> - `--extra webui`: Adds WebUI support (recommended).
> - `--extra deepspeed`: Adds DeepSpeed support (may speed up inference on some
> systems).
> [!IMPORTANT]
> **Important (Windows):** The DeepSpeed library may be difficult to install for
> some Windows users. You can skip it by removing the `--all-extras` flag. If you
> want any of the other extra features above, you can manually add their specific
> feature flags instead.
>
> **Important (Linux/Windows):** If you see an error about CUDA during the installation,
> please ensure that you have installed NVIDIA's [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit)
> version **12.8** (or newer) on your system.
5. Download the required models via [uv tool](https://docs.astral.sh/uv/guides/tools/#installing-tools):
Download via `huggingface-cli`:
```bash
uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints
```
Or download via `modelscope`:
```bash
uv tool install "modelscope"
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints
```
> [!IMPORTANT]
> If the commands above aren't available, please carefully read the `uv tool`
> output. It will tell you how to add the tools to your system's path.
> [!NOTE]
> If you prefer to use the `IndexTTS-1.0` model, please replace `IndexTeam/IndexTTS-1.5` with `IndexTeam/IndexTTS` in the above commands.
> In addition to the above models, some small models will also be automatically
> downloaded when the project is run for the first time. If your network environment
> has slow access to HuggingFace, it is recommended to execute the following
> command before running the code:
>
> ```bash
> export HF_ENDPOINT="https://hf-mirror.com"
> ```
4. Run test script:
#### 🖥️ Checking PyTorch GPU Acceleration
If you need to diagnose your environment to see which GPUs are detected,
you can use our included utility to check your system:
```bash
# Please put your prompt audio in 'test_data' and rename it to 'input.wav'
python indextts/infer.py
uv run tools/gpu_check.py
```
5. Use as command line tool:
### 🔥 IndexTTS2 Quickstart
#### 🌐 Web Demo
```bash
# Make sure pytorch has been installed before running this command
indextts "大家好我现在正在bilibili 体验 ai 科技说实话来之前我绝对想不到AI技术已经发展到这样匪夷所思的地步了" \
--voice reference_voice.wav \
--model_dir checkpoints \
--config checkpoints/config.yaml \
--output output.wav
```
Use `--help` to see more options.
```bash
indextts --help
```
#### Web Demo
```bash
pip install -e ".[webui]" --no-build-isolation
python webui.py
# use another model version:
python webui.py --model_dir IndexTTS-1.5
uv run webui.py
```
Open your browser and visit `http://127.0.0.1:7860` to see the demo.
You can also adjust the settings to enable features such as FP16 inference (lower
VRAM usage), DeepSpeed acceleration, compiled CUDA kernels for speed, etc. All
available options can be seen via the following command:
```bash
uv run webui.py -h
```
Have fun!
> [!IMPORTANT]
> It can be very helpful to use **FP16** (half-precision) inference. It is faster
> and uses less VRAM, with a very small quality loss.
>
> **DeepSpeed** *may* also speed up inference on some systems, but it could also
> make it slower. The performance impact is highly dependent on your specific
> hardware, drivers and operating system. Please try with and without it,
> to discover what works best on your personal system.
>
> Lastly, be aware that *all* `uv` commands will **automatically activate** the correct
> per-project virtual environments. Do *not* manually activate any environments
> before running `uv` commands, since that could lead to dependency conflicts!
#### 📝 Using IndexTTS2 in Python
To run scripts, you *must* use the `uv run <file.py>` command to ensure that
the code runs inside your current "uv" environment. It *may* sometimes also be
necessary to add the current directory to your `PYTHONPATH`, to help it find
the IndexTTS modules.
Example of running a script via `uv`:
```bash
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py
```
Here are several examples of how to use IndexTTS2 in your own scripts:
1. Synthesize new speech with a single reference audio file (voice cloning):
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "Translate for me, what is a surprise!"
tts.infer(spk_audio_prompt='examples/voice_01.wav', text=text, output_path="gen.wav", verbose=True)
```
2. Using a separate, emotional reference audio file to condition the speech synthesis:
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "酒楼丧尽天良,开始借机竞拍房间,哎,一群蠢货。"
tts.infer(spk_audio_prompt='examples/voice_07.wav', text=text, output_path="gen.wav", emo_audio_prompt="examples/emo_sad.wav", verbose=True)
```
3. When an emotional reference audio file is specified, you can optionally set
the `emo_alpha` to adjust how much it affects the output.
Valid range is `0.0 - 1.0`, and the default value is `1.0` (100%):
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "酒楼丧尽天良,开始借机竞拍房间,哎,一群蠢货。"
tts.infer(spk_audio_prompt='examples/voice_07.wav', text=text, output_path="gen.wav", emo_audio_prompt="examples/emo_sad.wav", emo_alpha=0.9, verbose=True)
```
4. It's also possible to omit the emotional reference audio and instead provide
an 8-float list specifying the intensity of each emotion, in the following order:
`[happy, angry, sad, afraid, disgusted, melancholic, surprised, calm]`.
You can additionally use the `use_random` parameter to introduce stochasticity
during inference; the default is `False`, and setting it to `True` enables
randomness:
> [!NOTE]
> Enabling random sampling will reduce the voice cloning fidelity of the speech
> synthesis.
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "哇塞!这个爆率也太高了!欧皇附体了!"
tts.infer(spk_audio_prompt='examples/voice_10.wav', text=text, output_path="gen.wav", emo_vector=[0, 0, 0, 0, 0, 0, 0.45, 0], use_random=False, verbose=True)
```
5. Alternatively, you can enable `use_emo_text` to guide the emotions based on
your provided `text` script. Your text script will then automatically
be converted into emotion vectors.
It's recommended to use `emo_alpha` around 0.6 (or lower) when using the text
emotion modes, for more natural sounding speech.
You can introduce randomness with `use_random` (default: `False`;
`True` enables randomness):
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "快躲起来!是他要来了!他要来抓我们了!"
tts.infer(spk_audio_prompt='examples/voice_12.wav', text=text, output_path="gen.wav", emo_alpha=0.6, use_emo_text=True, use_random=False, verbose=True)
```
6. It's also possible to directly provide a specific text emotion description
via the `emo_text` parameter. Your emotion text will then automatically be
converted into emotion vectors. This gives you separate control of the text
script and the text emotion description:
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "快躲起来!是他要来了!他要来抓我们了!"
emo_text = "你吓死我了!你是鬼吗?"
tts.infer(spk_audio_prompt='examples/voice_12.wav', text=text, output_path="gen.wav", emo_alpha=0.6, use_emo_text=True, emo_text=emo_text, use_random=False, verbose=True)
```
> [!TIP]
> **Pinyin Usage Notes:**
>
> IndexTTS2 still supports mixed modeling of Chinese characters and Pinyin.
> When you need precise pronunciation control, please provide text with specific Pinyin annotations to activate the Pinyin control feature.
> Note that Pinyin control does not work for every possible consonantvowel combination; only valid Chinese Pinyin cases are supported.
> For the full list of valid entries, please refer to `checkpoints/pinyin.vocab`.
>
> Example:
> ```
> 之前你做DE5很好所以这一次也DEI3做DE2很好才XING2如果这次目标完成得不错的话我们就直接打DI1去银行取钱。
> ```
### Legacy: IndexTTS1 User Guide
You can also use our previous IndexTTS1 model by importing a different module:
#### Sample Code
```python
from indextts.infer import IndexTTS
tts = IndexTTS(model_dir="checkpoints",cfg_path="checkpoints/config.yaml")
voice="reference_voice.wav"
text="大家好我现在正在bilibili 体验 ai 科技说实话来之前我绝对想不到AI技术已经发展到这样匪夷所思的地步了比如说现在正在说话的其实是B站为我现场复刻的数字分身简直就是平行宇宙的另一个我了。如果大家也想体验更多深入的AIGC功能可以访问 bilibili studio相信我你们也会吃惊的。"
tts.infer(voice, text, output_path)
voice = "examples/voice_07.wav"
text = "大家好我现在正在bilibili 体验 ai 科技说实话来之前我绝对想不到AI技术已经发展到这样匪夷所思的地步了比如说现在正在说话的其实是B站为我现场复刻的数字分身简直就是平行宇宙的另一个我了。如果大家也想体验更多深入的AIGC功能可以访问 bilibili studio相信我你们也会吃惊的。"
tts.infer(voice, text, 'gen.wav')
```
## Acknowledge
For more detailed information, see [README_INDEXTTS_1_5](archive/README_INDEXTTS_1_5.md),
or visit the IndexTTS1 repository at <a href="https://github.com/index-tts/index-tts/tree/v1.5.0">index-tts:v1.5.0</a>.
## Our Releases and Demos
### IndexTTS2: [[Paper]](https://arxiv.org/abs/2506.21619); [[Demo]](https://index-tts.github.io/index-tts2.github.io/); [[ModelScope]](https://modelscope.cn/studios/IndexTeam/IndexTTS-2-Demo); [[HuggingFace]](https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo)
### IndexTTS1: [[Paper]](https://arxiv.org/abs/2502.05512); [[Demo]](https://index-tts.github.io/); [[ModelScope]](https://modelscope.cn/studios/IndexTeam/IndexTTS-Demo); [[HuggingFace]](https://huggingface.co/spaces/IndexTeam/IndexTTS)
## Acknowledgements
1. [tortoise-tts](https://github.com/neonbjb/tortoise-tts)
2. [XTTSv2](https://github.com/coqui-ai/TTS)
3. [BigVGAN](https://github.com/NVIDIA/BigVGAN)
4. [wenet](https://github.com/wenet-e2e/wenet/tree/main)
5. [icefall](https://github.com/k2-fsa/icefall)
6. [maskgct](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct)
7. [seed-vc](https://github.com/Plachtaa/seed-vc)
## Contributors in Bilibili
We sincerely thank colleagues from different roles at Bilibili, whose combined efforts made the IndexTTS series possible.
### Core Authors
- **Wei Deng** - Core author; Initiated the IndexTTS project, led the development of the IndexTTS1 data pipeline, model architecture design and training, as well as iterative optimization of the IndexTTS series of models, focusing on fundamental capability building and performance optimization.
- **Siyi Zhou** Core author; in IndexTTS2, led model architecture design and training pipeline optimization, focusing on key features such as multilingual and emotional synthesis.
- **Jingchen Shu** - Core author; worked on overall architecture design, cross-lingual modeling solutions, and training strategy optimization, driving model iteration.
- **Xun Zhou** - Core author; worked on cross-lingual data processing and experiments, explored multilingual training strategies, and contributed to audio quality improvement and stability evaluation.
- **Jinchao Wang** - Core author; worked on model development and deployment, building the inference framework and supporting system integration.
- **Yiquan Zhou** - Core author; contributed to model experiments and validation, and proposed and implemented text-based emotion control.
- **Yi He** - Core author; contributed to model experiments and validation.
- **Lu Wang** Core author; worked on data processing and model evaluation, supporting model training and performance verification.
### Technical Contributors
- **Yining Wang** - Supporting contributor; contributed to open-source code implementation and maintenance, supporting feature adaptation and community release.
- **Yong Wu** - Supporting contributor; worked on data processing and experimental support, ensuring data quality and efficiency for model training and iteration.
- **Yaqin Huang** Supporting contributor; contributed to systematic model evaluation and effect tracking, providing feedback to support iterative improvements.
- **Yunhan Xu** Supporting contributor; provided guidance in recording and data collection, while also offering feedback from a product and operations perspective to improve usability and practical application.
- **Yuelang Sun** Supporting contributor; provided professional support in audio recording and data collection, ensuring high-quality data for model training and evaluation.
- **Yihuang Liang** - Supporting contributor; worked on systematic model evaluation and project promotion, helping IndexTTS expand its reach and engagement.
### Technical Guidance
- **Huyang Sun** - Provided strong support for the IndexTTS project, ensuring strategic alignment and resource backing.
- **Bin Xia** - Contributed to the review, optimization, and follow-up of technical solutions, focusing on ensuring model effectiveness.
## 📚 Citation
🌟 If you find our work helpful, please leave us a star and cite our paper.
IndexTTS2:
```
@article{zhou2025indextts2,
title={IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech},
author={Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu},
journal={arXiv preprint arXiv:2506.21619},
year={2025}
}
```
IndexTTS:
```
@article{deng2025indextts,
title={IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System},
author={Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang},
journal={arXiv preprint arXiv:2502.05512},
year={2025}
year={2025},
doi={10.48550/arXiv.2502.05512},
url={https://arxiv.org/abs/2502.05512}
}
```

View File

@@ -0,0 +1,247 @@
<div align="center">
<img src='assets/index_icon.png' width="250"/>
</div>
<h2><center>IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System</h2>
<p align="center">
<a href='https://arxiv.org/abs/2502.05512'><img src='https://img.shields.io/badge/ArXiv-2502.05512-red'></a>
## 👉🏻 IndexTTS 👈🏻
[[HuggingFace Demo]](https://huggingface.co/spaces/IndexTeam/IndexTTS) [[ModelScope Demo]](https://modelscope.cn/studios/IndexTeam/IndexTTS-Demo) \
[[Paper]](https://arxiv.org/abs/2502.05512) [[Demos]](https://index-tts.github.io)
**IndexTTS** is a GPT-style text-to-speech (TTS) model mainly based on XTTS and Tortoise. It is capable of correcting the pronunciation of Chinese characters using pinyin and controlling pauses at any position through punctuation marks. We enhanced multiple modules of the system, including the improvement of speaker condition feature representation, and the integration of BigVGAN2 to optimize audio quality. Trained on tens of thousands of hours of data, our system achieves state-of-the-art performance, outperforming current popular TTS systems such as XTTS, CosyVoice2, Fish-Speech, and F5-TTS.
<span style="font-size:16px;">
Experience **IndexTTS**: Please contact <u>xuanwu@bilibili.com</u> for more detailed information. </span>
### Contact
QQ群二群1048202584 \
Discordhttps://discord.gg/uT32E7KDmy \
简历indexspeech@bilibili.com \
欢迎大家来交流讨论!
## 📣 Updates
- `2025/05/14` 🔥🔥 We release the **IndexTTS-1.5**, Significantly improve the model's stability and its performance in the English language.
- `2025/03/25` 🔥 We release IndexTTS-1.0 model parameters and inference code.
- `2025/02/12` 🔥 We submitted our paper on arXiv, and released our demos and test sets.
## 🖥️ Method
The overview of IndexTTS is shown as follows.
<picture>
<img src="assets/IndexTTS.png" width="800"/>
</picture>
The main improvements and contributions are summarized as follows:
- In Chinese scenarios, we have introduced a character-pinyin hybrid modeling approach. This allows for quick correction of mispronounced characters.
- **IndexTTS** incorporate a conformer conditioning encoder and a BigVGAN2-based speechcode decoder. This improves training stability, voice timbre similarity, and sound quality.
- We release all test sets here, including those for polysyllabic words, subjective and objective test sets.
## Model Download
| 🤗**HuggingFace** | **ModelScope** |
|----------------------------------------------------------|----------------------------------------------------------|
| [IndexTTS](https://huggingface.co/IndexTeam/Index-TTS) | [IndexTTS](https://modelscope.cn/models/IndexTeam/Index-TTS) |
| [😁IndexTTS-1.5](https://huggingface.co/IndexTeam/IndexTTS-1.5) | [IndexTTS-1.5](https://modelscope.cn/models/IndexTeam/IndexTTS-1.5) |
## 📑 Evaluation
**Word Error Rate (WER) Results for IndexTTS and Baseline Models on the** [**seed-test**](https://github.com/BytedanceSpeech/seed-tts-eval)
| **WER** | **test_zh** | **test_en** | **test_hard** |
|:----------------------:|:-----------:|:-----------:|:-------------:|
| **Human** | 1.26 | 2.14 | - |
| **SeedTTS** | 1.002 | 1.945 | **6.243** |
| **CosyVoice 2** | 1.45 | 2.57 | 6.83 |
| **F5TTS** | 1.56 | 1.83 | 8.67 |
| **FireRedTTS** | 1.51 | 3.82 | 17.45 |
| **MaskGCT** | 2.27 | 2.62 | 10.27 |
| **Spark-TTS** | 1.2 | 1.98 | - |
| **MegaTTS 3** | 1.36 | 1.82 | - |
| **IndexTTS** | 0.937 | 1.936 | 6.831 |
| **IndexTTS-1.5** | **0.821** | **1.606** | 6.565 |
**Word Error Rate (WER) Results for IndexTTS and Baseline Models on the other opensource test**
| **Model** | **aishell1_test** | **commonvoice_20_test_zh** | **commonvoice_20_test_en** | **librispeech_test_clean** | **avg** |
|:---------------:|:-----------------:|:--------------------------:|:--------------------------:|:--------------------------:|:--------:|
| **Human** | 2.0 | 9.5 | 10.0 | 2.4 | 5.1 |
| **CosyVoice 2** | 1.8 | 9.1 | 7.3 | 4.9 | 5.9 |
| **F5TTS** | 3.9 | 11.7 | 5.4 | 7.8 | 8.2 |
| **Fishspeech** | 2.4 | 11.4 | 8.8 | 8.0 | 8.3 |
| **FireRedTTS** | 2.2 | 11.0 | 16.3 | 5.7 | 7.7 |
| **XTTS** | 3.0 | 11.4 | 7.1 | 3.5 | 6.0 |
| **IndexTTS** | 1.3 | 7.0 | 5.3 | 2.1 | 3.7 |
| **IndexTTS-1.5** | **1.2** | **6.8** | **3.9** | **1.7** | **3.1** |
**Speaker Similarity (SS) Results for IndexTTS and Baseline Models**
| **Model** | **aishell1_test** | **commonvoice_20_test_zh** | **commonvoice_20_test_en** | **librispeech_test_clean** | **avg** |
|:---------------:|:-----------------:|:--------------------------:|:--------------------------:|:--------------------------:|:---------:|
| **Human** | 0.846 | 0.809 | 0.820 | 0.858 | 0.836 |
| **CosyVoice 2** | **0.796** | 0.743 | 0.742 | **0.837** | **0.788** |
| **F5TTS** | 0.743 | **0.747** | 0.746 | 0.828 | 0.779 |
| **Fishspeech** | 0.488 | 0.552 | 0.622 | 0.701 | 0.612 |
| **FireRedTTS** | 0.579 | 0.593 | 0.587 | 0.698 | 0.631 |
| **XTTS** | 0.573 | 0.586 | 0.648 | 0.761 | 0.663 |
| **IndexTTS** | 0.744 | 0.742 | **0.758** | 0.823 | 0.776 |
| **IndexTTS-1.5** | 0.741 | 0.722 | 0.753 | 0.819 | 0.771 |
**MOS Scores for Zero-Shot Cloned Voice**
| **Model** | **Prosody** | **Timbre** | **Quality** | **AVG** |
|-----------------|:-----------:|:----------:|:-----------:|:---------:|
| **CosyVoice 2** | 3.67 | 4.05 | 3.73 | 3.81 |
| **F5TTS** | 3.56 | 3.88 | 3.56 | 3.66 |
| **Fishspeech** | 3.40 | 3.63 | 3.69 | 3.57 |
| **FireRedTTS** | 3.79 | 3.72 | 3.60 | 3.70 |
| **XTTS** | 3.23 | 2.99 | 3.10 | 3.11 |
| **IndexTTS** | **3.79** | **4.20** | **4.05** | **4.01** |
## Usage Instructions
### Environment Setup
1. Download this repository:
```bash
git clone https://github.com/index-tts/index-tts.git
```
2. Install dependencies:
Create a new conda environment and install dependencies:
```bash
conda create -n index-tts python=3.10
conda activate index-tts
apt-get install ffmpeg
# or use conda to install ffmpeg
conda install -c conda-forge ffmpeg
```
Install [PyTorch](https://pytorch.org/get-started/locally/), e.g.:
```bash
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
```
> [!NOTE]
> If you are using Windows you may encounter [an error](https://github.com/index-tts/index-tts/issues/61) when installing `pynini`:
`ERROR: Failed building wheel for pynini`
> In this case, please install `pynini` via `conda`:
> ```bash
> # after conda activate index-tts
> conda install -c conda-forge pynini==2.1.6
> pip install WeTextProcessing --no-deps
> ```
Install `IndexTTS` as a package:
```bash
cd index-tts
pip install -e .
```
3. Download models:
Download by `huggingface-cli`:
```bash
huggingface-cli download IndexTeam/IndexTTS-1.5 \
config.yaml bigvgan_discriminator.pth bigvgan_generator.pth bpe.model dvae.pth gpt.pth unigram_12000.vocab \
--local-dir checkpoints
```
Recommended for China users. 如果下载速度慢,可以使用镜像:
```bash
export HF_ENDPOINT="https://hf-mirror.com"
```
Or by `wget`:
```bash
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bigvgan_discriminator.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bigvgan_generator.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bpe.model -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/dvae.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/gpt.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/unigram_12000.vocab -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/config.yaml -P checkpoints
```
> [!NOTE]
> If you prefer to use the `IndexTTS-1.0` model, please replace `IndexTeam/IndexTTS-1.5` with `IndexTeam/IndexTTS` in the above commands.
4. Run test script:
```bash
# Please put your prompt audio in 'test_data' and rename it to 'input.wav'
python indextts/infer.py
```
5. Use as command line tool:
```bash
# Make sure pytorch has been installed before running this command
indextts "大家好我现在正在bilibili 体验 ai 科技说实话来之前我绝对想不到AI技术已经发展到这样匪夷所思的地步了" \
--voice reference_voice.wav \
--model_dir checkpoints \
--config checkpoints/config.yaml \
--output output.wav
```
Use `--help` to see more options.
```bash
indextts --help
```
#### Web Demo
```bash
pip install -e ".[webui]" --no-build-isolation
python webui.py
# use another model version:
python webui.py --model_dir IndexTTS-1.5
```
Open your browser and visit `http://127.0.0.1:7860` to see the demo.
#### Sample Code
```python
from indextts.infer import IndexTTS
tts = IndexTTS(model_dir="checkpoints",cfg_path="checkpoints/config.yaml")
voice="reference_voice.wav"
text="大家好我现在正在bilibili 体验 ai 科技说实话来之前我绝对想不到AI技术已经发展到这样匪夷所思的地步了比如说现在正在说话的其实是B站为我现场复刻的数字分身简直就是平行宇宙的另一个我了。如果大家也想体验更多深入的AIGC功能可以访问 bilibili studio相信我你们也会吃惊的。"
tts.infer(voice, text, output_path)
```
## Acknowledge
1. [tortoise-tts](https://github.com/neonbjb/tortoise-tts)
2. [XTTSv2](https://github.com/coqui-ai/TTS)
3. [BigVGAN](https://github.com/NVIDIA/BigVGAN)
4. [wenet](https://github.com/wenet-e2e/wenet/tree/main)
5. [icefall](https://github.com/k2-fsa/icefall)
## 📚 Citation
🌟 If you find our work helpful, please leave us a star and cite our paper.
```
@article{deng2025indextts,
title={IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System},
author={Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang},
journal={arXiv preprint arXiv:2502.05512},
year={2025}
}
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 528 KiB

BIN
assets/IndexTTS2.mp4 Normal file

Binary file not shown.

BIN
assets/IndexTTS2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

BIN
assets/IndexTTS2_banner.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.9 MiB

View File

@@ -12,14 +12,13 @@ dataset:
normalize: false
gpt:
model_dim: 1024
max_mel_tokens: 605
max_text_tokens: 402
heads: 16
model_dim: 1280
max_mel_tokens: 1815
max_text_tokens: 600
heads: 20
use_mel_codes_as_input: true
mel_length_compression: 1024
layers: 20
activation_function: "gelu_pytorch_tanh"
layers: 24
number_text_tokens: 12000
number_mel_codes: 8194
start_mel_token: 8192
@@ -35,79 +34,87 @@ gpt:
num_blocks: 6
input_layer: "conv2d2"
perceiver_mult: 2
emo_condition_module:
output_size: 512
linear_units: 1024
attention_heads: 4
num_blocks: 4
input_layer: "conv2d2"
perceiver_mult: 2
vqvae:
channels: 100
num_tokens: 8192
hidden_dim: 512
num_resnet_blocks: 3
codebook_dim: 512
num_layers: 2
positional_dims: 1
kernel_size: 3
smooth_l1_loss: true
use_transposed_convs: false
semantic_codec:
codebook_size: 8192
hidden_size: 1024
codebook_dim: 8
vocos_dim: 384
vocos_intermediate_dim: 2048
vocos_num_layers: 12
bigvgan:
adam_b1: 0.8
adam_b2: 0.99
lr_decay: 0.999998
seed: 1234
s2mel:
preprocess_params:
sr: 22050
spect_params:
n_fft: 1024
win_length: 1024
hop_length: 256
n_mels: 80
fmin: 0
fmax: "None"
resblock: "1"
upsample_rates: [4,4,4,4,2,2]
upsample_kernel_sizes: [8,8,4,4,4,4]
upsample_initial_channel: 1536
resblock_kernel_sizes: [3,7,11]
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
feat_upsample: false
speaker_embedding_dim: 512
cond_d_vector_in_each_upsampling_layer: true
dit_type: "DiT"
reg_loss_type: "l1"
style_encoder:
dim: 192
length_regulator:
channels: 512
is_discrete: false
in_channels: 1024
content_codebook_size: 2048
sampling_ratios: [1, 1, 1, 1]
vector_quantize: false
n_codebooks: 1
quantizer_dropout: 0.0
f0_condition: false
n_f0_bins: 512
DiT:
hidden_dim: 512
num_heads: 8
depth: 13
class_dropout_prob: 0.1
block_size: 8192
in_channels: 80
style_condition: true
final_layer_type: 'wavenet'
target: 'mel'
content_dim: 512
content_codebook_size: 1024
content_type: 'discrete'
f0_condition: false
n_f0_bins: 512
content_codebooks: 1
is_causal: false
long_skip_connection: true
zero_prompt_speech_token: false
time_as_token: false
style_as_token: false
uvit_skip_connection: true
add_resblock_in_transformer: false
wavenet:
hidden_dim: 512
num_layers: 8
kernel_size: 5
dilation_rate: 1
p_dropout: 0.2
style_condition: true
gpt_dim: 1024
activation: "snakebeta"
snake_logscale: true
use_cqtd_instead_of_mrd: true
cqtd_filters: 128
cqtd_max_filters: 1024
cqtd_filters_scale: 1
cqtd_dilations: [1, 2, 4]
cqtd_hop_lengths: [512, 256, 256]
cqtd_n_octaves: [9, 9, 9]
cqtd_bins_per_octaves: [24, 36, 48]
resolutions: [[1024, 120, 600], [2048, 240, 1200], [512, 50, 240]]
mpd_reshapes: [2, 3, 5, 7, 11]
use_spectral_norm: false
discriminator_channel_mult: 1
use_multiscale_melloss: true
lambda_melloss: 15
clip_grad_norm: 1000
segment_size: 16384
num_mels: 100
num_freq: 1025
n_fft: 1024
hop_size: 256
win_size: 1024
sampling_rate: 24000
fmin: 0
fmax: null
fmax_for_loss: null
mel_type: "pytorch"
num_workers: 2
dist_config:
dist_backend: "nccl"
dist_url: "tcp://localhost:54321"
world_size: 1
dvae_checkpoint: dvae.pth
gpt_checkpoint: gpt.pth
bigvgan_checkpoint: bigvgan_generator.pth
w2v_stat: wav2vec2bert_stats.pt
s2mel_checkpoint: s2mel.pth
emo_matrix: feat2.pt
spk_matrix: feat1.pt
emo_num: [3, 17, 2, 8, 4, 5, 10, 24]
qwen_emo_path: qwen0.6bemo4-merge/
vocoder:
type: "bigvgan"
name: "nvidia/bigvgan_v2_22khz_80band_256x"
version: 2.0

1728
checkpoints/pinyin.vocab Normal file

File diff suppressed because it is too large Load Diff

399
docs/README_zh.md Normal file
View File

@@ -0,0 +1,399 @@
<div align="center">
<img src='../assets/index_icon.png' width="250"/>
</div>
<div align="center">
<a href="README_zh.md" style="font-size: 24px">简体中文</a> |
<a href="../README.md" style="font-size: 24px">English</a>
</div>
## 👉🏻 IndexTTS2 👈🏻
<center><h3>IndexTTS2情感表达与时长可控的自回归零样本语音合成突破</h3></center>
[![IndexTTS2](../assets/IndexTTS2_banner.png)](../assets/IndexTTS2_banner.png)
<div align="center">
<a href='https://arxiv.org/abs/2506.21619'>
<img src='https://img.shields.io/badge/ArXiv-2506.21619-red?logo=arxiv'/>
</a>
<br/>
<a href='https://github.com/index-tts/index-tts'>
<img src='https://img.shields.io/badge/GitHub-Code-orange?logo=github'/>
</a>
<a href='https://index-tts.github.io/index-tts2.github.io/'>
<img src='https://img.shields.io/badge/GitHub-Demo-orange?logo=github'/>
</a>
<br/>
<a href='https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo'>
<img src='https://img.shields.io/badge/HuggingFace-Demo-blue?logo=huggingface'/>
</a>
<a href='https://huggingface.co/IndexTeam/IndexTTS-2'>
<img src='https://img.shields.io/badge/HuggingFace-Model-blue?logo=huggingface' />
</a>
<br/>
<a href='https://modelscope.cn/studios/IndexTeam/IndexTTS-2-Demo'>
<img src='https://img.shields.io/badge/ModelScope-Demo-purple?logo=modelscope'/>
</>
<a href='https://modelscope.cn/models/IndexTeam/IndexTTS-2'>
<img src='https://img.shields.io/badge/ModelScope-Model-purple?logo=modelscope'/>
</a>
</div>
### 摘要
现有自回归大规模文本转语音TTS模型在语音自然度方面具有优势但其逐token生成机制难以精确控制合成语音的时长。这在需要严格视音频同步的应用如视频配音中成为显著限制。
本文提出了IndexTTS2创新性地提出了一种通用且适用于自回归模型的语音时长控制方法。
该方法支持两种生成模式一种可显式指定生成token数量以精确控制语音时长另一种则自由自回归生成语音同时忠实还原输入提示的韵律特征。
此外IndexTTS2实现了情感表达与说话人身份的解耦可独立控制音色和情感。在零样本设置下模型能准确复刻目标音色来自音色提示同时完美还原指定的情感语调来自风格提示
为提升高情感表达下的语音清晰度我们引入GPT潜在表示并设计了三阶段训练范式提升生成语音的稳定性。为降低情感控制门槛我们基于文本描述微调Qwen3设计了软指令机制有效引导语音生成所需情感。
多数据集实验结果表明IndexTTS2在词错误率、说话人相似度和情感保真度方面均超越现有零样本TTS模型。音频样例见<a href="https://index-tts.github.io/index-tts2.github.io/">IndexTTS2演示页面</a>。
**Tips:** 如需更多信息请联系作者。商业合作请联系 <u>indexspeech@bilibili.com</u>。
### IndexTTS2体验
<div align="center">
**IndexTTS2语音未来现已生成**
[![IndexTTS2 Demo](../assets/IndexTTS2-video-pic.png)](https://www.bilibili.com/video/BV136a9zqEk5)
*点击图片观看IndexTTS2介绍视频*
</div>
### 联系方式
QQ群663272642(4群) 1013410623(5群) \
Discordhttps://discord.gg/uT32E7KDmy \
邮箱indexspeech@bilibili.com \
欢迎加入我们的社区!🌏 \
欢迎大家交流讨论!
> [!CAUTION]
> 感谢大家对bilibili indextts项目的支持与关注
> 请注意,目前由核心团队直接维护的**官方渠道仅有**: [https://github.com/index-tts/index-tts](https://github.com/index-tts/index-tts).
> ***其他任何网站或服务均非官方提供***,我们对其内容及安全性、准确性和及时性不作任何担保。
> 为了保障您的权益建议通过上述官方渠道获取bilibili indextts项目的最新进展与更新。
## 📣 更新日志
- `2025/09/08` 🔥🔥🔥 IndexTTS-2全球发布
- 首个支持精确合成时长控制的自回归TTS模型支持可控与非可控模式。<i>本版本暂未开放该功能。</i>
- 模型实现高度情感表达的语音合成,支持多模态情感控制。
- `2025/05/14` 🔥🔥 IndexTTS-1.5发布,显著提升模型稳定性及英文表现。
- `2025/03/25` 🔥 IndexTTS-1.0发布,开放模型权重与推理代码。
- `2025/02/12` 🔥 论文提交arXiv发布演示与测试集。
## 🖥️ 神经网络架构
IndexTTS2架构总览
<picture>
<img src="../assets/IndexTTS2.png" width="800"/>
</picture>
主要创新点:
- 提出自回归TTS模型的时长自适应方案。IndexTTS2是首个将精确时长控制与自然时长生成结合的自回归零样本TTS模型方法可扩展至任意自回归大模型。
- 情感与说话人特征从提示中解耦,设计特征融合策略,在高情感表达下保持语义流畅与发音清晰,并开发了基于自然语言描述的情感控制工具。
- 针对高表达性语音数据缺乏提出高效训练策略显著提升零样本TTS情感表达至SOTA水平。
- 代码与预训练权重将公开,促进后续研究与应用。
## 模型下载
| **HuggingFace** | **ModelScope** |
|----------------------------------------------------------|----------------------------------------------------------|
| [😁 IndexTTS-2](https://huggingface.co/IndexTeam/IndexTTS-2) | [IndexTTS-2](https://modelscope.cn/models/IndexTeam/IndexTTS-2) |
| [IndexTTS-1.5](https://huggingface.co/IndexTeam/IndexTTS-1.5) | [IndexTTS-1.5](https://modelscope.cn/models/IndexTeam/IndexTTS-1.5) |
| [IndexTTS](https://huggingface.co/IndexTeam/Index-TTS) | [IndexTTS](https://modelscope.cn/models/IndexTeam/Index-TTS) |
## 使用说明
### ⚙️ 环境配置
1. 请确保已安装 [git](https://git-scm.com/downloads) 和 [git-lfs](https://git-lfs.com/)。
在仓库中启用Git-LFS
```bash
git lfs install
```
2. 下载代码:
```bash
git clone https://github.com/index-tts/index-tts.git && cd index-tts
git lfs pull # 下载大文件
```
3. 安装 [uv 包管理器](https://docs.astral.sh/uv/getting-started/installation/)。
*必须*使用uv保证依赖环境可靠。
> [!TIP]
> **快速安装方法:**
>
> uv安装方式多样详见官网。也可快速安装
>
> ```bash
> pip install -U uv
> ```
> [!WARNING]
> 本文档仅支持uv安装。其他工具如conda/pip无法保证依赖正确可能导致*偶发bug、报错、GPU加速失效*等问题。
>
> uv比pip快[115倍](https://github.com/astral-sh/uv/blob/main/BENCHMARKS.md),强烈推荐。
4. 安装依赖:
使用uv安装依赖时会创建虚拟环境将所有依赖安装到`.venv`目录:
```bash
uv sync --all-extras
```
如中国大陆地区用户下载缓慢,可选用国内镜像:
```bash
uv sync --all-extras --default-index "https://mirrors.aliyun.com/pypi/simple"
uv sync --all-extras --default-index "https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
```
> [!TIP]
> **可选功能:**
>
> - `--all-extras`:安装全部可选功能。可去除自定义。
> - `--extra webui`安装WebUI支持推荐
> - `--extra deepspeed`安装DeepSpeed加速。
> [!IMPORTANT]
> **Windows注意** DeepSpeed在部分Windows环境较难安装可去除`--all-extras`。
>
> **Linux/Windows注意** 如遇CUDA相关报错请确保已安装NVIDIA [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) 12.8及以上。
5. 下载模型:
HuggingFace下载
```bash
uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints
```
ModelScope下载
```bash
uv tool install "modelscope"
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints
```
> [!NOTE]
> 项目首次运行还会自动下载部分小模型。如网络访问HuggingFace较慢建议提前设置
>
> ```bash
> export HF_ENDPOINT="https://hf-mirror.com"
> ```
#### 🖥️ PyTorch GPU 加速检测
可运行脚本检测机器是否有GPU以及是否安装了GPU版本的PyTorch。如PyTorch版本不对可能使用CPU启动推理会非常慢
```bash
uv run tools/gpu_check.py
```
### 🔥 IndexTTS2快速体验
#### 🌐 Web演示
```bash
uv run webui.py
```
浏览器访问 `http://127.0.0.1:7860` 查看演示。
可通过命令行参数开启FP16推理降低显存占用、DeepSpeed加速、CUDA内核编译加速等。可运行以下命令查看所有选项
```bash
uv run webui.py -h
```
祝使用愉快!
#### 📝 Python脚本调用
`uv run <file.py>`保证程序在uv创建的虚拟环境下运行。部分情况需要指定`PYTHONPATH`
示例:
```bash
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py
```
以下为IndexTTS2脚本调用示例
1. 单一参考音频(音色克隆):
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "Translate for me, what is a surprise!"
tts.infer(spk_audio_prompt='examples/voice_01.wav', text=text, output_path="gen.wav", verbose=True)
```
2. 指定情感参考音频:
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "酒楼丧尽天良,开始借机竞拍房间,哎,一群蠢货。"
tts.infer(spk_audio_prompt='examples/voice_07.wav', text=text, output_path="gen.wav", emo_audio_prompt="examples/emo_sad.wav", verbose=True)
```
3. 可调节情感参考音频的权重(`emo_alpha`范围0.0-1.0默认1.0
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "酒楼丧尽天良,开始借机竞拍房间,哎,一群蠢货。"
tts.infer(spk_audio_prompt='examples/voice_07.wav', text=text, output_path="gen.wav", emo_audio_prompt="examples/emo_sad.wav", emo_alpha=0.9, verbose=True)
```
4. 可直接指定8维情感向量 `[高兴, 愤怒, 悲伤, 害怕, 厌恶, 忧郁, 惊讶, 平静]`,可用`use_random`开启随机情感采样默认False
> [!NOTE]
> 开启随机采样会降低音色的还原度。
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "哇塞!这个爆率也太高了!欧皇附体了!"
tts.infer(spk_audio_prompt='examples/voice_10.wav', text=text, output_path="gen.wav", emo_vector=[0, 0, 0, 0, 0, 0, 0.45, 0], use_random=False, verbose=True)
```
5. 可用`use_emo_text`根据文本自动生成情感向量,可用`use_random`开启随机情感采样:
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "快躲起来!是他要来了!他要来抓我们了!"
tts.infer(spk_audio_prompt='examples/voice_12.wav', text=text, output_path="gen.wav", emo_alpha=0.6, use_emo_text=True, use_random=False, verbose=True)
```
6. 可直接指定情感文本描述(`emo_text`),实现文本与情感分离控制:
```python
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)
text = "快躲起来!是他要来了!他要来抓我们了!"
emo_text = "你吓死我了!你是鬼吗?"
tts.infer(spk_audio_prompt='examples/voice_12.wav', text=text, output_path="gen.wav", emo_alpha=0.6, use_emo_text=True, emo_text=emo_text, use_random=False, verbose=True)
```
> [!TIP]
> **拼音使用注意事项:**
>
> IndexTTS2依然支持中文字符与拼音混合建模。
> 在使用时,如果需要精确的发音控制,请输入包含特定拼音标注的文本来触发拼音控制功能。
> 需要注意的是:拼音控制并不是对所有声母韵母(辅音、元音)组合都生效,系统仅保留中文合法拼音的发音。
> 具体合法情况可参考项目中的`checkpoints/pinyin.vocab`文件。
>
> 参考样例:
> ```
> 之前你做DE5很好所以这一次也DEI3做DE2很好才XING2如果这次目标完成得不错的话我们就直接打DI1去银行取钱。
> ```
### 旧版IndexTTS1使用指南
如果需要使用旧的IndexTTS1.5模型可以import旧模块
```python
from indextts.infer import IndexTTS
tts = IndexTTS(model_dir="checkpoints",cfg_path="checkpoints/config.yaml")
voice = "examples/voice_07.wav"
text = "大家好我现在正在bilibili 体验 ai 科技说实话来之前我绝对想不到AI技术已经发展到这样匪夷所思的地步了比如说现在正在说话的其实是B站为我现场复刻的数字分身简直就是平行宇宙的另一个我了。如果大家也想体验更多深入的AIGC功能可以访问 bilibili studio相信我你们也会吃惊的。"
tts.infer(voice, text, 'gen.wav')
```
详细信息见 [README_INDEXTTS_1_5](archive/README_INDEXTTS_1_5.md),或访问 <a href="https://github.com/index-tts/index-tts/tree/v1.5.0">index-tts:v1.5.0</a>。
## 演示
### IndexTTS2: [[论文]](https://arxiv.org/abs/2506.21619); [[演示]](https://index-tts.github.io/index-tts2.github.io/); [[ModelScope]](https://modelscope.cn/studios/IndexTeam/IndexTTS-2-Demo); [[HuggingFace]](https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo)
### IndexTTS1: [[论文]](https://arxiv.org/abs/2502.05512); [[演示]](https://index-tts.github.io/); [[ModelScope]](https://modelscope.cn/studios/IndexTeam/IndexTTS-Demo); [[HuggingFace]](https://huggingface.co/spaces/IndexTeam/IndexTTS)
## 致谢
1. [tortoise-tts](https://github.com/neonbjb/tortoise-tts)
2. [XTTSv2](https://github.com/coqui-ai/TTS)
3. [BigVGAN](https://github.com/NVIDIA/BigVGAN)
4. [wenet](https://github.com/wenet-e2e/wenet/tree/main)
5. [icefall](https://github.com/k2-fsa/icefall)
6. [maskgct](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct)
7. [seed-vc](https://github.com/Plachtaa/seed-vc)
## Bilibili 贡献者名录
我们诚挚感谢来自Bilibili的同事们是大家的共同努力让IndexTTS系列得以实现。
### 核心作者
- **Siyi Zhou** 核心作者在IndexTTS2中主导模型架构设计与训练流程优化重点推动多语言、多情感合成等关键功能。
- **Wei Deng** 核心作者在IndexTTS1中主导模型架构设计与训练流程负责基础能力建设与性能优化。
- **Jingchen Shu** 核心作者;负责整体架构设计、跨语种建模方案与训练策略优化,推动模型迭代。
- **Xun Zhou** 核心作者;负责跨语言数据处理与实验,探索多语种训练策略,并在音质提升与稳定性评估方面作出贡献。
- **Jinchao Wang** 核心作者;负责模型开发与部署,构建推理框架并支持系统落地。
- **Yiquan Zhou** 核心作者;参与模型实验与验证,并提出并实现了基于文本的情感控制。
- **Yi He** 核心作者;参与模型实验与验证。
- **Lu Wang** 核心作者;负责数据处理与模型评测,支持模型训练与性能验证。
### 技术贡献者
- **Yining Wang** 技术贡献者;负责开源代码的实现与维护,支持功能适配与社区发布。
- **Yong Wu** 技术贡献者;参与数据处理与实验支持,保障模型训练的数据质量与迭代效率。
- **Yaqin Huang** 技术贡献者;参与系统性模型评估与效果跟进,提供反馈以支持迭代优化。
- **Yunhan Xu** 技术贡献者;在录音与数据采集方面提供指导,并从产品与运营角度提出改进建议,提升模型的易用性与实际应用效果。
- **Yuelang Sun** 技术贡献者;在音频录制与数据采集方面提供专业支持,保障模型训练与评测所需的高质量数据。
- **Yihuang Liang** 技术贡献者参与系统性模型评估与项目推广帮助IndexTTS项目扩大影响力并提升用户参与度。
### 技术指导
- **Huyang Sun** 对IndexTTS项目给予了大力支持确保了项目的战略方向与资源保障。
- **Bin Xia** 参与技术方案的评审、优化与跟进,重点关注模型效果的保障。
## 📚 论文引用
🌟 如果本项目对您有帮助请为我们点star并引用论文。
IndexTTS2:
```
@article{zhou2025indextts2,
title={IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech},
author={Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu},
journal={arXiv preprint arXiv:2506.21619},
year={2025}
}
```
IndexTTS:
```
@article{deng2025indextts,
title={IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System},
author={Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang},
journal={arXiv preprint arXiv:2502.05512},
year={2025},
doi={10.48550/arXiv.2502.05512},
url={https://arxiv.org/abs/2502.05512}
}
```

12
examples/cases.jsonl Normal file
View File

@@ -0,0 +1,12 @@
{"prompt_audio":"voice_01.wav","text":"Translate for me, what is a surprise!","emo_mode":0}
{"prompt_audio":"voice_02.wav","text":"The palace is strict, no false rumors, Lady Qi!","emo_mode":0}
{"prompt_audio":"voice_03.wav","text":"这个呀,就是我们精心制作准备的纪念品,大家可以看到这个色泽和这个材质啊,哎呀多么的光彩照人。","emo_mode":0}
{"prompt_audio":"voice_04.wav","text":"你就需要我这种专业人士的帮助,就像手无缚鸡之力的人进入雪山狩猎,一定需要最老练的猎人指导。","emo_mode":0}
{"prompt_audio":"voice_05.wav","text":"在真正的日本剑道中,格斗过程极其短暂,常常短至半秒,最长也不超过两秒,利剑相击的转瞬间,已有一方倒在血泊中。但在这电光石火的对决之前,双方都要以一个石雕般凝固的姿势站定,长时间的逼视对方,这一过程可能长达十分钟!","emo_mode":0}
{"prompt_audio":"voice_06.wav","text":"今天呢,咱们开一部新书,叫《赛博朋克二零七七》。这词儿我听着都新鲜。这赛博朋克啊,简单理解就是“高科技,低生活”。这一听,我就明白了,于老师就爱用那高科技的东西,手机都得拿脚纹开,大冬天为了解锁脱得一丝不挂,冻得跟王八蛋似的。","emo_mode":0}
{"prompt_audio":"voice_07.wav","emo_audio":"emo_sad.wav","emo_weight":0.65,"emo_mode":1,"text":"酒楼丧尽天良,开始借机竞拍房间,哎,一群蠢货。"}
{"prompt_audio":"voice_08.wav","emo_audio":"emo_hate.wav","emo_weight":0.65,"emo_mode":1,"text":"你看看你,对我还有没有一点父子之间的信任了。"}
{"prompt_audio":"voice_09.wav","emo_weight": 0.8,"emo_mode":2,"emo_vec_3":0.8,"text":"对不起嘛!我的记性真的不太好,但是和你在一起的事情,我都会努力记住的~"}
{"prompt_audio":"voice_10.wav","emo_weight": 0.8,"emo_mode":2,"emo_vec_7":1.0,"text":"哇塞!这个爆率也太高了!欧皇附体了!"}
{"prompt_audio":"voice_11.wav","emo_mode":3,"emo_text":"极度悲伤","text":"这些年的时光终究是错付了... "}
{"prompt_audio":"voice_12.wav","emo_mode":3,"emo_text":"You scared me to death! What are you, a ghost?","text":"快躲起来!是他要来了!他要来抓我们了!"}

3
examples/emo_hate.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:89e6e7eee1a28303776e9cf43971e9505529bd0e669f5fcf47f4d1370f9187c4
size 145368

3
examples/emo_sad.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f7d3e5bf2b7bca6458f9e6d7a5ce073c41eb4418895e7df2f994e5a0c96c064a
size 842016

3
examples/voice_01.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e33e6ee0107a1dd58e1d66dd90c13df3d55a8683047cc3d7ea206dad84ed3fc8
size 478050

3
examples/voice_02.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:8fe2dd1dbd54ef85a073fbc4c8fc0198f8d4523cc3320a600de0e347a3d8b491
size 574074

3
examples/voice_03.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:50e8b632efd794418919e2d33c8c2aab9189a57f4d21ef55020413be9f2b292a
size 616814

3
examples/voice_04.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2a3d2536245f45fd5e1eef046dd768ae7b72a0dba3ec3f370f145862fe64b3b2
size 681084

3
examples/voice_05.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:eefb7f4a29a8b36f08d5cc1014ea947dbe9f7bef348f07c40263058e604a98eb
size 1482796

3
examples/voice_06.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2d85800fe261d106c3274fa792cbb952458c4b0b2e1b908340a8cd0d63c73a30
size 299052

3
examples/voice_07.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bcb10f84e63c3fdbfe99ac4184ca403b46a6d20b50540732713d48c4c95375ce
size 591894

3
examples/voice_08.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2e2c5f4859999b1ada95ee801d50c3c72879147269a4ed99e385fd917dae5c6f
size 426812

3
examples/voice_09.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:8702467b9b3c83a16bead578e131c4388b3ef82aeff861bd336e622a9ae8a511
size 1798188

3
examples/voice_10.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:39c2db8b395e4c6ea1122ec7463b5f7bd7dd7d7302f3255780e4c529a9ae9985
size 1942242

3
examples/voice_11.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:82730e38498413d4371a76e841cd91fa2f74843b79ad3b606d45ad8a7b7a736c
size 1520734

3
examples/voice_12.wav Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d67bd4f51773677d5902409813b9bb4c1d59b8243c74fc104553b80b49edd22b
size 778626

View File

@@ -2,10 +2,9 @@
# Licensed under the MIT license.
import os
import sys
import pathlib
import subprocess
import platform
from torch.utils import cpp_extension
"""
@@ -46,45 +45,7 @@ def chinese_path_compile_support(sources, buildpath):
def load(force_rebuild=False):
import torch
if not torch.cuda.is_available():
raise RuntimeError("Please install PyTorch with CUDA support to use the anti_alias_activation_cuda extension.")
try:
from indextts.BigVGAN.alias_free_activation.cuda import anti_alias_activation_cuda
if not force_rebuild:
return anti_alias_activation_cuda
except ImportError:
anti_alias_activation_cuda = None
module_name = "anti_alias_activation_cuda"
# Build path
srcpath = pathlib.Path(__file__).parent.absolute()
buildpath = srcpath / "build"
_create_build_dir(buildpath)
filepath = buildpath / f"{module_name}{cpp_extension.LIB_EXT}"
if not force_rebuild and os.path.exists(filepath):
import importlib.util
import importlib.abc
# If the file exists, we can load it directly
spec = importlib.util.spec_from_file_location(module_name, filepath)
if spec is not None:
module = importlib.util.module_from_spec(spec)
assert isinstance(spec.loader, importlib.abc.Loader)
spec.loader.exec_module(module)
return module
if platform.system() == "Windows" and "MINGW64" in os.environ.get("MSYSTEM", ""):
# 在 MinGW-w64 (如 Git Bash) 环境下编译 CUDA 扩展可能会阻塞或失败
# https://github.com/index-tts/index-tts/issues/172#issuecomment-2914995096
print("Warning: Detected running in MinGW-w64 (e.g., Git Bash). CUDA extension build is not supported in this environment.", file=sys.stderr)
raise RuntimeError(
"Please use Command Prompt (cmd) or PowerShell to compile the anti_alias_activation_cuda extension."
)
if not cpp_extension.CUDA_HOME:
raise RuntimeError(cpp_extension.CUDA_NOT_FOUND_MESSAGE)
cpp_extension.verify_ninja_availability()
def load():
# Check if cuda 11 is installed for compute capability 8.0
cc_flag = []
_, bare_metal_major, _ = _get_cuda_bare_metal_version(cpp_extension.CUDA_HOME)
@@ -92,18 +53,24 @@ def load(force_rebuild=False):
cc_flag.append("-gencode")
cc_flag.append("arch=compute_80,code=sm_80")
# Build path
srcpath = pathlib.Path(__file__).parent.absolute()
buildpath = srcpath / "build"
_create_build_dir(buildpath)
# Helper function to build the kernels.
def _cpp_extention_load_helper(name, sources, extra_cuda_flags):
is_windows = cpp_extension.IS_WINDOWS
return cpp_extension.load(
name=name,
sources=sources,
build_directory=buildpath,
extra_cflags=[
"-O3" if not is_windows else "/O2",
"-O3",
],
extra_cuda_cflags=[
"-O3",
"-gencode",
"arch=compute_70,code=sm_70",
"--use_fast_math",
]
+ extra_cuda_flags
@@ -134,9 +101,8 @@ def load(force_rebuild=False):
def _get_cuda_bare_metal_version(cuda_dir):
nvcc = os.path.join(cuda_dir, 'bin', 'nvcc')
raw_output = subprocess.check_output(
[nvcc, "-V"], universal_newlines=True
[cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True
)
output = raw_output.split()
release_idx = output.index("release") + 1
@@ -149,8 +115,7 @@ def _get_cuda_bare_metal_version(cuda_dir):
def _create_build_dir(buildpath):
try:
if not os.path.isdir(buildpath):
os.mkdir(buildpath)
os.mkdir(buildpath)
except OSError:
if not os.path.isdir(buildpath):
print(f"Creation of the build directory {buildpath} failed")

View File

@@ -0,0 +1,9 @@
from .accel_engine import AccelInferenceEngine # noqa: F401
from .attention import ( # noqa: F401
Attention,
get_forward_context,
reset_forward_context,
set_forward_context,
)
from .gpt2_accel import GPT2AccelAttention, GPT2AccelModel # noqa: F401
from .kv_manager import KVCacheManager, Seq # noqa: F401

View File

@@ -0,0 +1,659 @@
import sys
from typing import List, Optional
import torch
from torch import nn
from .attention import (
ForwardContext,
get_forward_context,
reset_forward_context,
set_forward_context,
)
from .kv_manager import KVCacheManager, Seq
class Sampler(nn.Module):
def __init__(self):
super().__init__()
@torch.compile
def forward(self, logits: torch.Tensor, temperatures: torch.Tensor):
temperatures = temperatures.to(logits.device).clamp(min=1e-8)
greedy_mask = temperatures < 1e-5
temp_for_scaling = torch.where(greedy_mask, 1.0, temperatures)
scaled_logits = logits / temp_for_scaling.unsqueeze(-1)
probs = torch.softmax(scaled_logits, dim=-1, dtype=torch.float32)
q = torch.empty_like(probs)
q.exponential_()
sampled_tokens = probs.div_(q).argmax(dim=-1)
greedy_tokens = logits.argmax(dim=-1)
return torch.where(greedy_mask, greedy_tokens, sampled_tokens)
class AccelInferenceEngine:
def __init__(
self,
model,
lm_head,
num_layers: int,
num_heads: int,
head_dim: int,
block_size: int = 256,
num_blocks: int = 128,
use_cuda_graph: bool = True,
):
"""
Args:
model: The GPT transformer model (should have accel attention)
lm_head: Language model head for generating logits
num_layers: Number of transformer layers
num_heads: Number of attention heads
head_dim: Dimension per head
block_size: KV cache block size
num_blocks: Total number of KV cache blocks
use_cuda_graph: Whether to use CUDA Graph for decode optimization
"""
self.model = model
self.lm_head = lm_head
self.block_size = block_size
self.num_blocks = num_blocks
self.use_cuda_graph = use_cuda_graph and torch.cuda.is_available()
self.hidden_size = (
model.config.hidden_size
if hasattr(model, "config")
else head_dim * num_heads
)
self.kv_manager = KVCacheManager(
num_layers=num_layers,
num_heads=num_heads,
head_dim=head_dim,
block_size=block_size,
num_blocks=num_blocks,
dtype=torch.float16, # Force fp16 for FlashAttention
)
self.kv_manager.wire_kv_cache_to_model(model)
self.sampler = Sampler()
self.current_sequences = []
self.graphs = {}
self.graph_vars = None
self.graph_pool = None
self.graph_captured = False
def _prepare_prefill(self, requests: List[Seq]):
input_ids = []
positions = []
cu_seqlens_q = [0]
cu_seqlens_k = [0]
max_seqlen_q = 0
max_seqlen_k = 0
slot_mapping = []
for req in requests:
seqlen = len(req)
input_ids.extend(req[req.num_cached_tokens :])
positions.extend(list(range(req.num_cached_tokens, seqlen)))
seqlen_q = seqlen - req.num_cached_tokens
seqlen_k = seqlen
cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen_q)
cu_seqlens_k.append(cu_seqlens_k[-1] + seqlen_k)
max_seqlen_q = max(seqlen_q, max_seqlen_q)
max_seqlen_k = max(seqlen_k, max_seqlen_k)
if req.block_table:
num_cached = req.num_cached_tokens
num_total = len(req)
for token_idx in range(num_cached, num_total):
block_idx = token_idx // self.block_size
block_offset = token_idx % self.block_size
block_id = req.block_table[block_idx]
slot_idx = block_id * self.block_size + block_offset
slot_mapping.append(slot_idx)
input_ids = torch.tensor(input_ids, dtype=torch.int64, pin_memory=True).cuda(
non_blocking=True
)
positions = torch.tensor(positions, dtype=torch.int64, pin_memory=True).cuda(
non_blocking=True
)
cu_seqlens_q = torch.tensor(
cu_seqlens_q, dtype=torch.int32, pin_memory=True
).cuda(non_blocking=True)
cu_seqlens_k = torch.tensor(
cu_seqlens_k, dtype=torch.int32, pin_memory=True
).cuda(non_blocking=True)
slot_mapping = torch.tensor(
slot_mapping, dtype=torch.int32, pin_memory=True
).cuda(non_blocking=True)
block_tables = None
if cu_seqlens_k[-1] > cu_seqlens_q[-1]:
max_len = max(len(req.block_table) for req in requests)
block_tables_list = []
for req in requests:
table = req.block_table + [-1] * (max_len - len(req.block_table))
block_tables_list.append(table)
block_tables = torch.tensor(
block_tables_list, dtype=torch.int32, pin_memory=True
).cuda(non_blocking=True)
set_forward_context(
True,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
slot_mapping,
None,
block_tables,
)
return input_ids, positions
def _prepare_decode(self, requests: List[Seq]):
if not requests:
raise RuntimeError("FATAL: No requests provided to _prepare_decode!")
input_ids = []
positions = []
slot_mapping = []
context_lens = []
for req in requests:
input_ids.append(req.last_token)
pos = len(req) - 1
if hasattr(self, "_tts_mode") and self._tts_mode:
pos = pos - (self._tts_prompt_len - 1)
positions.append(pos)
context_lens.append(len(req))
slot_mapping.append(
req.block_table[-1] * self.block_size + req.last_block_num_tokens - 1
)
input_ids = torch.tensor(input_ids, dtype=torch.int64, pin_memory=True).cuda(
non_blocking=True
)
positions = torch.tensor(positions, dtype=torch.int64, pin_memory=True).cuda(
non_blocking=True
)
slot_mapping = torch.tensor(
slot_mapping, dtype=torch.int32, pin_memory=True
).cuda(non_blocking=True)
context_lens = torch.tensor(
context_lens, dtype=torch.int32, pin_memory=True
).cuda(non_blocking=True)
max_len = max(len(req.block_table) for req in requests)
block_tables_list = []
for req in requests:
table = req.block_table + [-1] * (max_len - len(req.block_table))
block_tables_list.append(table)
block_tables = torch.tensor(
block_tables_list, dtype=torch.int32, pin_memory=True
).cuda(non_blocking=True)
assert block_tables.dim() == 2, (
f"block_tables must be 2D, got shape {block_tables.shape}"
)
assert block_tables.size(0) == len(requests), (
f"block_tables batch size mismatch: {block_tables.size(0)} vs {len(requests)}"
)
set_forward_context(
False,
slot_mapping=slot_mapping,
context_lens=context_lens,
block_tables=block_tables,
)
return input_ids, positions
def _prepare_sample(self, requests: List[Seq], temperature: float):
temperatures = [temperature] * len(requests)
temperatures = torch.tensor(
temperatures, dtype=torch.float32, pin_memory=True
).cuda(non_blocking=True)
return temperatures
def _capture_cuda_graphs(self, tts_mel_embedding=None, tts_text_pos_embedding=None):
print("Capturing CUDA graphs for decode optimization...")
max_bs = 8 # Support up to batch size 8
max_num_blocks = (2048 + self.block_size - 1) // self.block_size
model_dtype = next(self.model.parameters()).dtype
input_ids = torch.ones(max_bs, dtype=torch.int64, device="cuda")
positions = torch.ones(max_bs, dtype=torch.int64, device="cuda")
slot_mapping = torch.zeros(max_bs, dtype=torch.int32, device="cuda")
context_lens = torch.zeros(max_bs, dtype=torch.int32, device="cuda")
block_tables = torch.zeros(
max_bs, max_num_blocks, dtype=torch.int32, device="cuda"
)
outputs = torch.zeros(
max_bs, self.hidden_size, dtype=model_dtype, device="cuda"
)
inputs_embeds_buffer = torch.zeros(
max_bs, self.hidden_size, dtype=model_dtype, device="cuda"
)
self.graph_bs = [1, 2, 4, 8]
use_tts = tts_mel_embedding is not None and tts_text_pos_embedding is not None
for bs in reversed(self.graph_bs):
graph = torch.cuda.CUDAGraph()
slot_mapping[:bs] = torch.arange(bs, dtype=torch.int32, device="cuda")
context_lens[:bs] = bs + 1
block_tables[:bs, :] = 0
set_forward_context(
False,
slot_mapping=slot_mapping[:bs],
context_lens=context_lens[:bs],
block_tables=block_tables[:bs],
)
# warmup
if use_tts:
assert tts_mel_embedding is not None
assert tts_text_pos_embedding is not None
emb = tts_mel_embedding(input_ids[:bs])
pos_clamped = torch.clamp(positions[:bs], min=0)
pos_emb = tts_text_pos_embedding.emb(pos_clamped)
inputs_embeds_buffer[:bs] = emb + pos_emb
out = self.model(
inputs_embeds=inputs_embeds_buffer[:bs].unsqueeze(1),
return_dict=True,
).last_hidden_state
else:
out = self.model(
input_ids=input_ids[:bs].unsqueeze(1), return_dict=True
).last_hidden_state
outputs[:bs] = out.squeeze(1) if out.dim() == 3 else out
with torch.cuda.graph(graph, self.graph_pool):
if use_tts:
assert tts_mel_embedding is not None
assert tts_text_pos_embedding is not None
emb = tts_mel_embedding(input_ids[:bs])
pos_clamped = torch.clamp(positions[:bs], min=0)
pos_emb = tts_text_pos_embedding.emb(pos_clamped)
inputs_embeds_buffer[:bs] = emb + pos_emb
out = self.model(
inputs_embeds=inputs_embeds_buffer[:bs].unsqueeze(1),
return_dict=True,
).last_hidden_state
else:
out = self.model(
input_ids=input_ids[:bs].unsqueeze(1), return_dict=True
).last_hidden_state
outputs[:bs] = out.squeeze(1) if out.dim() == 3 else out
if self.graph_pool is None:
self.graph_pool = graph.pool()
self.graphs[bs] = graph
torch.cuda.synchronize()
reset_forward_context()
self.graph_vars = {
"input_ids": input_ids,
"positions": positions,
"slot_mapping": slot_mapping,
"context_lens": context_lens,
"block_tables": block_tables,
"outputs": outputs,
"inputs_embeds": inputs_embeds_buffer,
}
print(f"CUDA graphs captured for batch sizes: {self.graph_bs}")
def _run_decode_with_graph(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
context: ForwardContext,
tts_mel_embedding: Optional[torch.nn.Module] = None,
tts_text_pos_embedding: Optional[torch.nn.Module] = None,
) -> torch.Tensor:
bs = input_ids.size(0)
use_tts_embedding = hasattr(self, "_tts_mode") and self._tts_mode
if not self.use_cuda_graph or not self.graphs:
if use_tts_embedding:
assert tts_mel_embedding is not None
assert tts_text_pos_embedding is not None
inputs_embeds = tts_mel_embedding(input_ids)
pos_clamped = torch.clamp(positions, min=0)
pos_emb = tts_text_pos_embedding.emb(pos_clamped)
inputs_embeds = inputs_embeds + pos_emb
out = self.model(
inputs_embeds=inputs_embeds.unsqueeze(1), return_dict=True
).last_hidden_state
else:
out = self.model(
input_ids=input_ids.unsqueeze(1), return_dict=True
).last_hidden_state
return out.squeeze(1) if out.dim() == 3 else out
graph_bs = next((x for x in self.graph_bs if x >= bs), None)
if graph_bs is None:
if use_tts_embedding:
assert tts_mel_embedding is not None
assert tts_text_pos_embedding is not None
inputs_embeds = tts_mel_embedding(input_ids)
pos_clamped = torch.clamp(positions, min=0)
pos_emb = tts_text_pos_embedding.emb(pos_clamped)
inputs_embeds = inputs_embeds + pos_emb
out = self.model(
inputs_embeds=inputs_embeds.unsqueeze(1), return_dict=True
).last_hidden_state
else:
out = self.model(
input_ids=input_ids.unsqueeze(1), return_dict=True
).last_hidden_state
return out.squeeze(1) if out.dim() == 3 else out
graph = self.graphs[graph_bs]
graph_vars = self.graph_vars
if graph_vars is None:
raise RuntimeError("Graph variables not initialized")
graph_vars["input_ids"][:bs] = input_ids
graph_vars["positions"][:bs] = positions
graph_vars["slot_mapping"].fill_(-1)
graph_vars["slot_mapping"][:bs] = context.slot_mapping
graph_vars["context_lens"].zero_()
graph_vars["context_lens"][:bs] = context.context_lens
graph_vars["block_tables"][:bs, :].fill_(-1)
graph_vars["block_tables"][:bs, : context.block_tables.size(1)] = (
context.block_tables
)
graph.replay()
return graph_vars["outputs"][:bs]
def generate(
self,
input_ids: torch.Tensor,
max_new_tokens: int = 100,
temperature: float = 1.0,
top_k: int = 50,
top_p: float = 1.0,
stop_tokens: Optional[List[int]] = None,
attention_mask: Optional[torch.Tensor] = None,
tts_embeddings: Optional[
torch.Tensor
] = None, # TTS: [pad][cond][text] embeddings (87 tokens, NO start_mel)
tts_mel_embedding: Optional[torch.nn.Module] = None, # TTS: mel_embedding layer
tts_text_pos_embedding: Optional[
torch.nn.Module
] = None, # TTS: text_pos_embedding layer
) -> torch.Tensor:
"""
Generate tokens.
Args:
input_ids: Input token IDs [batch_size, seq_len]
max_new_tokens: Maximum number of tokens to generate
temperature: Sampling temperature
top_k: Top-k sampling
top_p: Nucleus sampling threshold
stop_tokens: List of token IDs that stop generation
Returns:
Generated token IDs [batch_size, total_len]
"""
batch_size = input_ids.size(0)
device = input_ids.device
self._tts_mode = tts_embeddings is not None
self._tts_prompt_len = input_ids.size(1) if self._tts_mode else 0
if self.use_cuda_graph and not self.graph_captured:
print(
f"[CAPTURE] use_cuda_graph={self.use_cuda_graph}, graph_captured={self.graph_captured}",
file=sys.stderr,
flush=True,
)
self._capture_cuda_graphs(
tts_mel_embedding=tts_mel_embedding,
tts_text_pos_embedding=tts_text_pos_embedding,
)
self.graph_captured = True
print(
f"[CAPTURE] Completed! graphs={list(self.graphs.keys())}",
file=sys.stderr,
flush=True,
)
if tts_embeddings is not None:
actual_seq_len = tts_embeddings.size(1) + 1 # embeddings + start_mel_token
else:
actual_seq_len = input_ids.size(1)
is_varlen_batch = (
tts_embeddings is not None
and attention_mask is not None
and batch_size > 1
and (attention_mask.sum(dim=1) != attention_mask.size(1)).any()
)
if is_varlen_batch:
seq_lens = [attention_mask[i].sum().item() for i in range(batch_size)]
else:
seq_lens = [actual_seq_len] * batch_size
sequences = []
for i in range(batch_size):
seq_len = seq_lens[i]
token_ids = [1] * seq_len
if tts_embeddings is not None and seq_len > 0:
token_ids[-1] = input_ids[i, -1].item() if input_ids.size(1) > 0 else 1
else:
token_ids = input_ids[i].tolist()
req = Seq(token_ids)
self.kv_manager.allocate(req)
sequences.append(req)
self.current_sequences = sequences
prefill_ids, prefill_pos = self._prepare_prefill(sequences)
if (
tts_embeddings is not None
and tts_mel_embedding is not None
and tts_text_pos_embedding is not None
):
start_token_id = input_ids[0, -1] if input_ids.size(1) > 0 else 8192
start_emb = tts_mel_embedding(
torch.tensor([[start_token_id]], device="cuda")
) # [1, 1, hidden_dim]
start_pos = torch.tensor(
[[tts_embeddings.size(1)]], device="cuda", dtype=torch.long
)
pos_emb = tts_text_pos_embedding.emb(start_pos)
start_emb = start_emb + pos_emb
start_emb = start_emb.repeat(batch_size, 1, 1)
if is_varlen_batch:
valid_embeddings = []
for i in range(batch_size):
emb_len = seq_lens[i] - 1
padding_len = tts_embeddings.size(1) - emb_len
valid_emb = tts_embeddings[i, padding_len:].unsqueeze(
0
) # [1, emb_len, hidden_dim]
valid_embeddings.append(
torch.cat([valid_emb, start_emb[i : i + 1]], dim=1)
)
full_embeddings = torch.cat(
valid_embeddings, dim=1
) # [1, total_tokens, hidden_dim]
else:
full_embeddings = torch.cat(
[tts_embeddings, start_emb], dim=1
) # [batch_size, seq_len, hidden_dim]
model_dtype = next(self.model.parameters()).dtype
if full_embeddings.dtype != model_dtype:
full_embeddings = full_embeddings.to(model_dtype)
hidden_states = self.model(
inputs_embeds=full_embeddings, return_dict=True
).last_hidden_state
else:
hidden_states = self.model(
input_ids=input_ids, attention_mask=attention_mask, return_dict=True
).last_hidden_state
if is_varlen_batch:
context = get_forward_context()
cu_seqlens = context.cu_seqlens_q.cpu().tolist()
last_hidden = torch.stack(
[hidden_states[0, cu_seqlens[i + 1] - 1] for i in range(batch_size)]
)
else:
last_hidden = hidden_states[:, -1, :] # [batch_size, hidden_size]
reset_forward_context()
if self.lm_head is not None:
if last_hidden.dtype != next(self.lm_head.parameters()).dtype:
last_hidden = last_hidden.to(next(self.lm_head.parameters()).dtype)
logits = self.lm_head(last_hidden) # [batch_size, vocab_size]
else:
logits = self.model.compute_logits(last_hidden) # [batch_size, vocab_size]
temperatures = self._prepare_sample(sequences, temperature)
if temperature > 0:
first_token = self.sampler(logits, temperatures)
else:
first_token = torch.argmax(logits, dim=-1)
first_token_list = first_token.tolist()
generated_tokens = [[] for _ in range(batch_size)]
is_finished = [False] * batch_size
for i, token_id in enumerate(first_token_list):
if stop_tokens and token_id in stop_tokens:
is_finished[i] = True
else:
generated_tokens[i].append(token_id)
sequences[i].append_token(token_id)
self.kv_manager.append_to_seq(sequences[i])
if all(is_finished):
for req in sequences:
self.kv_manager.remove_seq(req)
self.current_sequences = []
output_ids = []
for i in range(batch_size):
full_sequence = input_ids[i].tolist() + generated_tokens[i]
output_ids.append(full_sequence)
output = torch.tensor(output_ids, dtype=torch.long, device=device)
return output
remaining_tokens = max_new_tokens - 1
for step in range(remaining_tokens):
decode_ids, decode_pos = self._prepare_decode(sequences)
context = get_forward_context()
hidden_states = self._run_decode_with_graph(
decode_ids,
decode_pos,
context,
tts_mel_embedding=tts_mel_embedding,
tts_text_pos_embedding=tts_text_pos_embedding,
)
# Get logits
if self.lm_head is not None:
logits = self.lm_head(hidden_states) # [batch_size, vocab_size]
else:
logits = self.model.compute_logits(
hidden_states
) # [batch_size, vocab_size]
reset_forward_context()
temperatures = self._prepare_sample(sequences, temperature)
if temperature > 0:
next_token = self.sampler(logits, temperatures)
else:
next_token = torch.argmax(logits, dim=-1)
next_token_list = next_token.tolist()
for i, token_id in enumerate(next_token_list):
if is_finished[i]:
continue
elif stop_tokens and token_id in stop_tokens:
is_finished[i] = True
else:
sequences[i].append_token(token_id)
self.kv_manager.append_to_seq(sequences[i])
generated_tokens[i].append(token_id)
if all(is_finished):
break
for req in sequences:
self.kv_manager.remove_seq(req)
self.current_sequences = []
pad_token = stop_tokens[0] if stop_tokens else 0
if is_varlen_batch:
max_prompt_len = attention_mask.size(1)
output_ids = []
for i in range(batch_size):
padding_len = max_prompt_len - seq_lens[i]
initial_tokens = sequences[i].token_ids[
: sequences[i].num_prompt_tokens
]
padded_prompt = [pad_token] * padding_len + initial_tokens
full_sequence = padded_prompt + generated_tokens[i]
output_ids.append(full_sequence)
else:
output_ids = [
sequences[i].token_ids[: sequences[i].num_prompt_tokens]
+ generated_tokens[i]
for i in range(batch_size)
]
max_length = max(len(seq) for seq in output_ids)
padded_output_ids = [
seq + [pad_token] * (max_length - len(seq)) for seq in output_ids
]
output = torch.tensor(padded_output_ids, dtype=torch.long, device=device)
assert output.size(0) == batch_size, (
f"Output batch size mismatch: {output.size(0)} != {batch_size}"
)
return output
class Sampler(nn.Module):
def __init__(self):
super().__init__()
@torch.compile
def forward(self, logits: torch.Tensor, temperatures: torch.Tensor):
logits = logits.float().div_(temperatures.unsqueeze(dim=1))
probs = torch.softmax(logits, dim=-1)
sample_tokens = probs.div_(
torch.empty_like(probs).exponential_(1).clamp_min_(1e-10)
).argmax(dim=-1)
return sample_tokens

154
indextts/accel/attention.py Normal file
View File

@@ -0,0 +1,154 @@
from dataclasses import dataclass
import torch
import triton
import triton.language as tl
from flash_attn import flash_attn_varlen_func, flash_attn_with_kvcache
from torch import nn
@dataclass
class ForwardContext:
is_prefill: bool = False
cu_seqlens_q: torch.Tensor | None = None
cu_seqlens_k: torch.Tensor | None = None
max_seqlen_q: int = 0
max_seqlen_k: int = 0
slot_mapping: torch.Tensor | None = None
context_lens: torch.Tensor | None = None
block_tables: torch.Tensor | None = None
_FORWARD_CONTEXT = ForwardContext()
def get_forward_context():
return _FORWARD_CONTEXT
def set_forward_context(
is_prefill,
cu_seqlens_q=None,
cu_seqlens_k=None,
max_seqlen_q=0,
max_seqlen_k=0,
slot_mapping=None,
context_lens=None,
block_tables=None,
):
global _FORWARD_CONTEXT
_FORWARD_CONTEXT = ForwardContext(
is_prefill,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
slot_mapping,
context_lens,
block_tables,
)
def reset_forward_context():
global _FORWARD_CONTEXT
_FORWARD_CONTEXT = ForwardContext()
@triton.jit
def store_kvcache_kernel(
key_ptr,
key_stride,
value_ptr,
value_stride,
k_cache_ptr,
v_cache_ptr,
slot_mapping_ptr,
D: tl.constexpr,
):
BLOCK_SIZE: tl.constexpr = 2048
idx = tl.program_id(0)
slot = tl.load(slot_mapping_ptr + idx)
if slot == -1:
return
d_offset = 0
while d_offset < D:
cur_block_size = min(BLOCK_SIZE, D - d_offset)
key_offsets = idx * key_stride + d_offset + tl.arange(0, BLOCK_SIZE)
value_offsets = idx * value_stride + d_offset + tl.arange(0, BLOCK_SIZE)
cache_offsets = slot * D + d_offset + tl.arange(0, BLOCK_SIZE)
mask = tl.arange(0, BLOCK_SIZE) < cur_block_size
key = tl.load(key_ptr + key_offsets, mask=mask, other=0.0)
value = tl.load(value_ptr + value_offsets, mask=mask, other=0.0)
tl.store(k_cache_ptr + cache_offsets, key, mask=mask)
tl.store(v_cache_ptr + cache_offsets, value, mask=mask)
d_offset += BLOCK_SIZE
def store_kvcache(
key: torch.Tensor,
value: torch.Tensor,
k_cache: torch.Tensor,
v_cache: torch.Tensor,
slot_mapping: torch.Tensor,
):
N, num_heads, head_dim = key.shape
D = num_heads * head_dim
assert key.stride(-1) == 1 and value.stride(-1) == 1
assert key.stride(1) == head_dim and value.stride(1) == head_dim
assert k_cache.stride(1) == D and v_cache.stride(1) == D
assert slot_mapping.numel() == N
store_kvcache_kernel[(N,)](
key, key.stride(0), value, value.stride(0), k_cache, v_cache, slot_mapping, D
)
class Attention(nn.Module):
def __init__(
self,
num_heads: int,
head_dim: int,
scale: float,
num_kv_heads: int,
):
super().__init__()
self.num_heads = num_heads
self.head_dim = head_dim
self.scale = scale
self.num_kv_heads = num_kv_heads
self.k_cache = self.v_cache = torch.tensor([])
def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor):
context = get_forward_context()
k_cache, v_cache = self.k_cache, self.v_cache
if k_cache.numel() and v_cache.numel() and context.slot_mapping is not None:
store_kvcache(k, v, k_cache, v_cache, context.slot_mapping)
if context.is_prefill:
if context.block_tables is not None:
k, v = k_cache, v_cache
o = flash_attn_varlen_func(
q,
k,
v,
max_seqlen_q=context.max_seqlen_q,
cu_seqlens_q=context.cu_seqlens_q,
max_seqlen_k=context.max_seqlen_k,
cu_seqlens_k=context.cu_seqlens_k,
softmax_scale=self.scale,
causal=True,
block_table=context.block_tables,
)
else:
o = flash_attn_with_kvcache(
q.unsqueeze(1),
k_cache,
v_cache,
cache_seqlens=context.context_lens,
block_table=context.block_tables,
softmax_scale=self.scale,
causal=True,
)
return o

View File

@@ -0,0 +1,181 @@
import torch
import torch.nn as nn
from transformers.modeling_outputs import BaseModelOutputWithPastAndCrossAttentions
from transformers.models.gpt2.modeling_gpt2 import Conv1D, GPT2Block, GPT2Model
from .attention import Attention
class GPT2AccelAttention(nn.Module):
def __init__(self, config, layer_idx=None):
super().__init__()
self.config = config
self.layer_idx = layer_idx
max_positions = config.max_position_embeddings
self.register_buffer(
"bias",
torch.tril(
torch.ones((max_positions, max_positions), dtype=torch.bool)
).view(1, 1, max_positions, max_positions),
persistent=False,
)
self.register_buffer("masked_bias", torch.tensor(-1e4), persistent=False)
self.embed_dim = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.embed_dim // self.num_heads
self.split_size = self.embed_dim
if self.head_dim * self.num_heads != self.embed_dim:
raise ValueError(
f"`embed_dim` must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
f" {self.num_heads})."
)
self.scale_attn_weights = config.scale_attn_weights
self.c_attn = Conv1D(3 * self.embed_dim, self.embed_dim)
self.c_proj = Conv1D(self.embed_dim, self.embed_dim)
self.attn_dropout = nn.Dropout(config.attn_pdrop)
self.resid_dropout = nn.Dropout(config.resid_pdrop)
scale = (self.head_dim**-0.5) if self.scale_attn_weights else 1.0
self.accel_attn = Attention(
self.num_heads, self.head_dim, scale, self.num_heads
)
def forward(
self,
hidden_states: torch.Tensor,
layer_past=None,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
use_cache=False,
output_attentions=False,
past_key_value=None,
**kwargs,
):
if encoder_hidden_states is not None:
raise NotImplementedError("Cross attention not supported in accel mode")
qkv = self.c_attn(hidden_states)
query, key, value = qkv.split(self.split_size, dim=2)
# [B, T, H*D] -> [B, H, T, D]
query = self._split_heads(query, self.num_heads, self.head_dim)
key = self._split_heads(key, self.num_heads, self.head_dim)
value = self._split_heads(value, self.num_heads, self.head_dim)
# flatten to [B*T, H, D]
bsz, num_heads, seq_len, head_dim = query.shape
q_flat = query.transpose(1, 2).contiguous().view(-1, num_heads, head_dim)
k_flat = key.transpose(1, 2).contiguous().view(-1, num_heads, head_dim)
v_flat = value.transpose(1, 2).contiguous().view(-1, num_heads, head_dim)
# ensure fp16
if q_flat.device.type == "cuda" and q_flat.dtype != torch.float16:
orig_dtype = q_flat.dtype
q_flat = q_flat.to(torch.float16)
k_flat = k_flat.to(torch.float16)
v_flat = v_flat.to(torch.float16)
else:
orig_dtype = q_flat.dtype
o_flat = self.accel_attn(q_flat, k_flat, v_flat) # [B*T, H, D]
if o_flat.dtype != orig_dtype:
o_flat = o_flat.to(orig_dtype)
# Reshape back: [B*T, H, D] -> [B, H, T, D]
attn_output = o_flat.view(bsz, seq_len, num_heads, head_dim).transpose(1, 2)
attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
attn_output = self.c_proj(attn_output)
attn_output = self.resid_dropout(attn_output)
outputs = (attn_output, None)
if output_attentions:
outputs += (None,)
return outputs
def _split_heads(self, tensor, num_heads, head_dim):
new_shape = tensor.size()[:-1] + (num_heads, head_dim)
tensor = tensor.view(new_shape)
return tensor.permute(0, 2, 1, 3) # (batch, head, seq_length, head_features)
def _merge_heads(self, tensor, num_heads, head_dim):
tensor = tensor.permute(0, 2, 1, 3).contiguous()
new_shape = tensor.size()[:-2] + (num_heads * head_dim,)
return tensor.view(new_shape)
class GPT2AccelBlock(GPT2Block):
def __init__(self, config, layer_idx=None):
super().__init__(config, layer_idx)
self.attn = GPT2AccelAttention(config, layer_idx)
class GPT2AccelModel(GPT2Model):
def __init__(self, config):
super().__init__(config)
self.h = nn.ModuleList(
[
GPT2AccelBlock(config, layer_idx=i)
for i in range(config.num_hidden_layers)
]
)
def forward(
self,
input_ids=None,
past_key_values=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
if inputs_embeds is not None:
hidden_states = inputs_embeds
for block in self.h:
hidden_states = block(hidden_states)[0]
hidden_states = self.ln_f(hidden_states)
if return_dict:
return BaseModelOutputWithPastAndCrossAttentions(
last_hidden_state=hidden_states,
past_key_values=None,
hidden_states=None,
attentions=None,
)
return (hidden_states,)
else:
return super().forward(
input_ids=input_ids,
past_key_values=None,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=None,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
use_cache=False,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)

View File

@@ -0,0 +1,209 @@
import hashlib
import pickle
from collections import deque
from copy import copy
from typing import Dict, List, Optional, Set
import torch
class KVCacheBlock:
def __init__(self, block_id: int):
self.block_id = block_id
self.ref_cnt = 0
self._block_hash = None
self.token_ids = []
@property
def block_hash(self) -> Optional[bytes]:
return self._block_hash
def update(self, block_hash: bytes, token_ids: List[int]):
self._block_hash = block_hash
self.token_ids = token_ids
def reset(self):
self.ref_cnt = 1
self._block_hash = None
self.token_ids = []
class Seq:
def __init__(self, token_ids: List[int], block_size: int = 256):
self.token_ids = copy(token_ids)
self.last_token = token_ids[-1] if token_ids else 0
self.num_tokens = len(self.token_ids)
self.num_prompt_tokens = len(token_ids)
self.num_cached_tokens = 0
self.block_table: List[int] = []
self.block_size = block_size
def __len__(self):
return self.num_tokens
def __getitem__(self, key):
return self.token_ids[key]
@property
def num_blocks(self):
return (self.num_tokens + self.block_size - 1) // self.block_size
@property
def num_cached_blocks(self):
return self.num_cached_tokens // self.block_size
@property
def last_block_num_tokens(self):
return self.num_tokens - (self.num_blocks - 1) * self.block_size
def get_block_tokens(self, block_idx: int) -> List[int]:
assert 0 <= block_idx < self.num_blocks
start = block_idx * self.block_size
end = start + self.block_size
return self.token_ids[start:end]
def append_token(self, token_id: int):
self.token_ids.append(token_id)
self.last_token = token_id
self.num_tokens += 1
class KVCacheManager:
def __init__(
self,
num_layers: int,
num_heads: int,
head_dim: int,
block_size: int,
num_blocks: int,
dtype: torch.dtype,
):
self.num_layers = num_layers
self.num_heads = num_heads
self.head_dim = head_dim
self.block_size = block_size
self.num_blocks = num_blocks
self.dtype = dtype
self.blocks: List[KVCacheBlock] = [KVCacheBlock(i) for i in range(num_blocks)]
self.block_hash_to_id: Dict[bytes, int] = {}
self.free_block_ids: deque = deque(range(num_blocks))
self.used_block_ids: Set[int] = set()
device = "cuda" if torch.cuda.is_available() else "cpu"
cache_dtype = torch.float16 if device == "cuda" else dtype
self.kv_cache = torch.empty(
2,
num_layers,
num_blocks,
block_size,
num_heads,
head_dim,
dtype=cache_dtype,
device=device,
)
@classmethod
def compute_block_hash(
cls, token_ids: List[int], parent_hash: Optional[bytes] = None
) -> bytes:
hash_input = []
if parent_hash is not None:
hash_input.append(parent_hash)
hash_input.extend(token_ids)
input_bytes = pickle.dumps(tuple(hash_input), protocol=pickle.HIGHEST_PROTOCOL)
return hashlib.sha256(input_bytes).digest()
def _allocate_block(self, block_id: int) -> KVCacheBlock:
block = self.blocks[block_id]
assert block.ref_cnt == 0
block.reset()
self.free_block_ids.remove(block_id)
self.used_block_ids.add(block_id)
return block
def _deallocate_block(self, block_id: int):
assert self.blocks[block_id].ref_cnt == 0
self.used_block_ids.remove(block_id)
self.free_block_ids.append(block_id)
def allocate(self, sequence: Seq):
assert not sequence.block_table, "Sequence already has allocated blocks"
parent_hash = None
cache_miss = False
for i in range(sequence.num_blocks):
token_ids = sequence.get_block_tokens(i)
block_hash = (
self.compute_block_hash(token_ids, parent_hash)
if len(token_ids) == self.block_size
else None
)
block_id = self.block_hash_to_id.get(block_hash) if block_hash else None
if block_id is None or self.blocks[block_id].token_ids != token_ids:
cache_miss = True
if cache_miss:
block_id = self.free_block_ids[0]
block = self._allocate_block(block_id)
else:
sequence.num_cached_tokens += self.block_size
if block_id is not None and block_id in self.used_block_ids:
block = self.blocks[block_id]
block.ref_cnt += 1
else:
block_id = self.free_block_ids[0]
block = self._allocate_block(block_id)
if block_hash is not None:
block.update(block_hash, token_ids)
self.block_hash_to_id[block_hash] = block_id
parent_hash = block_hash
sequence.block_table.append(block_id)
def deallocate(self, sequence: Seq):
for block_id in reversed(sequence.block_table):
block = self.blocks[block_id]
block.ref_cnt -= 1
if block.ref_cnt == 0:
self._deallocate_block(block_id)
sequence.num_cached_tokens = 0
sequence.block_table.clear()
def append_to_seq(self, sequence: Seq):
block_table = sequence.block_table
last_block = self.blocks[block_table[-1]]
if len(sequence) % self.block_size == 1:
assert last_block.block_hash is not None
block_id = self.free_block_ids[0]
self._allocate_block(block_id)
block_table.append(block_id)
elif len(sequence) % self.block_size == 0:
assert last_block.block_hash is None
token_ids = sequence.get_block_tokens(sequence.num_blocks - 1)
parent_hash = (
self.blocks[block_table[-2]].block_hash
if len(block_table) > 1
else None
)
block_hash = self.compute_block_hash(token_ids, parent_hash)
last_block.update(block_hash, token_ids)
self.block_hash_to_id[block_hash] = last_block.block_id
else:
assert last_block.block_hash is None
def remove_seq(self, sequence: Seq):
self.deallocate(sequence)
def wire_kv_cache_to_model(self, model):
layer_id = 0
for module in model.modules():
if hasattr(module, "k_cache") and hasattr(module, "v_cache"):
module.k_cache = self.kv_cache[0, layer_id]
module.v_cache = self.kv_cache[1, layer_id]
layer_id += 1

View File

@@ -12,9 +12,9 @@ def main():
parser.add_argument("-o", "--output_path", type=str, default="gen.wav", help="Path to the output wav file")
parser.add_argument("-c", "--config", type=str, default="checkpoints/config.yaml", help="Path to the config file. Default is 'checkpoints/config.yaml'")
parser.add_argument("--model_dir", type=str, default="checkpoints", help="Path to the model directory. Default is 'checkpoints'")
parser.add_argument("--fp16", action="store_true", default=True, help="Use FP16 for inference if available")
parser.add_argument("--fp16", action="store_true", default=False, help="Use FP16 for inference if available")
parser.add_argument("-f", "--force", action="store_true", default=False, help="Force to overwrite the output file if it exists")
parser.add_argument("-d", "--device", type=str, default=None, help="Device to run the model on (cpu, cuda, mps)." )
parser.add_argument("-d", "--device", type=str, default=None, help="Device to run the model on (cpu, cuda, mps, xpu)." )
args = parser.parse_args()
if len(args.text.strip()) == 0:
print("ERROR: Text is empty.")
@@ -47,15 +47,18 @@ def main():
if args.device is None:
if torch.cuda.is_available():
args.device = "cuda:0"
elif torch.mps.is_available():
elif hasattr(torch, "xpu") and torch.xpu.is_available():
args.device = "xpu"
elif hasattr(torch, "mps") and torch.mps.is_available():
args.device = "mps"
else:
args.device = "cpu"
args.fp16 = False # Disable FP16 on CPU
print("WARNING: Running on CPU may be slow.")
# TODO: Add CLI support for IndexTTS2.
from indextts.infer import IndexTTS
tts = IndexTTS(cfg_path=args.config, model_dir=args.model_dir, is_fp16=args.fp16, device=args.device)
tts = IndexTTS(cfg_path=args.config, model_dir=args.model_dir, use_fp16=args.fp16, device=args.device)
tts.infer(audio_prompt=args.voice, text=args.text.strip(), output_path=output_path)
if __name__ == "__main__":

View File

@@ -3,7 +3,12 @@ import functools
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2Config, GPT2PreTrainedModel, LogitsProcessorList, GenerationMixin
import transformers
from transformers import GPT2Config, LogitsProcessorList
from indextts.gpt.transformers_gpt2 import GPT2PreTrainedModel, GPT2Model
# from transformers import GPT2Config, GPT2PreTrainedModel, LogitsProcessorList
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
from transformers.utils.model_parallel_utils import (assert_device_map,
get_device_map)
@@ -37,7 +42,7 @@ class ResBlock(nn.Module):
return F.relu(self.net(x) + x)
class GPT2InferenceModel(GPT2PreTrainedModel, GenerationMixin):
class GPT2InferenceModel(GPT2PreTrainedModel):
def __init__(self, config, gpt, text_pos_emb, embeddings, norm, linear, kv_cache=False):
super().__init__(config)
# Note: the argument named `text_pos_emb` here actually represents the mel position embedding

796
indextts/gpt/model_v2.py Normal file
View File

@@ -0,0 +1,796 @@
import functools
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
from transformers import GPT2Config, LogitsProcessorList
from indextts.gpt.transformers_gpt2 import GPT2PreTrainedModel, GPT2Model
# from transformers import GPT2Config, GPT2PreTrainedModel, LogitsProcessorList
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
from transformers.utils.model_parallel_utils import (assert_device_map,
get_device_map)
from indextts.gpt.conformer_encoder import ConformerEncoder
from indextts.gpt.perceiver import PerceiverResampler
from indextts.utils.arch_util import AttentionBlock
from indextts.utils.typical_sampling import TypicalLogitsWarper
def null_position_embeddings(range, dim):
return torch.zeros((range.shape[0], range.shape[1], dim), device=range.device)
class ResBlock(nn.Module):
"""
Basic residual convolutional block that uses GroupNorm.
"""
def __init__(self, chan):
super().__init__()
self.net = nn.Sequential(
nn.Conv1d(chan, chan, kernel_size=3, padding=1),
nn.GroupNorm(chan // 8, chan),
nn.ReLU(),
nn.Conv1d(chan, chan, kernel_size=3, padding=1),
nn.GroupNorm(chan // 8, chan)
)
def forward(self, x):
return F.relu(self.net(x) + x)
class GPT2InferenceModel(GPT2PreTrainedModel):
def __init__(self, config, gpt, text_pos_emb, embeddings, norm, linear, kv_cache=False):
super().__init__(config)
# Note: the argument named `text_pos_emb` here actually represents the mel position embedding
self.transformer = gpt
self.text_pos_embedding = text_pos_emb
self.embeddings = embeddings
self.final_norm = norm
self.lm_head = nn.Sequential(norm, linear)
self.kv_cache = kv_cache
# Model parallel
self.model_parallel = False
self.device_map = None
self.cached_mel_emb = None
def parallelize(self, device_map=None):
self.device_map = (
get_device_map(len(self.transformer.h), range(max(1, torch.cuda.device_count())))
if device_map is None
else device_map
)
assert_device_map(self.device_map, len(self.transformer.h))
self.transformer.parallelize(self.device_map)
self.lm_head = self.lm_head.to(self.transformer.first_device)
self.model_parallel = True
def deparallelize(self):
self.transformer.deparallelize()
self.transformer = self.transformer.to("cpu")
self.lm_head = self.lm_head.to("cpu")
self.model_parallel = False
torch.cuda.empty_cache()
if torch.backends.mps.is_available():
torch.mps.empty_cache()
def get_output_embeddings(self):
return self.lm_head
def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings
def store_mel_emb(self, mel_emb):
self.cached_mel_emb = mel_emb
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs):
token_type_ids = kwargs.get("token_type_ids", None) # usually None
if not self.kv_cache:
past_key_values = None
# only last token for inputs_ids if past is defined in kwargs
if past_key_values:
input_ids = input_ids[:, -1].unsqueeze(-1)
if token_type_ids is not None:
token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
attention_mask = kwargs.get("attention_mask", None)
position_ids = kwargs.get("position_ids", None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 0)
if past_key_values:
position_ids = position_ids[:, -1].unsqueeze(-1)
else:
position_ids = None
return {
"input_ids": input_ids,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"position_ids": position_ids,
"attention_mask": attention_mask,
"token_type_ids": token_type_ids,
}
def forward(
self,
input_ids=None,
past_key_values=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
labels=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
assert self.cached_mel_emb is not None
assert inputs_embeds is None # Not supported by this inference model.
assert labels is None # Training not supported by this inference model.
return_dict = (
return_dict if return_dict is not None else self.config.use_return_dict
)
# Create embedding
mel_len = self.cached_mel_emb.shape[1]
if input_ids.shape[1] != 1:
text_inputs = input_ids[:, mel_len:]
text_emb = self.embeddings(text_inputs)
text_emb = text_emb + self.text_pos_embedding(text_emb)
if self.cached_mel_emb.shape[0] != text_emb.shape[0]:
mel_emb = self.cached_mel_emb.repeat_interleave(
text_emb.shape[0] // self.cached_mel_emb.shape[0], 0
)
else: # this outcome only occurs once per loop in most cases
mel_emb = self.cached_mel_emb
emb = torch.cat([mel_emb, text_emb], dim=1)
else:
emb = self.embeddings(input_ids)
emb = emb + self.text_pos_embedding.get_fixed_embedding(
attention_mask.shape[1] - mel_len, attention_mask.device
)
transformer_outputs = self.transformer(
inputs_embeds=emb,
past_key_values=past_key_values,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = transformer_outputs[0]
# Set device for model parallelism
if self.model_parallel:
if torch.backends.mps.is_available():
self.to(self.transformer.first_device)
else:
torch.cuda.set_device(self.transformer.first_device)
hidden_states = hidden_states.to(self.lm_head.weight.device)
lm_logits = self.lm_head(hidden_states)
if not return_dict:
return (lm_logits,) + transformer_outputs[1:]
return CausalLMOutputWithCrossAttentions(
loss=None,
logits=lm_logits,
past_key_values=transformer_outputs.past_key_values,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions,
cross_attentions=transformer_outputs.cross_attentions,
)
@staticmethod
def _reorder_cache(past, beam_idx):
"""
This function is used to re-order the :obj:`past_key_values` cache if
:meth:`~transformers.PreTrainedModel.beam_search` or :meth:`~transformers.PreTrainedModel.beam_sample` is
called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step.
"""
return tuple(
tuple(
past_state.index_select(0, beam_idx.to(past_state.device))
for past_state in layer_past
)
for layer_past in past
)
class ConditioningEncoder(nn.Module):
def __init__(self,
spec_dim,
embedding_dim,
attn_blocks=6,
num_attn_heads=4,
do_checkpointing=False,
mean=False):
super().__init__()
attn = []
self.init = nn.Conv1d(spec_dim, embedding_dim, kernel_size=1)
for a in range(attn_blocks):
attn.append(AttentionBlock(embedding_dim, num_attn_heads))
self.attn = nn.Sequential(*attn)
self.dim = embedding_dim
self.do_checkpointing = do_checkpointing
self.mean = mean
def forward(self, x):
h = self.init(x)
h = self.attn(h)
if self.mean:
return h.mean(dim=2)
else:
return h
# return h[:, :, 0]
class LearnedPositionEmbeddings(nn.Module):
def __init__(self, seq_len, model_dim, init=.02):
super().__init__()
self.emb = nn.Embedding(seq_len, model_dim)
# Initializing this way is standard for GPT-2
self.emb.weight.data.normal_(mean=0.0, std=init)
def forward(self, x):
sl = x.shape[1]
return self.emb(torch.arange(0, sl, device=x.device))
def get_fixed_embedding(self, ind, dev):
return self.emb(torch.tensor([ind], device=dev)).unsqueeze(0)
def build_hf_gpt_transformer(layers, model_dim, heads, max_mel_seq_len, max_text_seq_len, checkpointing):
"""
GPT-2 implemented by the HuggingFace library.
"""
from transformers import GPT2Config, GPT2Model
gpt_config = GPT2Config(vocab_size=256, # Unused.
n_positions=max_mel_seq_len + max_text_seq_len,
n_ctx=max_mel_seq_len + max_text_seq_len,
n_embd=model_dim,
n_layer=layers,
n_head=heads,
gradient_checkpointing=checkpointing,
use_cache=not checkpointing)
gpt = GPT2Model(gpt_config)
# Override the built in positional embeddings
del gpt.wpe
gpt.wpe = functools.partial(null_position_embeddings, dim=model_dim)
# Built-in token embeddings are unused.
del gpt.wte
return gpt, LearnedPositionEmbeddings(max_mel_seq_len, model_dim), LearnedPositionEmbeddings(max_text_seq_len, model_dim), \
None, None
class MelEncoder(nn.Module):
def __init__(self, channels, mel_channels=80, resblocks_per_reduction=2):
super().__init__()
self.channels = channels
self.encoder = nn.Sequential(nn.Conv1d(mel_channels, channels // 4, kernel_size=3, padding=1),
nn.Sequential(*[ResBlock(channels // 4) for _ in range(resblocks_per_reduction)]),
nn.Conv1d(channels // 4, channels // 2, kernel_size=3, stride=2, padding=1),
nn.GroupNorm(channels // 16, channels // 2),
nn.ReLU(),
nn.Sequential(*[ResBlock(channels // 2) for _ in range(resblocks_per_reduction)]),
nn.Conv1d(channels // 2, channels, kernel_size=3, stride=2, padding=1),
nn.GroupNorm(channels // 8, channels),
nn.ReLU(),
nn.Sequential(*[ResBlock(channels) for _ in range(resblocks_per_reduction)]),
)
self.reduction = 4
def forward(self, x):
for e in self.encoder:
x = e(x)
return x.permute(0, 2, 1)
class UnifiedVoice(nn.Module):
def __init__(self, layers=8, model_dim=512, heads=8, max_text_tokens=120, max_mel_tokens=250, max_conditioning_inputs=1,
mel_length_compression=1024, number_text_tokens=256,
start_text_token=0, stop_text_token=1, number_mel_codes=8194, start_mel_token=8192, stop_mel_token=8193,
train_solo_embeddings=False, use_mel_codes_as_input=True,
checkpointing=True, types=1,
condition_num_latent=32, condition_type="perceiver", condition_module=None, emo_condition_module=None, use_accel=False):
"""
Args:
layers: Number of layers in transformer stack.
model_dim: Operating dimensions of the transformer
heads: Number of transformer heads. Must be divisible by model_dim. Recommend model_dim//64
max_text_tokens: Maximum number of text tokens that will be encountered by model.
max_mel_tokens: Maximum number of MEL tokens that will be encountered by model.
max_conditioning_inputs: Maximum number of conditioning inputs provided to the model. If (1), conditioning input can be of format (b,80,s), otherwise (b,n,80,s).
mel_length_compression: The factor between <number_input_samples> and <mel_tokens>. Used to compute MEL code padding given wav input length.
number_text_tokens:
start_text_token:
stop_text_token:
number_mel_codes:
start_mel_token:
stop_mel_token:
train_solo_embeddings:
use_mel_codes_as_input:
checkpointing:
condition_type: perceiver, gst or default encoder
"""
super().__init__()
self.number_text_tokens = number_text_tokens
self.start_text_token = start_text_token
self.stop_text_token = stop_text_token
self.number_mel_codes = number_mel_codes
self.start_mel_token = start_mel_token
self.stop_mel_token = stop_mel_token
self.layers = layers
self.heads = heads
self.max_mel_tokens = max_mel_tokens
self.max_text_tokens = max_text_tokens
self.model_dim = model_dim
self.max_conditioning_inputs = max_conditioning_inputs
self.mel_length_compression = mel_length_compression
self.condition_type = condition_type
self.cond_num = condition_num_latent
self.cond_mask_pad = nn.ConstantPad1d((self.cond_num, 0), True)
self.emo_cond_mask_pad = nn.ConstantPad1d((1, 0), True)
if condition_type == "perceiver":
self.conditioning_encoder = ConditioningEncoder(1024, model_dim, num_attn_heads=heads)
self.perceiver_encoder = PerceiverResampler(model_dim, dim_context=model_dim, num_latents=self.cond_num)
elif condition_type == "conformer_perceiver" or condition_type == "conformer_encoder":
self.conditioning_encoder = ConformerEncoder(input_size=1024,
output_size=condition_module['output_size'],
linear_units=condition_module['linear_units'],
attention_heads=condition_module['attention_heads'],
num_blocks=condition_module['num_blocks'],
input_layer=condition_module['input_layer'])
if condition_type == "conformer_perceiver":
self.perceiver_encoder = PerceiverResampler(model_dim, dim_context=condition_module['output_size'],
ff_mult=condition_module['perceiver_mult'],
heads=condition_module['attention_heads'],
num_latents=self.cond_num)
else:
self.conditioning_encoder = ConditioningEncoder(1024, model_dim, num_attn_heads=heads, mean=True)
self.emo_conditioning_encoder = ConformerEncoder(input_size=1024,
output_size=emo_condition_module['output_size'],
linear_units=emo_condition_module['linear_units'],
attention_heads=emo_condition_module['attention_heads'],
num_blocks=emo_condition_module['num_blocks'],
input_layer=emo_condition_module['input_layer'])
self.emo_perceiver_encoder = PerceiverResampler(1024, dim_context=emo_condition_module['output_size'],
ff_mult=emo_condition_module['perceiver_mult'],
heads=emo_condition_module['attention_heads'],
num_latents=1)
self.text_embedding = nn.Embedding(self.number_text_tokens * types + 1, model_dim)
self.emo_layer = nn.Linear(model_dim, model_dim)
self.emovec_layer = nn.Linear(1024, model_dim)
if use_mel_codes_as_input:
self.mel_embedding = nn.Embedding(self.number_mel_codes, model_dim)
else:
self.mel_embedding = MelEncoder(model_dim, resblocks_per_reduction=1)
self.gpt, self.mel_pos_embedding, self.text_pos_embedding, self.mel_layer_pos_embedding, self.text_layer_pos_embedding = \
build_hf_gpt_transformer(layers, model_dim, heads, self.max_mel_tokens + 2 + self.max_conditioning_inputs,
self.max_text_tokens + 2, checkpointing)
if train_solo_embeddings:
self.mel_solo_embedding = nn.Parameter(torch.randn(1, 1, model_dim) * .02, requires_grad=True)
self.text_solo_embedding = nn.Parameter(torch.randn(1, 1, model_dim) * .02, requires_grad=True)
else:
self.mel_solo_embedding = 0
self.text_solo_embedding = 0
self.final_norm = nn.LayerNorm(model_dim)
self.text_head = nn.Linear(model_dim, self.number_text_tokens * types + 1)
self.mel_head = nn.Linear(model_dim, self.number_mel_codes)
self.speed_emb = nn.Embedding(2, model_dim)
self.speed_emb.weight.data.normal_(mean=0.0, std=0.0)
# Initialize the embeddings per the GPT-2 scheme
embeddings = [self.text_embedding]
if use_mel_codes_as_input:
embeddings.append(self.mel_embedding)
for module in embeddings:
module.weight.data.normal_(mean=0.0, std=.02)
self.use_accel = use_accel
self.accel_engine = None # Will be initialized in post_init_gpt2_config
def post_init_gpt2_config(self, use_deepspeed=False, kv_cache=False, half=False):
seq_length = self.max_mel_tokens + self.max_text_tokens + 2
gpt_config = GPT2Config(
vocab_size=self.number_mel_codes,
n_positions=seq_length,
n_ctx=seq_length,
n_embd=self.model_dim,
n_layer=self.layers,
n_head=self.heads,
gradient_checkpointing=False,
use_cache=True,
)
if self.use_accel and torch.cuda.is_available():
# Check if flash attention is available
try:
import flash_attn
except ImportError:
raise ImportError("flash_attn is required for acceleration but not installed. Please install from https://github.com/Dao-AILab/flash-attention/releases/")
from indextts.accel import GPT2AccelModel, AccelInferenceEngine
# Create accel model
accel_gpt = GPT2AccelModel(gpt_config)
accel_gpt.load_state_dict(self.gpt.state_dict(), strict=False)
if half:
accel_gpt = accel_gpt.half().cuda()
else:
accel_gpt = accel_gpt.cuda()
accel_gpt.eval()
lm_head_with_norm = nn.Sequential(self.final_norm, self.mel_head)
self.accel_engine = AccelInferenceEngine(
model=accel_gpt,
lm_head=lm_head_with_norm,
num_layers=self.layers,
num_heads=self.heads,
head_dim=self.model_dim // self.heads,
block_size=256,
num_blocks=16, # Reduce to save memory (16*256 = 4096 tokens capacity)
use_cuda_graph=True,
)
print("acceleration engine initialized")
self.inference_model = GPT2InferenceModel(
gpt_config,
self.gpt,
self.mel_pos_embedding,
self.mel_embedding,
self.final_norm,
self.mel_head,
kv_cache=kv_cache,
)
if use_deepspeed and half and torch.cuda.is_available():
import deepspeed
self.ds_engine = deepspeed.init_inference(model=self.inference_model,
mp_size=1,
replace_with_kernel_inject=True,
dtype=torch.float16)
self.inference_model = self.ds_engine.module.eval()
elif use_deepspeed and torch.cuda.is_available():
import deepspeed
self.ds_engine = deepspeed.init_inference(model=self.inference_model,
mp_size=1,
replace_with_kernel_inject=True,
dtype=torch.float32)
self.inference_model = self.ds_engine.module.eval()
else:
self.inference_model = self.inference_model.eval()
# self.inference_model = PrunedGPT2InferenceModel(gpt_config, self.gpt, self.mel_pos_embedding, self.mel_embedding, self.final_norm, self.mel_head)
self.gpt.wte = self.mel_embedding
def build_aligned_inputs_and_targets(self, input, start_token, stop_token):
inp = F.pad(input, (1, 0), value=start_token)
tar = F.pad(input, (0, 1), value=stop_token)
return inp, tar
def set_mel_padding(self, mel_input_tokens, mel_lengths):
"""
Given mel tokens that are derived from a padded audio clip and the actual lengths of each batch element in
that audio clip, reformats the tokens with STOP_MEL_TOKEN in place of the zero padding. This is required
preformatting to create a working TTS model.
"""
for b in range(len(mel_lengths)):
# Due to the convolutional nature of how these tokens are generated,
# it would be best if the model predicts a token past the actual last token.
actual_end = mel_lengths[b]
if actual_end < mel_input_tokens.shape[-1]:
mel_input_tokens[b, actual_end:] = self.stop_mel_token
return mel_input_tokens
def set_text_padding(self, text_input_tokens, text_lengths):
"""
Given mel tokens that are derived from a padded audio clip and the actual lengths of each batch element in
that audio clip, reformats the tokens with STOP_MEL_TOKEN in place of the zero padding. This is required
preformatting to create a working TTS model.
"""
for b in range(len(text_lengths)):
# Due to the convolutional nature of how these tokens are generated,
# it would be best if the model predicts a token past the actual last token.
actual_end = text_lengths[b]
if actual_end < text_input_tokens.shape[-1]:
text_input_tokens[b, actual_end:] = self.stop_text_token
return text_input_tokens
def get_logits(self, speech_conditioning_inputs, first_inputs, first_head, second_inputs=None, second_head=None, get_attns=False, return_latent=False):
if second_inputs is not None:
emb = torch.cat([speech_conditioning_inputs, first_inputs, second_inputs], dim=1)
else:
emb = torch.cat([speech_conditioning_inputs, first_inputs], dim=1)
gpt_out = self.gpt(inputs_embeds=emb, return_dict=True, output_attentions=get_attns)
if get_attns:
return gpt_out.attentions
offset = speech_conditioning_inputs.shape[1]
enc = gpt_out.last_hidden_state[:, offset:]
enc = self.final_norm(enc)
if return_latent:
return enc[:, :first_inputs.shape[1]], enc[:, -second_inputs.shape[1]:]
first_logits = enc[:, :first_inputs.shape[1]]
first_logits = first_head(first_logits)
first_logits = first_logits.permute(0, 2, 1)
if second_inputs is not None:
second_logits = enc[:, -second_inputs.shape[1]:]
second_logits = second_head(second_logits)
second_logits = second_logits.permute(0, 2, 1)
return first_logits, second_logits
else:
return first_logits
def get_conditioning(self, speech_conditioning_input, cond_mel_lengths=None):
if self.condition_type == "perceiver":
if speech_conditioning_input.ndim == 4:
speech_conditioning_input = speech_conditioning_input.squeeze(1)
speech_conditioning_input = self.conditioning_encoder(speech_conditioning_input) # (b, d, s)
conds = self.perceiver_encoder(speech_conditioning_input.transpose(1, 2)) # (b, 32, d)
elif self.condition_type == "conformer_perceiver":
speech_conditioning_input, mask = self.conditioning_encoder(speech_conditioning_input.transpose(1, 2),
cond_mel_lengths) # (b, s, d), (b, 1, s)
if self.condition_type == "conformer_perceiver":
# conds_mask = torch.cat([torch.ones((mask.shape[0], self.cond_num), dtype=torch.bool), mask.squeeze(1)], dim=1)
conds_mask = self.cond_mask_pad(mask.squeeze(1))
conds = self.perceiver_encoder(speech_conditioning_input, conds_mask) # (b, 32, d)
elif self.condition_type == "gst":
if speech_conditioning_input.ndim == 4:
speech_conditioning_input = speech_conditioning_input.squeeze(1)
conds = self.gst_encoder(speech_conditioning_input.transpose(1, 2)) # (b, 1, d)
else:
speech_conditioning_input = (
speech_conditioning_input.unsqueeze(1)
if len(speech_conditioning_input.shape) == 3
else speech_conditioning_input
)
conds = []
for j in range(speech_conditioning_input.shape[1]):
conds.append(self.conditioning_encoder(speech_conditioning_input[:, j]))
conds = torch.stack(conds, dim=1)
conds = conds.mean(dim=1)
conds = conds.unsqueeze(1)
return conds
def get_emo_conditioning(self, speech_conditioning_input, cond_mel_lengths=None):
speech_conditioning_input, mask = self.emo_conditioning_encoder(speech_conditioning_input.transpose(1, 2),
cond_mel_lengths) # (b, s, d), (b, 1, s)
conds_mask = self.emo_cond_mask_pad(mask.squeeze(1))
conds = self.emo_perceiver_encoder(speech_conditioning_input, conds_mask) # (b, 1, d)
return conds.squeeze(1)
def forward(self, speech_conditioning_latent, text_inputs, text_lengths, mel_codes, mel_codes_lengths, emo_speech_conditioning_latent,
cond_mel_lengths=None, emo_cond_mel_lengths=None, emo_vec=None, use_speed=None, do_spk_cond=False):
"""
Forward pass that uses both text and voice in either text conditioning mode or voice conditioning mode
speech_conditioning_input: MEL float tensor, (b,1024)
text_inputs: long tensor, (b,t)
text_lengths: long tensor, (b,)
mel_inputs: long tensor, (b,m)
wav_lengths: long tensor, (b,)
If return_attentions is specified, only logits are returned.
If return_latent is specified, loss & logits are not computed or returned. Only the predicted latents are returned.
"""
if do_spk_cond:
speech_conditioning_latent = self.get_conditioning(speech_conditioning_latent.transpose(1,2), cond_mel_lengths)
else:
speech_conditioning_latent = speech_conditioning_latent
if emo_vec is None:
emo_vec_syn_ori = self.get_emo_conditioning(emo_speech_conditioning_latent.transpose(1,2), emo_cond_mel_lengths)
emo_vec_syn = self.emovec_layer(emo_vec_syn_ori)
emo_vec = self.emo_layer(emo_vec_syn)
text_inputs = self.set_text_padding(text_inputs, text_lengths)
text_inputs = F.pad(text_inputs, (0, 1), value=self.stop_text_token)
mel_codes = self.set_mel_padding(mel_codes, mel_codes_lengths)
mel_codes = F.pad(mel_codes, (0, 1), value=self.stop_mel_token)
duration_emb = self.speed_emb(torch.zeros_like(use_speed))
duration_emb_half = self.speed_emb(torch.ones_like(use_speed))
conds = torch.cat((speech_conditioning_latent + emo_vec.unsqueeze(1), duration_emb_half.unsqueeze(1), duration_emb.unsqueeze(1)), 1)
text_inputs, text_targets = self.build_aligned_inputs_and_targets(text_inputs, self.start_text_token, self.stop_text_token)
text_emb = self.text_embedding(text_inputs) + self.text_pos_embedding(text_inputs)
mel_codes, mel_targets = self.build_aligned_inputs_and_targets(mel_codes, self.start_mel_token, self.stop_mel_token)
mel_emb = self.mel_embedding(mel_codes)
mel_emb = mel_emb + self.mel_pos_embedding(mel_codes)
text_logits, mel_logits = self.get_logits(conds, text_emb, self.text_head, mel_emb, self.mel_head, get_attns=False, return_latent=True)
return mel_logits[:, :-2] # Despite the name, these are not logits. Strip off the two tokens added by this forward pass.
def prepare_gpt_inputs(
self,
conditional_latents: torch.Tensor,
text_inputs: torch.Tensor,
):
"""
Prepare the inputs for the GPT2InferenceModel to generate.
Args:
conds_latent: (b, 32, dim) audio conditioning embedding by `get_conditioning()`
text_inputs: (b, L)
Returns:
input_ids: (b, s+1) the input ids for the GPT2InferenceModel.generate()
inputs_embeds: (b, s+1, dim) the input embeddings for the GPT2InferenceModel.forward()
attention_mask: (b, s+1) the attention mask for the GPT2InferenceModel.generate()
"""
b, L = text_inputs.shape[:2]
device = text_inputs.device
single_cond = conditional_latents.ndim == 3 and conditional_latents.shape[0] == 1
if not single_cond:
assert conditional_latents.shape[0] == b, f"batch size mismatch: {conditional_latents.shape[0]} vs {b}"
batched_mel_emb = []
attention_masks = []
target_len = conditional_latents.shape[1] + L + 2
for i in range(b):
valid_mask = (text_inputs[i] != self.stop_text_token) & (text_inputs[i] != self.start_text_token)
text_input = text_inputs[i][valid_mask]
text_input = F.pad(text_input, (1, 0), value=self.start_text_token)
text_input = F.pad(text_input, (0, 1), value=self.stop_text_token)
text_input_pos = torch.arange(0, text_input.size(-1), device=device)
text_emb = self.text_embedding(text_input) + self.text_pos_embedding.emb(text_input_pos)
# concatenate [conditional latents][text embeddings]
conds_text_emb = [
conditional_latents.squeeze(0) if single_cond else conditional_latents[i],
text_emb,
]
# +1 for the start_mel_token
attention_mask = torch.ones(target_len+1, dtype=torch.long, device=device)
# check this text input is padded
padding: int = L + 2 - text_input.size(-1)
# pad left of [cond][text] -> [pad][cond][text]
if padding > 0:
pad = torch.zeros((padding, conditional_latents.size(-1)), dtype=text_emb.dtype, device=device) # [p, dim]
conds_text_emb.insert(0, pad)
attention_mask[:padding] = 0
mel_emb = torch.cat(conds_text_emb) #[s, dim]
assert mel_emb.shape[0] == target_len, f"mel_emb.shape: {mel_emb.shape}, target_len: {target_len}"
batched_mel_emb.append(mel_emb)
attention_masks.append(attention_mask)
# [b, s, dim]
batched_mel_emb = torch.stack(batched_mel_emb, dim=0)
# [b, s+1]
attention_mask = torch.stack(attention_masks, dim=0)
# [b, s+1]
fake_inputs = torch.ones(
(
batched_mel_emb.shape[0],
batched_mel_emb.shape[1] + 1, # +1 for the start_mel_token
),
dtype=torch.long,
device=device,
)
fake_inputs[:, -1] = self.start_mel_token
return fake_inputs, batched_mel_emb, attention_mask
def inference_speech(self, speech_condition, text_inputs, emo_speech_condition=None, cond_lengths=None, emo_cond_lengths=None, emo_vec=None, use_speed=False, input_tokens=None, num_return_sequences=1,
max_generate_length=None, typical_sampling=False, typical_mass=.9, **hf_generate_kwargs):
"""
Args:
speech_condition: (b, d, frames) or (d, frames)
text_inputs: (b, L)
cond_mel_lengths: lengths of the conditioning mel spectrograms in shape (b,) or (1,)
input_tokens: additional tokens for generation in shape (b, s) or (s,)
max_generate_length: limit the number of generated tokens
hf_generate_kwargs: kwargs for `GPT2InferenceModel.generate(**hf_generate_kwargs)`
"""
if speech_condition.ndim == 2:
speech_condition = speech_condition.unsqueeze(0)
if emo_speech_condition is None:
emo_speech_condition = speech_condition
if cond_lengths is None:
cond_lengths = torch.tensor([speech_condition.shape[-1]], device=speech_condition.device)
if emo_cond_lengths is None:
emo_cond_lengths = torch.tensor([emo_speech_condition.shape[-1]], device=speech_condition.device)
speech_conditioning_latent = self.get_conditioning(speech_condition.transpose(1,2), cond_lengths)
if emo_vec is None:
print('compute emo vec')
emo_vec = self.get_emo_conditioning(emo_speech_condition.transpose(1,2), emo_cond_lengths)
emo_vec = self.emovec_layer(emo_vec)
emo_vec = self.emo_layer(emo_vec)
else:
print('Use the specified emotion vector')
tmp = torch.zeros(text_inputs.size(0)).to(text_inputs.device)
duration_emb = self.speed_emb(torch.zeros_like(tmp).long())
duration_emb_half = self.speed_emb(torch.ones_like(tmp).long())
conds_latent = torch.cat((speech_conditioning_latent + emo_vec.unsqueeze(1), duration_emb_half.unsqueeze(1), duration_emb.unsqueeze(1)), 1)
input_ids, inputs_embeds, attention_mask = self.prepare_gpt_inputs(conds_latent, text_inputs)
self.inference_model.store_mel_emb(inputs_embeds)
if input_tokens is None:
inputs = input_ids
else:
if input_tokens.ndim == 1:
input_tokens = input_tokens.unsqueeze(0)
assert num_return_sequences % input_tokens.shape[0] == 0, \
"The num_return_sequences must be divisible by the batch number of input_tokens"
assert num_return_sequences % text_inputs.shape[0] == 0, \
"The num_return_sequences must be divisible by the batch number of text_inputs"
b = num_return_sequences // input_ids.shape[0]
if b > 1:
input_ids = input_ids.repeat(b, 1)
attention_mask = attention_mask.repeat(b, 1)
input_tokens = input_tokens.repeat(num_return_sequences // input_tokens.shape[0], 1)
inputs = torch.cat([input_ids, input_tokens], dim=1)
attention_mask = F.pad(attention_mask, (0, input_tokens.shape[1]), value=1)
trunc_index = inputs.shape[1]
logits_processor = LogitsProcessorList()
if typical_sampling:
# employ custom typical sampling
if not (typical_mass > 0.0 and typical_mass < 1.0):
raise ValueError(f"`typical_mass` has to be a float > 0 and < 1, but is {typical_mass}")
min_tokens_to_keep = 2 if hf_generate_kwargs.get("num_beams", 1) > 1 else 1
logits_processor.append(TypicalLogitsWarper(mass=typical_mass, min_tokens_to_keep=min_tokens_to_keep))
max_length = (trunc_index + self.max_mel_tokens - 1) if max_generate_length is None else trunc_index + max_generate_length
# Use accel engine if available (single sequence only)
if self.accel_engine is not None and num_return_sequences == 1:
output = self.accel_engine.generate(
inputs, # fake input_ids (all 1s + start_mel_token)
max_new_tokens=max_length - trunc_index,
attention_mask=attention_mask,
temperature=hf_generate_kwargs.get('temperature', 1),
stop_tokens=[self.stop_mel_token],
tts_embeddings=inputs_embeds, # [pad][cond][text] embeddings (87 tokens, NO start_mel_token)
tts_mel_embedding=self.inference_model.embeddings, # mel_embedding layer
tts_text_pos_embedding=self.inference_model.text_pos_embedding, # text_pos_embedding layer
)
else:
output = self.inference_model.generate(inputs,
bos_token_id=self.start_mel_token, pad_token_id=self.stop_mel_token,
eos_token_id=self.stop_mel_token, attention_mask=attention_mask,
max_length=max_length, logits_processor=logits_processor,
num_return_sequences=num_return_sequences,
**hf_generate_kwargs)
if isinstance(output, torch.Tensor):
return output[:, trunc_index:], speech_conditioning_latent
# GenerateOutput
output.sequences = output.sequences[:, trunc_index:]
return output, speech_conditioning_latent
def get_emovec(self, emo_speech_conditioning_latent, emo_cond_lengths):
emo_vec_syn_ori = self.get_emo_conditioning(emo_speech_conditioning_latent.transpose(1,2), emo_cond_lengths)
emo_vec_syn = self.emovec_layer(emo_vec_syn_ori)
emo_vec = self.emo_layer(emo_vec_syn)
return emo_vec
def merge_emovec(self, speech_conditioning_latent, emo_speech_conditioning_latent, cond_lengths, emo_cond_lengths, alpha = 1.0):
emo_vec = self.get_emovec(emo_speech_conditioning_latent, emo_cond_lengths)
base_vec = self.get_emovec(speech_conditioning_latent, cond_lengths)
out = base_vec + alpha * (emo_vec - base_vec)
return out

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

1878
indextts/gpt/transformers_gpt2.py Executable file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,8 +1,9 @@
import os
import sys
os.environ['HF_HUB_CACHE'] = './checkpoints/hf_cache'
import time
from subprocess import CalledProcessError
from typing import Dict, List, Tuple
from typing import Dict, List
import torch
import torchaudio
@@ -25,37 +26,42 @@ from indextts.utils.front import TextNormalizer, TextTokenizer
class IndexTTS:
def __init__(
self, cfg_path="checkpoints/config.yaml", model_dir="checkpoints", is_fp16=True, device=None, use_cuda_kernel=None,
self, cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=True, device=None,
use_cuda_kernel=None,
):
"""
Args:
cfg_path (str): path to the config file.
model_dir (str): path to the model directory.
is_fp16 (bool): whether to use fp16.
use_fp16 (bool): whether to use fp16.
device (str): device to use (e.g., 'cuda:0', 'cpu'). If None, it will be set automatically based on the availability of CUDA or MPS.
use_cuda_kernel (None | bool): whether to use BigVGan custom fused activation CUDA kernel, only for CUDA device.
"""
if device is not None:
self.device = device
self.is_fp16 = False if device == "cpu" else is_fp16
self.use_fp16 = False if device == "cpu" else use_fp16
self.use_cuda_kernel = use_cuda_kernel is not None and use_cuda_kernel and device.startswith("cuda")
elif torch.cuda.is_available():
self.device = "cuda:0"
self.is_fp16 = is_fp16
self.use_fp16 = use_fp16
self.use_cuda_kernel = use_cuda_kernel is None or use_cuda_kernel
elif hasattr(torch, "xpu") and torch.xpu.is_available():
self.device = "xpu"
self.use_fp16 = use_fp16
self.use_cuda_kernel = False
elif hasattr(torch, "mps") and torch.backends.mps.is_available():
self.device = "mps"
self.is_fp16 = False # Use float16 on MPS is overhead than float32
self.use_fp16 = False # Use float16 on MPS is overhead than float32
self.use_cuda_kernel = False
else:
self.device = "cpu"
self.is_fp16 = False
self.use_fp16 = False
self.use_cuda_kernel = False
print(">> Be patient, it may take a while to run in CPU mode.")
self.cfg = OmegaConf.load(cfg_path)
self.model_dir = model_dir
self.dtype = torch.float16 if self.is_fp16 else None
self.dtype = torch.float16 if self.use_fp16 else None
self.stop_mel_token = self.cfg.gpt.stop_mel_token
# Comment-off to load the VQ-VAE model for debugging tokenizer
@@ -66,7 +72,7 @@ class IndexTTS:
# self.dvae_path = os.path.join(self.model_dir, self.cfg.dvae_checkpoint)
# load_checkpoint(self.dvae, self.dvae_path)
# self.dvae = self.dvae.to(self.device)
# if self.is_fp16:
# if self.use_fp16:
# self.dvae.eval().half()
# else:
# self.dvae.eval()
@@ -75,12 +81,12 @@ class IndexTTS:
self.gpt_path = os.path.join(self.model_dir, self.cfg.gpt_checkpoint)
load_checkpoint(self.gpt, self.gpt_path)
self.gpt = self.gpt.to(self.device)
if self.is_fp16:
if self.use_fp16:
self.gpt.eval().half()
else:
self.gpt.eval()
print(">> GPT weights restored from:", self.gpt_path)
if self.is_fp16:
if self.use_fp16:
try:
import deepspeed
@@ -88,24 +94,20 @@ class IndexTTS:
except (ImportError, OSError, CalledProcessError) as e:
use_deepspeed = False
print(f">> DeepSpeed加载失败回退到标准推理: {e}")
print("See more details https://www.deepspeed.ai/tutorials/advanced-install/")
self.gpt.post_init_gpt2_config(use_deepspeed=use_deepspeed, kv_cache=True, half=True)
else:
self.gpt.post_init_gpt2_config(use_deepspeed=False, kv_cache=True, half=False)
self.gpt.post_init_gpt2_config(use_deepspeed=False, kv_cache=False, half=False)
if self.use_cuda_kernel:
# preload the CUDA kernel for BigVGAN
try:
from indextts.BigVGAN.alias_free_activation.cuda import load as anti_alias_activation_loader
anti_alias_activation_cuda = anti_alias_activation_loader.load()
from indextts.BigVGAN.alias_free_activation.cuda import load
anti_alias_activation_cuda = load.load()
print(">> Preload custom CUDA kernel for BigVGAN", anti_alias_activation_cuda)
except Exception as e:
print(">> Failed to load custom CUDA kernel for BigVGAN. Falling back to torch.", e, file=sys.stderr)
print(" Reinstall with `pip install -e . --no-deps --no-build-isolation` to prebuild `anti_alias_activation_cuda` kernel.", file=sys.stderr)
print(
"See more details: https://github.com/index-tts/index-tts/issues/164#issuecomment-2903453206", file=sys.stderr
)
except:
print(">> Failed to load custom CUDA kernel for BigVGAN. Falling back to torch.")
self.use_cuda_kernel = False
self.bigvgan = Generator(self.cfg.bigvgan, use_cuda_kernel=self.use_cuda_kernel)
self.bigvgan_path = os.path.join(self.model_dir, self.cfg.bigvgan_checkpoint)
@@ -153,7 +155,8 @@ class IndexTTS:
ncode_idx = []
n = 0
for k in range(len_):
assert code[k] != self.stop_mel_token, f"stop_mel_token {self.stop_mel_token} should be shrinked here"
assert code[
k] != self.stop_mel_token, f"stop_mel_token {self.stop_mel_token} should be shrinked here"
if code[k] != silent_token:
ncode_idx.append(k)
n = 0
@@ -185,17 +188,17 @@ class IndexTTS:
code_lens = torch.tensor(code_lens, dtype=torch.long, device=device)
return codes, code_lens
def bucket_sentences(self, sentences, bucket_max_size=4) -> List[List[Dict]]:
def bucket_segments(self, segments, bucket_max_size=4) -> List[List[Dict]]:
"""
Sentence data bucketing.
if ``bucket_max_size=1``, return all sentences in one bucket.
Segment data bucketing.
if ``bucket_max_size=1``, return all segments in one bucket.
"""
outputs: List[Dict] = []
for idx, sent in enumerate(sentences):
for idx, sent in enumerate(segments):
outputs.append({"idx": idx, "sent": sent, "len": len(sent)})
if len(outputs) > bucket_max_size:
# split sentences into buckets by sentence length
# split segments into buckets by segment length
buckets: List[List[Dict]] = []
factor = 1.5
last_bucket = None
@@ -204,7 +207,7 @@ class IndexTTS:
for sent in sorted(outputs, key=lambda x: x["len"]):
current_sent_len = sent["len"]
if current_sent_len == 0:
print(">> skip empty sentence")
print(">> skip empty segment")
continue
if last_bucket is None \
or current_sent_len >= int(last_bucket_sent_len_median * factor) \
@@ -214,11 +217,11 @@ class IndexTTS:
last_bucket = buckets[-1]
last_bucket_sent_len_median = current_sent_len
else:
# current bucket can hold more sentences
last_bucket.append(sent) # sorted
# current bucket can hold more segments
last_bucket.append(sent) # sorted
mid = len(last_bucket) // 2
last_bucket_sent_len_median = last_bucket[mid]["len"]
last_bucket=None
last_bucket = None
# merge all buckets with size 1
out_buckets: List[List[Dict]] = []
only_ones: List[Dict] = []
@@ -238,7 +241,8 @@ class IndexTTS:
break
# combined all remaining sized 1 buckets
if len(only_ones) > 0:
out_buckets.extend([only_ones[i:i+bucket_max_size] for i in range(0, len(only_ones), bucket_max_size)])
out_buckets.extend(
[only_ones[i:i + bucket_max_size] for i in range(0, len(only_ones), bucket_max_size)])
return out_buckets
return [outputs]
@@ -247,7 +251,8 @@ class IndexTTS:
# 1.5版本以上直接使用stop_text_token 右侧填充,填充到最大长度
# [1, N] -> [N,]
tokens = [t.squeeze(0) for t in tokens]
return pad_sequence(tokens, batch_first=True, padding_value=self.cfg.gpt.stop_text_token, padding_side="right")
return pad_sequence(tokens, batch_first=True, padding_value=self.cfg.gpt.stop_text_token,
padding_side="right")
max_len = max(t.size(1) for t in tokens)
outputs = []
for tensor in tokens:
@@ -275,19 +280,20 @@ class IndexTTS:
self.gr_progress(value, desc=desc)
# 快速推理:对于“多句长文本”,可实现至少 2~10 倍以上的速度提升~ First modified by sunnyboxs 2025-04-16
def infer_fast(self, audio_prompt, text, output_path, verbose=False, max_text_tokens_per_sentence=100, sentences_bucket_max_size=4, **generation_kwargs):
def infer_fast(self, audio_prompt, text, output_path, verbose=False, max_text_tokens_per_segment=100,
segments_bucket_max_size=4, **generation_kwargs):
"""
Args:
``max_text_tokens_per_sentence``: 分句的最大token数默认``100``可以根据GPU硬件情况调整
``max_text_tokens_per_segment``: 分句的最大token数默认``100``可以根据GPU硬件情况调整
- 越小batch 越多,推理速度越*快*,占用内存更多,可能影响质量
- 越大batch 越少,推理速度越*慢*,占用内存和质量更接近于非快速推理
``sentences_bucket_max_size``: 分句分桶的最大容量,默认``4``可以根据GPU内存调整
``segments_bucket_max_size``: 分句分桶的最大容量,默认``4``可以根据GPU内存调整
- 越大bucket数量越少batch越多推理速度越*快*,占用内存更多,可能影响质量
- 越小bucket数量越多batch越少推理速度越*慢*,占用内存和质量更接近于非快速推理
"""
print(">> start fast inference...")
self._set_gr_progress(0, "start fast inference...")
print(">> starting fast inference...")
self._set_gr_progress(0, "starting fast inference...")
if verbose:
print(f"origin text:{text}")
start_time = time.perf_counter()
@@ -299,6 +305,15 @@ class IndexTTS:
if audio.shape[0] > 1:
audio = audio[0].unsqueeze(0)
audio = torchaudio.transforms.Resample(sr, 24000)(audio)
max_audio_length_seconds = 50
max_audio_samples = int(max_audio_length_seconds * 24000)
if audio.shape[1] > max_audio_samples:
if verbose:
print(f"Audio too long ({audio.shape[1]} samples), truncating to {max_audio_samples} samples")
audio = audio[:, :max_audio_samples]
cond_mel = MelSpectrogramFeatures()(audio).to(self.device)
cond_mel_frame = cond_mel.shape[-1]
if verbose:
@@ -317,12 +332,13 @@ class IndexTTS:
# text_tokens
text_tokens_list = self.tokenizer.tokenize(text)
sentences = self.tokenizer.split_sentences(text_tokens_list, max_tokens_per_sentence=max_text_tokens_per_sentence)
segments = self.tokenizer.split_segments(text_tokens_list,
max_text_tokens_per_segment=max_text_tokens_per_segment)
if verbose:
print(">> text token count:", len(text_tokens_list))
print(" splited sentences count:", len(sentences))
print(" max_text_tokens_per_sentence:", max_text_tokens_per_sentence)
print(*sentences, sep="\n")
print(" segments count:", len(segments))
print(" max_text_tokens_per_segment:", max_text_tokens_per_segment)
print(*segments, sep="\n")
do_sample = generation_kwargs.pop("do_sample", True)
top_p = generation_kwargs.pop("top_p", 0.8)
top_k = generation_kwargs.pop("top_k", 30)
@@ -343,17 +359,17 @@ class IndexTTS:
# text processing
all_text_tokens: List[List[torch.Tensor]] = []
self._set_gr_progress(0.1, "text processing...")
bucket_max_size = sentences_bucket_max_size if self.device != "cpu" else 1
all_sentences = self.bucket_sentences(sentences, bucket_max_size=bucket_max_size)
bucket_count = len(all_sentences)
bucket_max_size = segments_bucket_max_size if self.device != "cpu" else 1
all_segments = self.bucket_segments(segments, bucket_max_size=bucket_max_size)
bucket_count = len(all_segments)
if verbose:
print(">> sentences bucket_count:", bucket_count,
"bucket sizes:", [(len(s), [t["idx"] for t in s]) for s in all_sentences],
print(">> segments bucket_count:", bucket_count,
"bucket sizes:", [(len(s), [t["idx"] for t in s]) for s in all_segments],
"bucket_max_size:", bucket_max_size)
for sentences in all_sentences:
for segments in all_segments:
temp_tokens: List[torch.Tensor] = []
all_text_tokens.append(temp_tokens)
for item in sentences:
for item in segments:
sent = item["sent"]
text_tokens = self.tokenizer.convert_tokens_to_ids(sent)
text_tokens = torch.tensor(text_tokens, dtype=torch.int32, device=self.device).unsqueeze(0)
@@ -362,12 +378,11 @@ class IndexTTS:
print(f"text_tokens shape: {text_tokens.shape}, text_tokens type: {text_tokens.dtype}")
# debug tokenizer
text_token_syms = self.tokenizer.convert_ids_to_tokens(text_tokens[0].tolist())
print("text_token_syms is same as sentence tokens", text_token_syms == sent)
print("text_token_syms is same as segment tokens", text_token_syms == sent)
temp_tokens.append(text_tokens)
# Sequential processing of bucketing data
all_batch_num = sum(len(s) for s in all_sentences)
all_batch_num = sum(len(s) for s in all_segments)
all_batch_codes = []
processed_num = 0
for item_tokens in all_text_tokens:
@@ -378,38 +393,40 @@ class IndexTTS:
batch_text_tokens = item_tokens[0]
processed_num += batch_num
# gpt speech
self._set_gr_progress(0.2 + 0.3 * processed_num/all_batch_num, f"gpt inference speech... {processed_num}/{all_batch_num}")
self._set_gr_progress(0.2 + 0.3 * processed_num / all_batch_num,
f"gpt speech inference {processed_num}/{all_batch_num}...")
m_start_time = time.perf_counter()
with torch.no_grad():
with torch.amp.autocast(batch_text_tokens.device.type, enabled=self.dtype is not None, dtype=self.dtype):
with torch.amp.autocast(batch_text_tokens.device.type, enabled=self.dtype is not None,
dtype=self.dtype):
temp_codes = self.gpt.inference_speech(auto_conditioning, batch_text_tokens,
cond_mel_lengths=cond_mel_lengths,
# text_lengths=text_len,
do_sample=do_sample,
top_p=top_p,
top_k=top_k,
temperature=temperature,
num_return_sequences=autoregressive_batch_size,
length_penalty=length_penalty,
num_beams=num_beams,
repetition_penalty=repetition_penalty,
max_generate_length=max_mel_tokens,
**generation_kwargs)
cond_mel_lengths=cond_mel_lengths,
# text_lengths=text_len,
do_sample=do_sample,
top_p=top_p,
top_k=top_k,
temperature=temperature,
num_return_sequences=autoregressive_batch_size,
length_penalty=length_penalty,
num_beams=num_beams,
repetition_penalty=repetition_penalty,
max_generate_length=max_mel_tokens,
**generation_kwargs)
all_batch_codes.append(temp_codes)
gpt_gen_time += time.perf_counter() - m_start_time
# gpt latent
self._set_gr_progress(0.5, "gpt inference latents...")
self._set_gr_progress(0.5, "gpt latents inference...")
all_idxs = []
all_latents = []
has_warned = False
for batch_codes, batch_tokens, batch_sentences in zip(all_batch_codes, all_text_tokens, all_sentences):
for batch_codes, batch_tokens, batch_segments in zip(all_batch_codes, all_text_tokens, all_segments):
for i in range(batch_codes.shape[0]):
codes = batch_codes[i] # [x]
if not has_warned and codes[-1] != self.stop_mel_token:
warnings.warn(
f"WARN: generation stopped due to exceeding `max_mel_tokens` ({max_mel_tokens}). "
f"Consider reducing `max_text_tokens_per_sentence`({max_text_tokens_per_sentence}) or increasing `max_mel_tokens`.",
f"Consider reducing `max_text_tokens_per_segment`({max_text_tokens_per_segment}) or increasing `max_mel_tokens`.",
category=RuntimeWarning
)
has_warned = True
@@ -423,31 +440,32 @@ class IndexTTS:
print(codes)
print("code_lens:", code_lens)
text_tokens = batch_tokens[i]
all_idxs.append(batch_sentences[i]["idx"])
all_idxs.append(batch_segments[i]["idx"])
m_start_time = time.perf_counter()
with torch.no_grad():
with torch.amp.autocast(text_tokens.device.type, enabled=self.dtype is not None, dtype=self.dtype):
latent = \
self.gpt(auto_conditioning, text_tokens,
torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), codes,
code_lens*self.gpt.mel_length_compression,
cond_mel_lengths=torch.tensor([auto_conditioning.shape[-1]], device=text_tokens.device),
return_latent=True, clip_inputs=False)
torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), codes,
code_lens * self.gpt.mel_length_compression,
cond_mel_lengths=torch.tensor([auto_conditioning.shape[-1]],
device=text_tokens.device),
return_latent=True, clip_inputs=False)
gpt_forward_time += time.perf_counter() - m_start_time
all_latents.append(latent)
del all_batch_codes, all_text_tokens, all_sentences
del all_batch_codes, all_text_tokens, all_segments
# bigvgan chunk
chunk_size = 2
all_latents = [all_latents[all_idxs.index(i)] for i in range(len(all_latents))]
if verbose:
print(">> all_latents:", len(all_latents))
print(" latents length:", [l.shape[1] for l in all_latents])
chunk_latents = [all_latents[i : i + chunk_size] for i in range(0, len(all_latents), chunk_size)]
chunk_latents = [all_latents[i: i + chunk_size] for i in range(0, len(all_latents), chunk_size)]
chunk_length = len(chunk_latents)
latent_length = len(all_latents)
# bigvgan chunk decode
self._set_gr_progress(0.7, "bigvgan decode...")
self._set_gr_progress(0.7, "bigvgan decoding...")
tqdm_progress = tqdm(total=latent_length, desc="bigvgan")
for items in chunk_latents:
tqdm_progress.update(len(items))
@@ -460,7 +478,7 @@ class IndexTTS:
wav = wav.squeeze(1)
pass
wav = torch.clamp(32767 * wav, -32767.0, 32767.0)
wavs.append(wav.cpu()) # to cpu before saving
wavs.append(wav.cpu()) # to cpu before saving
# clear cache
tqdm_progress.close() # 确保进度条被关闭
@@ -469,7 +487,7 @@ class IndexTTS:
self.torch_empty_cache()
# wav audio output
self._set_gr_progress(0.9, "save audio...")
self._set_gr_progress(0.9, "saving audio...")
wav = torch.cat(wavs, dim=1)
wav_length = wav.shape[-1] / sampling_rate
print(f">> Reference audio length: {cond_mel_frame * 256 / sampling_rate:.2f} seconds")
@@ -479,7 +497,8 @@ class IndexTTS:
print(f">> Total fast inference time: {end_time - start_time:.2f} seconds")
print(f">> Generated audio length: {wav_length:.2f} seconds")
print(f">> [fast] bigvgan chunk_length: {chunk_length}")
print(f">> [fast] batch_num: {all_batch_num} bucket_max_size: {bucket_max_size}", f"bucket_count: {bucket_count}" if bucket_max_size > 1 else "")
print(f">> [fast] batch_num: {all_batch_num} bucket_max_size: {bucket_max_size}",
f"bucket_count: {bucket_count}" if bucket_max_size > 1 else "")
print(f">> [fast] RTF: {(end_time - start_time) / wav_length:.4f}")
# save audio
@@ -497,9 +516,10 @@ class IndexTTS:
return (sampling_rate, wav_data)
# 原始推理模式
def infer(self, audio_prompt, text, output_path, verbose=False, max_text_tokens_per_sentence=120, **generation_kwargs):
print(">> start inference...")
self._set_gr_progress(0, "start inference...")
def infer(self, audio_prompt, text, output_path, verbose=False, max_text_tokens_per_segment=120,
**generation_kwargs):
print(">> starting inference...")
self._set_gr_progress(0, "starting inference...")
if verbose:
print(f"origin text:{text}")
start_time = time.perf_counter()
@@ -526,12 +546,12 @@ class IndexTTS:
self._set_gr_progress(0.1, "text processing...")
auto_conditioning = cond_mel
text_tokens_list = self.tokenizer.tokenize(text)
sentences = self.tokenizer.split_sentences(text_tokens_list, max_text_tokens_per_sentence)
segments = self.tokenizer.split_segments(text_tokens_list, max_text_tokens_per_segment)
if verbose:
print("text token count:", len(text_tokens_list))
print("sentences count:", len(sentences))
print("max_text_tokens_per_sentence:", max_text_tokens_per_sentence)
print(*sentences, sep="\n")
print("segments count:", len(segments))
print("max_text_tokens_per_segment:", max_text_tokens_per_segment)
print(*segments, sep="\n")
do_sample = generation_kwargs.pop("do_sample", True)
top_p = generation_kwargs.pop("top_p", 0.8)
top_k = generation_kwargs.pop("top_k", 30)
@@ -550,7 +570,7 @@ class IndexTTS:
bigvgan_time = 0
progress = 0
has_warned = False
for sent in sentences:
for sent in segments:
text_tokens = self.tokenizer.convert_tokens_to_ids(sent)
text_tokens = torch.tensor(text_tokens, dtype=torch.int32, device=self.device).unsqueeze(0)
# text_tokens = F.pad(text_tokens, (0, 1)) # This may not be necessary.
@@ -561,35 +581,36 @@ class IndexTTS:
print(f"text_tokens shape: {text_tokens.shape}, text_tokens type: {text_tokens.dtype}")
# debug tokenizer
text_token_syms = self.tokenizer.convert_ids_to_tokens(text_tokens[0].tolist())
print("text_token_syms is same as sentence tokens", text_token_syms == sent)
print("text_token_syms is same as segment tokens", text_token_syms == sent)
# text_len = torch.IntTensor([text_tokens.size(1)], device=text_tokens.device)
# print(text_len)
progress += 1
self._set_gr_progress(0.2 + 0.4 * (progress-1) / len(sentences), f"gpt inference latent... {progress}/{len(sentences)}")
self._set_gr_progress(0.2 + 0.4 * (progress - 1) / len(segments),
f"gpt latents inference {progress}/{len(segments)}...")
m_start_time = time.perf_counter()
with torch.no_grad():
with torch.amp.autocast(text_tokens.device.type, enabled=self.dtype is not None, dtype=self.dtype):
codes = self.gpt.inference_speech(auto_conditioning, text_tokens,
cond_mel_lengths=torch.tensor([auto_conditioning.shape[-1]],
device=text_tokens.device),
# text_lengths=text_len,
do_sample=do_sample,
top_p=top_p,
top_k=top_k,
temperature=temperature,
num_return_sequences=autoregressive_batch_size,
length_penalty=length_penalty,
num_beams=num_beams,
repetition_penalty=repetition_penalty,
max_generate_length=max_mel_tokens,
**generation_kwargs)
cond_mel_lengths=torch.tensor([auto_conditioning.shape[-1]],
device=text_tokens.device),
# text_lengths=text_len,
do_sample=do_sample,
top_p=top_p,
top_k=top_k,
temperature=temperature,
num_return_sequences=autoregressive_batch_size,
length_penalty=length_penalty,
num_beams=num_beams,
repetition_penalty=repetition_penalty,
max_generate_length=max_mel_tokens,
**generation_kwargs)
gpt_gen_time += time.perf_counter() - m_start_time
if not has_warned and (codes[:, -1] != self.stop_mel_token).any():
warnings.warn(
f"WARN: generation stopped due to exceeding `max_mel_tokens` ({max_mel_tokens}). "
f"Input text tokens: {text_tokens.shape[1]}. "
f"Consider reducing `max_text_tokens_per_sentence`({max_text_tokens_per_sentence}) or increasing `max_mel_tokens`.",
f"Consider reducing `max_text_tokens_per_segment`({max_text_tokens_per_segment}) or increasing `max_mel_tokens`.",
category=RuntimeWarning
)
has_warned = True
@@ -607,16 +628,18 @@ class IndexTTS:
print(codes, type(codes))
print(f"fix codes shape: {codes.shape}, codes type: {codes.dtype}")
print(f"code len: {code_lens}")
self._set_gr_progress(0.2 + 0.4 * progress / len(sentences), f"gpt inference speech... {progress}/{len(sentences)}")
self._set_gr_progress(0.2 + 0.4 * progress / len(segments),
f"gpt speech inference {progress}/{len(segments)}...")
m_start_time = time.perf_counter()
# latent, text_lens_out, code_lens_out = \
with torch.amp.autocast(text_tokens.device.type, enabled=self.dtype is not None, dtype=self.dtype):
latent = \
self.gpt(auto_conditioning, text_tokens,
torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), codes,
code_lens*self.gpt.mel_length_compression,
cond_mel_lengths=torch.tensor([auto_conditioning.shape[-1]], device=text_tokens.device),
return_latent=True, clip_inputs=False)
torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), codes,
code_lens * self.gpt.mel_length_compression,
cond_mel_lengths=torch.tensor([auto_conditioning.shape[-1]],
device=text_tokens.device),
return_latent=True, clip_inputs=False)
gpt_forward_time += time.perf_counter() - m_start_time
m_start_time = time.perf_counter()
@@ -630,7 +653,7 @@ class IndexTTS:
# wavs.append(wav[:, :-512])
wavs.append(wav.cpu()) # to cpu before saving
end_time = time.perf_counter()
self._set_gr_progress(0.9, "save audio...")
self._set_gr_progress(0.9, "saving audio...")
wav = torch.cat(wavs, dim=1)
wav_length = wav.shape[-1] / sampling_rate
print(f">> Reference audio length: {cond_mel_frame * 256 / sampling_rate:.2f} seconds")
@@ -659,12 +682,9 @@ class IndexTTS:
wav_data = wav_data.numpy().T
return (sampling_rate, wav_data)
if __name__ == "__main__":
prompt_wav="test_data/input.wav"
#text="晕 XUAN4 是 一 种 GAN3 觉"
#text='大家好我现在正在bilibili 体验 ai 科技说实话来之前我绝对想不到AI技术已经发展到这样匪夷所思的地步了'
text="There is a vehicle arriving in dock number 7?"
prompt_wav = "examples/voice_01.wav"
text = '欢迎大家来体验indextts2并给予我们意见与反馈谢谢大家。'
tts = IndexTTS(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", is_fp16=True, use_cuda_kernel=False)
tts = IndexTTS(cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_cuda_kernel=False)
tts.infer(audio_prompt=prompt_wav, text=text, output_path="gen.wav", verbose=True)

845
indextts/infer_v2.py Normal file
View File

@@ -0,0 +1,845 @@
import os
from subprocess import CalledProcessError
os.environ['HF_HUB_CACHE'] = './checkpoints/hf_cache'
import json
import re
import time
import librosa
import torch
import torchaudio
from torch.nn.utils.rnn import pad_sequence
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
from omegaconf import OmegaConf
from indextts.gpt.model_v2 import UnifiedVoice
from indextts.utils.maskgct_utils import build_semantic_model, build_semantic_codec
from indextts.utils.checkpoint import load_checkpoint
from indextts.utils.front import TextNormalizer, TextTokenizer
from indextts.s2mel.modules.commons import load_checkpoint2, MyModel
from indextts.s2mel.modules.bigvgan import bigvgan
from indextts.s2mel.modules.campplus.DTDNN import CAMPPlus
from indextts.s2mel.modules.audio import mel_spectrogram
from transformers import AutoTokenizer
from modelscope import AutoModelForCausalLM
from huggingface_hub import hf_hub_download
import safetensors
from transformers import SeamlessM4TFeatureExtractor
import random
import torch.nn.functional as F
class IndexTTS2:
def __init__(
self, cfg_path="checkpoints/config.yaml", model_dir="checkpoints", use_fp16=False, device=None,
use_cuda_kernel=None,use_deepspeed=False, use_accel=False, use_torch_compile=False
):
"""
Args:
cfg_path (str): path to the config file.
model_dir (str): path to the model directory.
use_fp16 (bool): whether to use fp16.
device (str): device to use (e.g., 'cuda:0', 'cpu'). If None, it will be set automatically based on the availability of CUDA or MPS.
use_cuda_kernel (None | bool): whether to use BigVGan custom fused activation CUDA kernel, only for CUDA device.
use_deepspeed (bool): whether to use DeepSpeed or not.
use_accel (bool): whether to use acceleration engine for GPT2 or not.
use_torch_compile (bool): whether to use torch.compile for optimization or not.
"""
if device is not None:
self.device = device
self.use_fp16 = False if device == "cpu" else use_fp16
self.use_cuda_kernel = use_cuda_kernel is not None and use_cuda_kernel and device.startswith("cuda")
elif torch.cuda.is_available():
self.device = "cuda:0"
self.use_fp16 = use_fp16
self.use_cuda_kernel = use_cuda_kernel is None or use_cuda_kernel
elif hasattr(torch, "xpu") and torch.xpu.is_available():
self.device = "xpu"
self.use_fp16 = use_fp16
self.use_cuda_kernel = False
elif hasattr(torch, "mps") and torch.backends.mps.is_available():
self.device = "mps"
self.use_fp16 = False # Use float16 on MPS is overhead than float32
self.use_cuda_kernel = False
else:
self.device = "cpu"
self.use_fp16 = False
self.use_cuda_kernel = False
print(">> Be patient, it may take a while to run in CPU mode.")
self.cfg = OmegaConf.load(cfg_path)
self.model_dir = model_dir
self.dtype = torch.float16 if self.use_fp16 else None
self.stop_mel_token = self.cfg.gpt.stop_mel_token
self.use_accel = use_accel
self.use_torch_compile = use_torch_compile
self.qwen_emo = QwenEmotion(os.path.join(self.model_dir, self.cfg.qwen_emo_path))
self.gpt = UnifiedVoice(**self.cfg.gpt, use_accel=self.use_accel)
self.gpt_path = os.path.join(self.model_dir, self.cfg.gpt_checkpoint)
load_checkpoint(self.gpt, self.gpt_path)
self.gpt = self.gpt.to(self.device)
if self.use_fp16:
self.gpt.eval().half()
else:
self.gpt.eval()
print(">> GPT weights restored from:", self.gpt_path)
if use_deepspeed:
try:
import deepspeed
except (ImportError, OSError, CalledProcessError) as e:
use_deepspeed = False
print(f">> Failed to load DeepSpeed. Falling back to normal inference. Error: {e}")
self.gpt.post_init_gpt2_config(use_deepspeed=use_deepspeed, kv_cache=True, half=self.use_fp16)
if self.use_cuda_kernel:
# preload the CUDA kernel for BigVGAN
try:
from indextts.s2mel.modules.bigvgan.alias_free_activation.cuda import activation1d
print(">> Preload custom CUDA kernel for BigVGAN", activation1d.anti_alias_activation_cuda)
except Exception as e:
print(">> Failed to load custom CUDA kernel for BigVGAN. Falling back to torch.")
print(f"{e!r}")
self.use_cuda_kernel = False
self.extract_features = SeamlessM4TFeatureExtractor.from_pretrained("facebook/w2v-bert-2.0")
self.semantic_model, self.semantic_mean, self.semantic_std = build_semantic_model(
os.path.join(self.model_dir, self.cfg.w2v_stat))
self.semantic_model = self.semantic_model.to(self.device)
self.semantic_model.eval()
self.semantic_mean = self.semantic_mean.to(self.device)
self.semantic_std = self.semantic_std.to(self.device)
semantic_codec = build_semantic_codec(self.cfg.semantic_codec)
semantic_code_ckpt = hf_hub_download("amphion/MaskGCT", filename="semantic_codec/model.safetensors")
safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
self.semantic_codec = semantic_codec.to(self.device)
self.semantic_codec.eval()
print('>> semantic_codec weights restored from: {}'.format(semantic_code_ckpt))
s2mel_path = os.path.join(self.model_dir, self.cfg.s2mel_checkpoint)
s2mel = MyModel(self.cfg.s2mel, use_gpt_latent=True)
s2mel, _, _, _ = load_checkpoint2(
s2mel,
None,
s2mel_path,
load_only_params=True,
ignore_modules=[],
is_distributed=False,
)
self.s2mel = s2mel.to(self.device)
self.s2mel.models['cfm'].estimator.setup_caches(max_batch_size=1, max_seq_length=8192)
# Enable torch.compile optimization if requested
if self.use_torch_compile:
print(">> Enabling torch.compile optimization")
self.s2mel.enable_torch_compile()
print(">> torch.compile optimization enabled successfully")
self.s2mel.eval()
print(">> s2mel weights restored from:", s2mel_path)
# load campplus_model
campplus_ckpt_path = hf_hub_download(
"funasr/campplus", filename="campplus_cn_common.bin"
)
campplus_model = CAMPPlus(feat_dim=80, embedding_size=192)
campplus_model.load_state_dict(torch.load(campplus_ckpt_path, map_location="cpu"))
self.campplus_model = campplus_model.to(self.device)
self.campplus_model.eval()
print(">> campplus_model weights restored from:", campplus_ckpt_path)
bigvgan_name = self.cfg.vocoder.name
self.bigvgan = bigvgan.BigVGAN.from_pretrained(bigvgan_name, use_cuda_kernel=self.use_cuda_kernel)
self.bigvgan = self.bigvgan.to(self.device)
self.bigvgan.remove_weight_norm()
self.bigvgan.eval()
print(">> bigvgan weights restored from:", bigvgan_name)
self.bpe_path = os.path.join(self.model_dir, self.cfg.dataset["bpe_model"])
self.normalizer = TextNormalizer()
self.normalizer.load()
print(">> TextNormalizer loaded")
self.tokenizer = TextTokenizer(self.bpe_path, self.normalizer)
print(">> bpe model loaded from:", self.bpe_path)
emo_matrix = torch.load(os.path.join(self.model_dir, self.cfg.emo_matrix))
self.emo_matrix = emo_matrix.to(self.device)
self.emo_num = list(self.cfg.emo_num)
spk_matrix = torch.load(os.path.join(self.model_dir, self.cfg.spk_matrix))
self.spk_matrix = spk_matrix.to(self.device)
self.emo_matrix = torch.split(self.emo_matrix, self.emo_num)
self.spk_matrix = torch.split(self.spk_matrix, self.emo_num)
mel_fn_args = {
"n_fft": self.cfg.s2mel['preprocess_params']['spect_params']['n_fft'],
"win_size": self.cfg.s2mel['preprocess_params']['spect_params']['win_length'],
"hop_size": self.cfg.s2mel['preprocess_params']['spect_params']['hop_length'],
"num_mels": self.cfg.s2mel['preprocess_params']['spect_params']['n_mels'],
"sampling_rate": self.cfg.s2mel["preprocess_params"]["sr"],
"fmin": self.cfg.s2mel['preprocess_params']['spect_params'].get('fmin', 0),
"fmax": None if self.cfg.s2mel['preprocess_params']['spect_params'].get('fmax', "None") == "None" else 8000,
"center": False
}
self.mel_fn = lambda x: mel_spectrogram(x, **mel_fn_args)
# 缓存参考音频:
self.cache_spk_cond = None
self.cache_s2mel_style = None
self.cache_s2mel_prompt = None
self.cache_spk_audio_prompt = None
self.cache_emo_cond = None
self.cache_emo_audio_prompt = None
self.cache_mel = None
# 进度引用显示(可选)
self.gr_progress = None
self.model_version = self.cfg.version if hasattr(self.cfg, "version") else None
@torch.no_grad()
def get_emb(self, input_features, attention_mask):
vq_emb = self.semantic_model(
input_features=input_features,
attention_mask=attention_mask,
output_hidden_states=True,
)
feat = vq_emb.hidden_states[17] # (B, T, C)
feat = (feat - self.semantic_mean) / self.semantic_std
return feat
def remove_long_silence(self, codes: torch.Tensor, silent_token=52, max_consecutive=30):
"""
Shrink special tokens (silent_token and stop_mel_token) in codes
codes: [B, T]
"""
code_lens = []
codes_list = []
device = codes.device
dtype = codes.dtype
isfix = False
for i in range(0, codes.shape[0]):
code = codes[i]
if not torch.any(code == self.stop_mel_token).item():
len_ = code.size(0)
else:
stop_mel_idx = (code == self.stop_mel_token).nonzero(as_tuple=False)
len_ = stop_mel_idx[0].item() if len(stop_mel_idx) > 0 else code.size(0)
count = torch.sum(code == silent_token).item()
if count > max_consecutive:
# code = code.cpu().tolist()
ncode_idx = []
n = 0
for k in range(len_):
assert code[
k] != self.stop_mel_token, f"stop_mel_token {self.stop_mel_token} should be shrinked here"
if code[k] != silent_token:
ncode_idx.append(k)
n = 0
elif code[k] == silent_token and n < 10:
ncode_idx.append(k)
n += 1
# if (k == 0 and code[k] == 52) or (code[k] == 52 and code[k-1] == 52):
# n += 1
# new code
len_ = len(ncode_idx)
codes_list.append(code[ncode_idx])
isfix = True
else:
# shrink to len_
codes_list.append(code[:len_])
code_lens.append(len_)
if isfix:
if len(codes_list) > 1:
codes = pad_sequence(codes_list, batch_first=True, padding_value=self.stop_mel_token)
else:
codes = codes_list[0].unsqueeze(0)
else:
# unchanged
pass
# clip codes to max length
max_len = max(code_lens)
if max_len < codes.shape[1]:
codes = codes[:, :max_len]
code_lens = torch.tensor(code_lens, dtype=torch.long, device=device)
return codes, code_lens
def interval_silence(self, wavs, sampling_rate=22050, interval_silence=200):
"""
Silences to be insert between generated segments.
"""
if not wavs or interval_silence <= 0:
return wavs
# get channel_size
channel_size = wavs[0].size(0)
# get silence tensor
sil_dur = int(sampling_rate * interval_silence / 1000.0)
return torch.zeros(channel_size, sil_dur)
def insert_interval_silence(self, wavs, sampling_rate=22050, interval_silence=200):
"""
Insert silences between generated segments.
wavs: List[torch.tensor]
"""
if not wavs or interval_silence <= 0:
return wavs
# get channel_size
channel_size = wavs[0].size(0)
# get silence tensor
sil_dur = int(sampling_rate * interval_silence / 1000.0)
sil_tensor = torch.zeros(channel_size, sil_dur)
wavs_list = []
for i, wav in enumerate(wavs):
wavs_list.append(wav)
if i < len(wavs) - 1:
wavs_list.append(sil_tensor)
return wavs_list
def _set_gr_progress(self, value, desc):
if self.gr_progress is not None:
self.gr_progress(value, desc=desc)
def _load_and_cut_audio(self,audio_path,max_audio_length_seconds,verbose=False,sr=None):
if not sr:
audio, sr = librosa.load(audio_path)
else:
audio, _ = librosa.load(audio_path,sr=sr)
audio = torch.tensor(audio).unsqueeze(0)
max_audio_samples = int(max_audio_length_seconds * sr)
if audio.shape[1] > max_audio_samples:
if verbose:
print(f"Audio too long ({audio.shape[1]} samples), truncating to {max_audio_samples} samples")
audio = audio[:, :max_audio_samples]
return audio, sr
def normalize_emo_vec(self, emo_vector, apply_bias=True):
# apply biased emotion factors for better user experience,
# by de-emphasizing emotions that can cause strange results
if apply_bias:
# [happy, angry, sad, afraid, disgusted, melancholic, surprised, calm]
emo_bias = [0.9375, 0.875, 1.0, 1.0, 0.9375, 0.9375, 0.6875, 0.5625]
emo_vector = [vec * bias for vec, bias in zip(emo_vector, emo_bias)]
# the total emotion sum must be 0.8 or less
emo_sum = sum(emo_vector)
if emo_sum > 0.8:
scale_factor = 0.8 / emo_sum
emo_vector = [vec * scale_factor for vec in emo_vector]
return emo_vector
# 原始推理模式
def infer(self, spk_audio_prompt, text, output_path,
emo_audio_prompt=None, emo_alpha=1.0,
emo_vector=None,
use_emo_text=False, emo_text=None, use_random=False, interval_silence=200,
verbose=False, max_text_tokens_per_segment=120, stream_return=False, more_segment_before=0, **generation_kwargs):
if stream_return:
return self.infer_generator(
spk_audio_prompt, text, output_path,
emo_audio_prompt, emo_alpha,
emo_vector,
use_emo_text, emo_text, use_random, interval_silence,
verbose, max_text_tokens_per_segment, stream_return, more_segment_before, **generation_kwargs
)
else:
try:
return list(self.infer_generator(
spk_audio_prompt, text, output_path,
emo_audio_prompt, emo_alpha,
emo_vector,
use_emo_text, emo_text, use_random, interval_silence,
verbose, max_text_tokens_per_segment, stream_return, more_segment_before, **generation_kwargs
))[0]
except IndexError:
return None
def infer_generator(self, spk_audio_prompt, text, output_path,
emo_audio_prompt=None, emo_alpha=1.0,
emo_vector=None,
use_emo_text=False, emo_text=None, use_random=False, interval_silence=200,
verbose=False, max_text_tokens_per_segment=120, stream_return=False, quick_streaming_tokens=0, **generation_kwargs):
print(">> starting inference...")
self._set_gr_progress(0, "starting inference...")
if verbose:
print(f"origin text:{text}, spk_audio_prompt:{spk_audio_prompt}, "
f"emo_audio_prompt:{emo_audio_prompt}, emo_alpha:{emo_alpha}, "
f"emo_vector:{emo_vector}, use_emo_text:{use_emo_text}, "
f"emo_text:{emo_text}")
start_time = time.perf_counter()
if use_emo_text or emo_vector is not None:
# we're using a text or emotion vector guidance; so we must remove
# "emotion reference voice", to ensure we use correct emotion mixing!
emo_audio_prompt = None
if use_emo_text:
# automatically generate emotion vectors from text prompt
if emo_text is None:
emo_text = text # use main text prompt
emo_dict = self.qwen_emo.inference(emo_text)
print(f"detected emotion vectors from text: {emo_dict}")
# convert ordered dict to list of vectors; the order is VERY important!
emo_vector = list(emo_dict.values())
if emo_vector is not None:
# we have emotion vectors; they can't be blended via alpha mixing
# in the main inference process later, so we must pre-calculate
# their new strengths here based on the alpha instead!
emo_vector_scale = max(0.0, min(1.0, emo_alpha))
if emo_vector_scale != 1.0:
# scale each vector and truncate to 4 decimals (for nicer printing)
emo_vector = [int(x * emo_vector_scale * 10000) / 10000 for x in emo_vector]
print(f"scaled emotion vectors to {emo_vector_scale}x: {emo_vector}")
if emo_audio_prompt is None:
# we are not using any external "emotion reference voice"; use
# speaker's voice as the main emotion reference audio.
emo_audio_prompt = spk_audio_prompt
# must always use alpha=1.0 when we don't have an external reference voice
emo_alpha = 1.0
# 如果参考音频改变了,才需要重新生成, 提升速度
if self.cache_spk_cond is None or self.cache_spk_audio_prompt != spk_audio_prompt:
if self.cache_spk_cond is not None:
self.cache_spk_cond = None
self.cache_s2mel_style = None
self.cache_s2mel_prompt = None
self.cache_mel = None
torch.cuda.empty_cache()
audio,sr = self._load_and_cut_audio(spk_audio_prompt,15,verbose)
audio_22k = torchaudio.transforms.Resample(sr, 22050)(audio)
audio_16k = torchaudio.transforms.Resample(sr, 16000)(audio)
inputs = self.extract_features(audio_16k, sampling_rate=16000, return_tensors="pt")
input_features = inputs["input_features"]
attention_mask = inputs["attention_mask"]
input_features = input_features.to(self.device)
attention_mask = attention_mask.to(self.device)
spk_cond_emb = self.get_emb(input_features, attention_mask)
_, S_ref = self.semantic_codec.quantize(spk_cond_emb)
ref_mel = self.mel_fn(audio_22k.to(spk_cond_emb.device).float())
ref_target_lengths = torch.LongTensor([ref_mel.size(2)]).to(ref_mel.device)
feat = torchaudio.compliance.kaldi.fbank(audio_16k.to(ref_mel.device),
num_mel_bins=80,
dither=0,
sample_frequency=16000)
feat = feat - feat.mean(dim=0, keepdim=True) # feat2另外一个滤波器能量组特征[922, 80]
style = self.campplus_model(feat.unsqueeze(0)) # 参考音频的全局style2[1,192]
prompt_condition = self.s2mel.models['length_regulator'](S_ref,
ylens=ref_target_lengths,
n_quantizers=3,
f0=None)[0]
self.cache_spk_cond = spk_cond_emb
self.cache_s2mel_style = style
self.cache_s2mel_prompt = prompt_condition
self.cache_spk_audio_prompt = spk_audio_prompt
self.cache_mel = ref_mel
else:
style = self.cache_s2mel_style
prompt_condition = self.cache_s2mel_prompt
spk_cond_emb = self.cache_spk_cond
ref_mel = self.cache_mel
if emo_vector is not None:
weight_vector = torch.tensor(emo_vector, device=self.device)
if use_random:
random_index = [random.randint(0, x - 1) for x in self.emo_num]
else:
random_index = [find_most_similar_cosine(style, tmp) for tmp in self.spk_matrix]
emo_matrix = [tmp[index].unsqueeze(0) for index, tmp in zip(random_index, self.emo_matrix)]
emo_matrix = torch.cat(emo_matrix, 0)
emovec_mat = weight_vector.unsqueeze(1) * emo_matrix
emovec_mat = torch.sum(emovec_mat, 0)
emovec_mat = emovec_mat.unsqueeze(0)
if self.cache_emo_cond is None or self.cache_emo_audio_prompt != emo_audio_prompt:
if self.cache_emo_cond is not None:
self.cache_emo_cond = None
torch.cuda.empty_cache()
emo_audio, _ = self._load_and_cut_audio(emo_audio_prompt,15,verbose,sr=16000)
emo_inputs = self.extract_features(emo_audio, sampling_rate=16000, return_tensors="pt")
emo_input_features = emo_inputs["input_features"]
emo_attention_mask = emo_inputs["attention_mask"]
emo_input_features = emo_input_features.to(self.device)
emo_attention_mask = emo_attention_mask.to(self.device)
emo_cond_emb = self.get_emb(emo_input_features, emo_attention_mask)
self.cache_emo_cond = emo_cond_emb
self.cache_emo_audio_prompt = emo_audio_prompt
else:
emo_cond_emb = self.cache_emo_cond
self._set_gr_progress(0.1, "text processing...")
text_tokens_list = self.tokenizer.tokenize(text)
segments = self.tokenizer.split_segments(text_tokens_list, max_text_tokens_per_segment, quick_streaming_tokens = quick_streaming_tokens)
segments_count = len(segments)
text_token_ids = self.tokenizer.convert_tokens_to_ids(text_tokens_list)
if self.tokenizer.unk_token_id in text_token_ids:
print(f" >> Warning: input text contains {text_token_ids.count(self.tokenizer.unk_token_id)} unknown tokens (id={self.tokenizer.unk_token_id}):")
print( " Tokens which can't be encoded: ", [t for t, id in zip(text_tokens_list, text_token_ids) if id == self.tokenizer.unk_token_id])
print(f" Consider updating the BPE model or modifying the text to avoid unknown tokens.")
if verbose:
print("text_tokens_list:", text_tokens_list)
print("segments count:", segments_count)
print("max_text_tokens_per_segment:", max_text_tokens_per_segment)
print(*segments, sep="\n")
do_sample = generation_kwargs.pop("do_sample", True)
top_p = generation_kwargs.pop("top_p", 0.8)
top_k = generation_kwargs.pop("top_k", 30)
temperature = generation_kwargs.pop("temperature", 0.8)
autoregressive_batch_size = 1
length_penalty = generation_kwargs.pop("length_penalty", 0.0)
num_beams = generation_kwargs.pop("num_beams", 3)
repetition_penalty = generation_kwargs.pop("repetition_penalty", 10.0)
max_mel_tokens = generation_kwargs.pop("max_mel_tokens", 1500)
sampling_rate = 22050
wavs = []
gpt_gen_time = 0
gpt_forward_time = 0
s2mel_time = 0
bigvgan_time = 0
has_warned = False
silence = None # for stream_return
for seg_idx, sent in enumerate(segments):
self._set_gr_progress(0.2 + 0.7 * seg_idx / segments_count,
f"speech synthesis {seg_idx + 1}/{segments_count}...")
text_tokens = self.tokenizer.convert_tokens_to_ids(sent)
text_tokens = torch.tensor(text_tokens, dtype=torch.int32, device=self.device).unsqueeze(0)
if verbose:
print(text_tokens)
print(f"text_tokens shape: {text_tokens.shape}, text_tokens type: {text_tokens.dtype}")
# debug tokenizer
text_token_syms = self.tokenizer.convert_ids_to_tokens(text_tokens[0].tolist())
print("text_token_syms is same as segment tokens", text_token_syms == sent)
m_start_time = time.perf_counter()
with torch.no_grad():
with torch.amp.autocast(text_tokens.device.type, enabled=self.dtype is not None, dtype=self.dtype):
emovec = self.gpt.merge_emovec(
spk_cond_emb,
emo_cond_emb,
torch.tensor([spk_cond_emb.shape[-1]], device=text_tokens.device),
torch.tensor([emo_cond_emb.shape[-1]], device=text_tokens.device),
alpha=emo_alpha
)
if emo_vector is not None:
emovec = emovec_mat + (1 - torch.sum(weight_vector)) * emovec
# emovec = emovec_mat
codes, speech_conditioning_latent = self.gpt.inference_speech(
spk_cond_emb,
text_tokens,
emo_cond_emb,
cond_lengths=torch.tensor([spk_cond_emb.shape[-1]], device=text_tokens.device),
emo_cond_lengths=torch.tensor([emo_cond_emb.shape[-1]], device=text_tokens.device),
emo_vec=emovec,
do_sample=True,
top_p=top_p,
top_k=top_k,
temperature=temperature,
num_return_sequences=autoregressive_batch_size,
length_penalty=length_penalty,
num_beams=num_beams,
repetition_penalty=repetition_penalty,
max_generate_length=max_mel_tokens,
**generation_kwargs
)
gpt_gen_time += time.perf_counter() - m_start_time
if not has_warned and (codes[:, -1] != self.stop_mel_token).any():
warnings.warn(
f"WARN: generation stopped due to exceeding `max_mel_tokens` ({max_mel_tokens}). "
f"Input text tokens: {text_tokens.shape[1]}. "
f"Consider reducing `max_text_tokens_per_segment`({max_text_tokens_per_segment}) or increasing `max_mel_tokens`.",
category=RuntimeWarning
)
has_warned = True
code_lens = torch.tensor([codes.shape[-1]], device=codes.device, dtype=codes.dtype)
# if verbose:
# print(codes, type(codes))
# print(f"codes shape: {codes.shape}, codes type: {codes.dtype}")
# print(f"code len: {code_lens}")
code_lens = []
max_code_len = 0
for code in codes:
if self.stop_mel_token not in code:
code_len = len(code)
else:
len_ = (code == self.stop_mel_token).nonzero(as_tuple=False)[0]
code_len = len_[0].item() if len_.numel() > 0 else len(code)
code_lens.append(code_len)
max_code_len = max(max_code_len, code_len)
codes = codes[:, :max_code_len]
code_lens = torch.LongTensor(code_lens)
code_lens = code_lens.to(self.device)
if verbose:
print(codes, type(codes))
print(f"fix codes shape: {codes.shape}, codes type: {codes.dtype}")
print(f"code len: {code_lens}")
m_start_time = time.perf_counter()
use_speed = torch.zeros(spk_cond_emb.size(0)).to(spk_cond_emb.device).long()
with torch.amp.autocast(text_tokens.device.type, enabled=self.dtype is not None, dtype=self.dtype):
latent = self.gpt(
speech_conditioning_latent,
text_tokens,
torch.tensor([text_tokens.shape[-1]], device=text_tokens.device),
codes,
torch.tensor([codes.shape[-1]], device=text_tokens.device),
emo_cond_emb,
cond_mel_lengths=torch.tensor([spk_cond_emb.shape[-1]], device=text_tokens.device),
emo_cond_mel_lengths=torch.tensor([emo_cond_emb.shape[-1]], device=text_tokens.device),
emo_vec=emovec,
use_speed=use_speed,
)
gpt_forward_time += time.perf_counter() - m_start_time
dtype = None
with torch.amp.autocast(text_tokens.device.type, enabled=dtype is not None, dtype=dtype):
m_start_time = time.perf_counter()
diffusion_steps = 25
inference_cfg_rate = 0.7
latent = self.s2mel.models['gpt_layer'](latent)
S_infer = self.semantic_codec.quantizer.vq2emb(codes.unsqueeze(1))
S_infer = S_infer.transpose(1, 2)
S_infer = S_infer + latent
target_lengths = (code_lens * 1.72).long()
cond = self.s2mel.models['length_regulator'](S_infer,
ylens=target_lengths,
n_quantizers=3,
f0=None)[0]
cat_condition = torch.cat([prompt_condition, cond], dim=1)
vc_target = self.s2mel.models['cfm'].inference(cat_condition,
torch.LongTensor([cat_condition.size(1)]).to(
cond.device),
ref_mel, style, None, diffusion_steps,
inference_cfg_rate=inference_cfg_rate)
vc_target = vc_target[:, :, ref_mel.size(-1):]
s2mel_time += time.perf_counter() - m_start_time
m_start_time = time.perf_counter()
wav = self.bigvgan(vc_target.float()).squeeze().unsqueeze(0)
print(wav.shape)
bigvgan_time += time.perf_counter() - m_start_time
wav = wav.squeeze(1)
wav = torch.clamp(32767 * wav, -32767.0, 32767.0)
if verbose:
print(f"wav shape: {wav.shape}", "min:", wav.min(), "max:", wav.max())
# wavs.append(wav[:, :-512])
wavs.append(wav.cpu()) # to cpu before saving
if stream_return:
yield wav.cpu()
if silence == None:
silence = self.interval_silence(wavs, sampling_rate=sampling_rate, interval_silence=interval_silence)
yield silence
end_time = time.perf_counter()
self._set_gr_progress(0.9, "saving audio...")
wavs = self.insert_interval_silence(wavs, sampling_rate=sampling_rate, interval_silence=interval_silence)
wav = torch.cat(wavs, dim=1)
wav_length = wav.shape[-1] / sampling_rate
print(f">> gpt_gen_time: {gpt_gen_time:.2f} seconds")
print(f">> gpt_forward_time: {gpt_forward_time:.2f} seconds")
print(f">> s2mel_time: {s2mel_time:.2f} seconds")
print(f">> bigvgan_time: {bigvgan_time:.2f} seconds")
print(f">> Total inference time: {end_time - start_time:.2f} seconds")
print(f">> Generated audio length: {wav_length:.2f} seconds")
print(f">> RTF: {(end_time - start_time) / wav_length:.4f}")
# save audio
wav = wav.cpu() # to cpu
if output_path:
# 直接保存音频到指定路径中
if os.path.isfile(output_path):
os.remove(output_path)
print(">> remove old wav file:", output_path)
if os.path.dirname(output_path) != "":
os.makedirs(os.path.dirname(output_path), exist_ok=True)
torchaudio.save(output_path, wav.type(torch.int16), sampling_rate)
print(">> wav file saved to:", output_path)
if stream_return:
return None
yield output_path
else:
if stream_return:
return None
# 返回以符合Gradio的格式要求
wav_data = wav.type(torch.int16)
wav_data = wav_data.numpy().T
yield (sampling_rate, wav_data)
def find_most_similar_cosine(query_vector, matrix):
query_vector = query_vector.float()
matrix = matrix.float()
similarities = F.cosine_similarity(query_vector, matrix, dim=1)
most_similar_index = torch.argmax(similarities)
return most_similar_index
class QwenEmotion:
def __init__(self, model_dir):
self.model_dir = model_dir
self.tokenizer = AutoTokenizer.from_pretrained(self.model_dir)
self.model = AutoModelForCausalLM.from_pretrained(
self.model_dir,
torch_dtype="float16", # "auto"
device_map="auto"
)
self.prompt = "文本情感分类"
self.cn_key_to_en = {
"高兴": "happy",
"愤怒": "angry",
"悲伤": "sad",
"恐惧": "afraid",
"反感": "disgusted",
# TODO: the "低落" (melancholic) emotion will always be mapped to
# "悲伤" (sad) by QwenEmotion's text analysis. it doesn't know the
# difference between those emotions even if user writes exact words.
# SEE: `self.melancholic_words` for current workaround.
"低落": "melancholic",
"惊讶": "surprised",
"自然": "calm",
}
self.desired_vector_order = ["高兴", "愤怒", "悲伤", "恐惧", "反感", "低落", "惊讶", "自然"]
self.melancholic_words = {
# emotion text phrases that will force QwenEmotion's "悲伤" (sad) detection
# to become "低落" (melancholic) instead, to fix limitations mentioned above.
"低落",
"melancholy",
"melancholic",
"depression",
"depressed",
"gloomy",
}
self.max_score = 1.2
self.min_score = 0.0
def clamp_score(self, value):
return max(self.min_score, min(self.max_score, value))
def convert(self, content):
# generate emotion vector dictionary:
# - insert values in desired order (Python 3.7+ `dict` remembers insertion order)
# - convert Chinese keys to English
# - clamp all values to the allowed min/max range
# - use 0.0 for any values that were missing in `content`
emotion_dict = {
self.cn_key_to_en[cn_key]: self.clamp_score(content.get(cn_key, 0.0))
for cn_key in self.desired_vector_order
}
# default to a calm/neutral voice if all emotion vectors were empty
if all(val <= 0.0 for val in emotion_dict.values()):
print(">> no emotions detected; using default calm/neutral voice")
emotion_dict["calm"] = 1.0
return emotion_dict
def inference(self, text_input):
start = time.time()
messages = [
{"role": "system", "content": f"{self.prompt}"},
{"role": "user", "content": f"{text_input}"}
]
text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
model_inputs = self.tokenizer([text], return_tensors="pt").to(self.model.device)
# conduct text completion
generated_ids = self.model.generate(
**model_inputs,
max_new_tokens=32768,
pad_token_id=self.tokenizer.eos_token_id
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
content = self.tokenizer.decode(output_ids[index:], skip_special_tokens=True)
# decode the JSON emotion detections as a dictionary
try:
content = json.loads(content)
except json.decoder.JSONDecodeError:
# invalid JSON; fallback to manual string parsing
# print(">> parsing QwenEmotion response", content)
content = {
m.group(1): float(m.group(2))
for m in re.finditer(r'([^\s":.,]+?)"?\s*:\s*([\d.]+)', content)
}
# print(">> dict result", content)
# workaround for QwenEmotion's inability to distinguish "悲伤" (sad) vs "低落" (melancholic).
# if we detect any of the IndexTTS "melancholic" words, we swap those vectors
# to encode the "sad" emotion as "melancholic" (instead of sadness).
text_input_lower = text_input.lower()
if any(word in text_input_lower for word in self.melancholic_words):
# print(">> before vec swap", content)
content["悲伤"], content["低落"] = content.get("低落", 0.0), content.get("悲伤", 0.0)
# print(">> after vec swap", content)
return self.convert(content)
if __name__ == "__main__":
prompt_wav = "examples/voice_01.wav"
text = '欢迎大家来体验indextts2并给予我们意见与反馈谢谢大家。'
tts = IndexTTS2(
cfg_path="checkpoints/config.yaml",
model_dir="checkpoints",
use_cuda_kernel=False,
use_torch_compile=True
)
tts.infer(spk_audio_prompt=prompt_wav, text=text, output_path="gen.wav", verbose=True)
char_size = 5
import string
time_buckets = []
for i in range(10):
text = ''.join(random.choices(string.ascii_letters, k=char_size))
start_time = time.time()
tts.infer(spk_audio_prompt=prompt_wav, text=text, output_path="gen.wav", verbose=True)
time_buckets.append(time.time() - start_time)
print(time_buckets)

View File

@@ -0,0 +1,16 @@
__version__ = "1.0.0"
# preserved here for legacy reasons
__model_version__ = "latest"
import audiotools
audiotools.ml.BaseModel.INTERN += ["dac.**"]
audiotools.ml.BaseModel.EXTERN += ["einops"]
from . import nn
from . import model
from . import utils
from .model import DAC
from .model import DACFile

View File

@@ -0,0 +1,36 @@
import sys
import argbind
from dac.utils import download
from dac.utils.decode import decode
from dac.utils.encode import encode
STAGES = ["encode", "decode", "download"]
def run(stage: str):
"""Run stages.
Parameters
----------
stage : str
Stage to run
"""
if stage not in STAGES:
raise ValueError(f"Unknown command: {stage}. Allowed commands are {STAGES}")
stage_fn = globals()[stage]
if stage == "download":
stage_fn()
return
stage_fn()
if __name__ == "__main__":
group = sys.argv.pop(1)
args = argbind.parse_args(group=group)
with argbind.scope(args):
run(group)

View File

@@ -0,0 +1,4 @@
from .base import CodecMixin
from .base import DACFile
from .dac import DAC
from .discriminator import Discriminator

View File

@@ -0,0 +1,294 @@
import math
from dataclasses import dataclass
from pathlib import Path
from typing import Union
import numpy as np
import torch
import tqdm
from audiotools import AudioSignal
from torch import nn
SUPPORTED_VERSIONS = ["1.0.0"]
@dataclass
class DACFile:
codes: torch.Tensor
# Metadata
chunk_length: int
original_length: int
input_db: float
channels: int
sample_rate: int
padding: bool
dac_version: str
def save(self, path):
artifacts = {
"codes": self.codes.numpy().astype(np.uint16),
"metadata": {
"input_db": self.input_db.numpy().astype(np.float32),
"original_length": self.original_length,
"sample_rate": self.sample_rate,
"chunk_length": self.chunk_length,
"channels": self.channels,
"padding": self.padding,
"dac_version": SUPPORTED_VERSIONS[-1],
},
}
path = Path(path).with_suffix(".dac")
with open(path, "wb") as f:
np.save(f, artifacts)
return path
@classmethod
def load(cls, path):
artifacts = np.load(path, allow_pickle=True)[()]
codes = torch.from_numpy(artifacts["codes"].astype(int))
if artifacts["metadata"].get("dac_version", None) not in SUPPORTED_VERSIONS:
raise RuntimeError(
f"Given file {path} can't be loaded with this version of descript-audio-codec."
)
return cls(codes=codes, **artifacts["metadata"])
class CodecMixin:
@property
def padding(self):
if not hasattr(self, "_padding"):
self._padding = True
return self._padding
@padding.setter
def padding(self, value):
assert isinstance(value, bool)
layers = [
l for l in self.modules() if isinstance(l, (nn.Conv1d, nn.ConvTranspose1d))
]
for layer in layers:
if value:
if hasattr(layer, "original_padding"):
layer.padding = layer.original_padding
else:
layer.original_padding = layer.padding
layer.padding = tuple(0 for _ in range(len(layer.padding)))
self._padding = value
def get_delay(self):
# Any number works here, delay is invariant to input length
l_out = self.get_output_length(0)
L = l_out
layers = []
for layer in self.modules():
if isinstance(layer, (nn.Conv1d, nn.ConvTranspose1d)):
layers.append(layer)
for layer in reversed(layers):
d = layer.dilation[0]
k = layer.kernel_size[0]
s = layer.stride[0]
if isinstance(layer, nn.ConvTranspose1d):
L = ((L - d * (k - 1) - 1) / s) + 1
elif isinstance(layer, nn.Conv1d):
L = (L - 1) * s + d * (k - 1) + 1
L = math.ceil(L)
l_in = L
return (l_in - l_out) // 2
def get_output_length(self, input_length):
L = input_length
# Calculate output length
for layer in self.modules():
if isinstance(layer, (nn.Conv1d, nn.ConvTranspose1d)):
d = layer.dilation[0]
k = layer.kernel_size[0]
s = layer.stride[0]
if isinstance(layer, nn.Conv1d):
L = ((L - d * (k - 1) - 1) / s) + 1
elif isinstance(layer, nn.ConvTranspose1d):
L = (L - 1) * s + d * (k - 1) + 1
L = math.floor(L)
return L
@torch.no_grad()
def compress(
self,
audio_path_or_signal: Union[str, Path, AudioSignal],
win_duration: float = 1.0,
verbose: bool = False,
normalize_db: float = -16,
n_quantizers: int = None,
) -> DACFile:
"""Processes an audio signal from a file or AudioSignal object into
discrete codes. This function processes the signal in short windows,
using constant GPU memory.
Parameters
----------
audio_path_or_signal : Union[str, Path, AudioSignal]
audio signal to reconstruct
win_duration : float, optional
window duration in seconds, by default 5.0
verbose : bool, optional
by default False
normalize_db : float, optional
normalize db, by default -16
Returns
-------
DACFile
Object containing compressed codes and metadata
required for decompression
"""
audio_signal = audio_path_or_signal
if isinstance(audio_signal, (str, Path)):
audio_signal = AudioSignal.load_from_file_with_ffmpeg(str(audio_signal))
self.eval()
original_padding = self.padding
original_device = audio_signal.device
audio_signal = audio_signal.clone()
original_sr = audio_signal.sample_rate
resample_fn = audio_signal.resample
loudness_fn = audio_signal.loudness
# If audio is > 10 minutes long, use the ffmpeg versions
if audio_signal.signal_duration >= 10 * 60 * 60:
resample_fn = audio_signal.ffmpeg_resample
loudness_fn = audio_signal.ffmpeg_loudness
original_length = audio_signal.signal_length
resample_fn(self.sample_rate)
input_db = loudness_fn()
if normalize_db is not None:
audio_signal.normalize(normalize_db)
audio_signal.ensure_max_of_audio()
nb, nac, nt = audio_signal.audio_data.shape
audio_signal.audio_data = audio_signal.audio_data.reshape(nb * nac, 1, nt)
win_duration = (
audio_signal.signal_duration if win_duration is None else win_duration
)
if audio_signal.signal_duration <= win_duration:
# Unchunked compression (used if signal length < win duration)
self.padding = True
n_samples = nt
hop = nt
else:
# Chunked inference
self.padding = False
# Zero-pad signal on either side by the delay
audio_signal.zero_pad(self.delay, self.delay)
n_samples = int(win_duration * self.sample_rate)
# Round n_samples to nearest hop length multiple
n_samples = int(math.ceil(n_samples / self.hop_length) * self.hop_length)
hop = self.get_output_length(n_samples)
codes = []
range_fn = range if not verbose else tqdm.trange
for i in range_fn(0, nt, hop):
x = audio_signal[..., i : i + n_samples]
x = x.zero_pad(0, max(0, n_samples - x.shape[-1]))
audio_data = x.audio_data.to(self.device)
audio_data = self.preprocess(audio_data, self.sample_rate)
_, c, _, _, _ = self.encode(audio_data, n_quantizers)
codes.append(c.to(original_device))
chunk_length = c.shape[-1]
codes = torch.cat(codes, dim=-1)
dac_file = DACFile(
codes=codes,
chunk_length=chunk_length,
original_length=original_length,
input_db=input_db,
channels=nac,
sample_rate=original_sr,
padding=self.padding,
dac_version=SUPPORTED_VERSIONS[-1],
)
if n_quantizers is not None:
codes = codes[:, :n_quantizers, :]
self.padding = original_padding
return dac_file
@torch.no_grad()
def decompress(
self,
obj: Union[str, Path, DACFile],
verbose: bool = False,
) -> AudioSignal:
"""Reconstruct audio from a given .dac file
Parameters
----------
obj : Union[str, Path, DACFile]
.dac file location or corresponding DACFile object.
verbose : bool, optional
Prints progress if True, by default False
Returns
-------
AudioSignal
Object with the reconstructed audio
"""
self.eval()
if isinstance(obj, (str, Path)):
obj = DACFile.load(obj)
original_padding = self.padding
self.padding = obj.padding
range_fn = range if not verbose else tqdm.trange
codes = obj.codes
original_device = codes.device
chunk_length = obj.chunk_length
recons = []
for i in range_fn(0, codes.shape[-1], chunk_length):
c = codes[..., i : i + chunk_length].to(self.device)
z = self.quantizer.from_codes(c)[0]
r = self.decode(z)
recons.append(r.to(original_device))
recons = torch.cat(recons, dim=-1)
recons = AudioSignal(recons, self.sample_rate)
resample_fn = recons.resample
loudness_fn = recons.loudness
# If audio is > 10 minutes long, use the ffmpeg versions
if recons.signal_duration >= 10 * 60 * 60:
resample_fn = recons.ffmpeg_resample
loudness_fn = recons.ffmpeg_loudness
recons.normalize(obj.input_db)
resample_fn(obj.sample_rate)
recons = recons[..., : obj.original_length]
loudness_fn()
recons.audio_data = recons.audio_data.reshape(
-1, obj.channels, obj.original_length
)
self.padding = original_padding
return recons

View File

@@ -0,0 +1,400 @@
import math
from typing import List
from typing import Union
import numpy as np
import torch
from audiotools import AudioSignal
from audiotools.ml import BaseModel
from torch import nn
from .base import CodecMixin
from indextts.s2mel.dac.nn.layers import Snake1d
from indextts.s2mel.dac.nn.layers import WNConv1d
from indextts.s2mel.dac.nn.layers import WNConvTranspose1d
from indextts.s2mel.dac.nn.quantize import ResidualVectorQuantize
from .encodec import SConv1d, SConvTranspose1d, SLSTM
def init_weights(m):
if isinstance(m, nn.Conv1d):
nn.init.trunc_normal_(m.weight, std=0.02)
nn.init.constant_(m.bias, 0)
class ResidualUnit(nn.Module):
def __init__(self, dim: int = 16, dilation: int = 1, causal: bool = False):
super().__init__()
conv1d_type = SConv1d# if causal else WNConv1d
pad = ((7 - 1) * dilation) // 2
self.block = nn.Sequential(
Snake1d(dim),
conv1d_type(dim, dim, kernel_size=7, dilation=dilation, padding=pad, causal=causal, norm='weight_norm'),
Snake1d(dim),
conv1d_type(dim, dim, kernel_size=1, causal=causal, norm='weight_norm'),
)
def forward(self, x):
y = self.block(x)
pad = (x.shape[-1] - y.shape[-1]) // 2
if pad > 0:
x = x[..., pad:-pad]
return x + y
class EncoderBlock(nn.Module):
def __init__(self, dim: int = 16, stride: int = 1, causal: bool = False):
super().__init__()
conv1d_type = SConv1d# if causal else WNConv1d
self.block = nn.Sequential(
ResidualUnit(dim // 2, dilation=1, causal=causal),
ResidualUnit(dim // 2, dilation=3, causal=causal),
ResidualUnit(dim // 2, dilation=9, causal=causal),
Snake1d(dim // 2),
conv1d_type(
dim // 2,
dim,
kernel_size=2 * stride,
stride=stride,
padding=math.ceil(stride / 2),
causal=causal,
norm='weight_norm',
),
)
def forward(self, x):
return self.block(x)
class Encoder(nn.Module):
def __init__(
self,
d_model: int = 64,
strides: list = [2, 4, 8, 8],
d_latent: int = 64,
causal: bool = False,
lstm: int = 2,
):
super().__init__()
conv1d_type = SConv1d# if causal else WNConv1d
# Create first convolution
self.block = [conv1d_type(1, d_model, kernel_size=7, padding=3, causal=causal, norm='weight_norm')]
# Create EncoderBlocks that double channels as they downsample by `stride`
for stride in strides:
d_model *= 2
self.block += [EncoderBlock(d_model, stride=stride, causal=causal)]
# Add LSTM if needed
self.use_lstm = lstm
if lstm:
self.block += [SLSTM(d_model, lstm)]
# Create last convolution
self.block += [
Snake1d(d_model),
conv1d_type(d_model, d_latent, kernel_size=3, padding=1, causal=causal, norm='weight_norm'),
]
# Wrap black into nn.Sequential
self.block = nn.Sequential(*self.block)
self.enc_dim = d_model
def forward(self, x):
return self.block(x)
def reset_cache(self):
# recursively find all submodules named SConv1d in self.block and use their reset_cache method
def reset_cache(m):
if isinstance(m, SConv1d) or isinstance(m, SLSTM):
m.reset_cache()
return
for child in m.children():
reset_cache(child)
reset_cache(self.block)
class DecoderBlock(nn.Module):
def __init__(self, input_dim: int = 16, output_dim: int = 8, stride: int = 1, causal: bool = False):
super().__init__()
conv1d_type = SConvTranspose1d #if causal else WNConvTranspose1d
self.block = nn.Sequential(
Snake1d(input_dim),
conv1d_type(
input_dim,
output_dim,
kernel_size=2 * stride,
stride=stride,
padding=math.ceil(stride / 2),
causal=causal,
norm='weight_norm'
),
ResidualUnit(output_dim, dilation=1, causal=causal),
ResidualUnit(output_dim, dilation=3, causal=causal),
ResidualUnit(output_dim, dilation=9, causal=causal),
)
def forward(self, x):
return self.block(x)
class Decoder(nn.Module):
def __init__(
self,
input_channel,
channels,
rates,
d_out: int = 1,
causal: bool = False,
lstm: int = 2,
):
super().__init__()
conv1d_type = SConv1d# if causal else WNConv1d
# Add first conv layer
layers = [conv1d_type(input_channel, channels, kernel_size=7, padding=3, causal=causal, norm='weight_norm')]
if lstm:
layers += [SLSTM(channels, num_layers=lstm)]
# Add upsampling + MRF blocks
for i, stride in enumerate(rates):
input_dim = channels // 2**i
output_dim = channels // 2 ** (i + 1)
layers += [DecoderBlock(input_dim, output_dim, stride, causal=causal)]
# Add final conv layer
layers += [
Snake1d(output_dim),
conv1d_type(output_dim, d_out, kernel_size=7, padding=3, causal=causal, norm='weight_norm'),
nn.Tanh(),
]
self.model = nn.Sequential(*layers)
def forward(self, x):
return self.model(x)
class DAC(BaseModel, CodecMixin):
def __init__(
self,
encoder_dim: int = 64,
encoder_rates: List[int] = [2, 4, 8, 8],
latent_dim: int = None,
decoder_dim: int = 1536,
decoder_rates: List[int] = [8, 8, 4, 2],
n_codebooks: int = 9,
codebook_size: int = 1024,
codebook_dim: Union[int, list] = 8,
quantizer_dropout: bool = False,
sample_rate: int = 44100,
lstm: int = 2,
causal: bool = False,
):
super().__init__()
self.encoder_dim = encoder_dim
self.encoder_rates = encoder_rates
self.decoder_dim = decoder_dim
self.decoder_rates = decoder_rates
self.sample_rate = sample_rate
if latent_dim is None:
latent_dim = encoder_dim * (2 ** len(encoder_rates))
self.latent_dim = latent_dim
self.hop_length = np.prod(encoder_rates)
self.encoder = Encoder(encoder_dim, encoder_rates, latent_dim, causal=causal, lstm=lstm)
self.n_codebooks = n_codebooks
self.codebook_size = codebook_size
self.codebook_dim = codebook_dim
self.quantizer = ResidualVectorQuantize(
input_dim=latent_dim,
n_codebooks=n_codebooks,
codebook_size=codebook_size,
codebook_dim=codebook_dim,
quantizer_dropout=quantizer_dropout,
)
self.decoder = Decoder(
latent_dim,
decoder_dim,
decoder_rates,
lstm=lstm,
causal=causal,
)
self.sample_rate = sample_rate
self.apply(init_weights)
self.delay = self.get_delay()
def preprocess(self, audio_data, sample_rate):
if sample_rate is None:
sample_rate = self.sample_rate
assert sample_rate == self.sample_rate
length = audio_data.shape[-1]
right_pad = math.ceil(length / self.hop_length) * self.hop_length - length
audio_data = nn.functional.pad(audio_data, (0, right_pad))
return audio_data
def encode(
self,
audio_data: torch.Tensor,
n_quantizers: int = None,
):
"""Encode given audio data and return quantized latent codes
Parameters
----------
audio_data : Tensor[B x 1 x T]
Audio data to encode
n_quantizers : int, optional
Number of quantizers to use, by default None
If None, all quantizers are used.
Returns
-------
dict
A dictionary with the following keys:
"z" : Tensor[B x D x T]
Quantized continuous representation of input
"codes" : Tensor[B x N x T]
Codebook indices for each codebook
(quantized discrete representation of input)
"latents" : Tensor[B x N*D x T]
Projected latents (continuous representation of input before quantization)
"vq/commitment_loss" : Tensor[1]
Commitment loss to train encoder to predict vectors closer to codebook
entries
"vq/codebook_loss" : Tensor[1]
Codebook loss to update the codebook
"length" : int
Number of samples in input audio
"""
z = self.encoder(audio_data)
z, codes, latents, commitment_loss, codebook_loss = self.quantizer(
z, n_quantizers
)
return z, codes, latents, commitment_loss, codebook_loss
def decode(self, z: torch.Tensor):
"""Decode given latent codes and return audio data
Parameters
----------
z : Tensor[B x D x T]
Quantized continuous representation of input
length : int, optional
Number of samples in output audio, by default None
Returns
-------
dict
A dictionary with the following keys:
"audio" : Tensor[B x 1 x length]
Decoded audio data.
"""
return self.decoder(z)
def forward(
self,
audio_data: torch.Tensor,
sample_rate: int = None,
n_quantizers: int = None,
):
"""Model forward pass
Parameters
----------
audio_data : Tensor[B x 1 x T]
Audio data to encode
sample_rate : int, optional
Sample rate of audio data in Hz, by default None
If None, defaults to `self.sample_rate`
n_quantizers : int, optional
Number of quantizers to use, by default None.
If None, all quantizers are used.
Returns
-------
dict
A dictionary with the following keys:
"z" : Tensor[B x D x T]
Quantized continuous representation of input
"codes" : Tensor[B x N x T]
Codebook indices for each codebook
(quantized discrete representation of input)
"latents" : Tensor[B x N*D x T]
Projected latents (continuous representation of input before quantization)
"vq/commitment_loss" : Tensor[1]
Commitment loss to train encoder to predict vectors closer to codebook
entries
"vq/codebook_loss" : Tensor[1]
Codebook loss to update the codebook
"length" : int
Number of samples in input audio
"audio" : Tensor[B x 1 x length]
Decoded audio data.
"""
length = audio_data.shape[-1]
audio_data = self.preprocess(audio_data, sample_rate)
z, codes, latents, commitment_loss, codebook_loss = self.encode(
audio_data, n_quantizers
)
x = self.decode(z)
return {
"audio": x[..., :length],
"z": z,
"codes": codes,
"latents": latents,
"vq/commitment_loss": commitment_loss,
"vq/codebook_loss": codebook_loss,
}
if __name__ == "__main__":
import numpy as np
from functools import partial
model = DAC().to("cpu")
for n, m in model.named_modules():
o = m.extra_repr()
p = sum([np.prod(p.size()) for p in m.parameters()])
fn = lambda o, p: o + f" {p/1e6:<.3f}M params."
setattr(m, "extra_repr", partial(fn, o=o, p=p))
print(model)
print("Total # of params: ", sum([np.prod(p.size()) for p in model.parameters()]))
length = 88200 * 2
x = torch.randn(1, 1, length).to(model.device)
x.requires_grad_(True)
x.retain_grad()
# Make a forward pass
out = model(x)["audio"]
print("Input shape:", x.shape)
print("Output shape:", out.shape)
# Create gradient variable
grad = torch.zeros_like(out)
grad[:, :, grad.shape[-1] // 2] = 1
# Make a backward pass
out.backward(grad)
# Check non-zero values
gradmap = x.grad.squeeze(0)
gradmap = (gradmap != 0).sum(0) # sum across features
rf = (gradmap != 0).sum()
print(f"Receptive field: {rf.item()}")
x = AudioSignal(torch.randn(1, 1, 44100 * 60), 44100)
model.decompress(model.compress(x, verbose=True), verbose=True)

View File

@@ -0,0 +1,228 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from audiotools import AudioSignal
from audiotools import ml
from audiotools import STFTParams
from einops import rearrange
from torch.nn.utils import weight_norm
def WNConv1d(*args, **kwargs):
act = kwargs.pop("act", True)
conv = weight_norm(nn.Conv1d(*args, **kwargs))
if not act:
return conv
return nn.Sequential(conv, nn.LeakyReLU(0.1))
def WNConv2d(*args, **kwargs):
act = kwargs.pop("act", True)
conv = weight_norm(nn.Conv2d(*args, **kwargs))
if not act:
return conv
return nn.Sequential(conv, nn.LeakyReLU(0.1))
class MPD(nn.Module):
def __init__(self, period):
super().__init__()
self.period = period
self.convs = nn.ModuleList(
[
WNConv2d(1, 32, (5, 1), (3, 1), padding=(2, 0)),
WNConv2d(32, 128, (5, 1), (3, 1), padding=(2, 0)),
WNConv2d(128, 512, (5, 1), (3, 1), padding=(2, 0)),
WNConv2d(512, 1024, (5, 1), (3, 1), padding=(2, 0)),
WNConv2d(1024, 1024, (5, 1), 1, padding=(2, 0)),
]
)
self.conv_post = WNConv2d(
1024, 1, kernel_size=(3, 1), padding=(1, 0), act=False
)
def pad_to_period(self, x):
t = x.shape[-1]
x = F.pad(x, (0, self.period - t % self.period), mode="reflect")
return x
def forward(self, x):
fmap = []
x = self.pad_to_period(x)
x = rearrange(x, "b c (l p) -> b c l p", p=self.period)
for layer in self.convs:
x = layer(x)
fmap.append(x)
x = self.conv_post(x)
fmap.append(x)
return fmap
class MSD(nn.Module):
def __init__(self, rate: int = 1, sample_rate: int = 44100):
super().__init__()
self.convs = nn.ModuleList(
[
WNConv1d(1, 16, 15, 1, padding=7),
WNConv1d(16, 64, 41, 4, groups=4, padding=20),
WNConv1d(64, 256, 41, 4, groups=16, padding=20),
WNConv1d(256, 1024, 41, 4, groups=64, padding=20),
WNConv1d(1024, 1024, 41, 4, groups=256, padding=20),
WNConv1d(1024, 1024, 5, 1, padding=2),
]
)
self.conv_post = WNConv1d(1024, 1, 3, 1, padding=1, act=False)
self.sample_rate = sample_rate
self.rate = rate
def forward(self, x):
x = AudioSignal(x, self.sample_rate)
x.resample(self.sample_rate // self.rate)
x = x.audio_data
fmap = []
for l in self.convs:
x = l(x)
fmap.append(x)
x = self.conv_post(x)
fmap.append(x)
return fmap
BANDS = [(0.0, 0.1), (0.1, 0.25), (0.25, 0.5), (0.5, 0.75), (0.75, 1.0)]
class MRD(nn.Module):
def __init__(
self,
window_length: int,
hop_factor: float = 0.25,
sample_rate: int = 44100,
bands: list = BANDS,
):
"""Complex multi-band spectrogram discriminator.
Parameters
----------
window_length : int
Window length of STFT.
hop_factor : float, optional
Hop factor of the STFT, defaults to ``0.25 * window_length``.
sample_rate : int, optional
Sampling rate of audio in Hz, by default 44100
bands : list, optional
Bands to run discriminator over.
"""
super().__init__()
self.window_length = window_length
self.hop_factor = hop_factor
self.sample_rate = sample_rate
self.stft_params = STFTParams(
window_length=window_length,
hop_length=int(window_length * hop_factor),
match_stride=True,
)
n_fft = window_length // 2 + 1
bands = [(int(b[0] * n_fft), int(b[1] * n_fft)) for b in bands]
self.bands = bands
ch = 32
convs = lambda: nn.ModuleList(
[
WNConv2d(2, ch, (3, 9), (1, 1), padding=(1, 4)),
WNConv2d(ch, ch, (3, 9), (1, 2), padding=(1, 4)),
WNConv2d(ch, ch, (3, 9), (1, 2), padding=(1, 4)),
WNConv2d(ch, ch, (3, 9), (1, 2), padding=(1, 4)),
WNConv2d(ch, ch, (3, 3), (1, 1), padding=(1, 1)),
]
)
self.band_convs = nn.ModuleList([convs() for _ in range(len(self.bands))])
self.conv_post = WNConv2d(ch, 1, (3, 3), (1, 1), padding=(1, 1), act=False)
def spectrogram(self, x):
x = AudioSignal(x, self.sample_rate, stft_params=self.stft_params)
x = torch.view_as_real(x.stft())
x = rearrange(x, "b 1 f t c -> (b 1) c t f")
# Split into bands
x_bands = [x[..., b[0] : b[1]] for b in self.bands]
return x_bands
def forward(self, x):
x_bands = self.spectrogram(x)
fmap = []
x = []
for band, stack in zip(x_bands, self.band_convs):
for layer in stack:
band = layer(band)
fmap.append(band)
x.append(band)
x = torch.cat(x, dim=-1)
x = self.conv_post(x)
fmap.append(x)
return fmap
class Discriminator(nn.Module):
def __init__(
self,
rates: list = [],
periods: list = [2, 3, 5, 7, 11],
fft_sizes: list = [2048, 1024, 512],
sample_rate: int = 44100,
bands: list = BANDS,
):
"""Discriminator that combines multiple discriminators.
Parameters
----------
rates : list, optional
sampling rates (in Hz) to run MSD at, by default []
If empty, MSD is not used.
periods : list, optional
periods (of samples) to run MPD at, by default [2, 3, 5, 7, 11]
fft_sizes : list, optional
Window sizes of the FFT to run MRD at, by default [2048, 1024, 512]
sample_rate : int, optional
Sampling rate of audio in Hz, by default 44100
bands : list, optional
Bands to run MRD at, by default `BANDS`
"""
super().__init__()
discs = []
discs += [MPD(p) for p in periods]
discs += [MSD(r, sample_rate=sample_rate) for r in rates]
discs += [MRD(f, sample_rate=sample_rate, bands=bands) for f in fft_sizes]
self.discriminators = nn.ModuleList(discs)
def preprocess(self, y):
# Remove DC offset
y = y - y.mean(dim=-1, keepdims=True)
# Peak normalize the volume of input audio
y = 0.8 * y / (y.abs().max(dim=-1, keepdim=True)[0] + 1e-9)
return y
def forward(self, x):
x = self.preprocess(x)
fmaps = [d(x) for d in self.discriminators]
return fmaps
if __name__ == "__main__":
disc = Discriminator()
x = torch.zeros(1, 1, 44100)
results = disc(x)
for i, result in enumerate(results):
print(f"disc{i}")
for i, r in enumerate(result):
print(r.shape, r.mean(), r.min(), r.max())
print()

View File

@@ -0,0 +1,320 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Convolutional layers wrappers and utilities."""
import math
import typing as tp
import warnings
import torch
from torch import nn
from torch.nn import functional as F
from torch.nn.utils import spectral_norm, weight_norm
import typing as tp
import einops
class ConvLayerNorm(nn.LayerNorm):
"""
Convolution-friendly LayerNorm that moves channels to last dimensions
before running the normalization and moves them back to original position right after.
"""
def __init__(self, normalized_shape: tp.Union[int, tp.List[int], torch.Size], **kwargs):
super().__init__(normalized_shape, **kwargs)
def forward(self, x):
x = einops.rearrange(x, 'b ... t -> b t ...')
x = super().forward(x)
x = einops.rearrange(x, 'b t ... -> b ... t')
return
CONV_NORMALIZATIONS = frozenset(['none', 'weight_norm', 'spectral_norm',
'time_layer_norm', 'layer_norm', 'time_group_norm'])
def apply_parametrization_norm(module: nn.Module, norm: str = 'none') -> nn.Module:
assert norm in CONV_NORMALIZATIONS
if norm == 'weight_norm':
return weight_norm(module)
elif norm == 'spectral_norm':
return spectral_norm(module)
else:
# We already check was in CONV_NORMALIZATION, so any other choice
# doesn't need reparametrization.
return module
def get_norm_module(module: nn.Module, causal: bool = False, norm: str = 'none', **norm_kwargs) -> nn.Module:
"""Return the proper normalization module. If causal is True, this will ensure the returned
module is causal, or return an error if the normalization doesn't support causal evaluation.
"""
assert norm in CONV_NORMALIZATIONS
if norm == 'layer_norm':
assert isinstance(module, nn.modules.conv._ConvNd)
return ConvLayerNorm(module.out_channels, **norm_kwargs)
elif norm == 'time_group_norm':
if causal:
raise ValueError("GroupNorm doesn't support causal evaluation.")
assert isinstance(module, nn.modules.conv._ConvNd)
return nn.GroupNorm(1, module.out_channels, **norm_kwargs)
else:
return nn.Identity()
def get_extra_padding_for_conv1d(x: torch.Tensor, kernel_size: int, stride: int,
padding_total: int = 0) -> int:
"""See `pad_for_conv1d`.
"""
length = x.shape[-1]
n_frames = (length - kernel_size + padding_total) / stride + 1
ideal_length = (math.ceil(n_frames) - 1) * stride + (kernel_size - padding_total)
return ideal_length - length
def pad_for_conv1d(x: torch.Tensor, kernel_size: int, stride: int, padding_total: int = 0):
"""Pad for a convolution to make sure that the last window is full.
Extra padding is added at the end. This is required to ensure that we can rebuild
an output of the same length, as otherwise, even with padding, some time steps
might get removed.
For instance, with total padding = 4, kernel size = 4, stride = 2:
0 0 1 2 3 4 5 0 0 # (0s are padding)
1 2 3 # (output frames of a convolution, last 0 is never used)
0 0 1 2 3 4 5 0 # (output of tr. conv., but pos. 5 is going to get removed as padding)
1 2 3 4 # once you removed padding, we are missing one time step !
"""
extra_padding = get_extra_padding_for_conv1d(x, kernel_size, stride, padding_total)
return F.pad(x, (0, extra_padding))
def pad1d(x: torch.Tensor, paddings: tp.Tuple[int, int], mode: str = 'zero', value: float = 0.):
"""Tiny wrapper around F.pad, just to allow for reflect padding on small input.
If this is the case, we insert extra 0 padding to the right before the reflection happen.
"""
length = x.shape[-1]
padding_left, padding_right = paddings
assert padding_left >= 0 and padding_right >= 0, (padding_left, padding_right)
if mode == 'reflect':
max_pad = max(padding_left, padding_right)
extra_pad = 0
if length <= max_pad:
extra_pad = max_pad - length + 1
x = F.pad(x, (0, extra_pad))
padded = F.pad(x, paddings, mode, value)
end = padded.shape[-1] - extra_pad
return padded[..., :end]
else:
return F.pad(x, paddings, mode, value)
def unpad1d(x: torch.Tensor, paddings: tp.Tuple[int, int]):
"""Remove padding from x, handling properly zero padding. Only for 1d!"""
padding_left, padding_right = paddings
assert padding_left >= 0 and padding_right >= 0, (padding_left, padding_right)
assert (padding_left + padding_right) <= x.shape[-1]
end = x.shape[-1] - padding_right
return x[..., padding_left: end]
class NormConv1d(nn.Module):
"""Wrapper around Conv1d and normalization applied to this conv
to provide a uniform interface across normalization approaches.
"""
def __init__(self, *args, causal: bool = False, norm: str = 'none',
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.conv = apply_parametrization_norm(nn.Conv1d(*args, **kwargs), norm)
self.norm = get_norm_module(self.conv, causal, norm, **norm_kwargs)
self.norm_type = norm
def forward(self, x):
x = self.conv(x)
x = self.norm(x)
return x
class NormConv2d(nn.Module):
"""Wrapper around Conv2d and normalization applied to this conv
to provide a uniform interface across normalization approaches.
"""
def __init__(self, *args, norm: str = 'none',
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.conv = apply_parametrization_norm(nn.Conv2d(*args, **kwargs), norm)
self.norm = get_norm_module(self.conv, causal=False, norm=norm, **norm_kwargs)
self.norm_type = norm
def forward(self, x):
x = self.conv(x)
x = self.norm(x)
return x
class NormConvTranspose1d(nn.Module):
"""Wrapper around ConvTranspose1d and normalization applied to this conv
to provide a uniform interface across normalization approaches.
"""
def __init__(self, *args, causal: bool = False, norm: str = 'none',
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.convtr = apply_parametrization_norm(nn.ConvTranspose1d(*args, **kwargs), norm)
self.norm = get_norm_module(self.convtr, causal, norm, **norm_kwargs)
self.norm_type = norm
def forward(self, x):
x = self.convtr(x)
x = self.norm(x)
return x
class NormConvTranspose2d(nn.Module):
"""Wrapper around ConvTranspose2d and normalization applied to this conv
to provide a uniform interface across normalization approaches.
"""
def __init__(self, *args, norm: str = 'none',
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.convtr = apply_parametrization_norm(nn.ConvTranspose2d(*args, **kwargs), norm)
self.norm = get_norm_module(self.convtr, causal=False, norm=norm, **norm_kwargs)
def forward(self, x):
x = self.convtr(x)
x = self.norm(x)
return x
class SConv1d(nn.Module):
"""Conv1d with some builtin handling of asymmetric or causal padding
and normalization.
"""
def __init__(self, in_channels: int, out_channels: int,
kernel_size: int, stride: int = 1, dilation: int = 1,
groups: int = 1, bias: bool = True, causal: bool = False,
norm: str = 'none', norm_kwargs: tp.Dict[str, tp.Any] = {},
pad_mode: str = 'reflect', **kwargs):
super().__init__()
# warn user on unusual setup between dilation and stride
if stride > 1 and dilation > 1:
warnings.warn('SConv1d has been initialized with stride > 1 and dilation > 1'
f' (kernel_size={kernel_size} stride={stride}, dilation={dilation}).')
self.conv = NormConv1d(in_channels, out_channels, kernel_size, stride,
dilation=dilation, groups=groups, bias=bias, causal=causal,
norm=norm, norm_kwargs=norm_kwargs)
self.causal = causal
self.pad_mode = pad_mode
self.cache_enabled = False
def reset_cache(self):
"""Reset the cache when starting a new stream."""
self.cache = None
self.cache_enabled = True
def forward(self, x):
B, C, T = x.shape
kernel_size = self.conv.conv.kernel_size[0]
stride = self.conv.conv.stride[0]
dilation = self.conv.conv.dilation[0]
kernel_size = (kernel_size - 1) * dilation + 1 # effective kernel size with dilations
padding_total = kernel_size - stride
extra_padding = get_extra_padding_for_conv1d(x, kernel_size, stride, padding_total)
if self.causal:
# Left padding for causal
if self.cache_enabled and self.cache is not None:
# Concatenate the cache (previous inputs) with the new input for streaming
x = torch.cat([self.cache, x], dim=2)
else:
x = pad1d(x, (padding_total, extra_padding), mode=self.pad_mode)
else:
# Asymmetric padding required for odd strides
padding_right = padding_total // 2
padding_left = padding_total - padding_right
x = pad1d(x, (padding_left, padding_right + extra_padding), mode=self.pad_mode)
# Store the most recent input frames for future cache use
if self.cache_enabled:
if self.cache is None:
# Initialize cache with zeros (at the start of streaming)
self.cache = torch.zeros(B, C, kernel_size - 1, device=x.device)
# Update the cache by storing the latest input frames
if kernel_size > 1:
self.cache = x[:, :, -kernel_size + 1:].detach() # Only store the necessary frames
return self.conv(x)
class SConvTranspose1d(nn.Module):
"""ConvTranspose1d with some builtin handling of asymmetric or causal padding
and normalization.
"""
def __init__(self, in_channels: int, out_channels: int,
kernel_size: int, stride: int = 1, causal: bool = False,
norm: str = 'none', trim_right_ratio: float = 1.,
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.convtr = NormConvTranspose1d(in_channels, out_channels, kernel_size, stride,
causal=causal, norm=norm, norm_kwargs=norm_kwargs)
self.causal = causal
self.trim_right_ratio = trim_right_ratio
assert self.causal or self.trim_right_ratio == 1., \
"`trim_right_ratio` != 1.0 only makes sense for causal convolutions"
assert self.trim_right_ratio >= 0. and self.trim_right_ratio <= 1.
def forward(self, x):
kernel_size = self.convtr.convtr.kernel_size[0]
stride = self.convtr.convtr.stride[0]
padding_total = kernel_size - stride
y = self.convtr(x)
# We will only trim fixed padding. Extra padding from `pad_for_conv1d` would be
# removed at the very end, when keeping only the right length for the output,
# as removing it here would require also passing the length at the matching layer
# in the encoder.
if self.causal:
# Trim the padding on the right according to the specified ratio
# if trim_right_ratio = 1.0, trim everything from right
padding_right = math.ceil(padding_total * self.trim_right_ratio)
padding_left = padding_total - padding_right
y = unpad1d(y, (padding_left, padding_right))
else:
# Asymmetric padding required for odd strides
padding_right = padding_total // 2
padding_left = padding_total - padding_right
y = unpad1d(y, (padding_left, padding_right))
return y
class SLSTM(nn.Module):
"""
LSTM without worrying about the hidden state, nor the layout of the data.
Expects input as convolutional layout.
"""
def __init__(self, dimension: int, num_layers: int = 2, skip: bool = True):
super().__init__()
self.skip = skip
self.lstm = nn.LSTM(dimension, dimension, num_layers)
self.hidden = None
self.cache_enabled = False
def forward(self, x):
x = x.permute(2, 0, 1)
if self.training or not self.cache_enabled:
y, _ = self.lstm(x)
else:
y, self.hidden = self.lstm(x, self.hidden)
if self.skip:
y = y + x
y = y.permute(1, 2, 0)
return y
def reset_cache(self):
self.hidden = None
self.cache_enabled = True

View File

@@ -0,0 +1,3 @@
from . import layers
from . import loss
from . import quantize

View File

@@ -0,0 +1,33 @@
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
from torch.nn.utils import weight_norm
def WNConv1d(*args, **kwargs):
return weight_norm(nn.Conv1d(*args, **kwargs))
def WNConvTranspose1d(*args, **kwargs):
return weight_norm(nn.ConvTranspose1d(*args, **kwargs))
# Scripting this brings model speed up 1.4x
@torch.jit.script
def snake(x, alpha):
shape = x.shape
x = x.reshape(shape[0], shape[1], -1)
x = x + (alpha + 1e-9).reciprocal() * torch.sin(alpha * x).pow(2)
x = x.reshape(shape)
return x
class Snake1d(nn.Module):
def __init__(self, channels):
super().__init__()
self.alpha = nn.Parameter(torch.ones(1, channels, 1))
def forward(self, x):
return snake(x, self.alpha)

View File

@@ -0,0 +1,368 @@
import typing
from typing import List
import torch
import torch.nn.functional as F
from audiotools import AudioSignal
from audiotools import STFTParams
from torch import nn
class L1Loss(nn.L1Loss):
"""L1 Loss between AudioSignals. Defaults
to comparing ``audio_data``, but any
attribute of an AudioSignal can be used.
Parameters
----------
attribute : str, optional
Attribute of signal to compare, defaults to ``audio_data``.
weight : float, optional
Weight of this loss, defaults to 1.0.
Implementation copied from: https://github.com/descriptinc/lyrebird-audiotools/blob/961786aa1a9d628cca0c0486e5885a457fe70c1a/audiotools/metrics/distance.py
"""
def __init__(self, attribute: str = "audio_data", weight: float = 1.0, **kwargs):
self.attribute = attribute
self.weight = weight
super().__init__(**kwargs)
def forward(self, x: AudioSignal, y: AudioSignal):
"""
Parameters
----------
x : AudioSignal
Estimate AudioSignal
y : AudioSignal
Reference AudioSignal
Returns
-------
torch.Tensor
L1 loss between AudioSignal attributes.
"""
if isinstance(x, AudioSignal):
x = getattr(x, self.attribute)
y = getattr(y, self.attribute)
return super().forward(x, y)
class SISDRLoss(nn.Module):
"""
Computes the Scale-Invariant Source-to-Distortion Ratio between a batch
of estimated and reference audio signals or aligned features.
Parameters
----------
scaling : int, optional
Whether to use scale-invariant (True) or
signal-to-noise ratio (False), by default True
reduction : str, optional
How to reduce across the batch (either 'mean',
'sum', or none).], by default ' mean'
zero_mean : int, optional
Zero mean the references and estimates before
computing the loss, by default True
clip_min : int, optional
The minimum possible loss value. Helps network
to not focus on making already good examples better, by default None
weight : float, optional
Weight of this loss, defaults to 1.0.
Implementation copied from: https://github.com/descriptinc/lyrebird-audiotools/blob/961786aa1a9d628cca0c0486e5885a457fe70c1a/audiotools/metrics/distance.py
"""
def __init__(
self,
scaling: int = True,
reduction: str = "mean",
zero_mean: int = True,
clip_min: int = None,
weight: float = 1.0,
):
self.scaling = scaling
self.reduction = reduction
self.zero_mean = zero_mean
self.clip_min = clip_min
self.weight = weight
super().__init__()
def forward(self, x: AudioSignal, y: AudioSignal):
eps = 1e-8
# nb, nc, nt
if isinstance(x, AudioSignal):
references = x.audio_data
estimates = y.audio_data
else:
references = x
estimates = y
nb = references.shape[0]
references = references.reshape(nb, 1, -1).permute(0, 2, 1)
estimates = estimates.reshape(nb, 1, -1).permute(0, 2, 1)
# samples now on axis 1
if self.zero_mean:
mean_reference = references.mean(dim=1, keepdim=True)
mean_estimate = estimates.mean(dim=1, keepdim=True)
else:
mean_reference = 0
mean_estimate = 0
_references = references - mean_reference
_estimates = estimates - mean_estimate
references_projection = (_references**2).sum(dim=-2) + eps
references_on_estimates = (_estimates * _references).sum(dim=-2) + eps
scale = (
(references_on_estimates / references_projection).unsqueeze(1)
if self.scaling
else 1
)
e_true = scale * _references
e_res = _estimates - e_true
signal = (e_true**2).sum(dim=1)
noise = (e_res**2).sum(dim=1)
sdr = -10 * torch.log10(signal / noise + eps)
if self.clip_min is not None:
sdr = torch.clamp(sdr, min=self.clip_min)
if self.reduction == "mean":
sdr = sdr.mean()
elif self.reduction == "sum":
sdr = sdr.sum()
return sdr
class MultiScaleSTFTLoss(nn.Module):
"""Computes the multi-scale STFT loss from [1].
Parameters
----------
window_lengths : List[int], optional
Length of each window of each STFT, by default [2048, 512]
loss_fn : typing.Callable, optional
How to compare each loss, by default nn.L1Loss()
clamp_eps : float, optional
Clamp on the log magnitude, below, by default 1e-5
mag_weight : float, optional
Weight of raw magnitude portion of loss, by default 1.0
log_weight : float, optional
Weight of log magnitude portion of loss, by default 1.0
pow : float, optional
Power to raise magnitude to before taking log, by default 2.0
weight : float, optional
Weight of this loss, by default 1.0
match_stride : bool, optional
Whether to match the stride of convolutional layers, by default False
References
----------
1. Engel, Jesse, Chenjie Gu, and Adam Roberts.
"DDSP: Differentiable Digital Signal Processing."
International Conference on Learning Representations. 2019.
Implementation copied from: https://github.com/descriptinc/lyrebird-audiotools/blob/961786aa1a9d628cca0c0486e5885a457fe70c1a/audiotools/metrics/spectral.py
"""
def __init__(
self,
window_lengths: List[int] = [2048, 512],
loss_fn: typing.Callable = nn.L1Loss(),
clamp_eps: float = 1e-5,
mag_weight: float = 1.0,
log_weight: float = 1.0,
pow: float = 2.0,
weight: float = 1.0,
match_stride: bool = False,
window_type: str = None,
):
super().__init__()
self.stft_params = [
STFTParams(
window_length=w,
hop_length=w // 4,
match_stride=match_stride,
window_type=window_type,
)
for w in window_lengths
]
self.loss_fn = loss_fn
self.log_weight = log_weight
self.mag_weight = mag_weight
self.clamp_eps = clamp_eps
self.weight = weight
self.pow = pow
def forward(self, x: AudioSignal, y: AudioSignal):
"""Computes multi-scale STFT between an estimate and a reference
signal.
Parameters
----------
x : AudioSignal
Estimate signal
y : AudioSignal
Reference signal
Returns
-------
torch.Tensor
Multi-scale STFT loss.
"""
loss = 0.0
for s in self.stft_params:
x.stft(s.window_length, s.hop_length, s.window_type)
y.stft(s.window_length, s.hop_length, s.window_type)
loss += self.log_weight * self.loss_fn(
x.magnitude.clamp(self.clamp_eps).pow(self.pow).log10(),
y.magnitude.clamp(self.clamp_eps).pow(self.pow).log10(),
)
loss += self.mag_weight * self.loss_fn(x.magnitude, y.magnitude)
return loss
class MelSpectrogramLoss(nn.Module):
"""Compute distance between mel spectrograms. Can be used
in a multi-scale way.
Parameters
----------
n_mels : List[int]
Number of mels per STFT, by default [150, 80],
window_lengths : List[int], optional
Length of each window of each STFT, by default [2048, 512]
loss_fn : typing.Callable, optional
How to compare each loss, by default nn.L1Loss()
clamp_eps : float, optional
Clamp on the log magnitude, below, by default 1e-5
mag_weight : float, optional
Weight of raw magnitude portion of loss, by default 1.0
log_weight : float, optional
Weight of log magnitude portion of loss, by default 1.0
pow : float, optional
Power to raise magnitude to before taking log, by default 2.0
weight : float, optional
Weight of this loss, by default 1.0
match_stride : bool, optional
Whether to match the stride of convolutional layers, by default False
Implementation copied from: https://github.com/descriptinc/lyrebird-audiotools/blob/961786aa1a9d628cca0c0486e5885a457fe70c1a/audiotools/metrics/spectral.py
"""
def __init__(
self,
n_mels: List[int] = [150, 80],
window_lengths: List[int] = [2048, 512],
loss_fn: typing.Callable = nn.L1Loss(),
clamp_eps: float = 1e-5,
mag_weight: float = 1.0,
log_weight: float = 1.0,
pow: float = 2.0,
weight: float = 1.0,
match_stride: bool = False,
mel_fmin: List[float] = [0.0, 0.0],
mel_fmax: List[float] = [None, None],
window_type: str = None,
):
super().__init__()
self.stft_params = [
STFTParams(
window_length=w,
hop_length=w // 4,
match_stride=match_stride,
window_type=window_type,
)
for w in window_lengths
]
self.n_mels = n_mels
self.loss_fn = loss_fn
self.clamp_eps = clamp_eps
self.log_weight = log_weight
self.mag_weight = mag_weight
self.weight = weight
self.mel_fmin = mel_fmin
self.mel_fmax = mel_fmax
self.pow = pow
def forward(self, x: AudioSignal, y: AudioSignal):
"""Computes mel loss between an estimate and a reference
signal.
Parameters
----------
x : AudioSignal
Estimate signal
y : AudioSignal
Reference signal
Returns
-------
torch.Tensor
Mel loss.
"""
loss = 0.0
for n_mels, fmin, fmax, s in zip(
self.n_mels, self.mel_fmin, self.mel_fmax, self.stft_params
):
kwargs = {
"window_length": s.window_length,
"hop_length": s.hop_length,
"window_type": s.window_type,
}
x_mels = x.mel_spectrogram(n_mels, mel_fmin=fmin, mel_fmax=fmax, **kwargs)
y_mels = y.mel_spectrogram(n_mels, mel_fmin=fmin, mel_fmax=fmax, **kwargs)
loss += self.log_weight * self.loss_fn(
x_mels.clamp(self.clamp_eps).pow(self.pow).log10(),
y_mels.clamp(self.clamp_eps).pow(self.pow).log10(),
)
loss += self.mag_weight * self.loss_fn(x_mels, y_mels)
return loss
class GANLoss(nn.Module):
"""
Computes a discriminator loss, given a discriminator on
generated waveforms/spectrograms compared to ground truth
waveforms/spectrograms. Computes the loss for both the
discriminator and the generator in separate functions.
"""
def __init__(self, discriminator):
super().__init__()
self.discriminator = discriminator
def forward(self, fake, real):
d_fake = self.discriminator(fake.audio_data)
d_real = self.discriminator(real.audio_data)
return d_fake, d_real
def discriminator_loss(self, fake, real):
d_fake, d_real = self.forward(fake.clone().detach(), real)
loss_d = 0
for x_fake, x_real in zip(d_fake, d_real):
loss_d += torch.mean(x_fake[-1] ** 2)
loss_d += torch.mean((1 - x_real[-1]) ** 2)
return loss_d
def generator_loss(self, fake, real):
d_fake, d_real = self.forward(fake, real)
loss_g = 0
for x_fake in d_fake:
loss_g += torch.mean((1 - x_fake[-1]) ** 2)
loss_feature = 0
for i in range(len(d_fake)):
for j in range(len(d_fake[i]) - 1):
loss_feature += F.l1_loss(d_fake[i][j], d_real[i][j].detach())
return loss_g, loss_feature

View File

@@ -0,0 +1,339 @@
from typing import Union
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
from torch.nn.utils import weight_norm
from indextts.s2mel.dac.nn.layers import WNConv1d
class VectorQuantizeLegacy(nn.Module):
"""
Implementation of VQ similar to Karpathy's repo:
https://github.com/karpathy/deep-vector-quantization
removed in-out projection
"""
def __init__(self, input_dim: int, codebook_size: int):
super().__init__()
self.codebook_size = codebook_size
self.codebook = nn.Embedding(codebook_size, input_dim)
def forward(self, z, z_mask=None):
"""Quantized the input tensor using a fixed codebook and returns
the corresponding codebook vectors
Parameters
----------
z : Tensor[B x D x T]
Returns
-------
Tensor[B x D x T]
Quantized continuous representation of input
Tensor[1]
Commitment loss to train encoder to predict vectors closer to codebook
entries
Tensor[1]
Codebook loss to update the codebook
Tensor[B x T]
Codebook indices (quantized discrete representation of input)
Tensor[B x D x T]
Projected latents (continuous representation of input before quantization)
"""
z_e = z
z_q, indices = self.decode_latents(z)
if z_mask is not None:
commitment_loss = (F.mse_loss(z_e, z_q.detach(), reduction="none").mean(1) * z_mask).sum() / z_mask.sum()
codebook_loss = (F.mse_loss(z_q, z_e.detach(), reduction="none").mean(1) * z_mask).sum() / z_mask.sum()
else:
commitment_loss = F.mse_loss(z_e, z_q.detach())
codebook_loss = F.mse_loss(z_q, z_e.detach())
z_q = (
z_e + (z_q - z_e).detach()
) # noop in forward pass, straight-through gradient estimator in backward pass
return z_q, indices, z_e, commitment_loss, codebook_loss
def embed_code(self, embed_id):
return F.embedding(embed_id, self.codebook.weight)
def decode_code(self, embed_id):
return self.embed_code(embed_id).transpose(1, 2)
def decode_latents(self, latents):
encodings = rearrange(latents, "b d t -> (b t) d")
codebook = self.codebook.weight # codebook: (N x D)
# L2 normalize encodings and codebook (ViT-VQGAN)
encodings = F.normalize(encodings)
codebook = F.normalize(codebook)
# Compute euclidean distance with codebook
dist = (
encodings.pow(2).sum(1, keepdim=True)
- 2 * encodings @ codebook.t()
+ codebook.pow(2).sum(1, keepdim=True).t()
)
indices = rearrange((-dist).max(1)[1], "(b t) -> b t", b=latents.size(0))
z_q = self.decode_code(indices)
return z_q, indices
class VectorQuantize(nn.Module):
"""
Implementation of VQ similar to Karpathy's repo:
https://github.com/karpathy/deep-vector-quantization
Additionally uses following tricks from Improved VQGAN
(https://arxiv.org/pdf/2110.04627.pdf):
1. Factorized codes: Perform nearest neighbor lookup in low-dimensional space
for improved codebook usage
2. l2-normalized codes: Converts euclidean distance to cosine similarity which
improves training stability
"""
def __init__(self, input_dim: int, codebook_size: int, codebook_dim: int):
super().__init__()
self.codebook_size = codebook_size
self.codebook_dim = codebook_dim
self.in_proj = WNConv1d(input_dim, codebook_dim, kernel_size=1)
self.out_proj = WNConv1d(codebook_dim, input_dim, kernel_size=1)
self.codebook = nn.Embedding(codebook_size, codebook_dim)
def forward(self, z, z_mask=None):
"""Quantized the input tensor using a fixed codebook and returns
the corresponding codebook vectors
Parameters
----------
z : Tensor[B x D x T]
Returns
-------
Tensor[B x D x T]
Quantized continuous representation of input
Tensor[1]
Commitment loss to train encoder to predict vectors closer to codebook
entries
Tensor[1]
Codebook loss to update the codebook
Tensor[B x T]
Codebook indices (quantized discrete representation of input)
Tensor[B x D x T]
Projected latents (continuous representation of input before quantization)
"""
# Factorized codes (ViT-VQGAN) Project input into low-dimensional space
z_e = self.in_proj(z) # z_e : (B x D x T)
z_q, indices = self.decode_latents(z_e)
if z_mask is not None:
commitment_loss = (F.mse_loss(z_e, z_q.detach(), reduction="none").mean(1) * z_mask).sum() / z_mask.sum()
codebook_loss = (F.mse_loss(z_q, z_e.detach(), reduction="none").mean(1) * z_mask).sum() / z_mask.sum()
else:
commitment_loss = F.mse_loss(z_e, z_q.detach())
codebook_loss = F.mse_loss(z_q, z_e.detach())
z_q = (
z_e + (z_q - z_e).detach()
) # noop in forward pass, straight-through gradient estimator in backward pass
z_q = self.out_proj(z_q)
return z_q, commitment_loss, codebook_loss, indices, z_e
def embed_code(self, embed_id):
return F.embedding(embed_id, self.codebook.weight)
def decode_code(self, embed_id):
return self.embed_code(embed_id).transpose(1, 2)
def decode_latents(self, latents):
encodings = rearrange(latents, "b d t -> (b t) d")
codebook = self.codebook.weight # codebook: (N x D)
# L2 normalize encodings and codebook (ViT-VQGAN)
encodings = F.normalize(encodings)
codebook = F.normalize(codebook)
# Compute euclidean distance with codebook
dist = (
encodings.pow(2).sum(1, keepdim=True)
- 2 * encodings @ codebook.t()
+ codebook.pow(2).sum(1, keepdim=True).t()
)
indices = rearrange((-dist).max(1)[1], "(b t) -> b t", b=latents.size(0))
z_q = self.decode_code(indices)
return z_q, indices
class ResidualVectorQuantize(nn.Module):
"""
Introduced in SoundStream: An end2end neural audio codec
https://arxiv.org/abs/2107.03312
"""
def __init__(
self,
input_dim: int = 512,
n_codebooks: int = 9,
codebook_size: int = 1024,
codebook_dim: Union[int, list] = 8,
quantizer_dropout: float = 0.0,
):
super().__init__()
if isinstance(codebook_dim, int):
codebook_dim = [codebook_dim for _ in range(n_codebooks)]
self.n_codebooks = n_codebooks
self.codebook_dim = codebook_dim
self.codebook_size = codebook_size
self.quantizers = nn.ModuleList(
[
VectorQuantize(input_dim, codebook_size, codebook_dim[i])
for i in range(n_codebooks)
]
)
self.quantizer_dropout = quantizer_dropout
def forward(self, z, n_quantizers: int = None):
"""Quantized the input tensor using a fixed set of `n` codebooks and returns
the corresponding codebook vectors
Parameters
----------
z : Tensor[B x D x T]
n_quantizers : int, optional
No. of quantizers to use
(n_quantizers < self.n_codebooks ex: for quantizer dropout)
Note: if `self.quantizer_dropout` is True, this argument is ignored
when in training mode, and a random number of quantizers is used.
Returns
-------
dict
A dictionary with the following keys:
"z" : Tensor[B x D x T]
Quantized continuous representation of input
"codes" : Tensor[B x N x T]
Codebook indices for each codebook
(quantized discrete representation of input)
"latents" : Tensor[B x N*D x T]
Projected latents (continuous representation of input before quantization)
"vq/commitment_loss" : Tensor[1]
Commitment loss to train encoder to predict vectors closer to codebook
entries
"vq/codebook_loss" : Tensor[1]
Codebook loss to update the codebook
"""
z_q = 0
residual = z
commitment_loss = 0
codebook_loss = 0
codebook_indices = []
latents = []
if n_quantizers is None:
n_quantizers = self.n_codebooks
if self.training:
n_quantizers = torch.ones((z.shape[0],)) * self.n_codebooks + 1
dropout = torch.randint(1, self.n_codebooks + 1, (z.shape[0],))
n_dropout = int(z.shape[0] * self.quantizer_dropout)
n_quantizers[:n_dropout] = dropout[:n_dropout]
n_quantizers = n_quantizers.to(z.device)
for i, quantizer in enumerate(self.quantizers):
if self.training is False and i >= n_quantizers:
break
z_q_i, commitment_loss_i, codebook_loss_i, indices_i, z_e_i = quantizer(
residual
)
# Create mask to apply quantizer dropout
mask = (
torch.full((z.shape[0],), fill_value=i, device=z.device) < n_quantizers
)
z_q = z_q + z_q_i * mask[:, None, None]
residual = residual - z_q_i
# Sum losses
commitment_loss += (commitment_loss_i * mask).mean()
codebook_loss += (codebook_loss_i * mask).mean()
codebook_indices.append(indices_i)
latents.append(z_e_i)
codes = torch.stack(codebook_indices, dim=1)
latents = torch.cat(latents, dim=1)
return z_q, codes, latents, commitment_loss, codebook_loss
def from_codes(self, codes: torch.Tensor):
"""Given the quantized codes, reconstruct the continuous representation
Parameters
----------
codes : Tensor[B x N x T]
Quantized discrete representation of input
Returns
-------
Tensor[B x D x T]
Quantized continuous representation of input
"""
z_q = 0.0
z_p = []
n_codebooks = codes.shape[1]
for i in range(n_codebooks):
z_p_i = self.quantizers[i].decode_code(codes[:, i, :])
z_p.append(z_p_i)
z_q_i = self.quantizers[i].out_proj(z_p_i)
z_q = z_q + z_q_i
return z_q, torch.cat(z_p, dim=1), codes
def from_latents(self, latents: torch.Tensor):
"""Given the unquantized latents, reconstruct the
continuous representation after quantization.
Parameters
----------
latents : Tensor[B x N x T]
Continuous representation of input after projection
Returns
-------
Tensor[B x D x T]
Quantized representation of full-projected space
Tensor[B x D x T]
Quantized representation of latent space
"""
z_q = 0
z_p = []
codes = []
dims = np.cumsum([0] + [q.codebook_dim for q in self.quantizers])
n_codebooks = np.where(dims <= latents.shape[1])[0].max(axis=0, keepdims=True)[
0
]
for i in range(n_codebooks):
j, k = dims[i], dims[i + 1]
z_p_i, codes_i = self.quantizers[i].decode_latents(latents[:, j:k, :])
z_p.append(z_p_i)
codes.append(codes_i)
z_q_i = self.quantizers[i].out_proj(z_p_i)
z_q = z_q + z_q_i
return z_q, torch.cat(z_p, dim=1), torch.stack(codes, dim=1)
if __name__ == "__main__":
rvq = ResidualVectorQuantize(quantizer_dropout=True)
x = torch.randn(16, 512, 80)
y = rvq(x)
print(y["latents"].shape)

View File

@@ -0,0 +1,123 @@
from pathlib import Path
import argbind
from audiotools import ml
import indextts.s2mel.dac as dac
DAC = dac.model.DAC
Accelerator = ml.Accelerator
__MODEL_LATEST_TAGS__ = {
("44khz", "8kbps"): "0.0.1",
("24khz", "8kbps"): "0.0.4",
("16khz", "8kbps"): "0.0.5",
("44khz", "16kbps"): "1.0.0",
}
__MODEL_URLS__ = {
(
"44khz",
"0.0.1",
"8kbps",
): "https://github.com/descriptinc/descript-audio-codec/releases/download/0.0.1/weights.pth",
(
"24khz",
"0.0.4",
"8kbps",
): "https://github.com/descriptinc/descript-audio-codec/releases/download/0.0.4/weights_24khz.pth",
(
"16khz",
"0.0.5",
"8kbps",
): "https://github.com/descriptinc/descript-audio-codec/releases/download/0.0.5/weights_16khz.pth",
(
"44khz",
"1.0.0",
"16kbps",
): "https://github.com/descriptinc/descript-audio-codec/releases/download/1.0.0/weights_44khz_16kbps.pth",
}
@argbind.bind(group="download", positional=True, without_prefix=True)
def download(
model_type: str = "44khz", model_bitrate: str = "8kbps", tag: str = "latest"
):
"""
Function that downloads the weights file from URL if a local cache is not found.
Parameters
----------
model_type : str
The type of model to download. Must be one of "44khz", "24khz", or "16khz". Defaults to "44khz".
model_bitrate: str
Bitrate of the model. Must be one of "8kbps", or "16kbps". Defaults to "8kbps".
Only 44khz model supports 16kbps.
tag : str
The tag of the model to download. Defaults to "latest".
Returns
-------
Path
Directory path required to load model via audiotools.
"""
model_type = model_type.lower()
tag = tag.lower()
assert model_type in [
"44khz",
"24khz",
"16khz",
], "model_type must be one of '44khz', '24khz', or '16khz'"
assert model_bitrate in [
"8kbps",
"16kbps",
], "model_bitrate must be one of '8kbps', or '16kbps'"
if tag == "latest":
tag = __MODEL_LATEST_TAGS__[(model_type, model_bitrate)]
download_link = __MODEL_URLS__.get((model_type, tag, model_bitrate), None)
if download_link is None:
raise ValueError(
f"Could not find model with tag {tag} and model type {model_type}"
)
local_path = (
Path.home()
/ ".cache"
/ "descript"
/ "dac"
/ f"weights_{model_type}_{model_bitrate}_{tag}.pth"
)
if not local_path.exists():
local_path.parent.mkdir(parents=True, exist_ok=True)
# Download the model
import requests
response = requests.get(download_link)
if response.status_code != 200:
raise ValueError(
f"Could not download model. Received response code {response.status_code}"
)
local_path.write_bytes(response.content)
return local_path
def load_model(
model_type: str = "44khz",
model_bitrate: str = "8kbps",
tag: str = "latest",
load_path: str = None,
):
if not load_path:
load_path = download(
model_type=model_type, model_bitrate=model_bitrate, tag=tag
)
generator = DAC.load(load_path)
return generator

View File

@@ -0,0 +1,95 @@
import warnings
from pathlib import Path
import argbind
import numpy as np
import torch
from audiotools import AudioSignal
from tqdm import tqdm
from dac import DACFile
from dac.utils import load_model
warnings.filterwarnings("ignore", category=UserWarning)
@argbind.bind(group="decode", positional=True, without_prefix=True)
@torch.inference_mode()
@torch.no_grad()
def decode(
input: str,
output: str = "",
weights_path: str = "",
model_tag: str = "latest",
model_bitrate: str = "8kbps",
device: str = "cuda",
model_type: str = "44khz",
verbose: bool = False,
):
"""Decode audio from codes.
Parameters
----------
input : str
Path to input directory or file
output : str, optional
Path to output directory, by default "".
If `input` is a directory, the directory sub-tree relative to `input` is re-created in `output`.
weights_path : str, optional
Path to weights file, by default "". If not specified, the weights file will be downloaded from the internet using the
model_tag and model_type.
model_tag : str, optional
Tag of the model to use, by default "latest". Ignored if `weights_path` is specified.
model_bitrate: str
Bitrate of the model. Must be one of "8kbps", or "16kbps". Defaults to "8kbps".
device : str, optional
Device to use, by default "cuda". If "cpu", the model will be loaded on the CPU.
model_type : str, optional
The type of model to use. Must be one of "44khz", "24khz", or "16khz". Defaults to "44khz". Ignored if `weights_path` is specified.
"""
generator = load_model(
model_type=model_type,
model_bitrate=model_bitrate,
tag=model_tag,
load_path=weights_path,
)
generator.to(device)
generator.eval()
# Find all .dac files in input directory
_input = Path(input)
input_files = list(_input.glob("**/*.dac"))
# If input is a .dac file, add it to the list
if _input.suffix == ".dac":
input_files.append(_input)
# Create output directory
output = Path(output)
output.mkdir(parents=True, exist_ok=True)
for i in tqdm(range(len(input_files)), desc=f"Decoding files"):
# Load file
artifact = DACFile.load(input_files[i])
# Reconstruct audio from codes
recons = generator.decompress(artifact, verbose=verbose)
# Compute output path
relative_path = input_files[i].relative_to(input)
output_dir = output / relative_path.parent
if not relative_path.name:
output_dir = output
relative_path = input_files[i]
output_name = relative_path.with_suffix(".wav").name
output_path = output_dir / output_name
output_path.parent.mkdir(parents=True, exist_ok=True)
# Write to file
recons.write(output_path)
if __name__ == "__main__":
args = argbind.parse_args()
with argbind.scope(args):
decode()

View File

@@ -0,0 +1,94 @@
import math
import warnings
from pathlib import Path
import argbind
import numpy as np
import torch
from audiotools import AudioSignal
from audiotools.core import util
from tqdm import tqdm
from dac.utils import load_model
warnings.filterwarnings("ignore", category=UserWarning)
@argbind.bind(group="encode", positional=True, without_prefix=True)
@torch.inference_mode()
@torch.no_grad()
def encode(
input: str,
output: str = "",
weights_path: str = "",
model_tag: str = "latest",
model_bitrate: str = "8kbps",
n_quantizers: int = None,
device: str = "cuda",
model_type: str = "44khz",
win_duration: float = 5.0,
verbose: bool = False,
):
"""Encode audio files in input path to .dac format.
Parameters
----------
input : str
Path to input audio file or directory
output : str, optional
Path to output directory, by default "". If `input` is a directory, the directory sub-tree relative to `input` is re-created in `output`.
weights_path : str, optional
Path to weights file, by default "". If not specified, the weights file will be downloaded from the internet using the
model_tag and model_type.
model_tag : str, optional
Tag of the model to use, by default "latest". Ignored if `weights_path` is specified.
model_bitrate: str
Bitrate of the model. Must be one of "8kbps", or "16kbps". Defaults to "8kbps".
n_quantizers : int, optional
Number of quantizers to use, by default None. If not specified, all the quantizers will be used and the model will compress at maximum bitrate.
device : str, optional
Device to use, by default "cuda"
model_type : str, optional
The type of model to use. Must be one of "44khz", "24khz", or "16khz". Defaults to "44khz". Ignored if `weights_path` is specified.
"""
generator = load_model(
model_type=model_type,
model_bitrate=model_bitrate,
tag=model_tag,
load_path=weights_path,
)
generator.to(device)
generator.eval()
kwargs = {"n_quantizers": n_quantizers}
# Find all audio files in input path
input = Path(input)
audio_files = util.find_audio(input)
output = Path(output)
output.mkdir(parents=True, exist_ok=True)
for i in tqdm(range(len(audio_files)), desc="Encoding files"):
# Load file
signal = AudioSignal(audio_files[i])
# Encode audio to .dac format
artifact = generator.compress(signal, win_duration, verbose=verbose, **kwargs)
# Compute output path
relative_path = audio_files[i].relative_to(input)
output_dir = output / relative_path.parent
if not relative_path.name:
output_dir = output
relative_path = audio_files[i]
output_name = relative_path.with_suffix(".dac").name
output_path = output_dir / output_name
output_path.parent.mkdir(parents=True, exist_ok=True)
artifact.save(output_path)
if __name__ == "__main__":
args = argbind.parse_args()
with argbind.scope(args):
encode()

View File

@@ -0,0 +1,12 @@
import os
from huggingface_hub import hf_hub_download
def load_custom_model_from_hf(repo_id, model_filename="pytorch_model.bin", config_filename="config.yml"):
os.makedirs("./checkpoints", exist_ok=True)
model_path = hf_hub_download(repo_id=repo_id, filename=model_filename, cache_dir="./checkpoints")
if config_filename is None:
return model_path
config_path = hf_hub_download(repo_id=repo_id, filename=config_filename, cache_dir="./checkpoints")
return model_path, config_path

View File

@@ -0,0 +1,82 @@
import numpy as np
import torch
import torch.utils.data
from librosa.filters import mel as librosa_mel_fn
from scipy.io.wavfile import read
MAX_WAV_VALUE = 32768.0
def load_wav(full_path):
sampling_rate, data = read(full_path)
return data, sampling_rate
def dynamic_range_compression(x, C=1, clip_val=1e-5):
return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
def dynamic_range_decompression(x, C=1):
return np.exp(x) / C
def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
return torch.log(torch.clamp(x, min=clip_val) * C)
def dynamic_range_decompression_torch(x, C=1):
return torch.exp(x) / C
def spectral_normalize_torch(magnitudes):
output = dynamic_range_compression_torch(magnitudes)
return output
def spectral_de_normalize_torch(magnitudes):
output = dynamic_range_decompression_torch(magnitudes)
return output
mel_basis = {}
hann_window = {}
def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
# if torch.min(y) < -1.0:
# print("min value is ", torch.min(y))
# if torch.max(y) > 1.0:
# print("max value is ", torch.max(y))
global mel_basis, hann_window # pylint: disable=global-statement
if f"{str(sampling_rate)}_{str(fmax)}_{str(y.device)}" not in mel_basis:
mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax)
mel_basis[str(sampling_rate) + "_" + str(fmax) + "_" + str(y.device)] = torch.from_numpy(mel).float().to(y.device)
hann_window[str(sampling_rate) + "_" + str(y.device)] = torch.hann_window(win_size).to(y.device)
y = torch.nn.functional.pad(
y.unsqueeze(1), (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)), mode="reflect"
)
y = y.squeeze(1)
spec = torch.view_as_real(
torch.stft(
y,
n_fft,
hop_length=hop_size,
win_length=win_size,
window=hann_window[str(sampling_rate) + "_" + str(y.device)],
center=center,
pad_mode="reflect",
normalized=False,
onesided=True,
return_complex=True,
)
)
spec = torch.sqrt(spec.pow(2).sum(-1) + (1e-9))
spec = torch.matmul(mel_basis[str(sampling_rate) + "_" + str(fmax) + "_" + str(y.device)], spec)
spec = spectral_normalize_torch(spec)
return spec

View File

@@ -0,0 +1,610 @@
import math
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
from munch import Munch
import json
import argparse
from torch.nn.parallel import DistributedDataParallel as DDP
def str2bool(v):
if isinstance(v, bool):
return v
if v.lower() in ("yes", "true", "t", "y", "1"):
return True
elif v.lower() in ("no", "false", "f", "n", "0"):
return False
else:
raise argparse.ArgumentTypeError("Boolean value expected.")
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
def init_weights(m, mean=0.0, std=0.01):
classname = m.__class__.__name__
if classname.find("Conv") != -1:
m.weight.data.normal_(mean, std)
def get_padding(kernel_size, dilation=1):
return int((kernel_size * dilation - dilation) / 2)
def convert_pad_shape(pad_shape):
l = pad_shape[::-1]
pad_shape = [item for sublist in l for item in sublist]
return pad_shape
def intersperse(lst, item):
result = [item] * (len(lst) * 2 + 1)
result[1::2] = lst
return result
def kl_divergence(m_p, logs_p, m_q, logs_q):
"""KL(P||Q)"""
kl = (logs_q - logs_p) - 0.5
kl += (
0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q)
)
return kl
def rand_gumbel(shape):
"""Sample from the Gumbel distribution, protect from overflows."""
uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
return -torch.log(-torch.log(uniform_samples))
def rand_gumbel_like(x):
g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
return g
def slice_segments(x, ids_str, segment_size=4):
ret = torch.zeros_like(x[:, :, :segment_size])
for i in range(x.size(0)):
idx_str = ids_str[i]
idx_end = idx_str + segment_size
ret[i] = x[i, :, idx_str:idx_end]
return ret
def slice_segments_audio(x, ids_str, segment_size=4):
ret = torch.zeros_like(x[:, :segment_size])
for i in range(x.size(0)):
idx_str = ids_str[i]
idx_end = idx_str + segment_size
ret[i] = x[i, idx_str:idx_end]
return ret
def rand_slice_segments(x, x_lengths=None, segment_size=4):
b, d, t = x.size()
if x_lengths is None:
x_lengths = t
ids_str_max = x_lengths - segment_size + 1
ids_str = ((torch.rand([b]).to(device=x.device) * ids_str_max).clip(0)).to(
dtype=torch.long
)
ret = slice_segments(x, ids_str, segment_size)
return ret, ids_str
def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4):
position = torch.arange(length, dtype=torch.float)
num_timescales = channels // 2
log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / (
num_timescales - 1
)
inv_timescales = min_timescale * torch.exp(
torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment
)
scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
signal = F.pad(signal, [0, 0, 0, channels % 2])
signal = signal.view(1, channels, length)
return signal
def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
b, channels, length = x.size()
signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
return x + signal.to(dtype=x.dtype, device=x.device)
def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
b, channels, length = x.size()
signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
def subsequent_mask(length):
mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
return mask
@torch.jit.script
def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
n_channels_int = n_channels[0]
in_act = input_a + input_b
t_act = torch.tanh(in_act[:, :n_channels_int, :])
s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
acts = t_act * s_act
return acts
def convert_pad_shape(pad_shape):
l = pad_shape[::-1]
pad_shape = [item for sublist in l for item in sublist]
return pad_shape
def shift_1d(x):
x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
return x
def sequence_mask(length, max_length=None):
if max_length is None:
max_length = length.max()
x = torch.arange(max_length, dtype=length.dtype, device=length.device)
return x.unsqueeze(0) < length.unsqueeze(1)
def avg_with_mask(x, mask):
assert mask.dtype == torch.float, "Mask should be float"
if mask.ndim == 2:
mask = mask.unsqueeze(1)
if mask.shape[1] == 1:
mask = mask.expand_as(x)
return (x * mask).sum() / mask.sum()
def generate_path(duration, mask):
"""
duration: [b, 1, t_x]
mask: [b, 1, t_y, t_x]
"""
device = duration.device
b, _, t_y, t_x = mask.shape
cum_duration = torch.cumsum(duration, -1)
cum_duration_flat = cum_duration.view(b * t_x)
path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
path = path.view(b, t_x, t_y)
path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
path = path.unsqueeze(1).transpose(2, 3) * mask
return path
def clip_grad_value_(parameters, clip_value, norm_type=2):
if isinstance(parameters, torch.Tensor):
parameters = [parameters]
parameters = list(filter(lambda p: p.grad is not None, parameters))
norm_type = float(norm_type)
if clip_value is not None:
clip_value = float(clip_value)
total_norm = 0
for p in parameters:
param_norm = p.grad.data.norm(norm_type)
total_norm += param_norm.item() ** norm_type
if clip_value is not None:
p.grad.data.clamp_(min=-clip_value, max=clip_value)
total_norm = total_norm ** (1.0 / norm_type)
return total_norm
def log_norm(x, mean=-4, std=4, dim=2):
"""
normalized log mel -> mel -> norm -> log(norm)
"""
x = torch.log(torch.exp(x * std + mean).norm(dim=dim))
return x
def load_F0_models(path):
# load F0 model
from .JDC.model import JDCNet
F0_model = JDCNet(num_class=1, seq_len=192)
params = torch.load(path, map_location="cpu")["net"]
F0_model.load_state_dict(params)
_ = F0_model.train()
return F0_model
def modify_w2v_forward(self, output_layer=15):
"""
change forward method of w2v encoder to get its intermediate layer output
:param self:
:param layer:
:return:
"""
from transformers.modeling_outputs import BaseModelOutput
def forward(
hidden_states,
attention_mask=None,
output_attentions=False,
output_hidden_states=False,
return_dict=True,
):
all_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
conv_attention_mask = attention_mask
if attention_mask is not None:
# make sure padded tokens output 0
hidden_states = hidden_states.masked_fill(
~attention_mask.bool().unsqueeze(-1), 0.0
)
# extend attention_mask
attention_mask = 1.0 - attention_mask[:, None, None, :].to(
dtype=hidden_states.dtype
)
attention_mask = attention_mask * torch.finfo(hidden_states.dtype).min
attention_mask = attention_mask.expand(
attention_mask.shape[0],
1,
attention_mask.shape[-1],
attention_mask.shape[-1],
)
hidden_states = self.dropout(hidden_states)
if self.embed_positions is not None:
relative_position_embeddings = self.embed_positions(hidden_states)
else:
relative_position_embeddings = None
deepspeed_zero3_is_enabled = False
for i, layer in enumerate(self.layers):
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
# add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
dropout_probability = torch.rand([])
skip_the_layer = (
True
if self.training and (dropout_probability < self.config.layerdrop)
else False
)
if not skip_the_layer or deepspeed_zero3_is_enabled:
# under deepspeed zero3 all gpus must run in sync
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
layer.__call__,
hidden_states,
attention_mask,
relative_position_embeddings,
output_attentions,
conv_attention_mask,
)
else:
layer_outputs = layer(
hidden_states,
attention_mask=attention_mask,
relative_position_embeddings=relative_position_embeddings,
output_attentions=output_attentions,
conv_attention_mask=conv_attention_mask,
)
hidden_states = layer_outputs[0]
if skip_the_layer:
layer_outputs = (None, None)
if output_attentions:
all_self_attentions = all_self_attentions + (layer_outputs[1],)
if i == output_layer - 1:
break
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(
v
for v in [hidden_states, all_hidden_states, all_self_attentions]
if v is not None
)
return BaseModelOutput(
last_hidden_state=hidden_states,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
)
return forward
MATPLOTLIB_FLAG = False
def plot_spectrogram_to_numpy(spectrogram):
global MATPLOTLIB_FLAG
if not MATPLOTLIB_FLAG:
import matplotlib
import logging
matplotlib.use("Agg")
MATPLOTLIB_FLAG = True
mpl_logger = logging.getLogger("matplotlib")
mpl_logger.setLevel(logging.WARNING)
import matplotlib.pylab as plt
import numpy as np
fig, ax = plt.subplots(figsize=(10, 2))
im = ax.imshow(spectrogram, aspect="auto", origin="lower", interpolation="none")
plt.colorbar(im, ax=ax)
plt.xlabel("Frames")
plt.ylabel("Channels")
plt.tight_layout()
fig.canvas.draw()
data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep="")
data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
plt.close()
return data
def normalize_f0(f0_sequence):
# Remove unvoiced frames (replace with -1)
voiced_indices = np.where(f0_sequence > 0)[0]
f0_voiced = f0_sequence[voiced_indices]
# Convert to log scale
log_f0 = np.log2(f0_voiced)
# Calculate mean and standard deviation
mean_f0 = np.mean(log_f0)
std_f0 = np.std(log_f0)
# Normalize the F0 sequence
normalized_f0 = (log_f0 - mean_f0) / std_f0
# Create the normalized F0 sequence with unvoiced frames
normalized_sequence = np.zeros_like(f0_sequence)
normalized_sequence[voiced_indices] = normalized_f0
normalized_sequence[f0_sequence <= 0] = -1 # Assign -1 to unvoiced frames
return normalized_sequence
class MyModel(nn.Module):
def __init__(self,args):
super(MyModel, self).__init__()
from modules.flow_matching import CFM
from modules.length_regulator import InterpolateRegulator
length_regulator = InterpolateRegulator(
channels=args.length_regulator.channels,
sampling_ratios=args.length_regulator.sampling_ratios,
is_discrete=args.length_regulator.is_discrete,
in_channels=args.length_regulator.in_channels if hasattr(args.length_regulator, "in_channels") else None,
vector_quantize=args.length_regulator.vector_quantize if hasattr(args.length_regulator, "vector_quantize") else False,
codebook_size=args.length_regulator.content_codebook_size,
n_codebooks=args.length_regulator.n_codebooks if hasattr(args.length_regulator, "n_codebooks") else 1,
quantizer_dropout=args.length_regulator.quantizer_dropout if hasattr(args.length_regulator, "quantizer_dropout") else 0.0,
f0_condition=args.length_regulator.f0_condition if hasattr(args.length_regulator, "f0_condition") else False,
n_f0_bins=args.length_regulator.n_f0_bins if hasattr(args.length_regulator, "n_f0_bins") else 512,
)
self.models = nn.ModuleDict({
'cfm': CFM(args),
'length_regulator': length_regulator
})
def forward(self, x, target_lengths, prompt_len, cond, y):
x = self.models['cfm'](x, target_lengths, prompt_len, cond, y)
return x
def forward2(self, S_ori,target_lengths,F0_ori):
x = self.models['length_regulator'](S_ori, ylens=target_lengths, f0=F0_ori)
return x
def build_model(args, stage="DiT"):
if stage == "DiT":
from modules.flow_matching import CFM
from modules.length_regulator import InterpolateRegulator
length_regulator = InterpolateRegulator(
channels=args.length_regulator.channels,
sampling_ratios=args.length_regulator.sampling_ratios,
is_discrete=args.length_regulator.is_discrete,
in_channels=args.length_regulator.in_channels if hasattr(args.length_regulator, "in_channels") else None,
vector_quantize=args.length_regulator.vector_quantize if hasattr(args.length_regulator, "vector_quantize") else False,
codebook_size=args.length_regulator.content_codebook_size,
n_codebooks=args.length_regulator.n_codebooks if hasattr(args.length_regulator, "n_codebooks") else 1,
quantizer_dropout=args.length_regulator.quantizer_dropout if hasattr(args.length_regulator, "quantizer_dropout") else 0.0,
f0_condition=args.length_regulator.f0_condition if hasattr(args.length_regulator, "f0_condition") else False,
n_f0_bins=args.length_regulator.n_f0_bins if hasattr(args.length_regulator, "n_f0_bins") else 512,
)
cfm = CFM(args)
nets = Munch(
cfm=cfm,
length_regulator=length_regulator,
)
elif stage == 'codec':
from dac.model.dac import Encoder
from modules.quantize import (
FAquantizer,
)
encoder = Encoder(
d_model=args.DAC.encoder_dim,
strides=args.DAC.encoder_rates,
d_latent=1024,
causal=args.causal,
lstm=args.lstm,
)
quantizer = FAquantizer(
in_dim=1024,
n_p_codebooks=1,
n_c_codebooks=args.n_c_codebooks,
n_t_codebooks=2,
n_r_codebooks=3,
codebook_size=1024,
codebook_dim=8,
quantizer_dropout=0.5,
causal=args.causal,
separate_prosody_encoder=args.separate_prosody_encoder,
timbre_norm=args.timbre_norm,
)
nets = Munch(
encoder=encoder,
quantizer=quantizer,
)
elif stage == "mel_vocos":
from modules.vocos import Vocos
decoder = Vocos(args)
nets = Munch(
decoder=decoder,
)
else:
raise ValueError(f"Unknown stage: {stage}")
return nets
def load_checkpoint(
model,
optimizer,
path,
load_only_params=True,
ignore_modules=[],
is_distributed=False,
load_ema=False,
):
state = torch.load(path, map_location="cpu")
params = state["net"]
if load_ema and "ema" in state:
print("Loading EMA")
for key in model:
i = 0
for param_name in params[key]:
if "input_pos" in param_name:
continue
assert params[key][param_name].shape == state["ema"][key][0][i].shape
params[key][param_name] = state["ema"][key][0][i].clone()
i += 1
for key in model:
if key in params and key not in ignore_modules:
if not is_distributed:
# strip prefix of DDP (module.), create a new OrderedDict that does not contain the prefix
for k in list(params[key].keys()):
if k.startswith("module."):
params[key][k[len("module.") :]] = params[key][k]
del params[key][k]
model_state_dict = model[key].state_dict()
# 过滤出形状匹配的键值对
filtered_state_dict = {
k: v
for k, v in params[key].items()
if k in model_state_dict and v.shape == model_state_dict[k].shape
}
skipped_keys = set(params[key].keys()) - set(filtered_state_dict.keys())
if skipped_keys:
print(
f"Warning: Skipped loading some keys due to shape mismatch: {skipped_keys}"
)
print("%s loaded" % key)
model[key].load_state_dict(filtered_state_dict, strict=False)
_ = [model[key].eval() for key in model]
if not load_only_params:
epoch = state["epoch"] + 1
iters = state["iters"]
optimizer.load_state_dict(state["optimizer"])
optimizer.load_scheduler_state_dict(state["scheduler"])
else:
epoch = 0
iters = 0
return model, optimizer, epoch, iters
def load_checkpoint2(
model,
optimizer,
path,
load_only_params=True,
ignore_modules=[],
is_distributed=False,
load_ema=False,
):
state = torch.load(path, map_location="cpu")
params = state["net"]
if load_ema and "ema" in state:
print("Loading EMA")
for key in model.models:
i = 0
for param_name in params[key]:
if "input_pos" in param_name:
continue
assert params[key][param_name].shape == state["ema"][key][0][i].shape
params[key][param_name] = state["ema"][key][0][i].clone()
i += 1
for key in model.models:
if key in params and key not in ignore_modules:
if not is_distributed:
# strip prefix of DDP (module.), create a new OrderedDict that does not contain the prefix
for k in list(params[key].keys()):
if k.startswith("module."):
params[key][k[len("module.") :]] = params[key][k]
del params[key][k]
model_state_dict = model.models[key].state_dict()
# 过滤出形状匹配的键值对
filtered_state_dict = {
k: v
for k, v in params[key].items()
if k in model_state_dict and v.shape == model_state_dict[k].shape
}
skipped_keys = set(params[key].keys()) - set(filtered_state_dict.keys())
if skipped_keys:
print(
f"Warning: Skipped loading some keys due to shape mismatch: {skipped_keys}"
)
print("%s loaded" % key)
model.models[key].load_state_dict(filtered_state_dict, strict=False)
model.eval()
# _ = [model[key].eval() for key in model]
if not load_only_params:
epoch = state["epoch"] + 1
iters = state["iters"]
optimizer.load_state_dict(state["optimizer"])
optimizer.load_scheduler_state_dict(state["scheduler"])
else:
epoch = 0
iters = 0
return model, optimizer, epoch, iters
def recursive_munch(d):
if isinstance(d, dict):
return Munch((k, recursive_munch(v)) for k, v in d.items())
elif isinstance(d, list):
return [recursive_munch(v) for v in d]
else:
return d

View File

@@ -0,0 +1,258 @@
import torch
from torch import nn
import math
from modules.gpt_fast.model import ModelArgs, Transformer
# from modules.torchscript_modules.gpt_fast_model import ModelArgs, Transformer
from modules.wavenet import WN
from modules.commons import sequence_mask
from torch.nn.utils import weight_norm
def modulate(x, shift, scale):
return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
#################################################################################
# Embedding Layers for Timesteps and Class Labels #
#################################################################################
class TimestepEmbedder(nn.Module):
"""
Embeds scalar timesteps into vector representations.
"""
def __init__(self, hidden_size, frequency_embedding_size=256):
super().__init__()
self.mlp = nn.Sequential(
nn.Linear(frequency_embedding_size, hidden_size, bias=True),
nn.SiLU(),
nn.Linear(hidden_size, hidden_size, bias=True),
)
self.frequency_embedding_size = frequency_embedding_size
self.max_period = 10000
self.scale = 1000
half = frequency_embedding_size // 2
freqs = torch.exp(
-math.log(self.max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
)
self.register_buffer("freqs", freqs)
def timestep_embedding(self, t):
"""
Create sinusoidal timestep embeddings.
:param t: a 1-D Tensor of N indices, one per batch element.
These may be fractional.
:param dim: the dimension of the output.
:param max_period: controls the minimum frequency of the embeddings.
:return: an (N, D) Tensor of positional embeddings.
"""
# https://github.com/openai/glide-text2im/blob/main/glide_text2im/nn.py
args = self.scale * t[:, None].float() * self.freqs[None]
embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
if self.frequency_embedding_size % 2:
embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
return embedding
def forward(self, t):
t_freq = self.timestep_embedding(t)
t_emb = self.mlp(t_freq)
return t_emb
class StyleEmbedder(nn.Module):
"""
Embeds class labels into vector representations. Also handles label dropout for classifier-free guidance.
"""
def __init__(self, input_size, hidden_size, dropout_prob):
super().__init__()
use_cfg_embedding = dropout_prob > 0
self.embedding_table = nn.Embedding(int(use_cfg_embedding), hidden_size)
self.style_in = weight_norm(nn.Linear(input_size, hidden_size, bias=True))
self.input_size = input_size
self.dropout_prob = dropout_prob
def forward(self, labels, train, force_drop_ids=None):
use_dropout = self.dropout_prob > 0
if (train and use_dropout) or (force_drop_ids is not None):
labels = self.token_drop(labels, force_drop_ids)
else:
labels = self.style_in(labels)
embeddings = labels
return embeddings
class FinalLayer(nn.Module):
"""
The final layer of DiT.
"""
def __init__(self, hidden_size, patch_size, out_channels):
super().__init__()
self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
self.linear = weight_norm(nn.Linear(hidden_size, patch_size * patch_size * out_channels, bias=True))
self.adaLN_modulation = nn.Sequential(
nn.SiLU(),
nn.Linear(hidden_size, 2 * hidden_size, bias=True)
)
def forward(self, x, c):
shift, scale = self.adaLN_modulation(c).chunk(2, dim=1)
x = modulate(self.norm_final(x), shift, scale)
x = self.linear(x)
return x
class DiT(torch.nn.Module):
def __init__(
self,
args
):
super(DiT, self).__init__()
self.time_as_token = args.DiT.time_as_token if hasattr(args.DiT, 'time_as_token') else False
self.style_as_token = args.DiT.style_as_token if hasattr(args.DiT, 'style_as_token') else False
self.uvit_skip_connection = args.DiT.uvit_skip_connection if hasattr(args.DiT, 'uvit_skip_connection') else False
model_args = ModelArgs(
block_size=16384,#args.DiT.block_size,
n_layer=args.DiT.depth,
n_head=args.DiT.num_heads,
dim=args.DiT.hidden_dim,
head_dim=args.DiT.hidden_dim // args.DiT.num_heads,
vocab_size=1024,
uvit_skip_connection=self.uvit_skip_connection,
time_as_token=self.time_as_token,
)
self.transformer = Transformer(model_args)
self.in_channels = args.DiT.in_channels
self.out_channels = args.DiT.in_channels
self.num_heads = args.DiT.num_heads
self.x_embedder = weight_norm(nn.Linear(args.DiT.in_channels, args.DiT.hidden_dim, bias=True))
self.content_type = args.DiT.content_type # 'discrete' or 'continuous'
self.content_codebook_size = args.DiT.content_codebook_size # for discrete content
self.content_dim = args.DiT.content_dim # for continuous content
self.cond_embedder = nn.Embedding(args.DiT.content_codebook_size, args.DiT.hidden_dim) # discrete content
self.cond_projection = nn.Linear(args.DiT.content_dim, args.DiT.hidden_dim, bias=True) # continuous content
self.is_causal = args.DiT.is_causal
self.t_embedder = TimestepEmbedder(args.DiT.hidden_dim)
# self.style_embedder1 = weight_norm(nn.Linear(1024, args.DiT.hidden_dim, bias=True))
# self.style_embedder2 = weight_norm(nn.Linear(1024, args.style_encoder.dim, bias=True))
input_pos = torch.arange(16384)
self.register_buffer("input_pos", input_pos)
self.final_layer_type = args.DiT.final_layer_type # mlp or wavenet
if self.final_layer_type == 'wavenet':
self.t_embedder2 = TimestepEmbedder(args.wavenet.hidden_dim)
self.conv1 = nn.Linear(args.DiT.hidden_dim, args.wavenet.hidden_dim)
self.conv2 = nn.Conv1d(args.wavenet.hidden_dim, args.DiT.in_channels, 1)
self.wavenet = WN(hidden_channels=args.wavenet.hidden_dim,
kernel_size=args.wavenet.kernel_size,
dilation_rate=args.wavenet.dilation_rate,
n_layers=args.wavenet.num_layers,
gin_channels=args.wavenet.hidden_dim,
p_dropout=args.wavenet.p_dropout,
causal=False)
self.final_layer = FinalLayer(args.wavenet.hidden_dim, 1, args.wavenet.hidden_dim)
self.res_projection = nn.Linear(args.DiT.hidden_dim,
args.wavenet.hidden_dim) # residual connection from tranformer output to final output
self.wavenet_style_condition = args.wavenet.style_condition
assert args.DiT.style_condition == args.wavenet.style_condition
else:
self.final_mlp = nn.Sequential(
nn.Linear(args.DiT.hidden_dim, args.DiT.hidden_dim),
nn.SiLU(),
nn.Linear(args.DiT.hidden_dim, args.DiT.in_channels),
)
self.transformer_style_condition = args.DiT.style_condition
self.class_dropout_prob = args.DiT.class_dropout_prob
self.content_mask_embedder = nn.Embedding(1, args.DiT.hidden_dim)
self.long_skip_connection = args.DiT.long_skip_connection
self.skip_linear = nn.Linear(args.DiT.hidden_dim + args.DiT.in_channels, args.DiT.hidden_dim)
self.cond_x_merge_linear = nn.Linear(args.DiT.hidden_dim + args.DiT.in_channels * 2 +
args.style_encoder.dim * self.transformer_style_condition * (not self.style_as_token),
args.DiT.hidden_dim)
if self.style_as_token:
self.style_in = nn.Linear(args.style_encoder.dim, args.DiT.hidden_dim)
def setup_caches(self, max_batch_size, max_seq_length):
self.transformer.setup_caches(max_batch_size, max_seq_length, use_kv_cache=False)
def forward(self, x, prompt_x, x_lens, t, style, cond, mask_content=False):
"""
x (torch.Tensor): random noise
prompt_x (torch.Tensor): reference mel + zero mel
shape: (batch_size, 80, 795+1068)
x_lens (torch.Tensor): mel frames output
shape: (batch_size, mel_timesteps)
t (torch.Tensor): radshape:
shape: (batch_size)
style (torch.Tensor): reference global style
shape: (batch_size, 192)
cond (torch.Tensor): semantic info of reference audio and altered audio
shape: (batch_size, mel_timesteps(795+1069), 512)
"""
class_dropout = False
if self.training and torch.rand(1) < self.class_dropout_prob:
class_dropout = True
if not self.training and mask_content:
class_dropout = True
# cond_in_module = self.cond_embedder if self.content_type == 'discrete' else self.cond_projection
cond_in_module = self.cond_projection
B, _, T = x.size()
t1 = self.t_embedder(t) # (N, D) # t1 [2, 512]
cond = cond_in_module(cond) # cond [2,1863,512]->[2,1863,512]
x = x.transpose(1, 2) # [2,1863,80]
prompt_x = prompt_x.transpose(1, 2) # [2,1863,80]
x_in = torch.cat([x, prompt_x, cond], dim=-1) # 80+80+512=672 [2, 1863, 672]
if self.transformer_style_condition and not self.style_as_token: # True and True
x_in = torch.cat([x_in, style[:, None, :].repeat(1, T, 1)], dim=-1) #[2, 1863, 864]
if class_dropout: #False
x_in[..., self.in_channels:] = x_in[..., self.in_channels:] * 0 # 80维后全置为0
x_in = self.cond_x_merge_linear(x_in) # (N, T, D) [2, 1863, 512]
if self.style_as_token: # False
style = self.style_in(style)
style = torch.zeros_like(style) if class_dropout else style
x_in = torch.cat([style.unsqueeze(1), x_in], dim=1)
if self.time_as_token: # False
x_in = torch.cat([t1.unsqueeze(1), x_in], dim=1)
x_mask = sequence_mask(x_lens + self.style_as_token + self.time_as_token).to(x.device).unsqueeze(1) #torch.Size([1, 1, 1863])True
input_pos = self.input_pos[:x_in.size(1)] # (T,) range01863
x_mask_expanded = x_mask[:, None, :].repeat(1, 1, x_in.size(1), 1) if not self.is_causal else None # torch.Size([1, 1, 1863, 1863]
x_res = self.transformer(x_in, t1.unsqueeze(1), input_pos, x_mask_expanded) # [2, 1863, 512]
x_res = x_res[:, 1:] if self.time_as_token else x_res
x_res = x_res[:, 1:] if self.style_as_token else x_res
if self.long_skip_connection: #True
x_res = self.skip_linear(torch.cat([x_res, x], dim=-1))
if self.final_layer_type == 'wavenet':
x = self.conv1(x_res)
x = x.transpose(1, 2)
t2 = self.t_embedder2(t)
x = self.wavenet(x, x_mask, g=t2.unsqueeze(2)).transpose(1, 2) + self.res_projection(
x_res) # long residual connection
x = self.final_layer(x, t1).transpose(1, 2)
x = self.conv2(x)
else:
x = self.final_mlp(x_res)
x = x.transpose(1, 2)
# x [2,80,1863]
return x

View File

@@ -0,0 +1,171 @@
from abc import ABC
import torch
import torch.nn.functional as F
from modules.diffusion_transformer import DiT
from modules.commons import sequence_mask
from tqdm import tqdm
class BASECFM(torch.nn.Module, ABC):
def __init__(
self,
args,
):
super().__init__()
self.sigma_min = 1e-6
self.estimator = None
self.in_channels = args.DiT.in_channels
self.criterion = torch.nn.MSELoss() if args.reg_loss_type == "l2" else torch.nn.L1Loss()
if hasattr(args.DiT, 'zero_prompt_speech_token'):
self.zero_prompt_speech_token = args.DiT.zero_prompt_speech_token
else:
self.zero_prompt_speech_token = False
@torch.inference_mode()
def inference(self, mu, x_lens, prompt, style, f0, n_timesteps, temperature=1.0, inference_cfg_rate=0.5):
"""Forward diffusion
Args:
mu (torch.Tensor): semantic info of reference audio and altered audio
shape: (batch_size, mel_timesteps(795+1069), 512)
x_lens (torch.Tensor): mel frames output
shape: (batch_size, mel_timesteps)
prompt (torch.Tensor): reference mel
shape: (batch_size, 80, 795)
style (torch.Tensor): reference global style
shape: (batch_size, 192)
f0: None
n_timesteps (int): number of diffusion steps
temperature (float, optional): temperature for scaling noise. Defaults to 1.0.
Returns:
sample: generated mel-spectrogram
shape: (batch_size, 80, mel_timesteps)
"""
B, T = mu.size(0), mu.size(1)
z = torch.randn([B, self.in_channels, T], device=mu.device) * temperature
t_span = torch.linspace(0, 1, n_timesteps + 1, device=mu.device)
# t_span = t_span + (-1) * (torch.cos(torch.pi / 2 * t_span) - 1 + t_span)
return self.solve_euler(z, x_lens, prompt, mu, style, f0, t_span, inference_cfg_rate)
def solve_euler(self, x, x_lens, prompt, mu, style, f0, t_span, inference_cfg_rate=0.5):
"""
Fixed euler solver for ODEs.
Args:
x (torch.Tensor): random noise
t_span (torch.Tensor): n_timesteps interpolated
shape: (n_timesteps + 1,)
mu (torch.Tensor): semantic info of reference audio and altered audio
shape: (batch_size, mel_timesteps(795+1069), 512)
x_lens (torch.Tensor): mel frames output
shape: (batch_size, mel_timesteps)
prompt (torch.Tensor): reference mel
shape: (batch_size, 80, 795)
style (torch.Tensor): reference global style
shape: (batch_size, 192)
"""
t, _, _ = t_span[0], t_span[-1], t_span[1] - t_span[0]
# I am storing this because I can later plot it by putting a debugger here and saving it to a file
# Or in future might add like a return_all_steps flag
sol = []
# apply prompt
prompt_len = prompt.size(-1)
prompt_x = torch.zeros_like(x)
prompt_x[..., :prompt_len] = prompt[..., :prompt_len]
x[..., :prompt_len] = 0
if self.zero_prompt_speech_token:
mu[..., :prompt_len] = 0
for step in tqdm(range(1, len(t_span))):
dt = t_span[step] - t_span[step - 1]
if inference_cfg_rate > 0:
# Stack original and CFG (null) inputs for batched processing
stacked_prompt_x = torch.cat([prompt_x, torch.zeros_like(prompt_x)], dim=0)
stacked_style = torch.cat([style, torch.zeros_like(style)], dim=0)
stacked_mu = torch.cat([mu, torch.zeros_like(mu)], dim=0)
stacked_x = torch.cat([x, x], dim=0)
stacked_t = torch.cat([t.unsqueeze(0), t.unsqueeze(0)], dim=0)
# Perform a single forward pass for both original and CFG inputs
stacked_dphi_dt = self.estimator(
stacked_x, stacked_prompt_x, x_lens, stacked_t, stacked_style, stacked_mu,
)
# Split the output back into the original and CFG components
dphi_dt, cfg_dphi_dt = stacked_dphi_dt.chunk(2, dim=0)
# Apply CFG formula
dphi_dt = (1.0 + inference_cfg_rate) * dphi_dt - inference_cfg_rate * cfg_dphi_dt
else:
dphi_dt = self.estimator(x, prompt_x, x_lens, t.unsqueeze(0), style, mu)
x = x + dt * dphi_dt
t = t + dt
sol.append(x)
if step < len(t_span) - 1:
dt = t_span[step + 1] - t
x[:, :, :prompt_len] = 0
return sol[-1]
def forward(self, x1, x_lens, prompt_lens, mu, style):
"""Computes diffusion loss
Args:
mu (torch.Tensor): semantic info of reference audio and altered audio
shape: (batch_size, mel_timesteps(795+1069), 512)
x1: mel
x_lens (torch.Tensor): mel frames output
shape: (batch_size, mel_timesteps)
prompt (torch.Tensor): reference mel
shape: (batch_size, 80, 795)
style (torch.Tensor): reference global style
shape: (batch_size, 192)
Returns:
loss: conditional flow matching loss
y: conditional flow
shape: (batch_size, n_feats, mel_timesteps)
"""
b, _, t = x1.shape
# random timestep
t = torch.rand([b, 1, 1], device=mu.device, dtype=x1.dtype)
# sample noise p(x_0)
z = torch.randn_like(x1)
y = (1 - (1 - self.sigma_min) * t) * z + t * x1
u = x1 - (1 - self.sigma_min) * z
prompt = torch.zeros_like(x1)
for bib in range(b):
prompt[bib, :, :prompt_lens[bib]] = x1[bib, :, :prompt_lens[bib]]
# range covered by prompt are set to 0
y[bib, :, :prompt_lens[bib]] = 0
if self.zero_prompt_speech_token:
mu[bib, :, :prompt_lens[bib]] = 0
estimator_out = self.estimator(y, prompt, x_lens, t.squeeze(1).squeeze(1), style, mu, prompt_lens)
loss = 0
for bib in range(b):
loss += self.criterion(estimator_out[bib, :, prompt_lens[bib]:x_lens[bib]], u[bib, :, prompt_lens[bib]:x_lens[bib]])
loss /= b
return loss, estimator_out + (1 - self.sigma_min) * z
class CFM(BASECFM):
def __init__(self, args):
super().__init__(
args
)
if args.dit_type == "DiT":
self.estimator = DiT(args)
else:
raise NotImplementedError(f"Unknown diffusion type {args.dit_type}")

View File

@@ -0,0 +1,141 @@
from typing import Tuple
import torch
import torch.nn as nn
from torch.nn import functional as F
from modules.commons import sequence_mask
import numpy as np
from dac.nn.quantize import VectorQuantize
# f0_bin = 256
f0_max = 1100.0
f0_min = 50.0
f0_mel_min = 1127 * np.log(1 + f0_min / 700)
f0_mel_max = 1127 * np.log(1 + f0_max / 700)
def f0_to_coarse(f0, f0_bin):
f0_mel = 1127 * (1 + f0 / 700).log()
a = (f0_bin - 2) / (f0_mel_max - f0_mel_min)
b = f0_mel_min * a - 1.
f0_mel = torch.where(f0_mel > 0, f0_mel * a - b, f0_mel)
# torch.clip_(f0_mel, min=1., max=float(f0_bin - 1))
f0_coarse = torch.round(f0_mel).long()
f0_coarse = f0_coarse * (f0_coarse > 0)
f0_coarse = f0_coarse + ((f0_coarse < 1) * 1)
f0_coarse = f0_coarse * (f0_coarse < f0_bin)
f0_coarse = f0_coarse + ((f0_coarse >= f0_bin) * (f0_bin - 1))
return f0_coarse
class InterpolateRegulator(nn.Module):
def __init__(
self,
channels: int,
sampling_ratios: Tuple,
is_discrete: bool = False,
in_channels: int = None, # only applies to continuous input
vector_quantize: bool = False, # whether to use vector quantization, only applies to continuous input
codebook_size: int = 1024, # for discrete only
out_channels: int = None,
groups: int = 1,
n_codebooks: int = 1, # number of codebooks
quantizer_dropout: float = 0.0, # dropout for quantizer
f0_condition: bool = False,
n_f0_bins: int = 512,
):
super().__init__()
self.sampling_ratios = sampling_ratios
out_channels = out_channels or channels
model = nn.ModuleList([])
if len(sampling_ratios) > 0:
self.interpolate = True
for _ in sampling_ratios:
module = nn.Conv1d(channels, channels, 3, 1, 1)
norm = nn.GroupNorm(groups, channels)
act = nn.Mish()
model.extend([module, norm, act])
else:
self.interpolate = False
model.append(
nn.Conv1d(channels, out_channels, 1, 1)
)
self.model = nn.Sequential(*model)
self.embedding = nn.Embedding(codebook_size, channels)
self.is_discrete = is_discrete
self.mask_token = nn.Parameter(torch.zeros(1, channels))
self.n_codebooks = n_codebooks
if n_codebooks > 1:
self.extra_codebooks = nn.ModuleList([
nn.Embedding(codebook_size, channels) for _ in range(n_codebooks - 1)
])
self.extra_codebook_mask_tokens = nn.ParameterList([
nn.Parameter(torch.zeros(1, channels)) for _ in range(n_codebooks - 1)
])
self.quantizer_dropout = quantizer_dropout
if f0_condition:
self.f0_embedding = nn.Embedding(n_f0_bins, channels)
self.f0_condition = f0_condition
self.n_f0_bins = n_f0_bins
self.f0_bins = torch.arange(2, 1024, 1024 // n_f0_bins)
self.f0_mask = nn.Parameter(torch.zeros(1, channels))
else:
self.f0_condition = False
if not is_discrete:
self.content_in_proj = nn.Linear(in_channels, channels)
if vector_quantize:
self.vq = VectorQuantize(channels, codebook_size, 8)
def forward(self, x, ylens=None, n_quantizers=None, f0=None):
# apply token drop
if self.training:
n_quantizers = torch.ones((x.shape[0],)) * self.n_codebooks
dropout = torch.randint(1, self.n_codebooks + 1, (x.shape[0],))
n_dropout = int(x.shape[0] * self.quantizer_dropout)
n_quantizers[:n_dropout] = dropout[:n_dropout]
n_quantizers = n_quantizers.to(x.device)
# decide whether to drop for each sample in batch
else:
n_quantizers = torch.ones((x.shape[0],), device=x.device) * (self.n_codebooks if n_quantizers is None else n_quantizers)
if self.is_discrete:
if self.n_codebooks > 1:
assert len(x.size()) == 3
x_emb = self.embedding(x[:, 0])
for i, emb in enumerate(self.extra_codebooks):
x_emb = x_emb + (n_quantizers > i+1)[..., None, None] * emb(x[:, i+1])
# add mask token if not using this codebook
# x_emb = x_emb + (n_quantizers <= i+1)[..., None, None] * self.extra_codebook_mask_tokens[i]
x = x_emb
elif self.n_codebooks == 1:
if len(x.size()) == 2:
x = self.embedding(x)
else:
x = self.embedding(x[:, 0])
else:
x = self.content_in_proj(x)
# x in (B, T, D)
mask = sequence_mask(ylens).unsqueeze(-1)
if self.interpolate:
x = F.interpolate(x.transpose(1, 2).contiguous(), size=ylens.max(), mode='nearest')
else:
x = x.transpose(1, 2).contiguous()
mask = mask[:, :x.size(2), :]
ylens = ylens.clamp(max=x.size(2)).long()
if self.f0_condition:
if f0 is None:
x = x + self.f0_mask.unsqueeze(-1)
else:
#quantized_f0 = torch.bucketize(f0, self.f0_bins.to(f0.device)) # (N, T)
quantized_f0 = f0_to_coarse(f0, self.n_f0_bins)
quantized_f0 = quantized_f0.clamp(0, self.n_f0_bins - 1).long()
f0_emb = self.f0_embedding(quantized_f0)
f0_emb = F.interpolate(f0_emb.transpose(1, 2).contiguous(), size=ylens.max(), mode='nearest')
x = x + f0_emb
out = self.model(x).transpose(1, 2).contiguous()
if hasattr(self, 'vq'):
out_q, commitment_loss, codebook_loss, codes, out, = self.vq(out.transpose(1, 2))
out_q = out_q.transpose(1, 2)
return out_q * mask, ylens, codes, commitment_loss, codebook_loss
olens = ylens
return out * mask, olens, None, None, None

View File

@@ -0,0 +1,5 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
from .filter import *
from .resample import *
from .act import *

View File

@@ -0,0 +1,29 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
import torch.nn as nn
from .resample import UpSample1d, DownSample1d
class Activation1d(nn.Module):
def __init__(
self,
activation,
up_ratio: int = 2,
down_ratio: int = 2,
up_kernel_size: int = 12,
down_kernel_size: int = 12,
):
super().__init__()
self.up_ratio = up_ratio
self.down_ratio = down_ratio
self.act = activation
self.upsample = UpSample1d(up_ratio, up_kernel_size)
self.downsample = DownSample1d(down_ratio, down_kernel_size)
# x: [B,C,T]
def forward(self, x):
x = self.upsample(x)
x = self.act(x)
x = self.downsample(x)
return x

View File

@@ -0,0 +1,96 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
if "sinc" in dir(torch):
sinc = torch.sinc
else:
# This code is adopted from adefossez's julius.core.sinc under the MIT License
# https://adefossez.github.io/julius/julius/core.html
def sinc(x: torch.Tensor):
"""
Implementation of sinc, i.e. sin(pi * x) / (pi * x)
__Warning__: Different to julius.sinc, the input is multiplied by `pi`!
"""
return torch.where(
x == 0,
torch.tensor(1.0, device=x.device, dtype=x.dtype),
torch.sin(math.pi * x) / math.pi / x,
)
# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
# https://adefossez.github.io/julius/julius/lowpass.html
def kaiser_sinc_filter1d(
cutoff, half_width, kernel_size
): # return filter [1,1,kernel_size]
even = kernel_size % 2 == 0
half_size = kernel_size // 2
# For kaiser window
delta_f = 4 * half_width
A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
if A > 50.0:
beta = 0.1102 * (A - 8.7)
elif A >= 21.0:
beta = 0.5842 * (A - 21) ** 0.4 + 0.07886 * (A - 21.0)
else:
beta = 0.0
window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
# ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
if even:
time = torch.arange(-half_size, half_size) + 0.5
else:
time = torch.arange(kernel_size) - half_size
if cutoff == 0:
filter_ = torch.zeros_like(time)
else:
filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
# Normalize filter to have sum = 1, otherwise we will have a small leakage
# of the constant component in the input signal.
filter_ /= filter_.sum()
filter = filter_.view(1, 1, kernel_size)
return filter
class LowPassFilter1d(nn.Module):
def __init__(
self,
cutoff=0.5,
half_width=0.6,
stride: int = 1,
padding: bool = True,
padding_mode: str = "replicate",
kernel_size: int = 12,
):
# kernel_size should be even number for stylegan3 setup,
# in this implementation, odd number is also possible.
super().__init__()
if cutoff < -0.0:
raise ValueError("Minimum cutoff must be larger than zero.")
if cutoff > 0.5:
raise ValueError("A cutoff above 0.5 does not make sense.")
self.kernel_size = kernel_size
self.even = kernel_size % 2 == 0
self.pad_left = kernel_size // 2 - int(self.even)
self.pad_right = kernel_size // 2
self.stride = stride
self.padding = padding
self.padding_mode = padding_mode
filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
self.register_buffer("filter", filter)
# input [B, C, T]
def forward(self, x):
_, C, _ = x.shape
if self.padding:
x = F.pad(x, (self.pad_left, self.pad_right), mode=self.padding_mode)
out = F.conv1d(x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C)
return out

View File

@@ -0,0 +1,57 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
import torch.nn as nn
from torch.nn import functional as F
from .filter import LowPassFilter1d
from .filter import kaiser_sinc_filter1d
class UpSample1d(nn.Module):
def __init__(self, ratio=2, kernel_size=None):
super().__init__()
self.ratio = ratio
self.kernel_size = (
int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
)
self.stride = ratio
self.pad = self.kernel_size // ratio - 1
self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
self.pad_right = (
self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
)
filter = kaiser_sinc_filter1d(
cutoff=0.5 / ratio, half_width=0.6 / ratio, kernel_size=self.kernel_size
)
self.register_buffer("filter", filter)
# x: [B, C, T]
def forward(self, x):
_, C, _ = x.shape
x = F.pad(x, (self.pad, self.pad), mode="replicate")
x = self.ratio * F.conv_transpose1d(
x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C
)
x = x[..., self.pad_left : -self.pad_right]
return x
class DownSample1d(nn.Module):
def __init__(self, ratio=2, kernel_size=None):
super().__init__()
self.ratio = ratio
self.kernel_size = (
int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
)
self.lowpass = LowPassFilter1d(
cutoff=0.5 / ratio,
half_width=0.6 / ratio,
stride=ratio,
kernel_size=self.kernel_size,
)
def forward(self, x):
xx = self.lowpass(x)
return xx

View File

@@ -0,0 +1,82 @@
import numpy as np
import torch
import torch.utils.data
from librosa.filters import mel as librosa_mel_fn
from scipy.io.wavfile import read
MAX_WAV_VALUE = 32768.0
def load_wav(full_path):
sampling_rate, data = read(full_path)
return data, sampling_rate
def dynamic_range_compression(x, C=1, clip_val=1e-5):
return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
def dynamic_range_decompression(x, C=1):
return np.exp(x) / C
def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
return torch.log(torch.clamp(x, min=clip_val) * C)
def dynamic_range_decompression_torch(x, C=1):
return torch.exp(x) / C
def spectral_normalize_torch(magnitudes):
output = dynamic_range_compression_torch(magnitudes)
return output
def spectral_de_normalize_torch(magnitudes):
output = dynamic_range_decompression_torch(magnitudes)
return output
mel_basis = {}
hann_window = {}
def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
# if torch.min(y) < -1.0:
# print("min value is ", torch.min(y))
# if torch.max(y) > 1.0:
# print("max value is ", torch.max(y))
global mel_basis, hann_window # pylint: disable=global-statement
if f"{str(sampling_rate)}_{str(fmax)}_{str(y.device)}" not in mel_basis:
mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax)
mel_basis[str(sampling_rate) + "_" + str(fmax) + "_" + str(y.device)] = torch.from_numpy(mel).float().to(y.device)
hann_window[str(sampling_rate) + "_" + str(y.device)] = torch.hann_window(win_size).to(y.device)
y = torch.nn.functional.pad(
y.unsqueeze(1), (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)), mode="reflect"
)
y = y.squeeze(1)
spec = torch.view_as_real(
torch.stft(
y,
n_fft,
hop_length=hop_size,
win_length=win_size,
window=hann_window[str(sampling_rate) + "_" + str(y.device)],
center=center,
pad_mode="reflect",
normalized=False,
onesided=True,
return_complex=True,
)
)
spec = torch.sqrt(spec.pow(2).sum(-1) + (1e-9))
spec = torch.matmul(mel_basis[str(sampling_rate) + "_" + str(fmax) + "_" + str(y.device)], spec)
spec = spectral_normalize_torch(spec)
return spec

View File

@@ -0,0 +1,120 @@
# Implementation adapted from https://github.com/EdwardDixon/snake under the MIT license.
# LICENSE is in incl_licenses directory.
import torch
from torch import nn, sin, pow
from torch.nn import Parameter
class Snake(nn.Module):
'''
Implementation of a sine-based periodic activation function
Shape:
- Input: (B, C, T)
- Output: (B, C, T), same shape as the input
Parameters:
- alpha - trainable parameter
References:
- This activation function is from this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
https://arxiv.org/abs/2006.08195
Examples:
>>> a1 = snake(256)
>>> x = torch.randn(256)
>>> x = a1(x)
'''
def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
'''
Initialization.
INPUT:
- in_features: shape of the input
- alpha: trainable parameter
alpha is initialized to 1 by default, higher values = higher-frequency.
alpha will be trained along with the rest of your model.
'''
super(Snake, self).__init__()
self.in_features = in_features
# initialize alpha
self.alpha_logscale = alpha_logscale
if self.alpha_logscale: # log scale alphas initialized to zeros
self.alpha = Parameter(torch.zeros(in_features) * alpha)
else: # linear scale alphas initialized to ones
self.alpha = Parameter(torch.ones(in_features) * alpha)
self.alpha.requires_grad = alpha_trainable
self.no_div_by_zero = 0.000000001
def forward(self, x):
'''
Forward pass of the function.
Applies the function to the input elementwise.
Snake = x + 1/a * sin^2 (xa)
'''
alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
if self.alpha_logscale:
alpha = torch.exp(alpha)
x = x + (1.0 / (alpha + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
return x
class SnakeBeta(nn.Module):
'''
A modified Snake function which uses separate parameters for the magnitude of the periodic components
Shape:
- Input: (B, C, T)
- Output: (B, C, T), same shape as the input
Parameters:
- alpha - trainable parameter that controls frequency
- beta - trainable parameter that controls magnitude
References:
- This activation function is a modified version based on this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
https://arxiv.org/abs/2006.08195
Examples:
>>> a1 = snakebeta(256)
>>> x = torch.randn(256)
>>> x = a1(x)
'''
def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
'''
Initialization.
INPUT:
- in_features: shape of the input
- alpha - trainable parameter that controls frequency
- beta - trainable parameter that controls magnitude
alpha is initialized to 1 by default, higher values = higher-frequency.
beta is initialized to 1 by default, higher values = higher-magnitude.
alpha will be trained along with the rest of your model.
'''
super(SnakeBeta, self).__init__()
self.in_features = in_features
# initialize alpha
self.alpha_logscale = alpha_logscale
if self.alpha_logscale: # log scale alphas initialized to zeros
self.alpha = Parameter(torch.zeros(in_features) * alpha)
self.beta = Parameter(torch.zeros(in_features) * alpha)
else: # linear scale alphas initialized to ones
self.alpha = Parameter(torch.ones(in_features) * alpha)
self.beta = Parameter(torch.ones(in_features) * alpha)
self.alpha.requires_grad = alpha_trainable
self.beta.requires_grad = alpha_trainable
self.no_div_by_zero = 0.000000001
def forward(self, x):
'''
Forward pass of the function.
Applies the function to the input elementwise.
SnakeBeta = x + 1/b * sin^2 (xa)
'''
alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
beta = self.beta.unsqueeze(0).unsqueeze(-1)
if self.alpha_logscale:
alpha = torch.exp(alpha)
beta = torch.exp(beta)
x = x + (1.0 / (beta + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
return x

View File

@@ -0,0 +1,77 @@
# Copyright (c) 2024 NVIDIA CORPORATION.
# Licensed under the MIT license.
import torch
import torch.nn as nn
from ..torch.resample import UpSample1d, DownSample1d
# load fused CUDA kernel: this enables importing anti_alias_activation_cuda
from ..cuda import load
anti_alias_activation_cuda = load.load()
class FusedAntiAliasActivation(torch.autograd.Function):
"""
Assumes filter size 12, replication padding on upsampling/downsampling, and logscale alpha/beta parameters as inputs.
The hyperparameters are hard-coded in the kernel to maximize speed.
NOTE: The fused kenrel is incorrect for Activation1d with different hyperparameters.
"""
@staticmethod
def forward(ctx, inputs, up_ftr, down_ftr, alpha, beta):
activation_results = anti_alias_activation_cuda.forward(
inputs, up_ftr, down_ftr, alpha, beta
)
return activation_results
@staticmethod
def backward(ctx, output_grads):
raise NotImplementedError
return output_grads, None, None
class Activation1d(nn.Module):
def __init__(
self,
activation,
up_ratio: int = 2,
down_ratio: int = 2,
up_kernel_size: int = 12,
down_kernel_size: int = 12,
fused: bool = True,
):
super().__init__()
self.up_ratio = up_ratio
self.down_ratio = down_ratio
self.act = activation
self.upsample = UpSample1d(up_ratio, up_kernel_size)
self.downsample = DownSample1d(down_ratio, down_kernel_size)
self.fused = fused # Whether to use fused CUDA kernel or not
def forward(self, x):
if not self.fused:
x = self.upsample(x)
x = self.act(x)
x = self.downsample(x)
return x
else:
if self.act.__class__.__name__ == "Snake":
beta = self.act.alpha.data # Snake uses same params for alpha and beta
else:
beta = (
self.act.beta.data
) # Snakebeta uses different params for alpha and beta
alpha = self.act.alpha.data
if (
not self.act.alpha_logscale
): # Exp baked into cuda kernel, cancel it out with a log
alpha = torch.log(alpha)
beta = torch.log(beta)
x = FusedAntiAliasActivation.apply(
x, self.upsample.filter, self.downsample.lowpass.filter, alpha, beta
)
return x

View File

@@ -0,0 +1,23 @@
/* coding=utf-8
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <torch/extension.h>
extern "C" torch::Tensor fwd_cuda(torch::Tensor const &input, torch::Tensor const &up_filter, torch::Tensor const &down_filter, torch::Tensor const &alpha, torch::Tensor const &beta);
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("forward", &fwd_cuda, "Anti-Alias Activation forward (CUDA)");
}

View File

@@ -0,0 +1,246 @@
/* coding=utf-8
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <ATen/ATen.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <cuda_profiler_api.h>
#include <ATen/cuda/CUDAContext.h>
#include <torch/extension.h>
#include "type_shim.h"
#include <assert.h>
#include <cfloat>
#include <limits>
#include <stdint.h>
#include <c10/macros/Macros.h>
namespace
{
// Hard-coded hyperparameters
// WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and
constexpr int ELEMENTS_PER_LDG_STG = 1; //(WARP_ITERATIONS < 4) ? 1 : 4;
constexpr int BUFFER_SIZE = 32;
constexpr int FILTER_SIZE = 12;
constexpr int HALF_FILTER_SIZE = 6;
constexpr int UPSAMPLE_REPLICATION_PAD = 5; // 5 on each side, matching torch impl
constexpr int DOWNSAMPLE_REPLICATION_PAD_LEFT = 5; // matching torch impl
constexpr int DOWNSAMPLE_REPLICATION_PAD_RIGHT = 6; // matching torch impl
template <typename input_t, typename output_t, typename acc_t>
__global__ void anti_alias_activation_forward(
output_t *dst,
const input_t *src,
const input_t *up_ftr,
const input_t *down_ftr,
const input_t *alpha,
const input_t *beta,
int batch_size,
int channels,
int seq_len)
{
// Up and downsample filters
input_t up_filter[FILTER_SIZE];
input_t down_filter[FILTER_SIZE];
// Load data from global memory including extra indices reserved for replication paddings
input_t elements[2 * FILTER_SIZE + 2 * BUFFER_SIZE + 2 * UPSAMPLE_REPLICATION_PAD] = {0};
input_t intermediates[2 * FILTER_SIZE + 2 * BUFFER_SIZE + DOWNSAMPLE_REPLICATION_PAD_LEFT + DOWNSAMPLE_REPLICATION_PAD_RIGHT] = {0};
// Output stores downsampled output before writing to dst
output_t output[BUFFER_SIZE];
// blockDim/threadIdx = (128, 1, 1)
// gridDim/blockIdx = (seq_blocks, channels, batches)
int block_offset = (blockIdx.x * 128 * BUFFER_SIZE + seq_len * (blockIdx.y + gridDim.y * blockIdx.z));
int local_offset = threadIdx.x * BUFFER_SIZE;
int seq_offset = blockIdx.x * 128 * BUFFER_SIZE + local_offset;
// intermediate have double the seq_len
int intermediate_local_offset = threadIdx.x * BUFFER_SIZE * 2;
int intermediate_seq_offset = blockIdx.x * 128 * BUFFER_SIZE * 2 + intermediate_local_offset;
// Get values needed for replication padding before moving pointer
const input_t *right_most_pntr = src + (seq_len * (blockIdx.y + gridDim.y * blockIdx.z));
input_t seq_left_most_value = right_most_pntr[0];
input_t seq_right_most_value = right_most_pntr[seq_len - 1];
// Move src and dst pointers
src += block_offset + local_offset;
dst += block_offset + local_offset;
// Alpha and beta values for snake activatons. Applies exp by default
alpha = alpha + blockIdx.y;
input_t alpha_val = expf(alpha[0]);
beta = beta + blockIdx.y;
input_t beta_val = expf(beta[0]);
#pragma unroll
for (int it = 0; it < FILTER_SIZE; it += 1)
{
up_filter[it] = up_ftr[it];
down_filter[it] = down_ftr[it];
}
// Apply replication padding for upsampling, matching torch impl
#pragma unroll
for (int it = -HALF_FILTER_SIZE; it < BUFFER_SIZE + HALF_FILTER_SIZE; it += 1)
{
int element_index = seq_offset + it; // index for element
if ((element_index < 0) && (element_index >= -UPSAMPLE_REPLICATION_PAD))
{
elements[2 * (HALF_FILTER_SIZE + it)] = 2 * seq_left_most_value;
}
if ((element_index >= seq_len) && (element_index < seq_len + UPSAMPLE_REPLICATION_PAD))
{
elements[2 * (HALF_FILTER_SIZE + it)] = 2 * seq_right_most_value;
}
if ((element_index >= 0) && (element_index < seq_len))
{
elements[2 * (HALF_FILTER_SIZE + it)] = 2 * src[it];
}
}
// Apply upsampling strided convolution and write to intermediates. It reserves DOWNSAMPLE_REPLICATION_PAD_LEFT for replication padding of the downsampilng conv later
#pragma unroll
for (int it = 0; it < (2 * BUFFER_SIZE + 2 * FILTER_SIZE); it += 1)
{
input_t acc = 0.0;
int element_index = intermediate_seq_offset + it; // index for intermediate
#pragma unroll
for (int f_idx = 0; f_idx < FILTER_SIZE; f_idx += 1)
{
if ((element_index + f_idx) >= 0)
{
acc += up_filter[f_idx] * elements[it + f_idx];
}
}
intermediates[it + DOWNSAMPLE_REPLICATION_PAD_LEFT] = acc;
}
// Apply activation function. It reserves DOWNSAMPLE_REPLICATION_PAD_LEFT and DOWNSAMPLE_REPLICATION_PAD_RIGHT for replication padding of the downsampilng conv later
double no_div_by_zero = 0.000000001;
#pragma unroll
for (int it = 0; it < 2 * BUFFER_SIZE + 2 * FILTER_SIZE; it += 1)
{
intermediates[it + DOWNSAMPLE_REPLICATION_PAD_LEFT] += (1.0 / (beta_val + no_div_by_zero)) * sinf(intermediates[it + DOWNSAMPLE_REPLICATION_PAD_LEFT] * alpha_val) * sinf(intermediates[it + DOWNSAMPLE_REPLICATION_PAD_LEFT] * alpha_val);
}
// Apply replication padding before downsampling conv from intermediates
#pragma unroll
for (int it = 0; it < DOWNSAMPLE_REPLICATION_PAD_LEFT; it += 1)
{
intermediates[it] = intermediates[DOWNSAMPLE_REPLICATION_PAD_LEFT];
}
#pragma unroll
for (int it = DOWNSAMPLE_REPLICATION_PAD_LEFT + 2 * BUFFER_SIZE + 2 * FILTER_SIZE; it < DOWNSAMPLE_REPLICATION_PAD_LEFT + 2 * BUFFER_SIZE + 2 * FILTER_SIZE + DOWNSAMPLE_REPLICATION_PAD_RIGHT; it += 1)
{
intermediates[it] = intermediates[DOWNSAMPLE_REPLICATION_PAD_LEFT + 2 * BUFFER_SIZE + 2 * FILTER_SIZE - 1];
}
// Apply downsample strided convolution (assuming stride=2) from intermediates
#pragma unroll
for (int it = 0; it < BUFFER_SIZE; it += 1)
{
input_t acc = 0.0;
#pragma unroll
for (int f_idx = 0; f_idx < FILTER_SIZE; f_idx += 1)
{
// Add constant DOWNSAMPLE_REPLICATION_PAD_RIGHT to match torch implementation
acc += down_filter[f_idx] * intermediates[it * 2 + f_idx + DOWNSAMPLE_REPLICATION_PAD_RIGHT];
}
output[it] = acc;
}
// Write output to dst
#pragma unroll
for (int it = 0; it < BUFFER_SIZE; it += ELEMENTS_PER_LDG_STG)
{
int element_index = seq_offset + it;
if (element_index < seq_len)
{
dst[it] = output[it];
}
}
}
template <typename input_t, typename output_t, typename acc_t>
void dispatch_anti_alias_activation_forward(
output_t *dst,
const input_t *src,
const input_t *up_ftr,
const input_t *down_ftr,
const input_t *alpha,
const input_t *beta,
int batch_size,
int channels,
int seq_len)
{
if (seq_len == 0)
{
return;
}
else
{
// Use 128 threads per block to maximimize gpu utilization
constexpr int threads_per_block = 128;
constexpr int seq_len_per_block = 4096;
int blocks_per_seq_len = (seq_len + seq_len_per_block - 1) / seq_len_per_block;
dim3 blocks(blocks_per_seq_len, channels, batch_size);
dim3 threads(threads_per_block, 1, 1);
anti_alias_activation_forward<input_t, output_t, acc_t>
<<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, up_ftr, down_ftr, alpha, beta, batch_size, channels, seq_len);
}
}
}
extern "C" torch::Tensor fwd_cuda(torch::Tensor const &input, torch::Tensor const &up_filter, torch::Tensor const &down_filter, torch::Tensor const &alpha, torch::Tensor const &beta)
{
// Input is a 3d tensor with dimensions [batches, channels, seq_len]
const int batches = input.size(0);
const int channels = input.size(1);
const int seq_len = input.size(2);
// Output
auto act_options = input.options().requires_grad(false);
torch::Tensor anti_alias_activation_results =
torch::empty({batches, channels, seq_len}, act_options);
void *input_ptr = static_cast<void *>(input.data_ptr());
void *up_filter_ptr = static_cast<void *>(up_filter.data_ptr());
void *down_filter_ptr = static_cast<void *>(down_filter.data_ptr());
void *alpha_ptr = static_cast<void *>(alpha.data_ptr());
void *beta_ptr = static_cast<void *>(beta.data_ptr());
void *anti_alias_activation_results_ptr = static_cast<void *>(anti_alias_activation_results.data_ptr());
DISPATCH_FLOAT_HALF_AND_BFLOAT(
input.scalar_type(),
"dispatch anti alias activation_forward",
dispatch_anti_alias_activation_forward<scalar_t, scalar_t, float>(
reinterpret_cast<scalar_t *>(anti_alias_activation_results_ptr),
reinterpret_cast<const scalar_t *>(input_ptr),
reinterpret_cast<const scalar_t *>(up_filter_ptr),
reinterpret_cast<const scalar_t *>(down_filter_ptr),
reinterpret_cast<const scalar_t *>(alpha_ptr),
reinterpret_cast<const scalar_t *>(beta_ptr),
batches,
channels,
seq_len););
return anti_alias_activation_results;
}

View File

@@ -0,0 +1,29 @@
/* coding=utf-8
* Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*This code is copied fron NVIDIA apex:
* https://github.com/NVIDIA/apex
* with minor changes. */
#ifndef TORCH_CHECK
#define TORCH_CHECK AT_CHECK
#endif
#ifdef VERSION_GE_1_3
#define DATA_PTR data_ptr
#else
#define DATA_PTR data
#endif

View File

@@ -0,0 +1,86 @@
# Copyright (c) 2024 NVIDIA CORPORATION.
# Licensed under the MIT license.
import os
import pathlib
import subprocess
from torch.utils import cpp_extension
"""
Setting this param to a list has a problem of generating different compilation commands (with diferent order of architectures) and leading to recompilation of fused kernels.
Set it to empty stringo avoid recompilation and assign arch flags explicity in extra_cuda_cflags below
"""
os.environ["TORCH_CUDA_ARCH_LIST"] = ""
def load():
# Check if cuda 11 is installed for compute capability 8.0
cc_flag = []
_, bare_metal_major, _ = _get_cuda_bare_metal_version(cpp_extension.CUDA_HOME)
if int(bare_metal_major) >= 11:
cc_flag.append("-gencode")
cc_flag.append("arch=compute_80,code=sm_80")
# Build path
srcpath = pathlib.Path(__file__).parent.absolute()
buildpath = srcpath / "build"
_create_build_dir(buildpath)
# Helper function to build the kernels.
def _cpp_extention_load_helper(name, sources, extra_cuda_flags):
return cpp_extension.load(
name=name,
sources=sources,
build_directory=buildpath,
extra_cflags=[
"-O3",
],
extra_cuda_cflags=[
"-O3",
"-gencode",
"arch=compute_70,code=sm_70",
"--use_fast_math",
]
+ extra_cuda_flags
+ cc_flag,
verbose=True,
)
extra_cuda_flags = [
"-U__CUDA_NO_HALF_OPERATORS__",
"-U__CUDA_NO_HALF_CONVERSIONS__",
"--expt-relaxed-constexpr",
"--expt-extended-lambda",
]
sources = [
srcpath / "anti_alias_activation.cpp",
srcpath / "anti_alias_activation_cuda.cu",
]
anti_alias_activation_cuda = _cpp_extention_load_helper(
"anti_alias_activation_cuda", sources, extra_cuda_flags
)
return anti_alias_activation_cuda
def _get_cuda_bare_metal_version(cuda_dir):
raw_output = subprocess.check_output(
[cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True
)
output = raw_output.split()
release_idx = output.index("release") + 1
release = output[release_idx].split(".")
bare_metal_major = release[0]
bare_metal_minor = release[1][0]
return raw_output, bare_metal_major, bare_metal_minor
def _create_build_dir(buildpath):
try:
os.mkdir(buildpath)
except OSError:
if not os.path.isdir(buildpath):
print(f"Creation of the build directory {buildpath} failed")

View File

@@ -0,0 +1,92 @@
/* coding=utf-8
* Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <ATen/ATen.h>
#include "compat.h"
#define DISPATCH_FLOAT_HALF_AND_BFLOAT(TYPE, NAME, ...) \
switch (TYPE) \
{ \
case at::ScalarType::Float: \
{ \
using scalar_t = float; \
__VA_ARGS__; \
break; \
} \
case at::ScalarType::Half: \
{ \
using scalar_t = at::Half; \
__VA_ARGS__; \
break; \
} \
case at::ScalarType::BFloat16: \
{ \
using scalar_t = at::BFloat16; \
__VA_ARGS__; \
break; \
} \
default: \
AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
}
#define DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(TYPEIN, TYPEOUT, NAME, ...) \
switch (TYPEIN) \
{ \
case at::ScalarType::Float: \
{ \
using scalar_t_in = float; \
switch (TYPEOUT) \
{ \
case at::ScalarType::Float: \
{ \
using scalar_t_out = float; \
__VA_ARGS__; \
break; \
} \
case at::ScalarType::Half: \
{ \
using scalar_t_out = at::Half; \
__VA_ARGS__; \
break; \
} \
case at::ScalarType::BFloat16: \
{ \
using scalar_t_out = at::BFloat16; \
__VA_ARGS__; \
break; \
} \
default: \
AT_ERROR(#NAME, " not implemented for '", toString(TYPEOUT), "'"); \
} \
break; \
} \
case at::ScalarType::Half: \
{ \
using scalar_t_in = at::Half; \
using scalar_t_out = at::Half; \
__VA_ARGS__; \
break; \
} \
case at::ScalarType::BFloat16: \
{ \
using scalar_t_in = at::BFloat16; \
using scalar_t_out = at::BFloat16; \
__VA_ARGS__; \
break; \
} \
default: \
AT_ERROR(#NAME, " not implemented for '", toString(TYPEIN), "'"); \
}

View File

@@ -0,0 +1,6 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
# LICENSE is in incl_licenses directory.
from .filter import *
from .resample import *
from .act import *

View File

@@ -0,0 +1,30 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
# LICENSE is in incl_licenses directory.
import torch.nn as nn
from .resample import UpSample1d, DownSample1d
class Activation1d(nn.Module):
def __init__(
self,
activation,
up_ratio: int = 2,
down_ratio: int = 2,
up_kernel_size: int = 12,
down_kernel_size: int = 12,
):
super().__init__()
self.up_ratio = up_ratio
self.down_ratio = down_ratio
self.act = activation
self.upsample = UpSample1d(up_ratio, up_kernel_size)
self.downsample = DownSample1d(down_ratio, down_kernel_size)
# x: [B,C,T]
def forward(self, x):
x = self.upsample(x)
x = self.act(x)
x = self.downsample(x)
return x

View File

@@ -0,0 +1,101 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
# LICENSE is in incl_licenses directory.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
if "sinc" in dir(torch):
sinc = torch.sinc
else:
# This code is adopted from adefossez's julius.core.sinc under the MIT License
# https://adefossez.github.io/julius/julius/core.html
# LICENSE is in incl_licenses directory.
def sinc(x: torch.Tensor):
"""
Implementation of sinc, i.e. sin(pi * x) / (pi * x)
__Warning__: Different to julius.sinc, the input is multiplied by `pi`!
"""
return torch.where(
x == 0,
torch.tensor(1.0, device=x.device, dtype=x.dtype),
torch.sin(math.pi * x) / math.pi / x,
)
# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
# https://adefossez.github.io/julius/julius/lowpass.html
# LICENSE is in incl_licenses directory.
def kaiser_sinc_filter1d(
cutoff, half_width, kernel_size
): # return filter [1,1,kernel_size]
even = kernel_size % 2 == 0
half_size = kernel_size // 2
# For kaiser window
delta_f = 4 * half_width
A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
if A > 50.0:
beta = 0.1102 * (A - 8.7)
elif A >= 21.0:
beta = 0.5842 * (A - 21) ** 0.4 + 0.07886 * (A - 21.0)
else:
beta = 0.0
window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
# ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
if even:
time = torch.arange(-half_size, half_size) + 0.5
else:
time = torch.arange(kernel_size) - half_size
if cutoff == 0:
filter_ = torch.zeros_like(time)
else:
filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
"""
Normalize filter to have sum = 1, otherwise we will have a small leakage of the constant component in the input signal.
"""
filter_ /= filter_.sum()
filter = filter_.view(1, 1, kernel_size)
return filter
class LowPassFilter1d(nn.Module):
def __init__(
self,
cutoff=0.5,
half_width=0.6,
stride: int = 1,
padding: bool = True,
padding_mode: str = "replicate",
kernel_size: int = 12,
):
"""
kernel_size should be even number for stylegan3 setup, in this implementation, odd number is also possible.
"""
super().__init__()
if cutoff < -0.0:
raise ValueError("Minimum cutoff must be larger than zero.")
if cutoff > 0.5:
raise ValueError("A cutoff above 0.5 does not make sense.")
self.kernel_size = kernel_size
self.even = kernel_size % 2 == 0
self.pad_left = kernel_size // 2 - int(self.even)
self.pad_right = kernel_size // 2
self.stride = stride
self.padding = padding
self.padding_mode = padding_mode
filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
self.register_buffer("filter", filter)
# Input [B, C, T]
def forward(self, x):
_, C, _ = x.shape
if self.padding:
x = F.pad(x, (self.pad_left, self.pad_right), mode=self.padding_mode)
out = F.conv1d(x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C)
return out

View File

@@ -0,0 +1,58 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
# LICENSE is in incl_licenses directory.
import torch.nn as nn
from torch.nn import functional as F
from .filter import LowPassFilter1d
from .filter import kaiser_sinc_filter1d
class UpSample1d(nn.Module):
def __init__(self, ratio=2, kernel_size=None):
super().__init__()
self.ratio = ratio
self.kernel_size = (
int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
)
self.stride = ratio
self.pad = self.kernel_size // ratio - 1
self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
self.pad_right = (
self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
)
filter = kaiser_sinc_filter1d(
cutoff=0.5 / ratio, half_width=0.6 / ratio, kernel_size=self.kernel_size
)
self.register_buffer("filter", filter)
# x: [B, C, T]
def forward(self, x):
_, C, _ = x.shape
x = F.pad(x, (self.pad, self.pad), mode="replicate")
x = self.ratio * F.conv_transpose1d(
x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C
)
x = x[..., self.pad_left : -self.pad_right]
return x
class DownSample1d(nn.Module):
def __init__(self, ratio=2, kernel_size=None):
super().__init__()
self.ratio = ratio
self.kernel_size = (
int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
)
self.lowpass = LowPassFilter1d(
cutoff=0.5 / ratio,
half_width=0.6 / ratio,
stride=ratio,
kernel_size=self.kernel_size,
)
def forward(self, x):
xx = self.lowpass(x)
return xx

View File

@@ -0,0 +1,492 @@
# Copyright (c) 2024 NVIDIA CORPORATION.
# Licensed under the MIT license.
# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
# LICENSE is in incl_licenses directory.
import os
import json
from pathlib import Path
from typing import Optional, Union, Dict
import torch
import torch.nn as nn
from torch.nn import Conv1d, ConvTranspose1d
from torch.nn.utils import weight_norm, remove_weight_norm
from . import activations
from .utils import init_weights, get_padding
from .alias_free_activation.torch.act import Activation1d as TorchActivation1d
from .env import AttrDict
from huggingface_hub import PyTorchModelHubMixin, hf_hub_download
def load_hparams_from_json(path) -> AttrDict:
with open(path) as f:
data = f.read()
return AttrDict(json.loads(data))
class AMPBlock1(torch.nn.Module):
"""
AMPBlock applies Snake / SnakeBeta activation functions with trainable parameters that control periodicity, defined for each layer.
AMPBlock1 has additional self.convs2 that contains additional Conv1d layers with a fixed dilation=1 followed by each layer in self.convs1
Args:
h (AttrDict): Hyperparameters.
channels (int): Number of convolution channels.
kernel_size (int): Size of the convolution kernel. Default is 3.
dilation (tuple): Dilation rates for the convolutions. Each dilation layer has two convolutions. Default is (1, 3, 5).
activation (str): Activation function type. Should be either 'snake' or 'snakebeta'. Default is None.
"""
def __init__(
self,
h: AttrDict,
channels: int,
kernel_size: int = 3,
dilation: tuple = (1, 3, 5),
activation: str = None,
):
super().__init__()
self.h = h
self.convs1 = nn.ModuleList(
[
weight_norm(
Conv1d(
channels,
channels,
kernel_size,
stride=1,
dilation=d,
padding=get_padding(kernel_size, d),
)
)
for d in dilation
]
)
self.convs1.apply(init_weights)
self.convs2 = nn.ModuleList(
[
weight_norm(
Conv1d(
channels,
channels,
kernel_size,
stride=1,
dilation=1,
padding=get_padding(kernel_size, 1),
)
)
for _ in range(len(dilation))
]
)
self.convs2.apply(init_weights)
self.num_layers = len(self.convs1) + len(
self.convs2
) # Total number of conv layers
# Select which Activation1d, lazy-load cuda version to ensure backward compatibility
if self.h.get("use_cuda_kernel", False):
from .alias_free_activation.cuda.activation1d import (
Activation1d as CudaActivation1d,
)
Activation1d = CudaActivation1d
else:
Activation1d = TorchActivation1d
# Activation functions
if activation == "snake":
self.activations = nn.ModuleList(
[
Activation1d(
activation=activations.Snake(
channels, alpha_logscale=h.snake_logscale
)
)
for _ in range(self.num_layers)
]
)
elif activation == "snakebeta":
self.activations = nn.ModuleList(
[
Activation1d(
activation=activations.SnakeBeta(
channels, alpha_logscale=h.snake_logscale
)
)
for _ in range(self.num_layers)
]
)
else:
raise NotImplementedError(
"activation incorrectly specified. check the config file and look for 'activation'."
)
def forward(self, x):
acts1, acts2 = self.activations[::2], self.activations[1::2]
for c1, c2, a1, a2 in zip(self.convs1, self.convs2, acts1, acts2):
xt = a1(x)
xt = c1(xt)
xt = a2(xt)
xt = c2(xt)
x = xt + x
return x
def remove_weight_norm(self):
for l in self.convs1:
remove_weight_norm(l)
for l in self.convs2:
remove_weight_norm(l)
class AMPBlock2(torch.nn.Module):
"""
AMPBlock applies Snake / SnakeBeta activation functions with trainable parameters that control periodicity, defined for each layer.
Unlike AMPBlock1, AMPBlock2 does not contain extra Conv1d layers with fixed dilation=1
Args:
h (AttrDict): Hyperparameters.
channels (int): Number of convolution channels.
kernel_size (int): Size of the convolution kernel. Default is 3.
dilation (tuple): Dilation rates for the convolutions. Each dilation layer has two convolutions. Default is (1, 3, 5).
activation (str): Activation function type. Should be either 'snake' or 'snakebeta'. Default is None.
"""
def __init__(
self,
h: AttrDict,
channels: int,
kernel_size: int = 3,
dilation: tuple = (1, 3, 5),
activation: str = None,
):
super().__init__()
self.h = h
self.convs = nn.ModuleList(
[
weight_norm(
Conv1d(
channels,
channels,
kernel_size,
stride=1,
dilation=d,
padding=get_padding(kernel_size, d),
)
)
for d in dilation
]
)
self.convs.apply(init_weights)
self.num_layers = len(self.convs) # Total number of conv layers
# Select which Activation1d, lazy-load cuda version to ensure backward compatibility
if self.h.get("use_cuda_kernel", False):
from .alias_free_activation.cuda.activation1d import (
Activation1d as CudaActivation1d,
)
Activation1d = CudaActivation1d
else:
Activation1d = TorchActivation1d
# Activation functions
if activation == "snake":
self.activations = nn.ModuleList(
[
Activation1d(
activation=activations.Snake(
channels, alpha_logscale=h.snake_logscale
)
)
for _ in range(self.num_layers)
]
)
elif activation == "snakebeta":
self.activations = nn.ModuleList(
[
Activation1d(
activation=activations.SnakeBeta(
channels, alpha_logscale=h.snake_logscale
)
)
for _ in range(self.num_layers)
]
)
else:
raise NotImplementedError(
"activation incorrectly specified. check the config file and look for 'activation'."
)
def forward(self, x):
for c, a in zip(self.convs, self.activations):
xt = a(x)
xt = c(xt)
x = xt + x
def remove_weight_norm(self):
for l in self.convs:
remove_weight_norm(l)
class BigVGAN(
torch.nn.Module,
PyTorchModelHubMixin,
library_name="bigvgan",
repo_url="https://github.com/NVIDIA/BigVGAN",
docs_url="https://github.com/NVIDIA/BigVGAN/blob/main/README.md",
pipeline_tag="audio-to-audio",
license="mit",
tags=["neural-vocoder", "audio-generation", "arxiv:2206.04658"],
):
"""
BigVGAN is a neural vocoder model that applies anti-aliased periodic activation for residual blocks (resblocks).
New in BigVGAN-v2: it can optionally use optimized CUDA kernels for AMP (anti-aliased multi-periodicity) blocks.
Args:
h (AttrDict): Hyperparameters.
use_cuda_kernel (bool): If set to True, loads optimized CUDA kernels for AMP. This should be used for inference only, as training is not supported with CUDA kernels.
Note:
- The `use_cuda_kernel` parameter should be used for inference only, as training with CUDA kernels is not supported.
- Ensure that the activation function is correctly specified in the hyperparameters (h.activation).
"""
def __init__(self, h: AttrDict, use_cuda_kernel: bool = False):
super().__init__()
self.h = h
self.h["use_cuda_kernel"] = use_cuda_kernel
# Select which Activation1d, lazy-load cuda version to ensure backward compatibility
if self.h.get("use_cuda_kernel", False):
from .alias_free_activation.cuda.activation1d import (
Activation1d as CudaActivation1d,
)
Activation1d = CudaActivation1d
else:
Activation1d = TorchActivation1d
self.num_kernels = len(h.resblock_kernel_sizes)
self.num_upsamples = len(h.upsample_rates)
# Pre-conv
self.conv_pre = weight_norm(
Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3)
)
# Define which AMPBlock to use. BigVGAN uses AMPBlock1 as default
if h.resblock == "1":
resblock_class = AMPBlock1
elif h.resblock == "2":
resblock_class = AMPBlock2
else:
raise ValueError(
f"Incorrect resblock class specified in hyperparameters. Got {h.resblock}"
)
# Transposed conv-based upsamplers. does not apply anti-aliasing
self.ups = nn.ModuleList()
for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
self.ups.append(
nn.ModuleList(
[
weight_norm(
ConvTranspose1d(
h.upsample_initial_channel // (2 ** i),
h.upsample_initial_channel // (2 ** (i + 1)),
k,
u,
padding=(k - u) // 2,
)
)
]
)
)
# Residual blocks using anti-aliased multi-periodicity composition modules (AMP)
self.resblocks = nn.ModuleList()
for i in range(len(self.ups)):
ch = h.upsample_initial_channel // (2 ** (i + 1))
for j, (k, d) in enumerate(
zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)
):
self.resblocks.append(
resblock_class(h, ch, k, d, activation=h.activation)
)
# Post-conv
activation_post = (
activations.Snake(ch, alpha_logscale=h.snake_logscale)
if h.activation == "snake"
else (
activations.SnakeBeta(ch, alpha_logscale=h.snake_logscale)
if h.activation == "snakebeta"
else None
)
)
if activation_post is None:
raise NotImplementedError(
"activation incorrectly specified. check the config file and look for 'activation'."
)
self.activation_post = Activation1d(activation=activation_post)
# Whether to use bias for the final conv_post. Default to True for backward compatibility
self.use_bias_at_final = h.get("use_bias_at_final", True)
self.conv_post = weight_norm(
Conv1d(ch, 1, 7, 1, padding=3, bias=self.use_bias_at_final)
)
# Weight initialization
for i in range(len(self.ups)):
self.ups[i].apply(init_weights)
self.conv_post.apply(init_weights)
# Final tanh activation. Defaults to True for backward compatibility
self.use_tanh_at_final = h.get("use_tanh_at_final", True)
def forward(self, x):
# Pre-conv
x = self.conv_pre(x)
for i in range(self.num_upsamples):
# Upsampling
for i_up in range(len(self.ups[i])):
x = self.ups[i][i_up](x)
# AMP blocks
xs = None
for j in range(self.num_kernels):
if xs is None:
xs = self.resblocks[i * self.num_kernels + j](x)
else:
xs += self.resblocks[i * self.num_kernels + j](x)
x = xs / self.num_kernels
# Post-conv
x = self.activation_post(x)
x = self.conv_post(x)
# Final tanh activation
if self.use_tanh_at_final:
x = torch.tanh(x)
else:
x = torch.clamp(x, min=-1.0, max=1.0) # Bound the output to [-1, 1]
return x
def remove_weight_norm(self):
try:
print("Removing weight norm...")
for l in self.ups:
for l_i in l:
remove_weight_norm(l_i)
for l in self.resblocks:
l.remove_weight_norm()
remove_weight_norm(self.conv_pre)
remove_weight_norm(self.conv_post)
except ValueError:
print("[INFO] Model already removed weight norm. Skipping!")
pass
# Additional methods for huggingface_hub support
def _save_pretrained(self, save_directory: Path) -> None:
"""Save weights and config.json from a Pytorch model to a local directory."""
model_path = save_directory / "bigvgan_generator.pt"
torch.save({"generator": self.state_dict()}, model_path)
config_path = save_directory / "config.json"
with open(config_path, "w") as config_file:
json.dump(self.h, config_file, indent=4)
@classmethod
def _from_pretrained(
cls,
*,
model_id: str,
revision: str,
cache_dir: str,
force_download: bool,
proxies: Optional[Dict],
resume_download: bool,
local_files_only: bool,
token: Union[str, bool, None],
map_location: str = "cpu", # Additional argument
strict: bool = False, # Additional argument
use_cuda_kernel: bool = False,
**model_kwargs,
):
"""Load Pytorch pretrained weights and return the loaded model."""
# Download and load hyperparameters (h) used by BigVGAN
if os.path.isdir(model_id):
print("Loading config.json from local directory")
config_file = os.path.join(model_id, "config.json")
else:
config_file = hf_hub_download(
repo_id=model_id,
filename="config.json",
revision=revision,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
token=token,
local_files_only=local_files_only,
)
h = load_hparams_from_json(config_file)
# instantiate BigVGAN using h
if use_cuda_kernel:
print(
f"[WARNING] You have specified use_cuda_kernel=True during BigVGAN.from_pretrained(). Only inference is supported (training is not implemented)!"
)
print(
f"[WARNING] You need nvcc and ninja installed in your system that matches your PyTorch build is using to build the kernel. If not, the model will fail to initialize or generate incorrect waveform!"
)
print(
f"[WARNING] For detail, see the official GitHub repository: https://github.com/NVIDIA/BigVGAN?tab=readme-ov-file#using-custom-cuda-kernel-for-synthesis"
)
model = cls(h, use_cuda_kernel=use_cuda_kernel)
# Download and load pretrained generator weight
if os.path.isdir(model_id):
print("Loading weights from local directory")
model_file = os.path.join(model_id, "bigvgan_generator.pt")
else:
print(f"Loading weights from {model_id}")
model_file = hf_hub_download(
repo_id=model_id,
filename="bigvgan_generator.pt",
revision=revision,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
token=token,
local_files_only=local_files_only,
)
checkpoint_dict = torch.load(model_file, map_location=map_location)
try:
model.load_state_dict(checkpoint_dict["generator"])
except RuntimeError:
print(
f"[INFO] the pretrained checkpoint does not contain weight norm. Loading the checkpoint after removing weight norm!"
)
model.remove_weight_norm()
model.load_state_dict(checkpoint_dict["generator"])
return model

View File

@@ -0,0 +1,63 @@
{
"resblock": "1",
"num_gpus": 0,
"batch_size": 32,
"learning_rate": 0.0001,
"adam_b1": 0.8,
"adam_b2": 0.99,
"lr_decay": 0.9999996,
"seed": 1234,
"upsample_rates": [4,4,2,2,2,2],
"upsample_kernel_sizes": [8,8,4,4,4,4],
"upsample_initial_channel": 1536,
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"use_tanh_at_final": false,
"use_bias_at_final": false,
"activation": "snakebeta",
"snake_logscale": true,
"use_cqtd_instead_of_mrd": true,
"cqtd_filters": 128,
"cqtd_max_filters": 1024,
"cqtd_filters_scale": 1,
"cqtd_dilations": [1, 2, 4],
"cqtd_hop_lengths": [512, 256, 256],
"cqtd_n_octaves": [9, 9, 9],
"cqtd_bins_per_octaves": [24, 36, 48],
"mpd_reshapes": [2, 3, 5, 7, 11],
"use_spectral_norm": false,
"discriminator_channel_mult": 1,
"use_multiscale_melloss": true,
"lambda_melloss": 15,
"clip_grad_norm": 500,
"segment_size": 65536,
"num_mels": 80,
"num_freq": 1025,
"n_fft": 1024,
"hop_size": 256,
"win_size": 1024,
"sampling_rate": 22050,
"fmin": 0,
"fmax": null,
"fmax_for_loss": null,
"normalize_volume": true,
"num_workers": 4,
"dist_config": {
"dist_backend": "nccl",
"dist_url": "tcp://localhost:54321",
"world_size": 1
}
}

View File

@@ -0,0 +1,18 @@
# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
# LICENSE is in incl_licenses directory.
import os
import shutil
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
def build_env(config, config_name, path):
t_path = os.path.join(path, config_name)
if config != t_path:
os.makedirs(path, exist_ok=True)
shutil.copyfile(config, os.path.join(path, config_name))

View File

@@ -0,0 +1,354 @@
# Copyright (c) 2024 NVIDIA CORPORATION.
# Licensed under the MIT license.
# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
# LICENSE is in incl_licenses directory.
import math
import os
import random
import torch
import torch.utils.data
import numpy as np
from librosa.util import normalize
from scipy.io.wavfile import read
from librosa.filters import mel as librosa_mel_fn
import pathlib
from tqdm import tqdm
MAX_WAV_VALUE = 32767.0 # NOTE: 32768.0 -1 to prevent int16 overflow (results in popping sound in corner cases)
def load_wav(full_path, sr_target):
sampling_rate, data = read(full_path)
if sampling_rate != sr_target:
raise RuntimeError(
f"Sampling rate of the file {full_path} is {sampling_rate} Hz, but the model requires {sr_target} Hz"
)
return data, sampling_rate
def dynamic_range_compression(x, C=1, clip_val=1e-5):
return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
def dynamic_range_decompression(x, C=1):
return np.exp(x) / C
def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
return torch.log(torch.clamp(x, min=clip_val) * C)
def dynamic_range_decompression_torch(x, C=1):
return torch.exp(x) / C
def spectral_normalize_torch(magnitudes):
return dynamic_range_compression_torch(magnitudes)
def spectral_de_normalize_torch(magnitudes):
return dynamic_range_decompression_torch(magnitudes)
mel_basis_cache = {}
hann_window_cache = {}
def mel_spectrogram(
y: torch.Tensor,
n_fft: int,
num_mels: int,
sampling_rate: int,
hop_size: int,
win_size: int,
fmin: int,
fmax: int = None,
center: bool = False,
) -> torch.Tensor:
"""
Calculate the mel spectrogram of an input signal.
This function uses slaney norm for the librosa mel filterbank (using librosa.filters.mel) and uses Hann window for STFT (using torch.stft).
Args:
y (torch.Tensor): Input signal.
n_fft (int): FFT size.
num_mels (int): Number of mel bins.
sampling_rate (int): Sampling rate of the input signal.
hop_size (int): Hop size for STFT.
win_size (int): Window size for STFT.
fmin (int): Minimum frequency for mel filterbank.
fmax (int): Maximum frequency for mel filterbank. If None, defaults to half the sampling rate (fmax = sr / 2.0) inside librosa_mel_fn
center (bool): Whether to pad the input to center the frames. Default is False.
Returns:
torch.Tensor: Mel spectrogram.
"""
if torch.min(y) < -1.0:
print(f"[WARNING] Min value of input waveform signal is {torch.min(y)}")
if torch.max(y) > 1.0:
print(f"[WARNING] Max value of input waveform signal is {torch.max(y)}")
device = y.device
key = f"{n_fft}_{num_mels}_{sampling_rate}_{hop_size}_{win_size}_{fmin}_{fmax}_{device}"
if key not in mel_basis_cache:
mel = librosa_mel_fn(
sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax
)
mel_basis_cache[key] = torch.from_numpy(mel).float().to(device)
hann_window_cache[key] = torch.hann_window(win_size).to(device)
mel_basis = mel_basis_cache[key]
hann_window = hann_window_cache[key]
padding = (n_fft - hop_size) // 2
y = torch.nn.functional.pad(
y.unsqueeze(1), (padding, padding), mode="reflect"
).squeeze(1)
spec = torch.stft(
y,
n_fft,
hop_length=hop_size,
win_length=win_size,
window=hann_window,
center=center,
pad_mode="reflect",
normalized=False,
onesided=True,
return_complex=True,
)
spec = torch.sqrt(torch.view_as_real(spec).pow(2).sum(-1) + 1e-9)
mel_spec = torch.matmul(mel_basis, spec)
mel_spec = spectral_normalize_torch(mel_spec)
return mel_spec
def get_mel_spectrogram(wav, h):
"""
Generate mel spectrogram from a waveform using given hyperparameters.
Args:
wav (torch.Tensor): Input waveform.
h: Hyperparameters object with attributes n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax.
Returns:
torch.Tensor: Mel spectrogram.
"""
return mel_spectrogram(
wav,
h.n_fft,
h.num_mels,
h.sampling_rate,
h.hop_size,
h.win_size,
h.fmin,
h.fmax,
)
def get_dataset_filelist(a):
training_files = []
validation_files = []
list_unseen_validation_files = []
with open(a.input_training_file, "r", encoding="utf-8") as fi:
training_files = [
os.path.join(a.input_wavs_dir, x.split("|")[0] + ".wav")
for x in fi.read().split("\n")
if len(x) > 0
]
print(f"first training file: {training_files[0]}")
with open(a.input_validation_file, "r", encoding="utf-8") as fi:
validation_files = [
os.path.join(a.input_wavs_dir, x.split("|")[0] + ".wav")
for x in fi.read().split("\n")
if len(x) > 0
]
print(f"first validation file: {validation_files[0]}")
for i in range(len(a.list_input_unseen_validation_file)):
with open(a.list_input_unseen_validation_file[i], "r", encoding="utf-8") as fi:
unseen_validation_files = [
os.path.join(a.list_input_unseen_wavs_dir[i], x.split("|")[0] + ".wav")
for x in fi.read().split("\n")
if len(x) > 0
]
print(
f"first unseen {i}th validation fileset: {unseen_validation_files[0]}"
)
list_unseen_validation_files.append(unseen_validation_files)
return training_files, validation_files, list_unseen_validation_files
class MelDataset(torch.utils.data.Dataset):
def __init__(
self,
training_files,
hparams,
segment_size,
n_fft,
num_mels,
hop_size,
win_size,
sampling_rate,
fmin,
fmax,
split=True,
shuffle=True,
n_cache_reuse=1,
device=None,
fmax_loss=None,
fine_tuning=False,
base_mels_path=None,
is_seen=True,
):
self.audio_files = training_files
random.seed(1234)
if shuffle:
random.shuffle(self.audio_files)
self.hparams = hparams
self.is_seen = is_seen
if self.is_seen:
self.name = pathlib.Path(self.audio_files[0]).parts[0]
else:
self.name = "-".join(pathlib.Path(self.audio_files[0]).parts[:2]).strip("/")
self.segment_size = segment_size
self.sampling_rate = sampling_rate
self.split = split
self.n_fft = n_fft
self.num_mels = num_mels
self.hop_size = hop_size
self.win_size = win_size
self.fmin = fmin
self.fmax = fmax
self.fmax_loss = fmax_loss
self.cached_wav = None
self.n_cache_reuse = n_cache_reuse
self._cache_ref_count = 0
self.device = device
self.fine_tuning = fine_tuning
self.base_mels_path = base_mels_path
print("[INFO] checking dataset integrity...")
for i in tqdm(range(len(self.audio_files))):
assert os.path.exists(
self.audio_files[i]
), f"{self.audio_files[i]} not found"
def __getitem__(self, index):
filename = self.audio_files[index]
if self._cache_ref_count == 0:
audio, sampling_rate = load_wav(filename, self.sampling_rate)
audio = audio / MAX_WAV_VALUE
if not self.fine_tuning:
audio = normalize(audio) * 0.95
self.cached_wav = audio
if sampling_rate != self.sampling_rate:
raise ValueError(
f"{sampling_rate} SR doesn't match target {self.sampling_rate} SR"
)
self._cache_ref_count = self.n_cache_reuse
else:
audio = self.cached_wav
self._cache_ref_count -= 1
audio = torch.FloatTensor(audio)
audio = audio.unsqueeze(0)
if not self.fine_tuning:
if self.split:
if audio.size(1) >= self.segment_size:
max_audio_start = audio.size(1) - self.segment_size
audio_start = random.randint(0, max_audio_start)
audio = audio[:, audio_start : audio_start + self.segment_size]
else:
audio = torch.nn.functional.pad(
audio, (0, self.segment_size - audio.size(1)), "constant"
)
mel = mel_spectrogram(
audio,
self.n_fft,
self.num_mels,
self.sampling_rate,
self.hop_size,
self.win_size,
self.fmin,
self.fmax,
center=False,
)
else: # Validation step
# Match audio length to self.hop_size * n for evaluation
if (audio.size(1) % self.hop_size) != 0:
audio = audio[:, : -(audio.size(1) % self.hop_size)]
mel = mel_spectrogram(
audio,
self.n_fft,
self.num_mels,
self.sampling_rate,
self.hop_size,
self.win_size,
self.fmin,
self.fmax,
center=False,
)
assert (
audio.shape[1] == mel.shape[2] * self.hop_size
), f"audio shape {audio.shape} mel shape {mel.shape}"
else:
mel = np.load(
os.path.join(
self.base_mels_path,
os.path.splitext(os.path.split(filename)[-1])[0] + ".npy",
)
)
mel = torch.from_numpy(mel)
if len(mel.shape) < 3:
mel = mel.unsqueeze(0)
if self.split:
frames_per_seg = math.ceil(self.segment_size / self.hop_size)
if audio.size(1) >= self.segment_size:
mel_start = random.randint(0, mel.size(2) - frames_per_seg - 1)
mel = mel[:, :, mel_start : mel_start + frames_per_seg]
audio = audio[
:,
mel_start
* self.hop_size : (mel_start + frames_per_seg)
* self.hop_size,
]
else:
mel = torch.nn.functional.pad(
mel, (0, frames_per_seg - mel.size(2)), "constant"
)
audio = torch.nn.functional.pad(
audio, (0, self.segment_size - audio.size(1)), "constant"
)
mel_loss = mel_spectrogram(
audio,
self.n_fft,
self.num_mels,
self.sampling_rate,
self.hop_size,
self.win_size,
self.fmin,
self.fmax_loss,
center=False,
)
return (mel.squeeze(), audio.squeeze(0), filename, mel_loss.squeeze())
def __len__(self):
return len(self.audio_files)

View File

@@ -0,0 +1,99 @@
# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
# LICENSE is in incl_licenses directory.
import glob
import os
import matplotlib
import torch
from torch.nn.utils import weight_norm
matplotlib.use("Agg")
import matplotlib.pylab as plt
from .meldataset import MAX_WAV_VALUE
from scipy.io.wavfile import write
def plot_spectrogram(spectrogram):
fig, ax = plt.subplots(figsize=(10, 2))
im = ax.imshow(spectrogram, aspect="auto", origin="lower", interpolation="none")
plt.colorbar(im, ax=ax)
fig.canvas.draw()
plt.close()
return fig
def plot_spectrogram_clipped(spectrogram, clip_max=2.0):
fig, ax = plt.subplots(figsize=(10, 2))
im = ax.imshow(
spectrogram,
aspect="auto",
origin="lower",
interpolation="none",
vmin=1e-6,
vmax=clip_max,
)
plt.colorbar(im, ax=ax)
fig.canvas.draw()
plt.close()
return fig
def init_weights(m, mean=0.0, std=0.01):
classname = m.__class__.__name__
if classname.find("Conv") != -1:
m.weight.data.normal_(mean, std)
def apply_weight_norm(m):
classname = m.__class__.__name__
if classname.find("Conv") != -1:
weight_norm(m)
def get_padding(kernel_size, dilation=1):
return int((kernel_size * dilation - dilation) / 2)
def load_checkpoint(filepath, device):
assert os.path.isfile(filepath)
print(f"Loading '{filepath}'")
checkpoint_dict = torch.load(filepath, map_location=device)
print("Complete.")
return checkpoint_dict
def save_checkpoint(filepath, obj):
print(f"Saving checkpoint to {filepath}")
torch.save(obj, filepath)
print("Complete.")
def scan_checkpoint(cp_dir, prefix, renamed_file=None):
# Fallback to original scanning logic first
pattern = os.path.join(cp_dir, prefix + "????????")
cp_list = glob.glob(pattern)
if len(cp_list) > 0:
last_checkpoint_path = sorted(cp_list)[-1]
print(f"[INFO] Resuming from checkpoint: '{last_checkpoint_path}'")
return last_checkpoint_path
# If no pattern-based checkpoints are found, check for renamed file
if renamed_file:
renamed_path = os.path.join(cp_dir, renamed_file)
if os.path.isfile(renamed_path):
print(f"[INFO] Resuming from renamed checkpoint: '{renamed_file}'")
return renamed_path
return None
def save_audio(audio, path, sr):
# wav: torch with 1d shape
audio = audio * MAX_WAV_VALUE
audio = audio.cpu().numpy().astype("int16")
write(path, sr, audio)

View File

@@ -0,0 +1,115 @@
# Copyright 3D-Speaker (https://github.com/alibaba-damo-academy/3D-Speaker). All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
from collections import OrderedDict
import torch
from torch import nn
import torch.nn.functional as F
from indextts.s2mel.modules.campplus.layers import DenseLayer, StatsPool, TDNNLayer, CAMDenseTDNNBlock, TransitLayer, BasicResBlock, get_nonlinear
class FCM(nn.Module):
def __init__(self,
block=BasicResBlock,
num_blocks=[2, 2],
m_channels=32,
feat_dim=80):
super(FCM, self).__init__()
self.in_planes = m_channels
self.conv1 = nn.Conv2d(1, m_channels, kernel_size=3, stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(m_channels)
self.layer1 = self._make_layer(block, m_channels, num_blocks[0], stride=2)
self.layer2 = self._make_layer(block, m_channels, num_blocks[1], stride=2)
self.conv2 = nn.Conv2d(m_channels, m_channels, kernel_size=3, stride=(2, 1), padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(m_channels)
self.out_channels = m_channels * (feat_dim // 8)
def _make_layer(self, block, planes, num_blocks, stride):
strides = [stride] + [1] * (num_blocks - 1)
layers = []
for stride in strides:
layers.append(block(self.in_planes, planes, stride))
self.in_planes = planes * block.expansion
return nn.Sequential(*layers)
def forward(self, x):
x = x.unsqueeze(1)
out = F.relu(self.bn1(self.conv1(x)))
out = self.layer1(out)
out = self.layer2(out)
out = F.relu(self.bn2(self.conv2(out)))
shape = out.shape
out = out.reshape(shape[0], shape[1]*shape[2], shape[3])
return out
class CAMPPlus(nn.Module):
def __init__(self,
feat_dim=80,
embedding_size=512,
growth_rate=32,
bn_size=4,
init_channels=128,
config_str='batchnorm-relu',
memory_efficient=True):
super(CAMPPlus, self).__init__()
self.head = FCM(feat_dim=feat_dim)
channels = self.head.out_channels
self.xvector = nn.Sequential(
OrderedDict([
('tdnn',
TDNNLayer(channels,
init_channels,
5,
stride=2,
dilation=1,
padding=-1,
config_str=config_str)),
]))
channels = init_channels
for i, (num_layers, kernel_size,
dilation) in enumerate(zip((12, 24, 16), (3, 3, 3), (1, 2, 2))):
block = CAMDenseTDNNBlock(num_layers=num_layers,
in_channels=channels,
out_channels=growth_rate,
bn_channels=bn_size * growth_rate,
kernel_size=kernel_size,
dilation=dilation,
config_str=config_str,
memory_efficient=memory_efficient)
self.xvector.add_module('block%d' % (i + 1), block)
channels = channels + num_layers * growth_rate
self.xvector.add_module(
'transit%d' % (i + 1),
TransitLayer(channels,
channels // 2,
bias=False,
config_str=config_str))
channels //= 2
self.xvector.add_module(
'out_nonlinear', get_nonlinear(config_str, channels))
self.xvector.add_module('stats', StatsPool())
self.xvector.add_module(
'dense',
DenseLayer(channels * 2, embedding_size, config_str='batchnorm_'))
for m in self.modules():
if isinstance(m, (nn.Conv1d, nn.Linear)):
nn.init.kaiming_normal_(m.weight.data)
if m.bias is not None:
nn.init.zeros_(m.bias)
def forward(self, x):
x = x.permute(0, 2, 1) # (B,T,F) => (B,F,T)
x = self.head(x)
x = self.xvector(x)
return x

View File

@@ -0,0 +1,70 @@
# Copyright 3D-Speaker (https://github.com/alibaba-damo-academy/3D-Speaker). All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
import torch
import torch.nn as nn
import torch.nn.functional as F
from modules.campplus.layers import DenseLayer
class CosineClassifier(nn.Module):
def __init__(
self,
input_dim,
num_blocks=0,
inter_dim=512,
out_neurons=1000,
):
super().__init__()
self.blocks = nn.ModuleList()
for index in range(num_blocks):
self.blocks.append(
DenseLayer(input_dim, inter_dim, config_str='batchnorm')
)
input_dim = inter_dim
self.weight = nn.Parameter(
torch.FloatTensor(out_neurons, input_dim)
)
nn.init.xavier_uniform_(self.weight)
def forward(self, x):
# x: [B, dim]
for layer in self.blocks:
x = layer(x)
# normalized
x = F.linear(F.normalize(x), F.normalize(self.weight))
return x
class LinearClassifier(nn.Module):
def __init__(
self,
input_dim,
num_blocks=0,
inter_dim=512,
out_neurons=1000,
):
super().__init__()
self.blocks = nn.ModuleList()
self.nonlinear = nn.ReLU(inplace=True)
for index in range(num_blocks):
self.blocks.append(
DenseLayer(input_dim, inter_dim, bias=True)
)
input_dim = inter_dim
self.linear = nn.Linear(input_dim, out_neurons, bias=True)
def forward(self, x):
# x: [B, dim]
x = self.nonlinear(x)
for layer in self.blocks:
x = layer(x)
x = self.linear(x)
return x

View File

@@ -0,0 +1,253 @@
# Copyright 3D-Speaker (https://github.com/alibaba-damo-academy/3D-Speaker). All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
import torch
import torch.nn.functional as F
import torch.utils.checkpoint as cp
from torch import nn
def get_nonlinear(config_str, channels):
nonlinear = nn.Sequential()
for name in config_str.split('-'):
if name == 'relu':
nonlinear.add_module('relu', nn.ReLU(inplace=True))
elif name == 'prelu':
nonlinear.add_module('prelu', nn.PReLU(channels))
elif name == 'batchnorm':
nonlinear.add_module('batchnorm', nn.BatchNorm1d(channels))
elif name == 'batchnorm_':
nonlinear.add_module('batchnorm',
nn.BatchNorm1d(channels, affine=False))
else:
raise ValueError('Unexpected module ({}).'.format(name))
return nonlinear
def statistics_pooling(x, dim=-1, keepdim=False, unbiased=True, eps=1e-2):
mean = x.mean(dim=dim)
std = x.std(dim=dim, unbiased=unbiased)
stats = torch.cat([mean, std], dim=-1)
if keepdim:
stats = stats.unsqueeze(dim=dim)
return stats
class StatsPool(nn.Module):
def forward(self, x):
return statistics_pooling(x)
class TDNNLayer(nn.Module):
def __init__(self,
in_channels,
out_channels,
kernel_size,
stride=1,
padding=0,
dilation=1,
bias=False,
config_str='batchnorm-relu'):
super(TDNNLayer, self).__init__()
if padding < 0:
assert kernel_size % 2 == 1, 'Expect equal paddings, but got even kernel size ({})'.format(
kernel_size)
padding = (kernel_size - 1) // 2 * dilation
self.linear = nn.Conv1d(in_channels,
out_channels,
kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
bias=bias)
self.nonlinear = get_nonlinear(config_str, out_channels)
def forward(self, x):
x = self.linear(x)
x = self.nonlinear(x)
return x
class CAMLayer(nn.Module):
def __init__(self,
bn_channels,
out_channels,
kernel_size,
stride,
padding,
dilation,
bias,
reduction=2):
super(CAMLayer, self).__init__()
self.linear_local = nn.Conv1d(bn_channels,
out_channels,
kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
bias=bias)
self.linear1 = nn.Conv1d(bn_channels, bn_channels // reduction, 1)
self.relu = nn.ReLU(inplace=True)
self.linear2 = nn.Conv1d(bn_channels // reduction, out_channels, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
y = self.linear_local(x)
context = x.mean(-1, keepdim=True)+self.seg_pooling(x)
context = self.relu(self.linear1(context))
m = self.sigmoid(self.linear2(context))
return y*m
def seg_pooling(self, x, seg_len=100, stype='avg'):
if stype == 'avg':
seg = F.avg_pool1d(x, kernel_size=seg_len, stride=seg_len, ceil_mode=True)
elif stype == 'max':
seg = F.max_pool1d(x, kernel_size=seg_len, stride=seg_len, ceil_mode=True)
else:
raise ValueError('Wrong segment pooling type.')
shape = seg.shape
seg = seg.unsqueeze(-1).expand(*shape, seg_len).reshape(*shape[:-1], -1)
seg = seg[..., :x.shape[-1]]
return seg
class CAMDenseTDNNLayer(nn.Module):
def __init__(self,
in_channels,
out_channels,
bn_channels,
kernel_size,
stride=1,
dilation=1,
bias=False,
config_str='batchnorm-relu',
memory_efficient=False):
super(CAMDenseTDNNLayer, self).__init__()
assert kernel_size % 2 == 1, 'Expect equal paddings, but got even kernel size ({})'.format(
kernel_size)
padding = (kernel_size - 1) // 2 * dilation
self.memory_efficient = memory_efficient
self.nonlinear1 = get_nonlinear(config_str, in_channels)
self.linear1 = nn.Conv1d(in_channels, bn_channels, 1, bias=False)
self.nonlinear2 = get_nonlinear(config_str, bn_channels)
self.cam_layer = CAMLayer(bn_channels,
out_channels,
kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
bias=bias)
def bn_function(self, x):
return self.linear1(self.nonlinear1(x))
def forward(self, x):
if self.training and self.memory_efficient:
x = cp.checkpoint(self.bn_function, x)
else:
x = self.bn_function(x)
x = self.cam_layer(self.nonlinear2(x))
return x
class CAMDenseTDNNBlock(nn.ModuleList):
def __init__(self,
num_layers,
in_channels,
out_channels,
bn_channels,
kernel_size,
stride=1,
dilation=1,
bias=False,
config_str='batchnorm-relu',
memory_efficient=False):
super(CAMDenseTDNNBlock, self).__init__()
for i in range(num_layers):
layer = CAMDenseTDNNLayer(in_channels=in_channels + i * out_channels,
out_channels=out_channels,
bn_channels=bn_channels,
kernel_size=kernel_size,
stride=stride,
dilation=dilation,
bias=bias,
config_str=config_str,
memory_efficient=memory_efficient)
self.add_module('tdnnd%d' % (i + 1), layer)
def forward(self, x):
for layer in self:
x = torch.cat([x, layer(x)], dim=1)
return x
class TransitLayer(nn.Module):
def __init__(self,
in_channels,
out_channels,
bias=True,
config_str='batchnorm-relu'):
super(TransitLayer, self).__init__()
self.nonlinear = get_nonlinear(config_str, in_channels)
self.linear = nn.Conv1d(in_channels, out_channels, 1, bias=bias)
def forward(self, x):
x = self.nonlinear(x)
x = self.linear(x)
return x
class DenseLayer(nn.Module):
def __init__(self,
in_channels,
out_channels,
bias=False,
config_str='batchnorm-relu'):
super(DenseLayer, self).__init__()
self.linear = nn.Conv1d(in_channels, out_channels, 1, bias=bias)
self.nonlinear = get_nonlinear(config_str, out_channels)
def forward(self, x):
if len(x.shape) == 2:
x = self.linear(x.unsqueeze(dim=-1)).squeeze(dim=-1)
else:
x = self.linear(x)
x = self.nonlinear(x)
return x
class BasicResBlock(nn.Module):
expansion = 1
def __init__(self, in_planes, planes, stride=1):
super(BasicResBlock, self).__init__()
self.conv1 = nn.Conv2d(in_planes,
planes,
kernel_size=3,
stride=(stride, 1),
padding=1,
bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(planes,
planes,
kernel_size=3,
stride=1,
padding=1,
bias=False)
self.bn2 = nn.BatchNorm2d(planes)
self.shortcut = nn.Sequential()
if stride != 1 or in_planes != self.expansion * planes:
self.shortcut = nn.Sequential(
nn.Conv2d(in_planes,
self.expansion * planes,
kernel_size=1,
stride=(stride, 1),
bias=False),
nn.BatchNorm2d(self.expansion * planes))
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x)
out = F.relu(out)
return out

View File

@@ -0,0 +1,643 @@
import math
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
from munch import Munch
import json
import argparse
from torch.nn.parallel import DistributedDataParallel as DDP
def str2bool(v):
if isinstance(v, bool):
return v
if v.lower() in ("yes", "true", "t", "y", "1"):
return True
elif v.lower() in ("no", "false", "f", "n", "0"):
return False
else:
raise argparse.ArgumentTypeError("Boolean value expected.")
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
def init_weights(m, mean=0.0, std=0.01):
classname = m.__class__.__name__
if classname.find("Conv") != -1:
m.weight.data.normal_(mean, std)
def get_padding(kernel_size, dilation=1):
return int((kernel_size * dilation - dilation) / 2)
def convert_pad_shape(pad_shape):
l = pad_shape[::-1]
pad_shape = [item for sublist in l for item in sublist]
return pad_shape
def intersperse(lst, item):
result = [item] * (len(lst) * 2 + 1)
result[1::2] = lst
return result
def kl_divergence(m_p, logs_p, m_q, logs_q):
"""KL(P||Q)"""
kl = (logs_q - logs_p) - 0.5
kl += (
0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q)
)
return kl
def rand_gumbel(shape):
"""Sample from the Gumbel distribution, protect from overflows."""
uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
return -torch.log(-torch.log(uniform_samples))
def rand_gumbel_like(x):
g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
return g
def slice_segments(x, ids_str, segment_size=4):
ret = torch.zeros_like(x[:, :, :segment_size])
for i in range(x.size(0)):
idx_str = ids_str[i]
idx_end = idx_str + segment_size
ret[i] = x[i, :, idx_str:idx_end]
return ret
def slice_segments_audio(x, ids_str, segment_size=4):
ret = torch.zeros_like(x[:, :segment_size])
for i in range(x.size(0)):
idx_str = ids_str[i]
idx_end = idx_str + segment_size
ret[i] = x[i, idx_str:idx_end]
return ret
def rand_slice_segments(x, x_lengths=None, segment_size=4):
b, d, t = x.size()
if x_lengths is None:
x_lengths = t
ids_str_max = x_lengths - segment_size + 1
ids_str = ((torch.rand([b]).to(device=x.device) * ids_str_max).clip(0)).to(
dtype=torch.long
)
ret = slice_segments(x, ids_str, segment_size)
return ret, ids_str
def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4):
position = torch.arange(length, dtype=torch.float)
num_timescales = channels // 2
log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / (
num_timescales - 1
)
inv_timescales = min_timescale * torch.exp(
torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment
)
scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
signal = F.pad(signal, [0, 0, 0, channels % 2])
signal = signal.view(1, channels, length)
return signal
def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
b, channels, length = x.size()
signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
return x + signal.to(dtype=x.dtype, device=x.device)
def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
b, channels, length = x.size()
signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
def subsequent_mask(length):
mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
return mask
@torch.jit.script
def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
n_channels_int = n_channels[0]
in_act = input_a + input_b
# use torch.split to avoid dynamic slicing
t_act_part, s_act_part = torch.split(in_act, n_channels_int, dim=1)
t_act = torch.tanh(t_act_part)
s_act = torch.sigmoid(s_act_part)
acts = t_act * s_act
return acts
def convert_pad_shape(pad_shape):
l = pad_shape[::-1]
pad_shape = [item for sublist in l for item in sublist]
return pad_shape
def shift_1d(x):
x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
return x
def sequence_mask(length, max_length=None):
if max_length is None:
max_length = length.max()
x = torch.arange(max_length, dtype=length.dtype, device=length.device)
return x.unsqueeze(0) < length.unsqueeze(1)
def avg_with_mask(x, mask):
assert mask.dtype == torch.float, "Mask should be float"
if mask.ndim == 2:
mask = mask.unsqueeze(1)
if mask.shape[1] == 1:
mask = mask.expand_as(x)
return (x * mask).sum() / mask.sum()
def generate_path(duration, mask):
"""
duration: [b, 1, t_x]
mask: [b, 1, t_y, t_x]
"""
device = duration.device
b, _, t_y, t_x = mask.shape
cum_duration = torch.cumsum(duration, -1)
cum_duration_flat = cum_duration.view(b * t_x)
path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
path = path.view(b, t_x, t_y)
path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
path = path.unsqueeze(1).transpose(2, 3) * mask
return path
def clip_grad_value_(parameters, clip_value, norm_type=2):
if isinstance(parameters, torch.Tensor):
parameters = [parameters]
parameters = list(filter(lambda p: p.grad is not None, parameters))
norm_type = float(norm_type)
if clip_value is not None:
clip_value = float(clip_value)
total_norm = 0
for p in parameters:
param_norm = p.grad.data.norm(norm_type)
total_norm += param_norm.item() ** norm_type
if clip_value is not None:
p.grad.data.clamp_(min=-clip_value, max=clip_value)
total_norm = total_norm ** (1.0 / norm_type)
return total_norm
def log_norm(x, mean=-4, std=4, dim=2):
"""
normalized log mel -> mel -> norm -> log(norm)
"""
x = torch.log(torch.exp(x * std + mean).norm(dim=dim))
return x
def load_F0_models(path):
# load F0 model
from .JDC.model import JDCNet
F0_model = JDCNet(num_class=1, seq_len=192)
params = torch.load(path, map_location="cpu")["net"]
F0_model.load_state_dict(params)
_ = F0_model.train()
return F0_model
def modify_w2v_forward(self, output_layer=15):
"""
change forward method of w2v encoder to get its intermediate layer output
:param self:
:param layer:
:return:
"""
from transformers.modeling_outputs import BaseModelOutput
def forward(
hidden_states,
attention_mask=None,
output_attentions=False,
output_hidden_states=False,
return_dict=True,
):
all_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
conv_attention_mask = attention_mask
if attention_mask is not None:
# make sure padded tokens output 0
hidden_states = hidden_states.masked_fill(
~attention_mask.bool().unsqueeze(-1), 0.0
)
# extend attention_mask
attention_mask = 1.0 - attention_mask[:, None, None, :].to(
dtype=hidden_states.dtype
)
attention_mask = attention_mask * torch.finfo(hidden_states.dtype).min
attention_mask = attention_mask.expand(
attention_mask.shape[0],
1,
attention_mask.shape[-1],
attention_mask.shape[-1],
)
hidden_states = self.dropout(hidden_states)
if self.embed_positions is not None:
relative_position_embeddings = self.embed_positions(hidden_states)
else:
relative_position_embeddings = None
deepspeed_zero3_is_enabled = False
for i, layer in enumerate(self.layers):
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
# add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
dropout_probability = torch.rand([])
skip_the_layer = (
True
if self.training and (dropout_probability < self.config.layerdrop)
else False
)
if not skip_the_layer or deepspeed_zero3_is_enabled:
# under deepspeed zero3 all gpus must run in sync
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
layer.__call__,
hidden_states,
attention_mask,
relative_position_embeddings,
output_attentions,
conv_attention_mask,
)
else:
layer_outputs = layer(
hidden_states,
attention_mask=attention_mask,
relative_position_embeddings=relative_position_embeddings,
output_attentions=output_attentions,
conv_attention_mask=conv_attention_mask,
)
hidden_states = layer_outputs[0]
if skip_the_layer:
layer_outputs = (None, None)
if output_attentions:
all_self_attentions = all_self_attentions + (layer_outputs[1],)
if i == output_layer - 1:
break
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(
v
for v in [hidden_states, all_hidden_states, all_self_attentions]
if v is not None
)
return BaseModelOutput(
last_hidden_state=hidden_states,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
)
return forward
MATPLOTLIB_FLAG = False
def plot_spectrogram_to_numpy(spectrogram):
global MATPLOTLIB_FLAG
if not MATPLOTLIB_FLAG:
import matplotlib
import logging
matplotlib.use("Agg")
MATPLOTLIB_FLAG = True
mpl_logger = logging.getLogger("matplotlib")
mpl_logger.setLevel(logging.WARNING)
import matplotlib.pylab as plt
import numpy as np
fig, ax = plt.subplots(figsize=(10, 2))
im = ax.imshow(spectrogram, aspect="auto", origin="lower", interpolation="none")
plt.colorbar(im, ax=ax)
plt.xlabel("Frames")
plt.ylabel("Channels")
plt.tight_layout()
fig.canvas.draw()
data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep="")
data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
plt.close()
return data
def normalize_f0(f0_sequence):
# Remove unvoiced frames (replace with -1)
voiced_indices = np.where(f0_sequence > 0)[0]
f0_voiced = f0_sequence[voiced_indices]
# Convert to log scale
log_f0 = np.log2(f0_voiced)
# Calculate mean and standard deviation
mean_f0 = np.mean(log_f0)
std_f0 = np.std(log_f0)
# Normalize the F0 sequence
normalized_f0 = (log_f0 - mean_f0) / std_f0
# Create the normalized F0 sequence with unvoiced frames
normalized_sequence = np.zeros_like(f0_sequence)
normalized_sequence[voiced_indices] = normalized_f0
normalized_sequence[f0_sequence <= 0] = -1 # Assign -1 to unvoiced frames
return normalized_sequence
class MyModel(nn.Module):
def __init__(self,args, use_emovec=False, use_gpt_latent=False):
super(MyModel, self).__init__()
from indextts.s2mel.modules.flow_matching import CFM
from indextts.s2mel.modules.length_regulator import InterpolateRegulator
length_regulator = InterpolateRegulator(
channels=args.length_regulator.channels,
sampling_ratios=args.length_regulator.sampling_ratios,
is_discrete=args.length_regulator.is_discrete,
in_channels=args.length_regulator.in_channels if hasattr(args.length_regulator, "in_channels") else None,
vector_quantize=args.length_regulator.vector_quantize if hasattr(args.length_regulator, "vector_quantize") else False,
codebook_size=args.length_regulator.content_codebook_size,
n_codebooks=args.length_regulator.n_codebooks if hasattr(args.length_regulator, "n_codebooks") else 1,
quantizer_dropout=args.length_regulator.quantizer_dropout if hasattr(args.length_regulator, "quantizer_dropout") else 0.0,
f0_condition=args.length_regulator.f0_condition if hasattr(args.length_regulator, "f0_condition") else False,
n_f0_bins=args.length_regulator.n_f0_bins if hasattr(args.length_regulator, "n_f0_bins") else 512,
)
if use_gpt_latent:
self.models = nn.ModuleDict({
'cfm': CFM(args),
'length_regulator': length_regulator,
'gpt_layer': torch.nn.Sequential(torch.nn.Linear(1280, 256), torch.nn.Linear(256, 128), torch.nn.Linear(128, 1024))
})
else:
self.models = nn.ModuleDict({
'cfm': CFM(args),
'length_regulator': length_regulator
})
def forward(self, x, target_lengths, prompt_len, cond, y):
x = self.models['cfm'](x, target_lengths, prompt_len, cond, y)
return x
def forward2(self, S_ori,target_lengths,F0_ori):
x = self.models['length_regulator'](S_ori, ylens=target_lengths, f0=F0_ori)
return x
def forward_emovec(self, x):
x = self.models['emo_layer'](x)
return x
def forward_emo_encoder(self, x):
x = self.models['emo_encoder'](x)
return x
def forward_gpt(self,x):
x = self.models['gpt_layer'](x)
return x
def enable_torch_compile(self):
"""Enable torch.compile optimization.
This method applies torch.compile to the model for significant
performance improvements during inference.
"""
if 'cfm' in self.models:
self.models['cfm'].enable_torch_compile()
def build_model(args, stage="DiT"):
if stage == "DiT":
from modules.flow_matching import CFM
from modules.length_regulator import InterpolateRegulator
length_regulator = InterpolateRegulator(
channels=args.length_regulator.channels,
sampling_ratios=args.length_regulator.sampling_ratios,
is_discrete=args.length_regulator.is_discrete,
in_channels=args.length_regulator.in_channels if hasattr(args.length_regulator, "in_channels") else None,
vector_quantize=args.length_regulator.vector_quantize if hasattr(args.length_regulator, "vector_quantize") else False,
codebook_size=args.length_regulator.content_codebook_size,
n_codebooks=args.length_regulator.n_codebooks if hasattr(args.length_regulator, "n_codebooks") else 1,
quantizer_dropout=args.length_regulator.quantizer_dropout if hasattr(args.length_regulator, "quantizer_dropout") else 0.0,
f0_condition=args.length_regulator.f0_condition if hasattr(args.length_regulator, "f0_condition") else False,
n_f0_bins=args.length_regulator.n_f0_bins if hasattr(args.length_regulator, "n_f0_bins") else 512,
)
cfm = CFM(args)
nets = Munch(
cfm=cfm,
length_regulator=length_regulator,
)
elif stage == 'codec':
from dac.model.dac import Encoder
from modules.quantize import (
FAquantizer,
)
encoder = Encoder(
d_model=args.DAC.encoder_dim,
strides=args.DAC.encoder_rates,
d_latent=1024,
causal=args.causal,
lstm=args.lstm,
)
quantizer = FAquantizer(
in_dim=1024,
n_p_codebooks=1,
n_c_codebooks=args.n_c_codebooks,
n_t_codebooks=2,
n_r_codebooks=3,
codebook_size=1024,
codebook_dim=8,
quantizer_dropout=0.5,
causal=args.causal,
separate_prosody_encoder=args.separate_prosody_encoder,
timbre_norm=args.timbre_norm,
)
nets = Munch(
encoder=encoder,
quantizer=quantizer,
)
elif stage == "mel_vocos":
from modules.vocos import Vocos
decoder = Vocos(args)
nets = Munch(
decoder=decoder,
)
else:
raise ValueError(f"Unknown stage: {stage}")
return nets
def load_checkpoint(
model,
optimizer,
path,
load_only_params=True,
ignore_modules=[],
is_distributed=False,
load_ema=False,
):
state = torch.load(path, map_location="cpu")
params = state["net"]
if load_ema and "ema" in state:
print("Loading EMA")
for key in model:
i = 0
for param_name in params[key]:
if "input_pos" in param_name:
continue
assert params[key][param_name].shape == state["ema"][key][0][i].shape
params[key][param_name] = state["ema"][key][0][i].clone()
i += 1
for key in model:
if key in params and key not in ignore_modules:
if not is_distributed:
# strip prefix of DDP (module.), create a new OrderedDict that does not contain the prefix
for k in list(params[key].keys()):
if k.startswith("module."):
params[key][k[len("module.") :]] = params[key][k]
del params[key][k]
model_state_dict = model[key].state_dict()
# 过滤出形状匹配的键值对
filtered_state_dict = {
k: v
for k, v in params[key].items()
if k in model_state_dict and v.shape == model_state_dict[k].shape
}
skipped_keys = set(params[key].keys()) - set(filtered_state_dict.keys())
if skipped_keys:
print(
f"Warning: Skipped loading some keys due to shape mismatch: {skipped_keys}"
)
print("%s loaded" % key)
model[key].load_state_dict(filtered_state_dict, strict=False)
_ = [model[key].eval() for key in model]
if not load_only_params:
epoch = state["epoch"] + 1
iters = state["iters"]
optimizer.load_state_dict(state["optimizer"])
optimizer.load_scheduler_state_dict(state["scheduler"])
else:
epoch = 0
iters = 0
return model, optimizer, epoch, iters
def load_checkpoint2(
model,
optimizer,
path,
load_only_params=True,
ignore_modules=[],
is_distributed=False,
load_ema=False,
):
state = torch.load(path, map_location="cpu")
params = state["net"]
if load_ema and "ema" in state:
print("Loading EMA")
for key in model.models:
i = 0
for param_name in params[key]:
if "input_pos" in param_name:
continue
assert params[key][param_name].shape == state["ema"][key][0][i].shape
params[key][param_name] = state["ema"][key][0][i].clone()
i += 1
for key in model.models:
if key in params and key not in ignore_modules:
if not is_distributed:
# strip prefix of DDP (module.), create a new OrderedDict that does not contain the prefix
for k in list(params[key].keys()):
if k.startswith("module."):
params[key][k[len("module.") :]] = params[key][k]
del params[key][k]
model_state_dict = model.models[key].state_dict()
# 过滤出形状匹配的键值对
filtered_state_dict = {
k: v
for k, v in params[key].items()
if k in model_state_dict and v.shape == model_state_dict[k].shape
}
skipped_keys = set(params[key].keys()) - set(filtered_state_dict.keys())
if skipped_keys:
print(
f"Warning: Skipped loading some keys due to shape mismatch: {skipped_keys}"
)
print("%s loaded" % key)
model.models[key].load_state_dict(filtered_state_dict, strict=False)
model.eval()
# _ = [model[key].eval() for key in model]
if not load_only_params:
epoch = state["epoch"] + 1
iters = state["iters"]
optimizer.load_state_dict(state["optimizer"])
optimizer.load_scheduler_state_dict(state["scheduler"])
else:
epoch = 0
iters = 0
return model, optimizer, epoch, iters
def recursive_munch(d):
if isinstance(d, dict):
return Munch((k, recursive_munch(v)) for k, v in d.items())
elif isinstance(d, list):
return [recursive_munch(v) for v in d]
else:
return d

View File

@@ -0,0 +1,257 @@
import torch
from torch import nn
import math
from indextts.s2mel.modules.gpt_fast.model import ModelArgs, Transformer
from indextts.s2mel.modules.wavenet import WN
from indextts.s2mel.modules.commons import sequence_mask
from torch.nn.utils import weight_norm
def modulate(x, shift, scale):
return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
#################################################################################
# Embedding Layers for Timesteps and Class Labels #
#################################################################################
class TimestepEmbedder(nn.Module):
"""
Embeds scalar timesteps into vector representations.
"""
def __init__(self, hidden_size, frequency_embedding_size=256):
super().__init__()
self.mlp = nn.Sequential(
nn.Linear(frequency_embedding_size, hidden_size, bias=True),
nn.SiLU(),
nn.Linear(hidden_size, hidden_size, bias=True),
)
self.frequency_embedding_size = frequency_embedding_size
self.max_period = 10000
self.scale = 1000
half = frequency_embedding_size // 2
freqs = torch.exp(
-math.log(self.max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
)
self.register_buffer("freqs", freqs)
def timestep_embedding(self, t):
"""
Create sinusoidal timestep embeddings.
:param t: a 1-D Tensor of N indices, one per batch element.
These may be fractional.
:param dim: the dimension of the output.
:param max_period: controls the minimum frequency of the embeddings.
:return: an (N, D) Tensor of positional embeddings.
"""
# https://github.com/openai/glide-text2im/blob/main/glide_text2im/nn.py
args = self.scale * t[:, None].float() * self.freqs[None]
embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
if self.frequency_embedding_size % 2:
embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
return embedding
def forward(self, t):
t_freq = self.timestep_embedding(t)
t_emb = self.mlp(t_freq)
return t_emb
class StyleEmbedder(nn.Module):
"""
Embeds class labels into vector representations. Also handles label dropout for classifier-free guidance.
"""
def __init__(self, input_size, hidden_size, dropout_prob):
super().__init__()
use_cfg_embedding = dropout_prob > 0
self.embedding_table = nn.Embedding(int(use_cfg_embedding), hidden_size)
self.style_in = weight_norm(nn.Linear(input_size, hidden_size, bias=True))
self.input_size = input_size
self.dropout_prob = dropout_prob
def forward(self, labels, train, force_drop_ids=None):
use_dropout = self.dropout_prob > 0
if (train and use_dropout) or (force_drop_ids is not None):
labels = self.token_drop(labels, force_drop_ids)
else:
labels = self.style_in(labels)
embeddings = labels
return embeddings
class FinalLayer(nn.Module):
"""
The final layer of DiT.
"""
def __init__(self, hidden_size, patch_size, out_channels):
super().__init__()
self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
self.linear = weight_norm(nn.Linear(hidden_size, patch_size * patch_size * out_channels, bias=True))
self.adaLN_modulation = nn.Sequential(
nn.SiLU(),
nn.Linear(hidden_size, 2 * hidden_size, bias=True)
)
def forward(self, x, c):
shift, scale = self.adaLN_modulation(c).chunk(2, dim=1)
x = modulate(self.norm_final(x), shift, scale)
x = self.linear(x)
return x
class DiT(torch.nn.Module):
def __init__(
self,
args
):
super(DiT, self).__init__()
self.time_as_token = args.DiT.time_as_token if hasattr(args.DiT, 'time_as_token') else False
self.style_as_token = args.DiT.style_as_token if hasattr(args.DiT, 'style_as_token') else False
self.uvit_skip_connection = args.DiT.uvit_skip_connection if hasattr(args.DiT, 'uvit_skip_connection') else False
model_args = ModelArgs(
block_size=16384,#args.DiT.block_size,
n_layer=args.DiT.depth,
n_head=args.DiT.num_heads,
dim=args.DiT.hidden_dim,
head_dim=args.DiT.hidden_dim // args.DiT.num_heads,
vocab_size=1024,
uvit_skip_connection=self.uvit_skip_connection,
time_as_token=self.time_as_token,
)
self.transformer = Transformer(model_args)
self.in_channels = args.DiT.in_channels
self.out_channels = args.DiT.in_channels
self.num_heads = args.DiT.num_heads
self.x_embedder = weight_norm(nn.Linear(args.DiT.in_channels, args.DiT.hidden_dim, bias=True))
self.content_type = args.DiT.content_type # 'discrete' or 'continuous'
self.content_codebook_size = args.DiT.content_codebook_size # for discrete content
self.content_dim = args.DiT.content_dim # for continuous content
self.cond_embedder = nn.Embedding(args.DiT.content_codebook_size, args.DiT.hidden_dim) # discrete content
self.cond_projection = nn.Linear(args.DiT.content_dim, args.DiT.hidden_dim, bias=True) # continuous content
self.is_causal = args.DiT.is_causal
self.t_embedder = TimestepEmbedder(args.DiT.hidden_dim)
# self.style_embedder1 = weight_norm(nn.Linear(1024, args.DiT.hidden_dim, bias=True))
# self.style_embedder2 = weight_norm(nn.Linear(1024, args.style_encoder.dim, bias=True))
input_pos = torch.arange(16384)
self.register_buffer("input_pos", input_pos)
self.final_layer_type = args.DiT.final_layer_type # mlp or wavenet
if self.final_layer_type == 'wavenet':
self.t_embedder2 = TimestepEmbedder(args.wavenet.hidden_dim)
self.conv1 = nn.Linear(args.DiT.hidden_dim, args.wavenet.hidden_dim)
self.conv2 = nn.Conv1d(args.wavenet.hidden_dim, args.DiT.in_channels, 1)
self.wavenet = WN(hidden_channels=args.wavenet.hidden_dim,
kernel_size=args.wavenet.kernel_size,
dilation_rate=args.wavenet.dilation_rate,
n_layers=args.wavenet.num_layers,
gin_channels=args.wavenet.hidden_dim,
p_dropout=args.wavenet.p_dropout,
causal=False)
self.final_layer = FinalLayer(args.wavenet.hidden_dim, 1, args.wavenet.hidden_dim)
self.res_projection = nn.Linear(args.DiT.hidden_dim,
args.wavenet.hidden_dim) # residual connection from tranformer output to final output
self.wavenet_style_condition = args.wavenet.style_condition
assert args.DiT.style_condition == args.wavenet.style_condition
else:
self.final_mlp = nn.Sequential(
nn.Linear(args.DiT.hidden_dim, args.DiT.hidden_dim),
nn.SiLU(),
nn.Linear(args.DiT.hidden_dim, args.DiT.in_channels),
)
self.transformer_style_condition = args.DiT.style_condition
self.class_dropout_prob = args.DiT.class_dropout_prob
self.content_mask_embedder = nn.Embedding(1, args.DiT.hidden_dim)
self.long_skip_connection = args.DiT.long_skip_connection
self.skip_linear = nn.Linear(args.DiT.hidden_dim + args.DiT.in_channels, args.DiT.hidden_dim)
self.cond_x_merge_linear = nn.Linear(args.DiT.hidden_dim + args.DiT.in_channels * 2 +
args.style_encoder.dim * self.transformer_style_condition * (not self.style_as_token),
args.DiT.hidden_dim)
if self.style_as_token:
self.style_in = nn.Linear(args.style_encoder.dim, args.DiT.hidden_dim)
def setup_caches(self, max_batch_size, max_seq_length):
self.transformer.setup_caches(max_batch_size, max_seq_length, use_kv_cache=False)
def forward(self, x, prompt_x, x_lens, t, style, cond, mask_content=False):
"""
x (torch.Tensor): random noise
prompt_x (torch.Tensor): reference mel + zero mel
shape: (batch_size, 80, 795+1068)
x_lens (torch.Tensor): mel frames output
shape: (batch_size, mel_timesteps)
t (torch.Tensor): radshape:
shape: (batch_size)
style (torch.Tensor): reference global style
shape: (batch_size, 192)
cond (torch.Tensor): semantic info of reference audio and altered audio
shape: (batch_size, mel_timesteps(795+1069), 512)
"""
class_dropout = False
if self.training and torch.rand(1) < self.class_dropout_prob:
class_dropout = True
if not self.training and mask_content:
class_dropout = True
# cond_in_module = self.cond_embedder if self.content_type == 'discrete' else self.cond_projection
cond_in_module = self.cond_projection
B, _, T = x.size()
t1 = self.t_embedder(t) # (N, D) # t1 [2, 512]
cond = cond_in_module(cond) # cond [2,1863,512]->[2,1863,512]
x = x.transpose(1, 2) # [2,1863,80]
prompt_x = prompt_x.transpose(1, 2) # [2,1863,80]
x_in = torch.cat([x, prompt_x, cond], dim=-1) # 80+80+512=672 [2, 1863, 672]
if self.transformer_style_condition and not self.style_as_token: # True and True
x_in = torch.cat([x_in, style[:, None, :].repeat(1, T, 1)], dim=-1) #[2, 1863, 864]
if class_dropout: #False
x_in[..., self.in_channels:] = x_in[..., self.in_channels:] * 0 # 80维后全置为0
x_in = self.cond_x_merge_linear(x_in) # (N, T, D) [2, 1863, 512]
if self.style_as_token: # False
style = self.style_in(style)
style = torch.zeros_like(style) if class_dropout else style
x_in = torch.cat([style.unsqueeze(1), x_in], dim=1)
if self.time_as_token: # False
x_in = torch.cat([t1.unsqueeze(1), x_in], dim=1)
x_mask = sequence_mask(x_lens + self.style_as_token + self.time_as_token, max_length=x_in.size(1)).to(x.device).unsqueeze(1) #torch.Size([1, 1, 1863])True
input_pos = self.input_pos[:x_in.size(1)] # (T,) range01863
x_mask_expanded = x_mask[:, None, :].repeat(1, 1, x_in.size(1), 1) if not self.is_causal else None # torch.Size([1, 1, 1863, 1863]
x_res = self.transformer(x_in, t1.unsqueeze(1), input_pos, x_mask_expanded) # [2, 1863, 512]
x_res = x_res[:, 1:] if self.time_as_token else x_res
x_res = x_res[:, 1:] if self.style_as_token else x_res
if self.long_skip_connection: #True
x_res = self.skip_linear(torch.cat([x_res, x], dim=-1))
if self.final_layer_type == 'wavenet':
x = self.conv1(x_res)
x = x.transpose(1, 2)
t2 = self.t_embedder2(t)
x = self.wavenet(x, x_mask, g=t2.unsqueeze(2)).transpose(1, 2) + self.res_projection(
x_res) # long residual connection
x = self.final_layer(x, t1).transpose(1, 2)
x = self.conv2(x)
else:
x = self.final_mlp(x_res)
x = x.transpose(1, 2)
# x [2,80,1863]
return x

View File

@@ -0,0 +1,292 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Convolutional layers wrappers and utilities."""
import math
import typing as tp
import warnings
import torch
from torch import nn
from torch.nn import functional as F
from torch.nn.utils import spectral_norm, weight_norm
import typing as tp
import einops
class ConvLayerNorm(nn.LayerNorm):
"""
Convolution-friendly LayerNorm that moves channels to last dimensions
before running the normalization and moves them back to original position right after.
"""
def __init__(self, normalized_shape: tp.Union[int, tp.List[int], torch.Size], **kwargs):
super().__init__(normalized_shape, **kwargs)
def forward(self, x):
x = einops.rearrange(x, 'b ... t -> b t ...')
x = super().forward(x)
x = einops.rearrange(x, 'b t ... -> b ... t')
return
CONV_NORMALIZATIONS = frozenset(['none', 'weight_norm', 'spectral_norm',
'time_layer_norm', 'layer_norm', 'time_group_norm'])
def apply_parametrization_norm(module: nn.Module, norm: str = 'none') -> nn.Module:
assert norm in CONV_NORMALIZATIONS
if norm == 'weight_norm':
return weight_norm(module)
elif norm == 'spectral_norm':
return spectral_norm(module)
else:
# We already check was in CONV_NORMALIZATION, so any other choice
# doesn't need reparametrization.
return module
def get_norm_module(module: nn.Module, causal: bool = False, norm: str = 'none', **norm_kwargs) -> nn.Module:
"""Return the proper normalization module. If causal is True, this will ensure the returned
module is causal, or return an error if the normalization doesn't support causal evaluation.
"""
assert norm in CONV_NORMALIZATIONS
if norm == 'layer_norm':
assert isinstance(module, nn.modules.conv._ConvNd)
return ConvLayerNorm(module.out_channels, **norm_kwargs)
elif norm == 'time_group_norm':
if causal:
raise ValueError("GroupNorm doesn't support causal evaluation.")
assert isinstance(module, nn.modules.conv._ConvNd)
return nn.GroupNorm(1, module.out_channels, **norm_kwargs)
else:
return nn.Identity()
def get_extra_padding_for_conv1d(x: torch.Tensor, kernel_size: int, stride: int,
padding_total: int = 0) -> int:
"""See `pad_for_conv1d`.
"""
length = x.shape[-1]
n_frames = (length - kernel_size + padding_total) / stride + 1
ideal_length = (math.ceil(n_frames) - 1) * stride + (kernel_size - padding_total)
return ideal_length - length
def pad_for_conv1d(x: torch.Tensor, kernel_size: int, stride: int, padding_total: int = 0):
"""Pad for a convolution to make sure that the last window is full.
Extra padding is added at the end. This is required to ensure that we can rebuild
an output of the same length, as otherwise, even with padding, some time steps
might get removed.
For instance, with total padding = 4, kernel size = 4, stride = 2:
0 0 1 2 3 4 5 0 0 # (0s are padding)
1 2 3 # (output frames of a convolution, last 0 is never used)
0 0 1 2 3 4 5 0 # (output of tr. conv., but pos. 5 is going to get removed as padding)
1 2 3 4 # once you removed padding, we are missing one time step !
"""
extra_padding = get_extra_padding_for_conv1d(x, kernel_size, stride, padding_total)
return F.pad(x, (0, extra_padding))
def pad1d(x: torch.Tensor, paddings: tp.Tuple[int, int], mode: str = 'zero', value: float = 0.):
"""Tiny wrapper around F.pad, just to allow for reflect padding on small input.
If this is the case, we insert extra 0 padding to the right before the reflection happen.
"""
length = x.shape[-1]
padding_left, padding_right = paddings
assert padding_left >= 0 and padding_right >= 0, (padding_left, padding_right)
if mode == 'reflect':
max_pad = max(padding_left, padding_right)
extra_pad = 0
if length <= max_pad:
extra_pad = max_pad - length + 1
x = F.pad(x, (0, extra_pad))
padded = F.pad(x, paddings, mode, value)
end = padded.shape[-1] - extra_pad
return padded[..., :end]
else:
return F.pad(x, paddings, mode, value)
def unpad1d(x: torch.Tensor, paddings: tp.Tuple[int, int]):
"""Remove padding from x, handling properly zero padding. Only for 1d!"""
padding_left, padding_right = paddings
assert padding_left >= 0 and padding_right >= 0, (padding_left, padding_right)
assert (padding_left + padding_right) <= x.shape[-1]
end = x.shape[-1] - padding_right
return x[..., padding_left: end]
class NormConv1d(nn.Module):
"""Wrapper around Conv1d and normalization applied to this conv
to provide a uniform interface across normalization approaches.
"""
def __init__(self, *args, causal: bool = False, norm: str = 'none',
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.conv = apply_parametrization_norm(nn.Conv1d(*args, **kwargs), norm)
self.norm = get_norm_module(self.conv, causal, norm, **norm_kwargs)
self.norm_type = norm
def forward(self, x):
x = self.conv(x)
x = self.norm(x)
return x
class NormConv2d(nn.Module):
"""Wrapper around Conv2d and normalization applied to this conv
to provide a uniform interface across normalization approaches.
"""
def __init__(self, *args, norm: str = 'none',
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.conv = apply_parametrization_norm(nn.Conv2d(*args, **kwargs), norm)
self.norm = get_norm_module(self.conv, causal=False, norm=norm, **norm_kwargs)
self.norm_type = norm
def forward(self, x):
x = self.conv(x)
x = self.norm(x)
return x
class NormConvTranspose1d(nn.Module):
"""Wrapper around ConvTranspose1d and normalization applied to this conv
to provide a uniform interface across normalization approaches.
"""
def __init__(self, *args, causal: bool = False, norm: str = 'none',
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.convtr = apply_parametrization_norm(nn.ConvTranspose1d(*args, **kwargs), norm)
self.norm = get_norm_module(self.convtr, causal, norm, **norm_kwargs)
self.norm_type = norm
def forward(self, x):
x = self.convtr(x)
x = self.norm(x)
return x
class NormConvTranspose2d(nn.Module):
"""Wrapper around ConvTranspose2d and normalization applied to this conv
to provide a uniform interface across normalization approaches.
"""
def __init__(self, *args, norm: str = 'none',
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.convtr = apply_parametrization_norm(nn.ConvTranspose2d(*args, **kwargs), norm)
self.norm = get_norm_module(self.convtr, causal=False, norm=norm, **norm_kwargs)
def forward(self, x):
x = self.convtr(x)
x = self.norm(x)
return x
class SConv1d(nn.Module):
"""Conv1d with some builtin handling of asymmetric or causal padding
and normalization.
"""
def __init__(self, in_channels: int, out_channels: int,
kernel_size: int, stride: int = 1, dilation: int = 1,
groups: int = 1, bias: bool = True, causal: bool = False,
norm: str = 'none', norm_kwargs: tp.Dict[str, tp.Any] = {},
pad_mode: str = 'reflect', **kwargs):
super().__init__()
# warn user on unusual setup between dilation and stride
if stride > 1 and dilation > 1:
warnings.warn('SConv1d has been initialized with stride > 1 and dilation > 1'
f' (kernel_size={kernel_size} stride={stride}, dilation={dilation}).')
self.conv = NormConv1d(in_channels, out_channels, kernel_size, stride,
dilation=dilation, groups=groups, bias=bias, causal=causal,
norm=norm, norm_kwargs=norm_kwargs)
self.causal = causal
self.pad_mode = pad_mode
def forward(self, x):
B, C, T = x.shape
kernel_size = self.conv.conv.kernel_size[0]
stride = self.conv.conv.stride[0]
dilation = self.conv.conv.dilation[0]
kernel_size = (kernel_size - 1) * dilation + 1 # effective kernel size with dilations
padding_total = kernel_size - stride
extra_padding = get_extra_padding_for_conv1d(x, kernel_size, stride, padding_total)
if self.causal:
# Left padding for causal
x = pad1d(x, (padding_total, extra_padding), mode=self.pad_mode)
else:
# Asymmetric padding required for odd strides
padding_right = padding_total // 2
padding_left = padding_total - padding_right
x = pad1d(x, (padding_left, padding_right + extra_padding), mode=self.pad_mode)
return self.conv(x)
class SConvTranspose1d(nn.Module):
"""ConvTranspose1d with some builtin handling of asymmetric or causal padding
and normalization.
"""
def __init__(self, in_channels: int, out_channels: int,
kernel_size: int, stride: int = 1, causal: bool = False,
norm: str = 'none', trim_right_ratio: float = 1.,
norm_kwargs: tp.Dict[str, tp.Any] = {}, **kwargs):
super().__init__()
self.convtr = NormConvTranspose1d(in_channels, out_channels, kernel_size, stride,
causal=causal, norm=norm, norm_kwargs=norm_kwargs)
self.causal = causal
self.trim_right_ratio = trim_right_ratio
assert self.causal or self.trim_right_ratio == 1., \
"`trim_right_ratio` != 1.0 only makes sense for causal convolutions"
assert self.trim_right_ratio >= 0. and self.trim_right_ratio <= 1.
def forward(self, x):
kernel_size = self.convtr.convtr.kernel_size[0]
stride = self.convtr.convtr.stride[0]
padding_total = kernel_size - stride
y = self.convtr(x)
# We will only trim fixed padding. Extra padding from `pad_for_conv1d` would be
# removed at the very end, when keeping only the right length for the output,
# as removing it here would require also passing the length at the matching layer
# in the encoder.
if self.causal:
# Trim the padding on the right according to the specified ratio
# if trim_right_ratio = 1.0, trim everything from right
padding_right = math.ceil(padding_total * self.trim_right_ratio)
padding_left = padding_total - padding_right
y = unpad1d(y, (padding_left, padding_right))
else:
# Asymmetric padding required for odd strides
padding_right = padding_total // 2
padding_left = padding_total - padding_right
y = unpad1d(y, (padding_left, padding_right))
return y
class SLSTM(nn.Module):
"""
LSTM without worrying about the hidden state, nor the layout of the data.
Expects input as convolutional layout.
"""
def __init__(self, dimension: int, num_layers: int = 2, skip: bool = True):
super().__init__()
self.skip = skip
self.lstm = nn.LSTM(dimension, dimension, num_layers)
self.hidden = None
def forward(self, x):
x = x.permute(2, 0, 1)
if self.training:
y, _ = self.lstm(x)
else:
y, self.hidden = self.lstm(x, self.hidden)
if self.skip:
y = y + x
y = y.permute(1, 2, 0)
return y

View File

@@ -0,0 +1,186 @@
from abc import ABC
import torch
import torch.nn.functional as F
from indextts.s2mel.modules.diffusion_transformer import DiT
from indextts.s2mel.modules.commons import sequence_mask
from tqdm import tqdm
class BASECFM(torch.nn.Module, ABC):
def __init__(
self,
args,
):
super().__init__()
self.sigma_min = 1e-6
self.estimator = None
self.in_channels = args.DiT.in_channels
self.criterion = torch.nn.MSELoss() if args.reg_loss_type == "l2" else torch.nn.L1Loss()
if hasattr(args.DiT, 'zero_prompt_speech_token'):
self.zero_prompt_speech_token = args.DiT.zero_prompt_speech_token
else:
self.zero_prompt_speech_token = False
@torch.inference_mode()
def inference(self, mu, x_lens, prompt, style, f0, n_timesteps, temperature=1.0, inference_cfg_rate=0.5):
"""Forward diffusion
Args:
mu (torch.Tensor): semantic info of reference audio and altered audio
shape: (batch_size, mel_timesteps(795+1069), 512)
x_lens (torch.Tensor): mel frames output
shape: (batch_size, mel_timesteps)
prompt (torch.Tensor): reference mel
shape: (batch_size, 80, 795)
style (torch.Tensor): reference global style
shape: (batch_size, 192)
f0: None
n_timesteps (int): number of diffusion steps
temperature (float, optional): temperature for scaling noise. Defaults to 1.0.
Returns:
sample: generated mel-spectrogram
shape: (batch_size, 80, mel_timesteps)
"""
B, T = mu.size(0), mu.size(1)
z = torch.randn([B, self.in_channels, T], device=mu.device) * temperature
t_span = torch.linspace(0, 1, n_timesteps + 1, device=mu.device)
# t_span = t_span + (-1) * (torch.cos(torch.pi / 2 * t_span) - 1 + t_span)
return self.solve_euler(z, x_lens, prompt, mu, style, f0, t_span, inference_cfg_rate)
def solve_euler(self, x, x_lens, prompt, mu, style, f0, t_span, inference_cfg_rate=0.5):
"""
Fixed euler solver for ODEs.
Args:
x (torch.Tensor): random noise
t_span (torch.Tensor): n_timesteps interpolated
shape: (n_timesteps + 1,)
mu (torch.Tensor): semantic info of reference audio and altered audio
shape: (batch_size, mel_timesteps(795+1069), 512)
x_lens (torch.Tensor): mel frames output
shape: (batch_size, mel_timesteps)
prompt (torch.Tensor): reference mel
shape: (batch_size, 80, 795)
style (torch.Tensor): reference global style
shape: (batch_size, 192)
"""
t, _, _ = t_span[0], t_span[-1], t_span[1] - t_span[0]
# I am storing this because I can later plot it by putting a debugger here and saving it to a file
# Or in future might add like a return_all_steps flag
sol = []
# apply prompt
prompt_len = prompt.size(-1)
prompt_x = torch.zeros_like(x)
prompt_x[..., :prompt_len] = prompt[..., :prompt_len]
x[..., :prompt_len] = 0
if self.zero_prompt_speech_token:
mu[..., :prompt_len] = 0
for step in tqdm(range(1, len(t_span))):
dt = t_span[step] - t_span[step - 1]
if inference_cfg_rate > 0:
# Stack original and CFG (null) inputs for batched processing
stacked_prompt_x = torch.cat([prompt_x, torch.zeros_like(prompt_x)], dim=0)
stacked_style = torch.cat([style, torch.zeros_like(style)], dim=0)
stacked_mu = torch.cat([mu, torch.zeros_like(mu)], dim=0)
stacked_x = torch.cat([x, x], dim=0)
stacked_t = torch.cat([t.unsqueeze(0), t.unsqueeze(0)], dim=0)
# Perform a single forward pass for both original and CFG inputs
stacked_dphi_dt = self.estimator(
stacked_x, stacked_prompt_x, x_lens, stacked_t, stacked_style, stacked_mu,
)
# Split the output back into the original and CFG components
dphi_dt, cfg_dphi_dt = stacked_dphi_dt.chunk(2, dim=0)
# Apply CFG formula
dphi_dt = (1.0 + inference_cfg_rate) * dphi_dt - inference_cfg_rate * cfg_dphi_dt
else:
dphi_dt = self.estimator(x, prompt_x, x_lens, t.unsqueeze(0), style, mu)
x = x + dt * dphi_dt
t = t + dt
sol.append(x)
if step < len(t_span) - 1:
dt = t_span[step + 1] - t
x[:, :, :prompt_len] = 0
return sol[-1]
def forward(self, x1, x_lens, prompt_lens, mu, style):
"""Computes diffusion loss
Args:
mu (torch.Tensor): semantic info of reference audio and altered audio
shape: (batch_size, mel_timesteps(795+1069), 512)
x1: mel
x_lens (torch.Tensor): mel frames output
shape: (batch_size, mel_timesteps)
prompt (torch.Tensor): reference mel
shape: (batch_size, 80, 795)
style (torch.Tensor): reference global style
shape: (batch_size, 192)
Returns:
loss: conditional flow matching loss
y: conditional flow
shape: (batch_size, n_feats, mel_timesteps)
"""
b, _, t = x1.shape
# random timestep
t = torch.rand([b, 1, 1], device=mu.device, dtype=x1.dtype)
# sample noise p(x_0)
z = torch.randn_like(x1)
y = (1 - (1 - self.sigma_min) * t) * z + t * x1
u = x1 - (1 - self.sigma_min) * z
prompt = torch.zeros_like(x1)
for bib in range(b):
prompt[bib, :, :prompt_lens[bib]] = x1[bib, :, :prompt_lens[bib]]
# range covered by prompt are set to 0
y[bib, :, :prompt_lens[bib]] = 0
if self.zero_prompt_speech_token:
mu[bib, :, :prompt_lens[bib]] = 0
estimator_out = self.estimator(y, prompt, x_lens, t.squeeze(1).squeeze(1), style, mu, prompt_lens)
loss = 0
for bib in range(b):
loss += self.criterion(estimator_out[bib, :, prompt_lens[bib]:x_lens[bib]], u[bib, :, prompt_lens[bib]:x_lens[bib]])
loss /= b
return loss, estimator_out + (1 - self.sigma_min) * z
class CFM(BASECFM):
def __init__(self, args):
super().__init__(
args
)
if args.dit_type == "DiT":
self.estimator = DiT(args)
else:
raise NotImplementedError(f"Unknown diffusion type {args.dit_type}")
def enable_torch_compile(self):
"""Enable torch.compile optimization for the estimator model.
This method applies torch.compile to the estimator (DiT model) for significant
performance improvements during inference. It also configures distributed
training optimizations if applicable.
"""
if torch.distributed.is_initialized():
torch._inductor.config.reorder_for_compute_comm_overlap = True
self.estimator = torch.compile(
self.estimator,
fullgraph=True,
dynamic=True,
)

View File

@@ -0,0 +1,360 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
from dataclasses import dataclass
from typing import Optional
import torch
import torch.nn as nn
from torch import Tensor
from torch.nn import functional as F
def find_multiple(n: int, k: int) -> int:
if n % k == 0:
return n
return n + k - (n % k)
class AdaptiveLayerNorm(nn.Module):
r"""Adaptive Layer Normalization"""
def __init__(self, d_model, norm) -> None:
super(AdaptiveLayerNorm, self).__init__()
self.project_layer = nn.Linear(d_model, 2 * d_model)
self.norm = norm
self.d_model = d_model
self.eps = self.norm.eps
def forward(self, input: Tensor, embedding: Tensor = None) -> Tensor:
if embedding is None:
return self.norm(input)
weight, bias = torch.split(
self.project_layer(embedding),
split_size_or_sections=self.d_model,
dim=-1,
)
return weight * self.norm(input) + bias
@dataclass
class ModelArgs:
block_size: int = 2048
vocab_size: int = 32000
n_layer: int = 32
n_head: int = 32
dim: int = 4096
intermediate_size: int = None
n_local_heads: int = -1
head_dim: int = 64
rope_base: float = 10000
norm_eps: float = 1e-5
has_cross_attention: bool = False
context_dim: int = 0
uvit_skip_connection: bool = False
time_as_token: bool = False
def __post_init__(self):
if self.n_local_heads == -1:
self.n_local_heads = self.n_head
if self.intermediate_size is None:
hidden_dim = 4 * self.dim
n_hidden = int(2 * hidden_dim / 3)
self.intermediate_size = find_multiple(n_hidden, 256)
# self.head_dim = self.dim // self.n_head
@classmethod
def from_name(cls, name: str):
if name in transformer_configs:
return cls(**transformer_configs[name])
# fuzzy search
config = [config for config in transformer_configs if config.lower() in str(name).lower()]
# We may have two or more configs matched (e.g. "7B" and "Mistral-7B"). Find the best config match,
# take longer name (as it have more symbols matched)
if len(config) > 1:
config.sort(key=len, reverse=True)
assert len(config[0]) != len(config[1]), name # make sure only one 'best' match
return cls(**transformer_configs[config[0]])
transformer_configs = {
"CodeLlama-7b-Python-hf": dict(block_size=16384, vocab_size=32000, n_layer=32, dim=4096, rope_base=1000000),
"7B": dict(n_layer=32, n_head=32, dim=4096),
"13B": dict(n_layer=40, n_head=40, dim=5120),
"30B": dict(n_layer=60, n_head=52, dim=6656),
"34B": dict(n_layer=48, n_head=64, dim=8192, vocab_size=32000, n_local_heads=8, intermediate_size=22016,
rope_base=1000000), # CodeLlama-34B-Python-hf
"70B": dict(n_layer=80, n_head=64, dim=8192, n_local_heads=8, intermediate_size=28672),
"Mistral-7B": dict(n_layer=32, n_head=32, n_local_heads=8, dim=4096, intermediate_size=14336, vocab_size=32000),
"stories15M": dict(n_layer=6, n_head=6, dim=288),
"stories110M": dict(n_layer=12, n_head=12, dim=768),
"llama-3-8b": dict(block_size=8192, n_layer=32, n_head=32, n_local_heads=8, dim=4096, intermediate_size=14336,
vocab_size=128256, rope_base=500000),
"llama-3-70b": dict(block_size=8192, n_layer=80, n_head=64, n_local_heads=8, dim=8192, intermediate_size=28672,
vocab_size=128256, rope_base=500000),
}
class KVCache(nn.Module):
def __init__(self, max_batch_size, max_seq_length, n_heads, head_dim, dtype=torch.bfloat16):
super().__init__()
cache_shape = (max_batch_size, n_heads, max_seq_length, head_dim)
self.register_buffer('k_cache', torch.zeros(cache_shape, dtype=dtype))
self.register_buffer('v_cache', torch.zeros(cache_shape, dtype=dtype))
def update(self, input_pos, k_val, v_val):
# input_pos: [S], k_val: [B, H, S, D]
assert input_pos.shape[0] == k_val.shape[2]
k_out = self.k_cache
v_out = self.v_cache
k_out[:, :, input_pos] = k_val
v_out[:, :, input_pos] = v_val
return k_out, v_out
class Transformer(nn.Module):
def __init__(self, config: ModelArgs) -> None:
super().__init__()
self.config = config
self.layers = nn.ModuleList(TransformerBlock(config) for _ in range(config.n_layer))
self.norm = AdaptiveLayerNorm(config.dim, RMSNorm(config.dim, eps=config.norm_eps))
self.freqs_cis: Optional[Tensor] = None
self.mask_cache: Optional[Tensor] = None
self.max_batch_size = -1
self.max_seq_length = -1
def setup_caches(self, max_batch_size, max_seq_length, use_kv_cache=True):
if self.max_seq_length >= max_seq_length and self.max_batch_size >= max_batch_size:
return
head_dim = self.config.dim // self.config.n_head
max_seq_length = find_multiple(max_seq_length, 8)
self.max_seq_length = max_seq_length
self.max_batch_size = max_batch_size
dtype = self.norm.project_layer.weight.dtype
device = self.norm.project_layer.weight.device
if not self.training and use_kv_cache:
for b in self.layers:
b.attention.kv_cache = KVCache(max_batch_size, max_seq_length, self.config.n_local_heads, head_dim, dtype).to(device)
self.freqs_cis = precompute_freqs_cis(self.config.block_size, self.config.head_dim,
self.config.rope_base, dtype).to(device)
self.causal_mask = torch.tril(torch.ones(self.max_seq_length, self.max_seq_length, dtype=torch.bool)).to(device)
self.use_kv_cache = use_kv_cache
self.uvit_skip_connection = self.config.uvit_skip_connection
if self.uvit_skip_connection:
self.layers_emit_skip = [i for i in range(self.config.n_layer) if i < self.config.n_layer // 2]
self.layers_receive_skip = [i for i in range(self.config.n_layer) if i > self.config.n_layer // 2]
else:
self.layers_emit_skip = []
self.layers_receive_skip = []
def forward(self,
x: Tensor,
c: Tensor,
input_pos: Optional[Tensor] = None,
mask: Optional[Tensor] = None,
context: Optional[Tensor] = None,
context_input_pos: Optional[Tensor] = None,
cross_attention_mask: Optional[Tensor] = None,
) -> Tensor:
assert self.freqs_cis is not None, "Caches must be initialized first"
if mask is None: # in case of non-causal model
if not self.training and self.use_kv_cache:
mask = self.causal_mask[None, None, input_pos]
else:
mask = self.causal_mask[None, None, input_pos]
mask = mask[..., input_pos]
freqs_cis = self.freqs_cis[input_pos]
if context is not None:
context_freqs_cis = self.freqs_cis[context_input_pos]
else:
context_freqs_cis = None
skip_in_x_list = []
for i, layer in enumerate(self.layers):
if self.uvit_skip_connection and i in self.layers_receive_skip:
skip_in_x = skip_in_x_list.pop(-1)
else:
skip_in_x = None
x = layer(x, c, input_pos, freqs_cis, mask, context, context_freqs_cis, cross_attention_mask, skip_in_x)
if self.uvit_skip_connection and i in self.layers_emit_skip:
skip_in_x_list.append(x)
x = self.norm(x, c)
return x
@classmethod
def from_name(cls, name: str):
return cls(ModelArgs.from_name(name))
class TransformerBlock(nn.Module):
def __init__(self, config: ModelArgs) -> None:
super().__init__()
self.attention = Attention(config)
self.feed_forward = FeedForward(config)
self.ffn_norm = AdaptiveLayerNorm(config.dim, RMSNorm(config.dim, eps=config.norm_eps))
self.attention_norm = AdaptiveLayerNorm(config.dim, RMSNorm(config.dim, eps=config.norm_eps))
if config.has_cross_attention:
self.has_cross_attention = True
self.cross_attention = Attention(config, is_cross_attention=True)
self.cross_attention_norm = AdaptiveLayerNorm(config.dim, RMSNorm(config.dim, eps=config.norm_eps))
else:
self.has_cross_attention = False
if config.uvit_skip_connection:
self.skip_in_linear = nn.Linear(config.dim * 2, config.dim)
self.uvit_skip_connection = True
else:
self.uvit_skip_connection = False
self.time_as_token = config.time_as_token
def forward(self,
x: Tensor,
c: Tensor,
input_pos: Tensor,
freqs_cis: Tensor,
mask: Tensor,
context: Optional[Tensor] = None,
context_freqs_cis: Optional[Tensor] = None,
cross_attention_mask: Optional[Tensor] = None,
skip_in_x: Optional[Tensor] = None,
) -> Tensor:
c = None if self.time_as_token else c
if self.uvit_skip_connection and skip_in_x is not None:
x = self.skip_in_linear(torch.cat([x, skip_in_x], dim=-1))
h = x + self.attention(self.attention_norm(x, c), freqs_cis, mask, input_pos)
if self.has_cross_attention:
h = h + self.cross_attention(self.cross_attention_norm(h, c), freqs_cis, cross_attention_mask, input_pos, context, context_freqs_cis)
out = h + self.feed_forward(self.ffn_norm(h, c))
return out
class Attention(nn.Module):
def __init__(self, config: ModelArgs, is_cross_attention: bool = False):
super().__init__()
assert config.dim % config.n_head == 0
total_head_dim = (config.n_head + 2 * config.n_local_heads) * config.head_dim
# key, query, value projections for all heads, but in a batch
if is_cross_attention:
self.wq = nn.Linear(config.dim, config.n_head * config.head_dim, bias=False)
self.wkv = nn.Linear(config.context_dim, 2 * config.n_local_heads * config.head_dim, bias=False)
else:
self.wqkv = nn.Linear(config.dim, total_head_dim, bias=False)
self.wo = nn.Linear(config.head_dim * config.n_head, config.dim, bias=False)
self.kv_cache = None
self.n_head = config.n_head
self.head_dim = config.head_dim
self.n_local_heads = config.n_local_heads
self.dim = config.dim
# self._register_load_state_dict_pre_hook(self.load_hook)
# def load_hook(self, state_dict, prefix, *args):
# if prefix + "wq.weight" in state_dict:
# wq = state_dict.pop(prefix + "wq.weight")
# wk = state_dict.pop(prefix + "wk.weight")
# wv = state_dict.pop(prefix + "wv.weight")
# state_dict[prefix + "wqkv.weight"] = torch.cat([wq, wk, wv])
def forward(self,
x: Tensor,
freqs_cis: Tensor,
mask: Tensor,
input_pos: Optional[Tensor] = None,
context: Optional[Tensor] = None,
context_freqs_cis: Optional[Tensor] = None,
) -> Tensor:
bsz, seqlen, _ = x.shape
kv_size = self.n_local_heads * self.head_dim
if context is None:
q, k, v = self.wqkv(x).split([kv_size, kv_size, kv_size], dim=-1)
context_seqlen = seqlen
else:
q = self.wq(x)
k, v = self.wkv(context).split([kv_size, kv_size], dim=-1)
context_seqlen = context.shape[1]
q = q.view(bsz, seqlen, self.n_head, self.head_dim)
k = k.view(bsz, context_seqlen, self.n_local_heads, self.head_dim)
v = v.view(bsz, context_seqlen, self.n_local_heads, self.head_dim)
q = apply_rotary_emb(q, freqs_cis)
k = apply_rotary_emb(k, context_freqs_cis if context_freqs_cis is not None else freqs_cis)
q, k, v = map(lambda x: x.transpose(1, 2), (q, k, v))
if self.kv_cache is not None:
k, v = self.kv_cache.update(input_pos, k, v)
k = k.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
v = v.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
y = y.transpose(1, 2).contiguous().view(bsz, seqlen, self.head_dim * self.n_head)
y = self.wo(y)
return y
class FeedForward(nn.Module):
def __init__(self, config: ModelArgs) -> None:
super().__init__()
self.w1 = nn.Linear(config.dim, config.intermediate_size, bias=False)
self.w3 = nn.Linear(config.dim, config.intermediate_size, bias=False)
self.w2 = nn.Linear(config.intermediate_size, config.dim, bias=False)
def forward(self, x: Tensor) -> Tensor:
return self.w2(F.silu(self.w1(x)) * self.w3(x))
class RMSNorm(nn.Module):
def __init__(self, dim: int, eps: float = 1e-5):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))
def _norm(self, x):
return x * torch.rsqrt(torch.mean(x * x, dim=-1, keepdim=True) + self.eps)
def forward(self, x: Tensor) -> Tensor:
output = self._norm(x.float()).type_as(x)
return output * self.weight
def precompute_freqs_cis(
seq_len: int, n_elem: int, base: int = 10000,
dtype: torch.dtype = torch.bfloat16
) -> Tensor:
freqs = 1.0 / (base ** (torch.arange(0, n_elem, 2)[: (n_elem // 2)].float() / n_elem))
t = torch.arange(seq_len, device=freqs.device)
freqs = torch.outer(t, freqs)
freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
cache = torch.stack([freqs_cis.real, freqs_cis.imag], dim=-1)
return cache.to(dtype=dtype)
def apply_rotary_emb(x: Tensor, freqs_cis: Tensor) -> Tensor:
xshaped = x.float().reshape(*x.shape[:-1], -1, 2)
freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2)
x_out2 = torch.stack(
[
xshaped[..., 0] * freqs_cis[..., 0] - xshaped[..., 1] * freqs_cis[..., 1],
xshaped[..., 1] * freqs_cis[..., 0] + xshaped[..., 0] * freqs_cis[..., 1],
],
-1,
)
x_out2 = x_out2.flatten(3)
return x_out2.type_as(x)

View File

@@ -0,0 +1,436 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import itertools
import sys
import time
from pathlib import Path
from typing import Optional, Tuple
import torch
import torch._dynamo.config
import torch._inductor.config
def device_sync(device):
if "cuda" in device:
torch.cuda.synchronize(device)
elif ("cpu" in device) or ("mps" in device):
pass
else:
print(f"device={device} is not yet suppported")
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.triton.unique_kernel_names = True
torch._inductor.config.fx_graph_cache = True # Experimental feature to reduce compilation times, will be on by default in future
default_device = 'cuda' if torch.cuda.is_available() else 'cpu'
# support running without installing as a package
wd = Path(__file__).parent.parent.resolve()
sys.path.append(str(wd))
from model import Transformer
from tokenizer import get_tokenizer
def multinomial_sample_one_no_sync(probs_sort): # Does multinomial sampling without a cuda synchronization
q = torch.empty_like(probs_sort).exponential_(1)
return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
def logits_to_probs(logits, temperature: float = 1.0, top_k: Optional[int] = None):
logits = logits / max(temperature, 1e-5)
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
pivot = v.select(-1, -1).unsqueeze(-1)
logits = torch.where(logits < pivot, -float("Inf"), logits)
probs = torch.nn.functional.softmax(logits, dim=-1)
return probs
def sample(logits, temperature: float = 1.0, top_k: Optional[int] = None):
probs = logits_to_probs(logits[0, -1], temperature, top_k)
idx_next = multinomial_sample_one_no_sync(probs)
return idx_next, probs
def prefill(model: Transformer, x: torch.Tensor, input_pos: torch.Tensor, **sampling_kwargs) -> torch.Tensor:
# input_pos: [B, S]
logits = model(x, input_pos)
return sample(logits, **sampling_kwargs)[0]
def decode_one_token(model: Transformer, x: torch.Tensor, input_pos: torch.Tensor, **sampling_kwargs) -> Tuple[torch.Tensor, torch.Tensor]:
# input_pos: [B, 1]
assert input_pos.shape[-1] == 1
logits = model(x, input_pos)
return sample(logits, **sampling_kwargs)
def decode_n_tokens(model: Transformer, cur_token: torch.Tensor, input_pos: torch.Tensor, num_new_tokens: int, callback=lambda _: _, **sampling_kwargs):
new_tokens, new_probs = [], []
for i in range(num_new_tokens):
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True): # Actually better for Inductor to codegen attention here
next_token, next_prob = decode_one_token(
model, cur_token, input_pos, **sampling_kwargs
)
input_pos += 1
new_tokens.append(next_token.clone())
callback(new_tokens[-1])
new_probs.append(next_prob.clone())
cur_token = next_token.view(1, -1)
return new_tokens, new_probs
def model_forward(model, x, input_pos):
return model(x, input_pos)
def speculative_decode(
model: Transformer,
draft_model: Transformer,
cur_token: torch.Tensor,
input_pos: int,
speculate_k: int,
**sampling_kwargs
) -> torch.Tensor:
# draft model inference sequentially
device = cur_token.device
orig_input_pos = torch.tensor([input_pos], dtype=torch.int64, device=cur_token.device)
draft_tokens, draft_probs = decode_n_tokens(draft_model, cur_token.view(1, -1), orig_input_pos.clone(), speculate_k, **sampling_kwargs)
draft_tokens = torch.cat(draft_tokens)
# parallel inference on target model using draft tokens
target_logits = model_forward(
model,
torch.cat([cur_token.view(1), draft_tokens]).view(1, -1),
torch.arange(input_pos, input_pos + speculate_k + 1, device=cur_token.device)
)
target_probs = logits_to_probs(target_logits[0], **sampling_kwargs)
draft_probs = torch.stack(draft_probs)
# q: target prob, p: draft prob
# q >= p: always accept draft token
# q < p: q/p prob to accept draft token
p = draft_probs[torch.arange(0, speculate_k, device=device), draft_tokens]
q = target_probs[torch.arange(0, speculate_k, device=device), draft_tokens]
accept_draft_prob = torch.minimum(torch.ones(()), q[:speculate_k]/ p)
rejected_locations = (torch.rand_like(accept_draft_prob) > accept_draft_prob).nonzero()
if rejected_locations.shape[0] == 0: # All draft tokens have been accepted
accept_length = speculate_k + 1
last_token = multinomial_sample_one_no_sync(target_probs[-1])
# fill last token into draft model
model_forward(
draft_model,
draft_tokens[-1].view(1, -1),
orig_input_pos + speculate_k,
)
return torch.cat([draft_tokens, last_token])
else:
accept_length = rejected_locations[0].item()
p = draft_probs[accept_length]
q = target_probs[accept_length]
new = q - p
new = torch.where(new > 0, new, 0.0)
new = new / new.sum()
next_token = multinomial_sample_one_no_sync(new)
return torch.cat([draft_tokens[:accept_length], next_token])
@torch.no_grad()
def generate(
model: Transformer,
prompt: torch.Tensor,
max_new_tokens: int,
*,
interactive: bool,
draft_model: Transformer,
speculate_k: Optional[int] = 8,
callback = lambda x: x,
**sampling_kwargs
) -> torch.Tensor:
"""
Takes a conditioning sequence (prompt) as input and continues to generate as many tokens as requested.
"""
is_speculative = draft_model is not None
# create an empty tensor of the expected final shape and fill in the current tokens
T = prompt.size(0)
T_new = T + max_new_tokens
if interactive:
max_seq_length = 350
else:
max_seq_length = min(T_new, model.config.block_size)
device, dtype = prompt.device, prompt.dtype
max_seq_length = max_seq_length + speculate_k + 1 if is_speculative else max_seq_length
with torch.device(device):
model.setup_caches(max_batch_size=1, max_seq_length=max_seq_length)
if is_speculative and draft_model is not model:
draft_model.setup_caches(max_batch_size=1, max_seq_length=max_seq_length)
# create an empty tensor of the expected final shape and fill in the current tokens
empty = torch.empty(T_new, dtype=dtype, device=device)
empty[:T] = prompt
seq = empty
input_pos = torch.arange(0, T, device=device)
next_token = prefill(model, prompt.view(1, -1), input_pos, **sampling_kwargs).clone()
if is_speculative:
prefill(draft_model, prompt.view(1, -1), input_pos, **sampling_kwargs)
seq[T] = next_token
input_pos = torch.tensor([T], device=device, dtype=torch.int)
accept_counts = [0] * (speculate_k + 1)
if is_speculative:
input_pos = input_pos.item() # for speculative decoding easier to keep on host
while input_pos < T_new - 1:
cur_token = next_token.view(())
next_tokens = speculative_decode(
model, draft_model, cur_token, input_pos, speculate_k, **sampling_kwargs
)
accept_counts[len(next_tokens) - 1] += 1
num_added = min(T_new - input_pos - 1, len(next_tokens))
seq[input_pos + 1 : input_pos + num_added + 1] = next_tokens[: num_added]
for i in next_tokens[: num_added,]:
callback(i)
input_pos = input_pos + num_added
next_token = next_tokens[-1]
else:
generated_tokens, _ = decode_n_tokens(model, next_token.view(1, -1), input_pos, max_new_tokens - 1, callback=callback, **sampling_kwargs)
seq[T + 1:] = torch.cat(generated_tokens)
generate_stats = {
'accept_counts': accept_counts
}
return seq, generate_stats
def encode_tokens(tokenizer, string, bos=True, device=default_device):
tokens = tokenizer.encode(string)
if bos:
tokens = [tokenizer.bos_id()] + tokens
return torch.tensor(tokens, dtype=torch.int, device=device)
def _load_model(checkpoint_path, device, precision, use_tp):
use_cuda = 'cuda' in device
with torch.device('meta'):
model = Transformer.from_name(checkpoint_path.parent.name)
if "int8" in str(checkpoint_path):
print("Using int8 weight-only quantization!")
from quantize import WeightOnlyInt8QuantHandler
simple_quantizer = WeightOnlyInt8QuantHandler(model)
model = simple_quantizer.convert_for_runtime()
if "int4" in str(checkpoint_path):
print("Using int4 weight-only quantization!")
path_comps = checkpoint_path.name.split(".")
groupsize = int(path_comps[-2][1:])
from quantize import WeightOnlyInt4QuantHandler
simple_quantizer = WeightOnlyInt4QuantHandler(model, groupsize)
model = simple_quantizer.convert_for_runtime()
checkpoint = torch.load(str(checkpoint_path), mmap=True, weights_only=True)
if "model" in checkpoint and "stories" in str(checkpoint_path):
checkpoint = checkpoint["model"]
model.load_state_dict(checkpoint, assign=True)
if use_tp:
from tp import apply_tp
print("Applying tensor parallel to model ...")
apply_tp(model)
model = model.to(device=device, dtype=precision)
return model.eval()
def _get_model_size(model):
model_size = 0
for name, child in model.named_children():
if not isinstance(child, torch.nn.Embedding):
model_size += sum(
[
p.numel() * p.dtype.itemsize
for p in itertools.chain(child.parameters(), child.buffers())
]
)
return model_size
B_INST, E_INST = "[INST]", "[/INST]"
def main(
prompt: str = "Hello, my name is",
interactive: bool = False,
num_samples: int = 5,
max_new_tokens: int = 100,
top_k: int = 200,
temperature: float = 0.8,
checkpoint_path: Path = Path("checkpoints/meta-Transformer/Transformer-2-7b-chat-hf/model.pth"),
compile: bool = True,
compile_prefill: bool = False,
profile: Optional[Path] = None,
draft_checkpoint_path: Optional[Path] = None,
speculate_k: int = 5,
device=default_device,
) -> None:
"""Generates text samples based on a pre-trained Transformer model and tokenizer.
"""
assert checkpoint_path.is_file(), checkpoint_path
tokenizer_path = checkpoint_path.parent / "tokenizer.model"
assert tokenizer_path.is_file(), str(tokenizer_path)
global print
from tp import maybe_init_dist
rank = maybe_init_dist()
use_tp = rank is not None
if use_tp:
if rank != 0:
# only print on rank 0
print = lambda *args, **kwargs: None
print(f"Using device={device}")
precision = torch.bfloat16
is_speculative = draft_checkpoint_path is not None
is_chat = "chat" in str(checkpoint_path)
print("Loading model ...")
t0 = time.time()
model = _load_model(checkpoint_path, device, precision, use_tp)
if is_speculative:
draft_model = _load_model(draft_checkpoint_path, device, precision, use_tp)
else:
draft_model = None
device_sync(device=device) # MKG
print(f"Time to load model: {time.time() - t0:.02f} seconds")
tokenizer = get_tokenizer(tokenizer_path, checkpoint_path)
encoded = encode_tokens(tokenizer, prompt, bos=True, device=device)
prompt_length = encoded.size(0)
torch.manual_seed(1234)
model_size = _get_model_size(model)
if compile:
if is_speculative and use_tp: # and ("cuda" in device):
torch._inductor.config.triton.cudagraph_trees = False # Bug with cudagraph trees in this case
if is_speculative:
global model_forward, logits_to_prob
model_forward = torch.compile(model_forward, mode="reduce-overhead", fullgraph=True)
global decode_one_token, prefill
decode_one_token = torch.compile(decode_one_token, mode="reduce-overhead", fullgraph=True)
# Uncomment to squeeze more perf out of prefill
if compile_prefill:
prefill = torch.compile(prefill, fullgraph=True, dynamic=True)
aggregate_metrics = {
'tokens_per_sec': [],
'accept_counts': [],
}
start = -1 if compile else 0
for i in range(start, num_samples):
device_sync(device=device) # MKG
if i >= 0 and interactive:
prompt = input("What is your prompt? ")
if is_chat:
prompt = f"{B_INST} {prompt.strip()} {E_INST}"
encoded = encode_tokens(tokenizer, prompt, bos=True, device=device)
if interactive and i >= 0:
buffer = []
period_id = tokenizer.encode('.')[0]
done_generating = False
def callback(x):
nonlocal done_generating
if done_generating:
return
buffer.append(tokenizer.decode([period_id] + x.tolist())[1:])
if x.item() == tokenizer.eos_id():
done_generating = True
if len(buffer) == 4 or done_generating:
print(''.join(buffer), end='', flush=True)
buffer.clear()
# print(, end='', flush=True)
else:
callback = lambda x : x
t0 = time.perf_counter()
import contextlib
if (i != num_samples - 1 or not profile) or (use_tp and rank != 0):
prof = contextlib.nullcontext()
else:
torch.profiler._utils._init_for_cuda_graphs()
prof = torch.profiler.profile()
with prof:
y, metrics = generate(
model,
encoded,
max_new_tokens,
draft_model=draft_model,
speculate_k=speculate_k,
interactive=interactive,
callback=callback,
temperature=temperature,
top_k=top_k,
)
aggregate_metrics['accept_counts'].append(metrics['accept_counts'])
if i == -1:
print(f"Compilation time: {time.perf_counter() - t0:.2f} seconds")
continue
if hasattr(prof, "export_chrome_trace"):
if use_tp:
prof.export_chrome_trace(f"{profile}_rank_{rank}.json")
else:
prof.export_chrome_trace(f"{profile}.json")
device_sync(device=device) # MKG
t = time.perf_counter() - t0
if not interactive:
print(tokenizer.decode(y.tolist()))
else:
print()
tokens_generated = y.size(0) - prompt_length
tokens_sec = tokens_generated / t
aggregate_metrics['tokens_per_sec'].append(tokens_sec)
print(f"Time for inference {i + 1}: {t:.02f} sec total, {tokens_sec:.02f} tokens/sec")
print(f"Bandwidth achieved: {model_size * tokens_sec / 1e9:.02f} GB/s")
print("==========")
if is_speculative:
counts_aggregated = [sum(i) for i in zip(*aggregate_metrics['accept_counts'])]
acceptance_probs = [i/sum(counts_aggregated) for i in counts_aggregated]
print(f"Acceptance probs: {acceptance_probs}")
print(f"Mean Accepted: {sum([idx * i for idx, i in enumerate(counts_aggregated)])/sum(counts_aggregated)}")
print(f"Average tokens/sec: {torch.mean(torch.tensor(aggregate_metrics['tokens_per_sec'])).item():.2f}")
print(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB")
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(description='Your CLI description.')
parser.add_argument('--prompt', type=str, default="Hello, my name is", help='Input prompt.')
parser.add_argument('--interactive', action='store_true', help='Whether to launch in interactive mode')
parser.add_argument('--num_samples', type=int, default=5, help='Number of samples.')
parser.add_argument('--max_new_tokens', type=int, default=200, help='Maximum number of new tokens.')
parser.add_argument('--top_k', type=int, default=200, help='Top-k for sampling.')
parser.add_argument('--temperature', type=float, default=0.8, help='Temperature for sampling.')
parser.add_argument('--checkpoint_path', type=Path, default=Path("checkpoints/meta-Transformer/Transformer-2-7b-chat-hf/model.pth"), help='Model checkpoint path.')
parser.add_argument('--compile', action='store_true', help='Whether to compile the model.')
parser.add_argument('--compile_prefill', action='store_true', help='Whether to compile the prefill (improves prefill perf, but higher compile times)')
parser.add_argument('--profile', type=Path, default=None, help='Profile path.')
parser.add_argument('--speculate_k', type=int, default=5, help='Speculative execution depth.')
parser.add_argument('--draft_checkpoint_path', type=Path, default=None, help='Draft checkpoint path.')
parser.add_argument('--device', type=str, default=default_device, help='Device to use')
args = parser.parse_args()
main(
args.prompt, args.interactive, args.num_samples, args.max_new_tokens, args.top_k,
args.temperature, args.checkpoint_path, args.compile, args.compile_prefill, args.profile, args.draft_checkpoint_path,
args.speculate_k, args.device
)

View File

@@ -0,0 +1,360 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
from dataclasses import dataclass
from typing import Optional
import torch
import torch.nn as nn
from torch import Tensor
from torch.nn import functional as F
def find_multiple(n: int, k: int) -> int:
if n % k == 0:
return n
return n + k - (n % k)
class AdaptiveLayerNorm(nn.Module):
r"""Adaptive Layer Normalization"""
def __init__(self, d_model, norm) -> None:
super(AdaptiveLayerNorm, self).__init__()
self.project_layer = nn.Linear(d_model, 2 * d_model)
self.norm = norm
self.d_model = d_model
self.eps = self.norm.eps
def forward(self, input: Tensor, embedding: Tensor = None) -> Tensor:
if embedding is None:
return self.norm(input)
weight, bias = torch.split(
self.project_layer(embedding),
split_size_or_sections=self.d_model,
dim=-1,
)
return weight * self.norm(input) + bias
@dataclass
class ModelArgs:
block_size: int = 2048
vocab_size: int = 32000
n_layer: int = 32
n_head: int = 32
dim: int = 4096
intermediate_size: int = None
n_local_heads: int = -1
head_dim: int = 64
rope_base: float = 10000
norm_eps: float = 1e-5
has_cross_attention: bool = False
context_dim: int = 0
uvit_skip_connection: bool = False
time_as_token: bool = False
def __post_init__(self):
if self.n_local_heads == -1:
self.n_local_heads = self.n_head
if self.intermediate_size is None:
hidden_dim = 4 * self.dim
n_hidden = int(2 * hidden_dim / 3)
self.intermediate_size = find_multiple(n_hidden, 256)
# self.head_dim = self.dim // self.n_head
@classmethod
def from_name(cls, name: str):
if name in transformer_configs:
return cls(**transformer_configs[name])
# fuzzy search
config = [config for config in transformer_configs if config.lower() in str(name).lower()]
# We may have two or more configs matched (e.g. "7B" and "Mistral-7B"). Find the best config match,
# take longer name (as it have more symbols matched)
if len(config) > 1:
config.sort(key=len, reverse=True)
assert len(config[0]) != len(config[1]), name # make sure only one 'best' match
return cls(**transformer_configs[config[0]])
transformer_configs = {
"CodeLlama-7b-Python-hf": dict(block_size=16384, vocab_size=32000, n_layer=32, dim=4096, rope_base=1000000),
"7B": dict(n_layer=32, n_head=32, dim=4096),
"13B": dict(n_layer=40, n_head=40, dim=5120),
"30B": dict(n_layer=60, n_head=52, dim=6656),
"34B": dict(n_layer=48, n_head=64, dim=8192, vocab_size=32000, n_local_heads=8, intermediate_size=22016,
rope_base=1000000), # CodeLlama-34B-Python-hf
"70B": dict(n_layer=80, n_head=64, dim=8192, n_local_heads=8, intermediate_size=28672),
"Mistral-7B": dict(n_layer=32, n_head=32, n_local_heads=8, dim=4096, intermediate_size=14336, vocab_size=32000),
"stories15M": dict(n_layer=6, n_head=6, dim=288),
"stories110M": dict(n_layer=12, n_head=12, dim=768),
"llama-3-8b": dict(block_size=8192, n_layer=32, n_head=32, n_local_heads=8, dim=4096, intermediate_size=14336,
vocab_size=128256, rope_base=500000),
"llama-3-70b": dict(block_size=8192, n_layer=80, n_head=64, n_local_heads=8, dim=8192, intermediate_size=28672,
vocab_size=128256, rope_base=500000),
}
class KVCache(nn.Module):
def __init__(self, max_batch_size, max_seq_length, n_heads, head_dim, dtype=torch.bfloat16):
super().__init__()
cache_shape = (max_batch_size, n_heads, max_seq_length, head_dim)
self.register_buffer('k_cache', torch.zeros(cache_shape, dtype=dtype))
self.register_buffer('v_cache', torch.zeros(cache_shape, dtype=dtype))
def update(self, input_pos, k_val, v_val):
# input_pos: [S], k_val: [B, H, S, D]
assert input_pos.shape[0] == k_val.shape[2]
k_out = self.k_cache
v_out = self.v_cache
k_out[:, :, input_pos] = k_val
v_out[:, :, input_pos] = v_val
return k_out, v_out
class Transformer(nn.Module):
def __init__(self, config: ModelArgs) -> None:
super().__init__()
self.config = config
self.layers = nn.ModuleList(TransformerBlock(config) for _ in range(config.n_layer))
self.norm = AdaptiveLayerNorm(config.dim, RMSNorm(config.dim, eps=config.norm_eps))
self.freqs_cis: Optional[Tensor] = None
self.mask_cache: Optional[Tensor] = None
self.max_batch_size = -1
self.max_seq_length = -1
def setup_caches(self, max_batch_size, max_seq_length, use_kv_cache=True):
if self.max_seq_length >= max_seq_length and self.max_batch_size >= max_batch_size:
return
head_dim = self.config.dim // self.config.n_head
max_seq_length = find_multiple(max_seq_length, 8)
self.max_seq_length = max_seq_length
self.max_batch_size = max_batch_size
dtype = self.norm.project_layer.weight.dtype
device = self.norm.project_layer.weight.device
if not self.training and use_kv_cache:
for b in self.layers:
b.attention.kv_cache = KVCache(max_batch_size, max_seq_length, self.config.n_local_heads, head_dim, dtype).to(device)
self.freqs_cis = precompute_freqs_cis(self.config.block_size, self.config.head_dim,
self.config.rope_base, dtype).to(device)
self.causal_mask = torch.tril(torch.ones(self.max_seq_length, self.max_seq_length, dtype=torch.bool)).to(device)
self.use_kv_cache = use_kv_cache
self.uvit_skip_connection = self.config.uvit_skip_connection
if self.uvit_skip_connection:
self.layers_emit_skip = [i for i in range(self.config.n_layer) if i < self.config.n_layer // 2]
self.layers_receive_skip = [i for i in range(self.config.n_layer) if i > self.config.n_layer // 2]
else:
self.layers_emit_skip = []
self.layers_receive_skip = []
def forward(self,
x: Tensor,
c: Tensor,
input_pos: Optional[Tensor] = None,
mask: Optional[Tensor] = None,
context: Optional[Tensor] = None,
context_input_pos: Optional[Tensor] = None,
cross_attention_mask: Optional[Tensor] = None,
) -> Tensor:
assert self.freqs_cis is not None, "Caches must be initialized first"
if mask is None: # in case of non-causal model
if not self.training and self.use_kv_cache:
mask = self.causal_mask[None, None, input_pos]
else:
mask = self.causal_mask[None, None, input_pos]
mask = mask[..., input_pos]
freqs_cis = self.freqs_cis[input_pos]
if context is not None:
context_freqs_cis = self.freqs_cis[context_input_pos]
else:
context_freqs_cis = None
skip_in_x_list = []
for i, layer in enumerate(self.layers):
if self.uvit_skip_connection and i in self.layers_receive_skip:
skip_in_x = skip_in_x_list.pop(-1)
else:
skip_in_x = None
x = layer(x, c, input_pos, freqs_cis, mask, context, context_freqs_cis, cross_attention_mask, skip_in_x)
if self.uvit_skip_connection and i in self.layers_emit_skip:
skip_in_x_list.append(x)
x = self.norm(x, c)
return x
@classmethod
def from_name(cls, name: str):
return cls(ModelArgs.from_name(name))
class TransformerBlock(nn.Module):
def __init__(self, config: ModelArgs) -> None:
super().__init__()
self.attention = Attention(config)
self.feed_forward = FeedForward(config)
self.ffn_norm = AdaptiveLayerNorm(config.dim, RMSNorm(config.dim, eps=config.norm_eps))
self.attention_norm = AdaptiveLayerNorm(config.dim, RMSNorm(config.dim, eps=config.norm_eps))
if config.has_cross_attention:
self.has_cross_attention = True
self.cross_attention = Attention(config, is_cross_attention=True)
self.cross_attention_norm = AdaptiveLayerNorm(config.dim, RMSNorm(config.dim, eps=config.norm_eps))
else:
self.has_cross_attention = False
if config.uvit_skip_connection:
self.skip_in_linear = nn.Linear(config.dim * 2, config.dim)
self.uvit_skip_connection = True
else:
self.uvit_skip_connection = False
self.time_as_token = config.time_as_token
def forward(self,
x: Tensor,
c: Tensor,
input_pos: Tensor,
freqs_cis: Tensor,
mask: Tensor,
context: Optional[Tensor] = None,
context_freqs_cis: Optional[Tensor] = None,
cross_attention_mask: Optional[Tensor] = None,
skip_in_x: Optional[Tensor] = None,
) -> Tensor:
c = None if self.time_as_token else c
if self.uvit_skip_connection and skip_in_x is not None:
x = self.skip_in_linear(torch.cat([x, skip_in_x], dim=-1))
h = x + self.attention(self.attention_norm(x, c), freqs_cis, mask, input_pos)
if self.has_cross_attention:
h = h + self.cross_attention(self.cross_attention_norm(h, c), freqs_cis, cross_attention_mask, input_pos, context, context_freqs_cis)
out = h + self.feed_forward(self.ffn_norm(h, c))
return out
class Attention(nn.Module):
def __init__(self, config: ModelArgs, is_cross_attention: bool = False):
super().__init__()
assert config.dim % config.n_head == 0
total_head_dim = (config.n_head + 2 * config.n_local_heads) * config.head_dim
# key, query, value projections for all heads, but in a batch
if is_cross_attention:
self.wq = nn.Linear(config.dim, config.n_head * config.head_dim, bias=False)
self.wkv = nn.Linear(config.context_dim, 2 * config.n_local_heads * config.head_dim, bias=False)
else:
self.wqkv = nn.Linear(config.dim, total_head_dim, bias=False)
self.wo = nn.Linear(config.head_dim * config.n_head, config.dim, bias=False)
self.kv_cache = None
self.n_head = config.n_head
self.head_dim = config.head_dim
self.n_local_heads = config.n_local_heads
self.dim = config.dim
# self._register_load_state_dict_pre_hook(self.load_hook)
# def load_hook(self, state_dict, prefix, *args):
# if prefix + "wq.weight" in state_dict:
# wq = state_dict.pop(prefix + "wq.weight")
# wk = state_dict.pop(prefix + "wk.weight")
# wv = state_dict.pop(prefix + "wv.weight")
# state_dict[prefix + "wqkv.weight"] = torch.cat([wq, wk, wv])
def forward(self,
x: Tensor,
freqs_cis: Tensor,
mask: Tensor,
input_pos: Optional[Tensor] = None,
context: Optional[Tensor] = None,
context_freqs_cis: Optional[Tensor] = None,
) -> Tensor:
bsz, seqlen, _ = x.shape
kv_size = self.n_local_heads * self.head_dim
if context is None:
q, k, v = self.wqkv(x).split([kv_size, kv_size, kv_size], dim=-1)
context_seqlen = seqlen
else:
q = self.wq(x)
k, v = self.wkv(context).split([kv_size, kv_size], dim=-1)
context_seqlen = context.shape[1]
q = q.view(bsz, seqlen, self.n_head, self.head_dim)
k = k.view(bsz, context_seqlen, self.n_local_heads, self.head_dim)
v = v.view(bsz, context_seqlen, self.n_local_heads, self.head_dim)
q = apply_rotary_emb(q, freqs_cis)
k = apply_rotary_emb(k, context_freqs_cis if context_freqs_cis is not None else freqs_cis)
q, k, v = map(lambda x: x.transpose(1, 2), (q, k, v))
if self.kv_cache is not None:
k, v = self.kv_cache.update(input_pos, k, v)
k = k.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
v = v.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
y = y.transpose(1, 2).contiguous().view(bsz, seqlen, self.head_dim * self.n_head)
y = self.wo(y)
return y
class FeedForward(nn.Module):
def __init__(self, config: ModelArgs) -> None:
super().__init__()
self.w1 = nn.Linear(config.dim, config.intermediate_size, bias=False)
self.w3 = nn.Linear(config.dim, config.intermediate_size, bias=False)
self.w2 = nn.Linear(config.intermediate_size, config.dim, bias=False)
def forward(self, x: Tensor) -> Tensor:
return self.w2(F.silu(self.w1(x)) * self.w3(x))
class RMSNorm(nn.Module):
def __init__(self, dim: int, eps: float = 1e-5):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))
def _norm(self, x):
return x * torch.rsqrt(torch.mean(x * x, dim=-1, keepdim=True) + self.eps)
def forward(self, x: Tensor) -> Tensor:
output = self._norm(x.float()).type_as(x)
return output * self.weight
def precompute_freqs_cis(
seq_len: int, n_elem: int, base: int = 10000,
dtype: torch.dtype = torch.bfloat16
) -> Tensor:
freqs = 1.0 / (base ** (torch.arange(0, n_elem, 2)[: (n_elem // 2)].float() / n_elem))
t = torch.arange(seq_len, device=freqs.device)
freqs = torch.outer(t, freqs)
freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
cache = torch.stack([freqs_cis.real, freqs_cis.imag], dim=-1)
return cache.to(dtype=dtype)
def apply_rotary_emb(x: Tensor, freqs_cis: Tensor) -> Tensor:
xshaped = x.float().reshape(*x.shape[:-1], -1, 2)
freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2)
x_out2 = torch.stack(
[
xshaped[..., 0] * freqs_cis[..., 0] - xshaped[..., 1] * freqs_cis[..., 1],
xshaped[..., 1] * freqs_cis[..., 0] + xshaped[..., 0] * freqs_cis[..., 1],
],
-1,
)
x_out2 = x_out2.flatten(3)
return x_out2.type_as(x)

View File

@@ -0,0 +1,622 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import time
from pathlib import Path
import torch
import torch.nn as nn
import torch.nn.functional as F
from tokenizer import get_tokenizer
try:
from GPTQ import GenericGPTQRunner, InputRecorder
from eval import get_task_dict, evaluate, lm_eval
except:
pass
from model import Transformer
##### Quantization Primitives ######
def dynamically_quantize_per_channel(x, quant_min, quant_max, target_dtype):
# assumes symmetric quantization
# assumes axis == 0
# assumes dense memory format
# TODO(future): relax ^ as needed
# default setup for affine quantization of activations
eps = torch.finfo(torch.float32).eps
# get min and max
min_val, max_val = torch.aminmax(x, dim=1)
# calculate scales and zero_points based on min and max
# reference: https://fburl.com/code/srbiybme
min_val_neg = torch.min(min_val, torch.zeros_like(min_val))
max_val_pos = torch.max(max_val, torch.zeros_like(max_val))
device = min_val_neg.device
# reference: https://fburl.com/code/4wll53rk
max_val_pos = torch.max(-min_val_neg, max_val_pos)
scales = max_val_pos / (float(quant_max - quant_min) / 2)
# ensure scales is the same dtype as the original tensor
scales = torch.clamp(scales, min=eps).to(x.dtype)
zero_points = torch.zeros(min_val_neg.size(), dtype=torch.int64, device=device)
# quantize based on qmin/qmax/scales/zp
# reference: https://www.internalfb.com/code/fbsource/[8edc275012b1]/fbcode/caffe2/torch/ao/quantization/fx/_decomposed.py?lines=63
x_div = x / scales.unsqueeze(-1)
x_round = torch.round(x_div)
x_zp = x_round + zero_points.unsqueeze(-1)
quant = torch.clamp(x_zp, quant_min, quant_max).to(target_dtype)
return quant, scales, zero_points
def get_group_qparams(w, n_bit=4, groupsize=128):
# needed for GPTQ with padding
if groupsize > w.shape[-1]:
groupsize = w.shape[-1]
assert groupsize > 1
assert w.shape[-1] % groupsize == 0
assert w.dim() == 2
to_quant = w.reshape(-1, groupsize)
assert torch.isnan(to_quant).sum() == 0
max_val = to_quant.amax(dim=1, keepdim=True)
min_val = to_quant.amin(dim=1, keepdim=True)
max_int = 2**n_bit - 1
scales = (max_val - min_val).clamp(min=1e-6) / max_int
zeros = min_val + scales * (2 ** (n_bit - 1))
return scales.to(torch.bfloat16).reshape(w.shape[0], -1), zeros.to(
torch.bfloat16
).reshape(w.shape[0], -1)
def pack_scales_and_zeros(scales, zeros):
assert scales.shape == zeros.shape
assert scales.dtype == torch.bfloat16
assert zeros.dtype == torch.bfloat16
return (
torch.cat(
[
scales.reshape(scales.size(0), scales.size(1), 1),
zeros.reshape(zeros.size(0), zeros.size(1), 1),
],
2,
)
.transpose(0, 1)
.contiguous()
)
def unpack_scales_and_zeros(scales_and_zeros):
assert len(scales_and_zeros.shape) == 3 and scales_and_zeros.shape[2] == 2
assert scales_and_zeros.dtype == torch.float
return torch.split(scales_and_zeros.transpose(0, 1), 1, 2)
def group_quantize_tensor_from_qparams(w, scales, zeros, n_bit=4, groupsize=128):
assert groupsize > 1
# needed for GPTQ single column quantize
if groupsize > w.shape[-1] and scales.shape[-1] == 1:
groupsize = w.shape[-1]
assert w.shape[-1] % groupsize == 0
assert w.dim() == 2
to_quant = w.reshape(-1, groupsize)
assert torch.isnan(to_quant).sum() == 0
scales = scales.reshape(-1, 1)
zeros = zeros.reshape(-1, 1)
min_val = zeros - scales * (2 ** (n_bit - 1))
max_int = 2**n_bit - 1
min_int = 0
w_int32 = (
to_quant.sub(min_val)
.div(scales)
.round()
.clamp_(min_int, max_int)
.to(torch.int32)
.reshape_as(w)
)
return w_int32
def group_quantize_tensor(w, n_bit=4, groupsize=128):
scales, zeros = get_group_qparams(w, n_bit, groupsize)
w_int32 = group_quantize_tensor_from_qparams(w, scales, zeros, n_bit, groupsize)
scales_and_zeros = pack_scales_and_zeros(scales, zeros)
return w_int32, scales_and_zeros
def group_dequantize_tensor_from_qparams(
w_int32, scales, zeros, n_bit=4, groupsize=128
):
assert groupsize > 1
# needed for GPTQ single column dequantize
if groupsize > w_int32.shape[-1] and scales.shape[-1] == 1:
groupsize = w_int32.shape[-1]
assert w_int32.shape[-1] % groupsize == 0
assert w_int32.dim() == 2
w_int32_grouped = w_int32.reshape(-1, groupsize)
scales = scales.reshape(-1, 1)
zeros = zeros.reshape(-1, 1)
w_dq = (
w_int32_grouped.sub(2 ** (n_bit - 1)).mul(scales).add(zeros).reshape_as(w_int32)
)
return w_dq
def group_dequantize_tensor(w_int32, scales_and_zeros, n_bit=4, groupsize=128):
scales, zeros = unpack_scales_and_zeros(scales_and_zeros)
return group_dequantize_tensor_from_qparams(
w_int32, scales, zeros, n_bit, groupsize
)
class QuantHandler:
def __init__(self, mod):
self.mod = mod
def create_quantized_state_dict(self) -> "StateDict":
pass
def convert_for_runtime(self) -> "nn.Module":
pass
class GPTQQuantHandler(QuantHandler):
"""
This class implements a GPTQ QuantHandler that can be used to apply GPTQ to a model in concert with the GenericGPTQRunner class.
Unlike the base QuantHandler class, the user does not need to implement the create_quantized_state_dict, instead they have to reimplement
__init__ such that it defines the functions for the quantization mode. User is expected to reimplement convert_for_runtime.
The following functions (which must be defined in __init__) are used to define the quantization mode for both GPTQ and
create_quantized_state_dict. Here is a description of each function.
get_qparams_func:
A function that calculates the quantization qparams for an input tensor.
Args:
weight: A 2d weight tensor with non-integer dtype.
Returns:
qparams: it can have any format but will need to be handled by the other defined functions below.
quantize_func:
A function that applies quantization to an input tensor. It should be noted
that this function needs to be able to handle quantizing the entire weight tensor, a single group,
or a single column.
Args:
weight: A 2d weight tensor with non-integer dtype.
qparams: the output from get_qparams_func
Returns:
quantized_weight: A 2d quantized weight tensor (generally with an integer dtype)
dequantize_func:
A function that dequantizes an input quantized weight tensor. It should be noted
that this function needs to be able to handle dequantizing the entire weight tensor, a single group,
or a single column.
Args:
quantized_weight: A 2d quantized weight tensor (generally with an integer dtype)
qparams: the output from get_qparams_func
Returns:
weight: A 2d weight tensor with non-integer dtype.
combine_qparams_list_func:
A function that combines several qparams into one qparam.
Args:
qparams_list: a list of qparams objects, each obtained by calling get_qparams_func
on a single group from a weight tensor
Returns:
qparams: an object of the same format as the qparams above.
skip_layer_func:
A function that determines which linear layers should be skipped during GPTQ
Args:
weight: A 2d weight tensor with non-integer dtype.
Returns:
skip: boolean indicating whether layer should be skipped
make_names_and_values_dict_func:
A function that prepares the qparams and quantized_weight and creates a dictionary indicating how they
should be inserted into the state_dict. Generally any packing of the weight and qparams should be done here.
Args:
quantized_weight: A 2d quantized weight tensor (generally with an integer dtype)
qparams: the output from get_qparams_func
Returns:
names_and_values_dict: a dictionary mapping the name of the parameters of the quantized module to the
corresponding quantized weights and qparams.
"""
def __init__(self):
assert self.mod is not None
assert self.get_qparams_func is not None
assert self.quantize_func is not None
assert self.dequantize_func is not None
assert self.combine_qparams_list_func is not None
assert self.make_names_and_values_dict_func is not None
@staticmethod
def get_inputs(model, tokenizer, calibration_tasks, calibration_limit, calibration_seq_length, pad_calibration_inputs) -> "MultiInput":
input_recorder = InputRecorder(
model,
tokenizer,
calibration_seq_length,
pad_calibration_inputs,
)
try:
lm_eval.tasks.initialize_tasks()
except:
pass
task_dict = get_task_dict(calibration_tasks)
print("Obtaining GPTQ calibration inputs on: ", calibration_tasks)
evaluate(
input_recorder,
task_dict,
limit=calibration_limit,
)
inputs = input_recorder.get_recorded_inputs()
assert inputs is not None, (
f"No inputs were collected, use a task other than {calibration_tasks}, "+
f"use option pad_calibration_inputs, or decrease calibration_sequence_length (currently "+
f"{calibration_seq_length})"
)
print(f"Obtained {len(inputs[0].values)} calibration samples")
return inputs
@torch.no_grad()
def create_quantized_state_dict(
self,
tokenizer,
blocksize,
percdamp,
groupsize,
calibration_tasks,
calibration_limit,
calibration_seq_length,
pad_calibration_inputs,
) -> "StateDict":
inputs = GPTQQuantHandler.get_inputs(self.mod, tokenizer, calibration_tasks, calibration_limit, calibration_seq_length, pad_calibration_inputs)
print("Tracing model for GPTQ")
GPTQ_runner = GenericGPTQRunner(
self.mod,
inputs,
blocksize,
percdamp,
groupsize,
).configure_quantization_mode(
self.get_qparams_func,
self.quantize_func,
self.dequantize_func,
self.combine_qparams_list_func,
self.make_names_and_values_dict_func,
self.skip_layer_func
)
print("Applying GPTQ to weights")
GPTQ_runner.run()
return GPTQ_runner.get_quantized_state_dict()
def convert_for_runtime(self) -> "nn.Module":
pass
##### Weight-only int8 per-channel quantized code ######
def replace_linear_weight_only_int8_per_channel(module):
for name, child in module.named_children():
if isinstance(child, nn.Linear):
setattr(module, name, WeightOnlyInt8Linear(child.in_features, child.out_features))
else:
replace_linear_weight_only_int8_per_channel(child)
class WeightOnlyInt8QuantHandler:
def __init__(self, mod):
self.mod = mod
@torch.no_grad()
def create_quantized_state_dict(self):
cur_state_dict = self.mod.state_dict()
for fqn, mod in self.mod.named_modules():
if isinstance(mod, torch.nn.Linear):
int8_weight, scales, _ = dynamically_quantize_per_channel(mod.weight.float(), -128, 127, torch.int8)
cur_state_dict[f"{fqn}.weight"] = int8_weight
cur_state_dict[f"{fqn}.scales"] = scales.to(mod.weight.dtype)
return cur_state_dict
def convert_for_runtime(self):
replace_linear_weight_only_int8_per_channel(self.mod)
return self.mod
class WeightOnlyInt8Linear(torch.nn.Module):
__constants__ = ['in_features', 'out_features']
in_features: int
out_features: int
weight: torch.Tensor
def __init__(self, in_features: int, out_features: int, bias: bool = True,
device=None, dtype=None) -> None:
factory_kwargs = {'device': device, 'dtype': dtype}
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.register_buffer("weight", torch.empty((out_features, in_features), dtype=torch.int8))
self.register_buffer("scales", torch.ones(out_features, dtype=torch.bfloat16))
def forward(self, input: torch.Tensor) -> torch.Tensor:
return F.linear(input, self.weight.to(dtype=input.dtype)) * self.scales
##### weight only int4 per channel groupwise quantized code ######
def prepare_int4_weight_and_scales_and_zeros(weight_bf16, groupsize, inner_k_tiles):
weight_int32, scales_and_zeros = group_quantize_tensor(
weight_bf16, n_bit=4, groupsize=groupsize
)
weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
return weight_int4pack, scales_and_zeros
def linear_forward_int4(x, weight_int4pack, scales_and_zeros, out_features, groupsize):
origin_x_size = x.size()
x = x.reshape(-1, origin_x_size[-1])
c = torch.ops.aten._weight_int4pack_mm(x, weight_int4pack, groupsize, scales_and_zeros)
new_shape = origin_x_size[:-1] + (out_features,)
c = c.reshape(new_shape)
return c
def _check_linear_int4_k(k, groupsize = 1, inner_k_tiles = 1):
return k % groupsize == 0 and k % (inner_k_tiles * 16) == 0
def replace_linear_int4(module, groupsize, inner_k_tiles, padding):
for name, child in module.named_children():
if isinstance(child, nn.Linear):
if _check_linear_int4_k(child.in_features, groupsize, inner_k_tiles):
setattr(module, name, WeightOnlyInt4Linear(
child.in_features, child.out_features, bias=False,
groupsize=groupsize, inner_k_tiles=inner_k_tiles, padding=False,
))
elif padding:
setattr(module, name, WeightOnlyInt4Linear(
child.in_features, child.out_features, bias=False,
groupsize=groupsize, inner_k_tiles=inner_k_tiles, padding=True,
))
else:
replace_linear_int4(child, groupsize, inner_k_tiles, padding)
class WeightOnlyInt4QuantHandler:
def __init__(self, mod, groupsize=128, inner_k_tiles=8, padding=True):
self.mod = mod
self.groupsize = groupsize
self.inner_k_tiles = inner_k_tiles
self.padding = padding
assert groupsize in [32, 64, 128, 256]
assert inner_k_tiles in [2, 4, 8]
@torch.no_grad()
def create_quantized_state_dict(self, use_cuda = True):
if use_cuda:
device="cuda"
else:
device="cpu"
cur_state_dict = self.mod.state_dict()
for fqn, mod in self.mod.named_modules():
if isinstance(mod, torch.nn.Linear):
assert not mod.bias
out_features = mod.out_features
in_features = mod.in_features
assert out_features % 8 == 0, "require out_features % 8 == 0"
print(f"linear: {fqn}, in={in_features}, out={out_features}")
weight = mod.weight.data
if not _check_linear_int4_k(in_features, self.groupsize, self.inner_k_tiles):
if self.padding:
from model import find_multiple
import torch.nn.functional as F
print(f"warning: {fqn} is padded to satisfy in_features % 1024 == 0")
padded_in_features = find_multiple(in_features, 1024)
weight = F.pad(weight, pad=(0, padded_in_features - in_features))
else:
print(f"warning: {fqn} is skipped, int4 requires that in_features is 32, 64, or is divisible by 1024, " +
"and that groupsize and inner_k_tiles*16 evenly divide into it")
continue
weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
weight.to(torch.bfloat16).to(device=device), self.groupsize, self.inner_k_tiles
)
cur_state_dict[f"{fqn}.weight"] = weight_int4pack.to('cpu')
cur_state_dict[f"{fqn}.scales_and_zeros"] = scales_and_zeros.to('cpu')
return cur_state_dict
def convert_for_runtime(self):
replace_linear_int4(self.mod, self.groupsize, self.inner_k_tiles, self.padding)
return self.mod
class WeightOnlyInt4GPTQQuantHandler(GPTQQuantHandler):
def __init__(self, mod, groupsize=128, inner_k_tiles=8, padding=True):
from model import find_multiple
self.mod = mod
self.groupsize = groupsize
self.inner_k_tiles = inner_k_tiles
self.padding = padding
self.get_qparams_func = lambda w: get_group_qparams(w, 4, groupsize)
self.quantize_func = lambda w, qparams: \
group_quantize_tensor_from_qparams(w, qparams[0], qparams[1], 4, groupsize)
self.dequantize_func = lambda q, qparams: \
group_dequantize_tensor_from_qparams(q, qparams[0], qparams[1], 4, groupsize).float()
self.combine_qparams_list_func = lambda qparams_list: \
[torch.cat(x, dim=1) for x in zip(*qparams_list)]
# skip unless padding=True or its correctly sized
self.skip_layer_func = lambda linear_weight: not (
_check_linear_int4_k(linear_weight.shape[-1], groupsize, inner_k_tiles) or padding
)
# we need to do the padding here, both for q and the qparams if necessary
def make_names_and_values_dict_func(q, qparams):
k = q.shape[1]
new_k = find_multiple(k, 1024)
# how much we need to pad the weight
delta_k = new_k - q.shape[1]
final_q = torch.ops.aten._convert_weight_to_int4pack(F.pad(q, pad=(0, delta_k)), inner_k_tiles)
scales_and_zeros = pack_scales_and_zeros(*qparams)
# how many new groups we need for padded weight
delta_groups = new_k // groupsize - scales_and_zeros.shape[0]
final_s_and_z = F.pad(scales_and_zeros, pad=(0,0,0,0,0, delta_groups), value=1)
return {"weight": final_q, "scales_and_zeros": final_s_and_z}
self.make_names_and_values_dict_func = make_names_and_values_dict_func
super().__init__()
def convert_for_runtime(self):
replace_linear_int4(self.mod, self.groupsize, self.inner_k_tiles, self.padding)
return self.mod
class WeightOnlyInt4Linear(torch.nn.Module):
__constants__ = ['in_features', 'out_features']
in_features: int
out_features: int
weight: torch.Tensor
def __init__(
self, in_features: int, out_features: int,
bias=True, device=None, dtype=None, groupsize: int = 128, inner_k_tiles: int = 8, padding: bool = True,
) -> None:
super().__init__()
self.padding = padding
if padding:
from model import find_multiple
self.origin_in_features = in_features
in_features = find_multiple(in_features, 1024)
self.in_features = in_features
self.out_features = out_features
assert not bias, "require bias=False"
self.groupsize = groupsize
self.inner_k_tiles = inner_k_tiles
assert out_features % 8 == 0, "require out_features % 8 == 0"
assert in_features % (inner_k_tiles * 16) == 0, "require in_features % (innerKTiles * 16) == 0"
self.register_buffer(
"weight",
torch.empty((out_features // 8, in_features // (inner_k_tiles * 16), 32, inner_k_tiles // 2), dtype=torch.int32)
)
self.register_buffer(
"scales_and_zeros",
torch.empty((in_features // groupsize, out_features, 2), dtype=torch.bfloat16)
)
def forward(self, input: torch.Tensor) -> torch.Tensor:
input = input.to(torch.bfloat16)
if self.padding:
import torch.nn.functional as F
input = F.pad(input, pad=(0, self.in_features - self.origin_in_features))
return linear_forward_int4(
input,
self.weight, self.scales_and_zeros, self.out_features, self.groupsize
)
def quantize(
checkpoint_path: Path = Path("checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth"),
mode: str = 'int8',
# following arguments only available when setting int4 quantization.
groupsize: int = 128,
# following arguments only used for GPTQ
calibration_tasks: list = ["hellaswag"],
calibration_limit: int = 1000,
calibration_seq_length: int = 100,
pad_calibration_inputs: bool = False,
percdamp: float = .01,
blocksize: int = 128,
label: str = '',
) -> None:
assert checkpoint_path.is_file(), checkpoint_path
device = 'cpu'
precision = torch.bfloat16
print("Loading model ...")
t0 = time.time()
with torch.device('meta'):
model = Transformer.from_name(checkpoint_path.parent.name)
checkpoint = torch.load(str(checkpoint_path), mmap=True, weights_only=True)
model.load_state_dict(checkpoint, assign=True)
model = model.to(dtype=precision, device=device)
if mode == 'int8':
print("Quantizing model weights for int8 weight-only symmetric per-channel quantization")
quant_handler = WeightOnlyInt8QuantHandler(model)
quantized_state_dict = quant_handler.create_quantized_state_dict()
dir_name = checkpoint_path.parent
base_name = checkpoint_path.name
new_base_name = base_name.replace('.pth', f'{label}int8.pth')
elif mode == 'int4':
print("Quantizing model weights for int4 weight-only affine per-channel groupwise quantization")
quant_handler = WeightOnlyInt4QuantHandler(model, groupsize)
quantized_state_dict = quant_handler.create_quantized_state_dict()
dir_name = checkpoint_path.parent
base_name = checkpoint_path.name
new_base_name = base_name.replace('.pth', f"{label}int4.g{groupsize}.pth")
elif mode == 'int4-gptq':
print("Quantizing model weights for int4 weight-only affine per-channel groupwise quantization using GPTQ...")
quant_handler = WeightOnlyInt4GPTQQuantHandler(model, groupsize)
tokenizer_path = checkpoint_path.parent / "tokenizer.model"
assert tokenizer_path.is_file(), str(tokenizer_path)
tokenizer = get_tokenizer(tokenizer_path, checkpoint_path)
quantized_state_dict = quant_handler.create_quantized_state_dict(
tokenizer,
blocksize,
percdamp,
groupsize,
calibration_tasks,
calibration_limit,
calibration_seq_length,
pad_calibration_inputs
)
dir_name = checkpoint_path.parent
base_name = checkpoint_path.name
new_base_name = base_name.replace('.pth', f"{label}int4-gptq.g{groupsize}.pth")
else:
raise ValueError(f"Invalid quantization mode {mode} needs to be one of [int8, int4, int4-gpptq]")
quantize_path = dir_name / new_base_name
print(f"Writing quantized weights to {quantize_path}")
quantize_path.unlink(missing_ok=True) # remove existing file if one already there
torch.save(quantized_state_dict, quantize_path)
print(f"Quantization complete took {time.time() - t0:.02f} seconds")
return
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(description='Quantize a model.')
parser.add_argument('--checkpoint_path', type=Path, default=Path("checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth"), help='Path to the model checkpoint to be quantized.')
parser.add_argument('--mode', '-q', type=str, default='int8', choices=['int8', 'int4', 'int4-gptq'], help='type of quantization to perform')
parser.add_argument('--groupsize', type=int, default=32, help='Group size for int4 quantization.')
parser.add_argument('--calibration_tasks', type=str, nargs='+', default=['wikitext'], help='tasks to do gptq calibration on, if doing gptq')
parser.add_argument('--calibration_limit', type=int, default=1000, help='number of samples to use for gptq calibration')
parser.add_argument('--calibration_seq_length', type=int, default=100, help='length of sequences to use for gptq calibration')
parser.add_argument('--pad_calibration_inputs', type=bool, default=False, help='pads sequences shorter than calibration_seq_length to that length, yielding more calibration inputs but running much slower')
parser.add_argument('--percdamp', type=float, default=.01, help='gptq percentage dampening')
parser.add_argument('--blocksize', type=int, default=128, help='blocksize for gptq')
parser.add_argument('--label', type=str, default='_', help='label to add to output filename')
args = parser.parse_args()
quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)

Some files were not shown because too many files have changed in this diff Show More