霉霉讲中文的原始音视频地址:https://www.bilibili.com/video/BV1bB4y1R7Nu/
pip3 install moviepymoviepy可以帮我们把音频部分提取出来,编写代码:
from moviepy.editor import AudioFileClip my_audio_clip = AudioFileClip("e:/meimei.mp4") my_audio_clip.write_audiofile("e:/meimei.wav")音频就被提取了出来。
import librosa import numpy as np audio, freq = librosa.load("e:\meimei.wav") time = np.arange(0, len(audio)) / freq print(len(audio), type(audio), freq, sep="\t")程序返回:
python3 -u "test.py" 848384 <class 'numpy.ndarray'> 22050可以看到读取到了采样频率和每个采样点的信号强度,采样点共 848384,频率为 22050,音频长度约38秒。至此,我们就完成了原始数据集文件的准备。
pip install -r requirements.txt随后运行项目内的切分脚本:
python3 audio_slicer.py该脚本原理就是利用slicer2库将大文件切分为小份:
import librosa # Optional. Use any library you like to read audio files. import soundfile # Optional. Use any library you like to write audio files. import shutil import gradio as gr import os import webbrowser import subprocess import datetime import json import requests import soundfile as sf import numpy as np import yaml from config import config import os with open('config.yml', mode="r", encoding="utf-8") as f: configyml=yaml.load(f,Loader=yaml.FullLoader) model_name = configyml["dataset_path"].replace("Data\\","") from slicer2 import Slicer audio, sr = librosa.load(f'./Data/{model_name}/raw/{model_name}/{model_name}.wav', sr=None, mono=False) # Load an audio file with librosa. slicer = Slicer( sr=sr, threshold=-40, min_length=2000, min_interval=300, hop_size=10, max_sil_kept=500 ) chunks = slicer.slice(audio) for i, chunk in enumerate(chunks): if len(chunk.shape) > 1: chunk = chunk.T # Swap axes if the audio is stereo. soundfile.write(f'./Data/{model_name}/raw/{model_name}/{model_name}_{i}.wav', chunk, sr) # Save sliced audio files with soundfile. if os.path.exists(f'./Data/{model_name}/raw/{model_name}/{model_name}.wav'): # 如果文件存在 os.remove(f'./Data/{model_name}/raw/{model_name}/{model_name}.wav')需要注意的是min_length参数非常重要,分片文件时长绝对不能低于2秒,这里单位是毫秒,所以数值为2000,因为梅尔频谱本身需要有一个加窗的过程,音频文件必须要至少达到1帧长+窗口时长才能有结果,否则就会返回空。所以在数据切分时不能有超过2秒的音频,同时本来短时样本的质量就普遍偏低。
E:\work\Bert-VITS2-v202_demo\Data\meimei\raw\meimei>tree /f Folder PATH listing for volume myssd Volume serial number is 7CE3-15AE E:. meimei_0.wav meimei_1.wav meimei_2.wav meimei_3.wav meimei_4.wav meimei_5.wav meimei_6.wav meimei_7.wav meimei_8.wav可以看到38秒音频被切成了九份。
python3 short_audio_transcribe.py --languages "CJE" --whisper_size medium这里语言使用medium模型进行推理,解决方案采用whisper,关于whisper,请移步:持续进化,快速转录,Faster-Whisper对视频进行双语字幕转录实践(Python3.10),这里不再赘述。
E:\work\Bert-VITS2-v202_demo\venv\lib\site-packages\whisper\timing.py:58: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def backtrace(trace: np.ndarray): Data\meimei\raw Detected language: zh 但这些歌曲没进入专辑因为想留着他们下一张专辑用 Processed: 1/31 Detected language: zh 然後下一張專輯完全不同所以他們被拋在了後面 Processed: 2/31 Detected language: zh 你總是會想起這些歌曲你會想 Processed: 3/31 Detected language: zh 会发生什么因为我希望人们能听到这个但它属于那个时刻 Processed: 4/31 Detected language: zh 所以现在我可以回去重新审视我的旧作品 Processed: 5/31 Detected language: zh 我從他們所在的地方挖掘出那些歌曲 Processed: 6/31 Detected language: zh 並聯繫了我喜歡的藝術家 Processed: 7/31 Detected language: zh 問他們是否願意和我一起演唱這首歌 Processed: 8/31 Detected language: zh 你知道Phoebe Bridgers是我最喜欢的艺术家之一 Processed: 9/31可以看到文本已经被whisper转录了出来。随后对文本进行预处理以及生成bert模型可读文件:
python3 preprocess_text.py python3 bert_gen.py执行后会产生训练集和验证集文件:
E:\work\Bert-VITS2-v202\Data\meimei\filelists>tree /f Folder PATH listing for volume myssd Volume serial number is 7CE3-15AE E:. cleaned.list short_character_anno.list train.list val.list检查无误后,数据预处理就完成了。
{ "train": { "log_interval": 50, "eval_interval": 50, "seed": 42, "epochs": 200, "learning_rate": 0.0001, "betas": [ 0.8, 0.99 ], "eps": 1e-09, "batch_size": 8, "fp16_run": false, "lr_decay": 0.99995, "segment_size": 16384, "init_lr_ratio": 1, "warmup_epochs": 0, "c_mel": 45, "c_kl": 1.0, "skip_optimizer": false }, "data": { "training_files": "Data/meimei/filelists/train.list", "validation_files": "Data/meimei/filelists/val.list", "max_wav_value": 32768.0, "sampling_rate": 44100, "filter_length": 2048, "hop_length": 512, "win_length": 2048, "n_mel_channels": 128, "mel_fmin": 0.0, "mel_fmax": null, "add_blank": true, "n_speakers": 1, "cleaned_text": true, "spk2id": { "keqing": 0 } }, "model": { "use_spk_conditioned_encoder": true, "use_noise_scaled_mas": true, "use_mel_posterior_encoder": false, "use_duration_discriminator": true, "inter_channels": 192, "hidden_channels": 192, "filter_channels": 768, "n_heads": 2, "n_layers": 6, "kernel_size": 3, "p_dropout": 0.1, "resblock": "1", "resblock_kernel_sizes": [ 3, 7, 11 ], "resblock_dilation_sizes": [ [ 1, 3, 5 ], [ 1, 3, 5 ], [ 1, 3, 5 ] ], "upsample_rates": [ 8, 8, 2, 2, 2 ], "upsample_initial_channel": 512, "upsample_kernel_sizes": [ 16, 16, 8, 2, 2 ], "n_layers_q": 3, "use_spectral_norm": false, "gin_channels": 256 }, "version": "2.0" }训练的保存间隔调小一点,方便训练过程中随时进行推理验证。
python3 train_ms.py至此,训练环节和之前的基于已有数据集的本地训练流程已经一致,更多训练步骤请移步:本地训练,开箱可用,Bert-VITS2 V2.0.2版本本地基于现有数据集训练(原神刻晴),囿于篇幅,这里不再赘述。
python3 server_fastapi.py