cant start training #3

008karan · 2019-12-05T14:02:12Z

I was testing the setup on mini librispeech data .This is log when I started training

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 19:24:21 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7ffb7b99c610>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
Traceback (most recent call last):
  File "../../../eend/bin/train.py", line 72, in <module>
    train(args)
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/train.py", line 100, in train
    gpuid = use_single_gpu()
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 56, in use_single_gpu
    cvd = get_free_gpus()[0]
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 40, in get_free_gpus
    del gpus[busid]
KeyError: ' 00000000:01:00.0'
# Accounting: time=1 threads=1
# Ended (code 1) at Thu Dec  5 19:24:22 IST 2019, elapsed time 1 seconds

can you guys suggest whats going wrong?

The text was updated successfully, but these errors were encountered:

sw005320 · 2019-12-05T14:19:38Z

What kind of cluster environments are you using?
You may need to change https://github.com/hitachi-speech/EEND/blob/master/egs/mini_librispeech/v1/cmd.sh based on your environment accordingly. Check https://kaldi-asr.org/doc/queue.html

@yubouf, I strongly recommend to add more documents about cmd.sh and also change run.pl as default.

008karan · 2019-12-05T14:23:16Z

i am using conda environment. and using local machine so have changed to run.pl in cmd.sh

yubouf · 2019-12-05T14:28:39Z

@008karan Thank you for testing EEND.
Consider set CUDA_VISIBLE_DEVICES.
The gpu selection failure might come from cuda (nvidia-smi) version, where I had not tested on cuda10.

@sw005320 Thank you for your suggestion. I will change default to 'run.pl'

sw005320 · 2019-12-05T14:28:55Z

Oh, I see.
Can you set CUDA_VISIBLE_DEVICES explicitly then?

008karan · 2019-12-05T14:34:41Z

after exporting CUDA_VISIBLE_DEVICES=1 here is the log

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 20:00:36 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7f9d28248910>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
Traceback (most recent call last):
  File "../../../eend/bin/train.py", line 72, in <module>
    train(args)
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/train.py", line 100, in train
    gpuid = use_single_gpu()
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 64, in use_single_gpu
    chainer.cuda.get_device_from_id(cvd).use()
  File "cupy/cuda/device.pyx", line 135, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 141, in cupy.cuda.device.Device.use
  File "cupy/cuda/runtime.pyx", line 193, in cupy.cuda.runtime.setDevice
  File "cupy/cuda/runtime.pyx", line 145, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
# Accounting: time=1 threads=1
# Ended (code 1) at Thu Dec  5 20:00:37 IST 2019, elapsed time 1 seconds

sw005320 · 2019-12-05T14:35:32Z

CUDA_VISIBLE_DEVICES=0?

008karan · 2019-12-05T14:46:26Z

looks like training started but stopped

training model at exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train.
bash: line 1:  6217 Aborted                 (core dumped) ( train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train ) 2>> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log >> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log

log:

[
    {
        "main/loss": 0.8094631433486938,
        "main/speech_scored": 429.4651162790698,
        "main/speech_miss": 135.0,
        "main/speech_falarm": 20.930232558139537,
        "main/speaker_scored": 683.7209302325581,
        "main/speaker_miss": 351.7906976744186,
        "main/speaker_falarm": 55.25581395348837,
        "main/speaker_error": 28.13953488372093,
        "main/correct": 221.59302325581396,
        "main/diarization_error": 435.1860465116279,
        "main/frames": 453.25581395348837,
        "validation/main/loss": 0.7502496242523193,
        "validation/main/speech_scored": 377.26666666666665,
        "validation/main/speech_miss": 97.96666666666667,
        "validation/main/speech_falarm": 35.733333333333334,
        "validation/main/speaker_scored": 545.8,
        "validation/main/speaker_miss": 234.56666666666666,
        "validation/main/speaker_falarm": 83.8,
        "validation/main/speaker_error": 33.86666666666667,
        "validation/main/correct": 224.55,
        "validation/main/diarization_error": 352.23333333333335,
        "validation/main/frames": 417.6,
        "main/DER": 0.6364965986394557,
        "validation/main/DER": 0.6453523879320875,
        "main/SAD_MR": 0.3143445064168517,
        "validation/main/SAD_MR": 0.2596748542145256,
        "main/SAD_FR": 0.048735582390209566,
        "validation/main/SAD_FR": 0.09471638098603995,
        "main/MI": 0.5145238095238095,
        "validation/main/MI": 0.42976670331012584,
        "main/FA": 0.08081632653061224,
        "validation/main/FA": 0.1535360938072554,
        "main/CF": 0.04115646258503401,
        "validation/main/CF": 0.06204959081470625,
        "main/accuracy": 0.4888917393535146,
        "validation/main/accuracy": 0.5377155172413793,
        "epoch": 1,
        "iteration": 43,
        "elapsed_time": 107.64393779402599
    },
    {
        "main/loss": 0.6841620802879333,
        "main/speech_scored": 429.09302325581393,
        "main/speech_miss": 59.44186046511628,
        "main/speech_falarm": 22.41860465116279,
        "main/speaker_scored": 699.4651162790698,
        "main/speaker_miss": 238.53488372093022,
        "main/speaker_falarm": 89.3953488372093,
        "main/speaker_error": 21.53488372093023,
        "main/correct": 267.8953488372093,
        "main/diarization_error": 349.4651162790698,
        "main/frames": 453.3953488372093,
        "validation/main/loss": 0.6442975997924805,
        "validation/main/speech_scored": 377.26666666666665,
        "validation/main/speech_miss": 17.0,
        "validation/main/speech_falarm": 40.2,
        "validation/main/speaker_scored": 545.8,
        "validation/main/speaker_miss": 92.33333333333333,
        "validation/main/speaker_falarm": 159.63333333333333,
        "validation/main/speaker_error": 20.066666666666666,
        "validation/main/correct": 271.55,
        "validation/main/diarization_error": 272.03333333333336,
        "validation/main/frames": 417.6,
        "main/DER": 0.4996176480367058,
        "validation/main/DER": 0.4984121167704899,
        "main/SAD_MR": 0.13852907701479594,
        "validation/main/SAD_MR": 0.045060964834776465,
        "main/SAD_FR": 0.05224649070511084,
        "validation/main/SAD_FR": 0.10655592860929494,
        "main/MI": 0.3410247032616285,
        "validation/main/MI": 0.16917063637474045,
        "main/FA": 0.1278052997306912,
        "validation/main/FA": 0.2924758763893978,
        "main/CF": 0.030787645044386077,
        "validation/main/CF": 0.03676560400635154,
        "main/accuracy": 0.5908647927780057,
        "validation/main/accuracy": 0.6502634099616859,
        "epoch": 2,
        "iteration": 86,
        "elapsed_time": 178.86651986418292
    }
]

yubouf · 2019-12-05T14:57:21Z

The losses of the two epochs look good.
Now I have no idea of the core dump cause.

What does exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log say?

008karan · 2019-12-05T15:00:39Z

here it is

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 20:08:49 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7fd94ec1a850>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
GPU device 0 is used
Prepared model
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.607145 to fit
epoch       main/loss   validation/main/loss  main/diarization_error_rate  validation/main/diarization_error_rate  elapsed_time
Tcl_AsyncDelete: async handler deleted by the wrong thread
# Accounting: time=187 threads=1
# Ended (code 134) at Thu Dec  5 20:11:56 IST 2019, elapsed time 187 seconds

008karan · 2019-12-05T15:33:07Z

where are the hyper parameter of model? maybe reducing the batch size would help

yubouf · 2019-12-05T16:43:24Z

See conf directory. conf/train.yml have hyperparameters.

008karan · 2019-12-06T11:18:07Z

Thanks for the help. I really appreciate your quick reply. @yubouf

After reducing the batch size training completed with 29% DER.
Now I need to test it on my custom data. Got some doubts here:

Is this repo implementation of 'End-to-End Neural Speaker Diarization with Permutation-free Objectives' or 'End-to-End Neural Speaker Diarization with Self-attention'.
How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.
Getting confused with the directory structure of the repo. I need to test it on my custom data. I have collected some audio data having a single speaker in each audio. I don't have the transcript.
Found from comments that there should be segments, reco2dur, wav.scp, utt2spk, and spk2utt files for training.
segment mean audio having a single speaker for saying 1 utterance
reco2dur is for the duration of that audio
wav.scp for list of audio
utt2spk and _spk2ut_t for mapping
In repo these files were only in dev_clean_2 not in train_clean_2.

Also there is diarization_data with mix audio what's that for?

I think I am missing something. Can you spread some light on what should be dataset format and structure for speaker diarization.

yubouf · 2019-12-06T18:06:49Z

Is this repo implementation of 'End-to-End Neural Speaker Diarization with Permutation-free Objectives' or 'End-to-End Neural Speaker Diarization with Self-attention'.

Both. The latest network configuration is based on 'End-to-End Neural Speaker Diarization with Self-attention'.

How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.

"mini_librispeech" model is prepared just for the code integration tests, not related to the papers.
It's better to train a model in the "callhome" recipe.
But it requires huge data and training time is needed.

I'm afraid the current code is not intended for the inference-only purpose.
For inference, see below:

EEND/egs/mini_librispeech/v1/run.sh

Lines 106 to 117 in 9a0f211

    
           for dset in dev_clean_2_ns2_beta2_500; do 
        
               work=$infer_dir/$dset/.work 
        
               mkdir -p $work 
        
               $infer_cmd $work/infer.log \ 
        
                   infer.py \ 
        
                   -c $infer_config \ 
        
                   $infer_args \ 
        
                   data/simu/data/$dset \ 
        
                   $model_dir/$ave_id.nnet.npz \ 
        
                   $infer_dir/$dset \ 
        
                   || exit 1 
        
           done

data/simu/data/dev_clean_2_ns2_beta2_500 is the kaldi-style data directory for inference.

Getting confused with the directory structure of the repo. I need to test it on my custom data. I have collected some audio data having a single speaker in each audio. I don't have the transcript.
Found from comments that there should be segments, reco2dur, wav.scp, utt2spk, and spk2utt files for training.
segment mean audio having a single speaker for saying 1 utterance
reco2dur is for the duration of that audio
wav.scp for list of audio
utt2spk and _spk2ut_t for mapping
In repo these files were only in dev_clean_2 not in train_clean_2.
Also there is diarization_data with mix audio what's that for?

train_clean_2 and dev_clean_2 are not actual training and test data for our model.
These are mini_librispeech dataset.
Our training and test data is generated by simulation:
Training: data/simu/data/train_clean_5_ns2_beta2_500
Test: data/simu/data/dev_clean_2_ns2_beta2_500.

008karan · 2019-12-07T10:06:33Z

ok so training data should contain call recording of two people that's what you simulated right? Can you tell me how much data is needed and training time? Also, it is independent of a person who is speaking right?
I would like to try both the papers which you have published. Where can I find the implementation for 'End-to-End Neural Speaker Diarization with Permutation-free Objectives'. I am assuming that both have the same data input.
I have gone through data/simu/data/train_clean_5_ns2_beta2_500: As there is no documentation not getting whats in the following files
In rttm

SPEAKER data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066 1    2.08   15.75 <NA> <NA> 1088-134315 <NA>

In segment file for example below :

1088-134315_data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066_0000208_0001782 data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066 2.08325 17.82825

in spk2utt: as per my understanding audio mixture generated by 1088 and 134315 are audio number 66,208,1782

1088-134315 1088-134315_data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066_0000208_0001782

same goes with utt2spk
and lastly wav.scp having mapping between directory


data_simu_wav_train_clean_5_ns2_beta2_500_100_mix_0000496 /home/gamut/Downloads/EEND/egs/mini_librispeech/v1/data/simu/wav/train_clean_5_ns2_beta2_500/100/mix_0000496.wav

please elaborate where i am wrong and whats actually in those files.
As of now, I have audio recording data of having two speakers in each audio. So do I need to label each speaker in it and generate the mapping you got in all files above?

Thanks!

yubouf · 2019-12-08T21:38:53Z

Explanation of Kaldi's data directory:
https://kaldi-asr.org/doc/data_prep.html
RTTM:
https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf

To know how we generate the simulated training data, see run_prepare_shared.sh with our paper, particularly Algorithm 1.
Training time was not described in the papers. It depends on computing environments. In our experiments, for 100,000 mixtures (generated with beta=2) with 100 epochs, it took 4-6 days.

008karan · 2019-12-09T08:26:27Z

I already have audio recordings so no need to simulate but do I need to get the transcript?

yubouf · 2019-12-09T13:30:13Z

I already have audio recordings so no need to simulate but do I need to get the transcript?

No. You don’t have to prepare the text file.

008karan · 2019-12-10T17:44:30Z

Thanks for the links.
Got some doubt here: In RTTM

SPEAKER data_simu_wav_train_clean_5_ns2_beta2_500_100_mix_0000500 1    2.82    4.27 <NA> <NA> 1867-154075 <NA>

Is tbeg(2.82) and tdur(4.27) randomly generated here as I couldn't find difference after hearing the mix audio file. Same goes with found in segment file. I found segments which you are passing are randomly generated?

Lastly in spk2utt and utt2spk : which require <utterance-id> <speaker-id> how to get it as I got audio recording at first place.

Cheers!

yubouf · 2019-12-10T20:59:00Z

Yes, the training data is the simulated two-speaker mixture of "mini_librespeech" utterances with randomly chosen silence intervals. segments and rttm reflects the random simulation result.
Each "mini_librespeech" utterance might be longer containing several sentences, it seemed strange mixture. But again, this is just intended for the integration test.
Our actual recipe related to the paper is the "callhome" recipe.

Suppose you already have your two-speaker mixtures for training data:
audio recordings:
rec1.wav, rec2.wav, ...
and segmentation for two speakers per recording.
You should prepare these below:
wav.scp: the list of <recording> <file> like

rec1 rec1.wav
rec2 rec2.wav
...

segments: the list of <utterance> <recording> <start_time> <end_time> like

rec1_Alice_001 rec1 2.0 4.5
rec1_Bob_001 rec1 4.3 8.0
rec1_Alice_002 rec1 10.0 11.5
rec2_Charlie_001 rec2 3.3 4.4
rec2_Charlie_002 rec2 5.5 6.0
rec2_Daisy_001 rec2 7.0 7.5

utt2spk: the list of <utterance> <speaker> like

rec1_Alice_001 Alice
rec1_Alice_002 Alice
rec1_Bob_001 Bob
rec2_Charlie_001 Charlie
rec2_Charlie_002 Charlie
rec2_Daisy_001 Daisy
...

Then, you can generate spk2utt, reco2dur, rttm using kaldi-tools.
rttm from steps/segmentation/convert_utt2spk_and_segments_to_rttm.py
reco2dur from utils/data/get_reco2dur.sh
spk2utt from utils/utt2spk_to_spk2utt.pl

008karan · 2019-12-19T21:33:02Z

hi, i got all the files and started training but nothing is happening. There is nothing inside data.data.train except cg.dot and cg.png

008karan · 2019-12-20T07:13:19Z

train log

# train.py -c conf/train.yaml data data exp/diarize/model/data.data.train 
# Started at Fri Dec 20 12:27:28 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
GPU device 0 is used
Prepared model

yubouf · 2019-12-20T18:11:02Z

The log indicates that train.py is still on hold.
If the mini_librispeech recipe had worked, the difference might be your data preparation.

[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on

I have no idea about those lines.

008karan · 2019-12-22T12:57:52Z

before trainer.run() everything is printing out.

    trainer.extend(extensions.dump_graph('main/loss', out_name="cg.dot"))
    print('###########################5')
    trainer.run()
    print('Finished!')

can you suggest how to debug further

yubouf · 2019-12-23T17:10:27Z

When you interrupt the program by Ctrl+C, you will find the stack trace and possible cause of the stop.

I'm afraid that it's hard to find the problem because it might be related to the data preparation of your data. If you could open the data for me, I could run that for debugging.
But I don't want to face a risk of getting sensitive speech data you own.
When you try our code with other publicly available data, we possibly solve your issue.

008karan · 2020-01-06T09:05:52Z

i am getting this results. can you help me with inference

        "main/loss": 0.5038370490074158,
        "main/speech_scored": 242.6723163841808,
        "main/speech_miss": 149.28248587570621,
        "main/speech_falarm": 22.28813559322034,
        "main/speaker_scored": 242.6723163841808,
        "main/speaker_miss": 149.28248587570621,
        "main/speaker_falarm": 22.51412429378531,
        "main/speaker_error": 18.163841807909606,
        "main/correct": 346.4124293785311,
        "main/diarization_error": 189.96045197740114,
        "main/frames": 450.47457627118644,
        "validation/main/loss": 0.4737112522125244,
        "validation/main/speech_scored": 286.44943820224717,
        "validation/main/speech_miss": 86.78651685393258,
        "validation/main/speech_falarm": 40.17977528089887,
        "validation/main/speaker_scored": 286.44943820224717,
        "validation/main/speaker_miss": 86.78651685393258,
        "validation/main/speaker_falarm": 42.40449438202247,
        "validation/main/speaker_error": 40.95505617977528,
        "validation/main/correct": 348.02247191011236,
        "validation/main/diarization_error": 170.14606741573033,
        "validation/main/frames": 453.5730337078652,
        "main/DER": 0.7827858356808605,
        "validation/main/DER": 0.5939828979367695,
        "main/SAD_MR": 0.6151607571066049,
        "validation/main/SAD_MR": 0.3029732486075155,
        "main/SAD_FR": 0.09184457430214421,
        "validation/main/SAD_FR": 0.14026829842315838,
        "main/MI": 0.6151607571066049,
        "validation/main/MI": 0.3029732486075155,
        "main/FA": 0.09277582473866784,
        "validation/main/FA": 0.1480348317251118,
        "main/CF": 0.07484925383558774,
        "validation/main/CF": 0.14297481760414218,
        "main/accuracy": 0.7689944064012842,
        "validation/main/accuracy": 0.7672909235037653,
        "epoch": 10,
        "iteration": 1777,
        "elapsed_time": 903.8602520569693

yubouf · 2020-01-06T18:40:54Z

Copied my earlier comment.

How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.

"mini_librispeech" model is prepared just for the code integration tests, not related to the papers.
It's better to train a model in the "callhome" recipe.
But it requires huge data and training time is needed.

I'm afraid the current code is not intended for the inference-only purpose.
For inference, see below:

EEND/egs/mini_librispeech/v1/run.sh

Lines 106 to 117 in 9a0f211

    
           for dset in dev_clean_2_ns2_beta2_500; do 
        
               work=$infer_dir/$dset/.work 
        
               mkdir -p $work 
        
               $infer_cmd $work/infer.log \ 
        
                   infer.py \ 
        
                   -c $infer_config \ 
        
                   $infer_args \ 
        
                   data/simu/data/$dset \ 
        
                   $model_dir/$ave_id.nnet.npz \ 
        
                   $infer_dir/$dset \ 
        
                   || exit 1 
        
           done

data/simu/data/dev_clean_2_ns2_beta2_500 is the kaldi-style data directory for inference.

yubouf · 2020-01-06T18:45:13Z

"main/DER": 0.7827858356808605, means the performance is very poor.
"iteration": 1777, indicates that your training data size is too small.

008karan · 2020-01-11T07:46:19Z

ok, can you suggest how much hours of data is needed to build a good speaker diarization system? Also, can we do this without timestamps? As you know getting audio with accurate timestamps is a difficult task.
Thanks

yubouf · 2020-01-11T20:42:09Z

We didn't use manual timestamps for simulated mixtures and two-channel recordings.
In both cases, we have single-speaker recordings. So we can get timestamps via a speech activity detection system.
In our papers, I suggested using a simulated training set of 100k recordings, sampled from large-scale telephone recordings, which have separate channels. The "callhome" recipe is good for general diarization tasks, but two-speaker recordings only.
Although we observed the real training set was better than the simulated dataset, we believe that better large-scale simulation with better model architecture can outperform the smaller real training set.

AntonOkhotnikov · 2020-02-12T09:45:02Z

Explanation of Kaldi's data directory:
https://kaldi-asr.org/doc/data_prep.html
RTTM:
https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf

To know how we generate the simulated training data, see run_prepare_shared.sh with our paper, particularly Algorithm 1.
Training time was not described in the papers. It depends on computing environments. In our experiments, for 100,000 mixtures (generated with beta=2) with 100 epochs, it took 4-6 days.

@yubouf Could you please reveal the GPU you used, so I can roughly estimate the training time in my case?

yubouf · 2020-02-12T13:58:44Z

GeForce GTX 1080 Ti.

AntonOkhotnikov · 2020-02-12T14:16:35Z

Thank you very much

Durgesh92 · 2020-02-23T07:52:35Z

We didn't use manual timestamps for simulated mixtures and two-channel recordings.
In both cases, we have single-speaker recordings. So we can get timestamps via a speech activity detection system.
In our papers, I suggested using a simulated training set of 100k recordings, sampled from large-scale telephone recordings, which have separate channels. The "callhome" recipe is good for general diarization tasks, but two-speaker recordings only.
Although we observed the real training set was better than the simulated dataset, we believe that better large-scale simulation with better model architecture can outperform the smaller real training set.

Is there any way to train multi-speaker recordings with callhome recipe? I get this error when no of speakers are more than two

File "/home/sysadmin/EEND/eend/feature.py", line 282, in get_labeledSTFT
T[rel_start:rel_end, speaker_index] = 1
IndexError: index 2 is out of bounds for axis 1 with size 2

yubouf · 2020-02-24T23:59:48Z

The model should have a fixed number of speakers as in config num_speakers: 2.
For extending to a variable number of speakers, you can assume the num_speakers as the maximum number of speakers. And train the model using a variable number of speakers with labels zero padded to the maximum number of speakers.

fxyouruo · 2021-07-07T10:52:08Z

train log

# train.py -c conf/train.yaml data data exp/diarize/model/data.data.train 
# Started at Fri Dec 20 12:27:28 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
GPU device 0 is used
Prepared model

@008karan Hi，I have the same problem as you, could you share your solution? Thank you!

maerduduqi · 2024-04-22T09:13:51Z

train.py -c conf/train.yaml data data exp/diarize/model/data.data.train

Started at Fri Dec 20 12:27:28 IST 2019

python version: 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843 chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843 chunks
GPU device 0 is used
Prepared model

maerduduqi · 2024-04-22T09:39:46Z

火车日志

# train.py -c conf/train.yaml data data exp/diarize/model/data.data.train 
# Started at Fri Dec 20 12:27:28 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
GPU device 0 is used
Prepared model

@008karan你好，我和你遇到同样的问题，可以分享一下你的解决方案吗？谢谢你！

Have you solved the problem?

serendipity24 mentioned this issue Aug 25, 2021

Adaptation Error #27

Open

cant start training #3

cant start training #3

Comments

008karan commented Dec 5, 2019

sw005320 commented Dec 5, 2019

008karan commented Dec 5, 2019 • edited Loading

yubouf commented Dec 5, 2019

sw005320 commented Dec 5, 2019

008karan commented Dec 5, 2019

sw005320 commented Dec 5, 2019

008karan commented Dec 5, 2019

yubouf commented Dec 5, 2019

008karan commented Dec 5, 2019

008karan commented Dec 5, 2019

yubouf commented Dec 5, 2019

008karan commented Dec 6, 2019

yubouf commented Dec 6, 2019

008karan commented Dec 7, 2019 • edited Loading

yubouf commented Dec 8, 2019

008karan commented Dec 9, 2019

yubouf commented Dec 9, 2019

008karan commented Dec 10, 2019

yubouf commented Dec 10, 2019

008karan commented Dec 19, 2019

008karan commented Dec 20, 2019 • edited Loading

yubouf commented Dec 20, 2019

008karan commented Dec 22, 2019

yubouf commented Dec 23, 2019

008karan commented Jan 6, 2020 • edited Loading

yubouf commented Jan 6, 2020

yubouf commented Jan 6, 2020 • edited Loading

008karan commented Jan 11, 2020

yubouf commented Jan 11, 2020

AntonOkhotnikov commented Feb 12, 2020

yubouf commented Feb 12, 2020

AntonOkhotnikov commented Feb 12, 2020

Durgesh92 commented Feb 23, 2020

yubouf commented Feb 24, 2020

fxyouruo commented Jul 7, 2021

maerduduqi commented Apr 22, 2024

train.py -c conf/train.yaml data data exp/diarize/model/data.data.train

Started at Fri Dec 20 12:27:28 IST 2019

maerduduqi commented Apr 22, 2024

008karan commented Dec 5, 2019 •

edited

Loading

008karan commented Dec 7, 2019 •

edited

Loading

008karan commented Dec 20, 2019 •

edited

Loading

008karan commented Jan 6, 2020 •

edited

Loading

yubouf commented Jan 6, 2020 •

edited

Loading