KoELECTRA์ Korean Hate Speech Dataset์ ์ด์ฉํ Bias & Hate Classification
| # of data | |
|---|---|
| train | 7,896 |
| validate | 471 |
| test | 974 |
- Bias (gender, other, none), Hate (hate, offensive, none)
- torch==1.5.0
- transformers==2.11.0
- soynlp==0.0.493
[CLS] token์์ bias์ hate๋ฅผ ๋์์ ์์ธกํ๋ Joint Architecture
- loss = bias_coef * bias_loss + hate_coef * hate_loss (
bias_loss_coef,hate_loss_coef๋ณ๊ฒฝ ๊ฐ๋ฅ) - model.py์
ElectraForBiasClassification์ฐธ๊ณ
[CLS] comment [SEP] title [SEP]์ผ๋ก comment์ title์ ์ด์ด ๋ถ์ฌ Input์ผ๋ก ๋ฃ์- ์ ์ฒ๋ฆฌ์ ๊ฒฝ์ฐ
[]๋ฑ์ brace๋ก ๋ฌถ์ธ ๋จ์ด ์ ๊ฑฐ, ๋ฐ์ดํ ํต์ผ, ๋ถํ์ํ ๋ฐ์ดํ ์ ๊ฑฐ, normalization ๋ฑ ๊ฐ๋จํ ๊ฒ๋ง ์ ์ฉ- data_loader.py์
preprocessํจ์ ์ฐธ๊ณ
- data_loader.py์
| Parameters | |
|---|---|
| Batch Size | 16 |
| Learning Rate | 5e-5 |
| Epochs | 10 |
| Warmup Proportion | 0.1 |
| Max Seq Length | 100 |
| Bias Loss Coefficient | 0.5 |
| Hate Loss Coefficient | 1.0 |
๊ฐ ์นดํ ๊ณ ๋ฆฌ(Bias, Hate)์ Weighted F1 ์ฐ์ถ ํ ์ฐ์ ํ๊ท
- mean_weighted_f1 = (bias_weighted_f1 + hate_weighted_f1) / 2
Dev dataset๊ธฐ์ค์ผ๋กmean_weighted_f1์ ๊ฐ์ด ๊ฐ์ฅ ๋์ ๋ชจ๋ธ์ ์ต์ข ์ ์ผ๋ก ์ ์ฅ
$ python3 main.py --model_type koelectra-base-v2 \
--model_name_or_path monologg/koelectra-base-v2-discriminator \
--model_dir {$MODEL_DIR} \
--prediction_file prediction.csv \
--do_trainTest file์ ๋ํ ์์ธก๊ฐ์ csv ํํ๋ก ์ ์ฅ
$ python3 main.py --model_type koelectra-base-v2 \
--model_name_or_path {$MODEL_DIR} \
--pred_dir preds \
--prediction_file prediction.csv \
--do_predbias,hate
none,offensive
gender,hate
none,none
others,none
...
(๊ฐ๋ณ๊ฒ ์ ์ํ Baseline์ด์ฌ์ ์ ์ ๊ฐ์ ์ ์ฌ์ง๊ฐ ์กด์ฌํฉ๋๋ค)
| (Weighted F1) | Bias F1 | Hate F1 | Mean F1 |
|---|---|---|---|
| Dev Dataset | 82.28 | 67.25 | 74.77 |