aidea

2020-01-20T03:29:03+00:00

4

1樓

各位議題參與者大家好，我是本議題的負責人，歡迎大家多多使用討論區交流分享意見。

這邊要特別提醒大家一點，本議題的Public Leaderboard排名成績是使用最後一筆上傳成果當排名依據，但是Private Leaderboard排名成績則是參與者挑選三筆成果最高分來當作最後排名依據，在議題上傳時間截止之前要記得去挑選成果，否則將無法進入最終的排名。

本議題也歡迎大家使用其他的可能外部資料，但是要記得在討論區留下使用外部資料的來源，謝謝大家。

leaverijg

2020-02-24T11:11:32+00:00

0

3樓

test

Daniel_Lin

2020-04-05T14:30:16+00:00

0

4樓

請問官方要如何防止"工人"智慧，如逐個音檔進行人工判別?

aidea

2020-04-07T07:14:07+00:00

0

5樓

@Daniel_Lin

為了防止工人智慧，公佈private leaderboard之後我們會通知排行榜前幾名的參賽者並提供一份未釋出的資料集，請參賽者在三天內回覆inference完的結果。

如果參賽者三天內未回覆結果，我們會當作棄權，由下一位進行驗證遞補名次。

謝謝！

yuchio

2020-04-09T09:39:11+00:00

3

6樓

附上外部公開資料連結
https://ctext.org/
內有紅樓夢、三國演義等相關文本

peiyu12

2020-04-13T10:01:44+00:00

2

7樓

外部資料
http://cls.lib.ntu.edu.tw/HLM/
http://cls.lib.ntu.edu.tw/san/
http://cls.lib.ntu.edu.tw/shz/
http://freewestjourney.blogspot.com/
http://www.bwsk.net/mj/l/luxun/lh/
https://zh.wikisource.org/wiki/%E4%BA%8C%E5%8D%81%E5%B9%B4%E7%9B%AE%E7%9D%B9%E4%B9%8B%E6%80%AA%E7%8F%BE%E7%8B%80/%E7%AC%AC001%E5%9B%9E
https://www.pbs.gov.tw/cht/index.php?code=list&ids=46&page=100

dsnj58941

2020-04-15T10:34:46+00:00

0

8樓

外部資料
https://www.51shucheng.net/zh-tw/sidamingzhu/shuihuzhuan/442.html
https://www.51shucheng.net/zh-tw/sidamingzhu/sanguoyanyi/360.html
http://big5.quanben-xiaoshuo.com/n/xiyouji/1.html
https://fgc.stpi.narl.org.tw/news/newsDetail?id=4b1141306b2ab792016b9d375a1f001c
https://www.51shucheng.net/zh-tw/sidamingzhu/hongloumeng

FishFu

2020-04-15T15:17:31+00:00

0

9樓

外部資料
https://scidm.nchc.org.tw/dataset/grandchallenge

wayne860810

2020-04-15T15:32:53+00:00

0

10樓

外部資料
https://scidm.nchc.org.tw/dataset/grandchallenge

PDFwithData

2020-04-16T03:33:54+00:00

1

11樓

以結果而論，好的model + 外部資料決定了排名，這是個人從頭到尾觀察排名變化的心得；競賽想達到的目的不同，遊戲規則亦會不同，建議主辦單位在評估＂外部資料＂這部份可以再審慎評估其優缺點。

yuchio

2020-04-17T02:38:20+00:00

0

13樓

上面有發過外部公開資料連結了，不過剛剛收到官方的信要求公布所有使用的外部資料
這邊詳列一下:
第一部分在https://ctext.org/可找到的文檔有
　水滸傳: https://ctext.org/wiki.pl?if=gb&res=47184
　紅樓夢: https://ctext.org/hongloumeng/zh
　西遊記: https://ctext.org/xiyouji/zh
　三國演義: https://ctext.org/sanguo-yanyi/zh
第二部分是參考上面參賽者列出的資料
　魯迅作品集-吶喊: http://www.bwsk.net/mj/l/luxun/lh/
　二十年目睹之怪現狀: https://zh.wikisource.org/wiki/%E4%BA%8C%E5%8D%81%E5%B9%B4%E7%9B%AE%E7%9D%B9%E4%B9%8B%E6%80%AA%E7%8F%BE%E7%8B%80/%E7%AC%AC001%E5%9B%9E
　警察廣播電臺文本: https://www.pbs.gov.tw/cht/index.php?code=list&ids=46&page=100
以上，我這邊沒有使用科技大擂台的語音資料

Smile

2020-04-17T02:41:57+00:00

0

14樓

比賽結束了，請問還有需要做什麼?

FishFu

2020-04-17T03:51:40+00:00

0

15樓

利用外部資源
https://scidm.nchc.org.tw/dataset/grandchallenge
使用ESPNET訓練語音辨識器(聲學模型)以及文字分類器

Smile

2020-04-17T03:56:48+00:00

0

16樓

"外部資料集" 應該是指有做好分類標記整理過的資料，像「科技大擂台」提供的那樣，這是沒有使用的。

只是簡介就很明確說了是水滸、紅樓、三國、西遊... 就把所有找的到的相關文檔，哪找的也不記得了，反正問搜尋引擎就有了，都餵給程式分析。

yeha

2020-04-17T04:36:06+00:00

1

17樓

As for me, no data augmentation, no google, no reference to origin articles.

yungchialee

2020-04-17T07:21:40+00:00

0

18樓

<< Information Leak in Machine Learning >>
Sorry for the troubles I may have caused by my previous post.

I did look into 科技大擂台_測試資料集 where I found some .wav files, their corresponding Chinese text, and 資料來源. I picked 3000 or so files with obvious 來源 and labeled them quickly (without reading the full Chinese text) so as to enlarge my training set. The submission score jumped up by roughly 10%.

Unfortunately, the same .wav files may have been numbered/named differently in various data-subsets. Therefore, by comparing the first 5 MFCC values, I managed to remove about 1000 duplicates. I then realized that I might have labeled the same, but named differently, .wav file twice with different labels (fat fingers maybe). Finally, it came the more concerned part, namely, what if the files I picked have already been included in the submission list. And, I found 46 of them.

Granted, I could have left those 46 there in my enlarged training set and secured a higher score/ranking. I could even try to gather more samples so as to move the score/ranking even higher. But, it is totally wrong to train a model this way as one should never leave a specific training record in both the training and validation, or test, dataset. As I have worked with many industrial projects these years in Taiwan, I have witnessed many of them that have failed badly when moved into operation. Tough, tricky, and subtle "information leak" such as the above is often the cause that makes a data scientist/practician a hero at the beginning but a failure at the end.

My intent of this post is not to stir up any fuss as the contest has been concluded. Instead, I simply want to raise the awareness of "information leak in machine learning".

Thanks.

aidea

2020-04-17T08:32:07+00:00

0

19樓

To yungchialee：

謝謝你寶貴的意見，AIdea團隊對於比賽的公平性是很嚴肅看待的。
當初在設計這個議題時我們知道科技大擂台語音資料集是公開的，無法限制參賽者使用此資料集，無法保證不會有data leakage狀況發生
因此在資料整備時有盡量防止這類狀況發生，也會釋出驗證資料集防止工人智慧發生。

感謝對AIdea平台的支持。

To Smile：

AIdea團隊正在整備驗證的資料集，待準備好以後會再個別通知前幾名的參賽者。

謝謝支持AIdea平台的競賽

dsnj58941

2020-04-17T09:31:25+00:00

0

20樓

主辦單位您好，
先前留言使用外部資料來源有
水滸傳
https://www.51shucheng.net/zh- tw/sidamingzhu/shuihuzhuan/442.html
三國演義
https://www.51shucheng.net/zh-tw/sidamingzhu/sanguoyanyi/360.html
西遊記
http://big5.quanben-xiaoshuo.com/n/xiyouji/1.html
AI語音數據資料集
https://fgc.stpi.narl.org.tw/news/newsDetail?id=4b1141306b2ab792016b9d375a1f001c
紅樓夢
https://www.51shucheng.net/zh-tw/sidamingzhu/hongloumeng

先前是想說外部資料怎麼會知道分類的答案，4/13看到討論區有留言使用外部資料，比賽最後一天嘗試使用留言上的外部資料尋找對應的答案。
以原本929個測試資料外加1599個科技大擂台的四大文學、新聞及廣播電台的資料作為最後的模型訓練資料。

yungchialee

2020-04-17T09:36:14+00:00

0

21樓

To AIdea,

謝謝！
刻意的工人智慧當然不妥，不過我的原意並非針對競賽的公平性，而是誠摯的提醒大家，避免可能渾然不覺中帶進模型的「Information Leakage」。