您的位置: 主页 > 人工智能学家 > 如何看待Yoav Goldberg 怒怼来自MILA的GAN for NLG的paper?

如何看待Yoav Goldberg 怒怼来自MILA的GAN for NLG的paper?

时间: 2017-06-11阅读:

作者:知乎用户
来源:知乎

最近,来自以色列 Bar Ilan 大学的计算机科学高级讲师 Yoav Goldberg 撰文批评了蒙特利尔大学的新论文《Adversarial Generation of Natural Language》,指责了该论文作者,乃至整个 arXiv 目前出现的不良风气。Yoav 的言论引起了学界的关注,Yann LeCun 等人很快也发出了回应。


没有从头到尾认真读完Yoav这篇博文的同学们请务必去完整读一遍,因为这篇博文几乎每段话都值得花时间去体会。

While it may seem that I am picking on a specific paper (and in a way I am), the broader message is that I am going against a trend in deep-learning-for-language papers, in particular papers that come from the “deep learning” community rather than the “natural language” community. There are many papers that share very similar flaws. I “chose” this one because it was getting some positive attention, and also because the authors are from a strong DL group and can (I hope) stand the heat. Also, because I find it really bad in pretty much every aspect, as I explain below.

Yoav其实只是以MILA这篇论文为例,来批判目前有一系列DL4NLP论文(以某些文本生成模型为主)的一致缺陷。这篇论文正好几乎汇集了所有这一类论文共有的毛病。

This post is also an ideological action w.r.t arxiv publishing: while I agree that short publication cycles on arxiv can be better than the lengthy peer-review process we now have, there is also a rising trend of people using arxiv for flag-planting, and to circumvent the peer-review process. This is especially true for work coming from “strong” groups. Currently, there is practically no downside of posting your (often very preliminary, often incomplete) work to arxiv, only potential benefits.Why do I care that some paper got on arxiv? Because many people take these papers seriously, especially when they come from a reputable lab like MILA. And now every work on either natural language generation or adversarial learning for text will have to cite “Rajeswar et al 2017'’. And they will accumulate citations. And reputation. Despite being a really, really poor work when it comes to language generation. And people will also replicate their setup (for comparability! for science!!). And it is terrible setup. And other people, likely serious NLP researchers, will come up with a good, realistic setup on a real task, or with something more nuanced, and then asked to compare against Rajeswar et al 2017. Which is not their task, and not their setup, and irrelevant, and shouldn’t exist. But the flag was already planted.

同时被批判的还有arxiv引起的占坑潮,即很多人喜欢把自己非常初步、根本不完整的工作挂到arxiv上占坑。这样的论文质量并没有经过同行评审把关,只因为放出来得早,立刻会有大批后续工作跟进,效仿它们槽点满满的实验设置。用答主最近在另外一条回答里说过的话,叫做“带坏一大批后续跟进的小朋友们”。


接下来开始结合论文实例,对上述两个主要批判对象进行具体多个方面的展开

1. 研究内容

同类论文中,有一部分作者故意夸大论文的范畴,或者并没有意识清楚自己做这些的目的是什么、以及自己处理的对象究竟是什么。上来就去起“generation of natural language”或者“text generation”这种大标题,搞得好像论文里的模型真的可以生成自然语言一样。实际上真正做了的内容不过仅仅是“A Slightly Better Trick for Adversarial Training of Short Discrete Sequences with Small Vocabularies That Somewhat Works”。

Yoav非常不客气地把这种只在高度简化情形下做实验同时还要overclaim的行为称为“disrespecting language”(不尊重语言)。在此顺便黑了一把bAbI数据集,因为它也是机器学习研究者发现搞不定一般自然语言之后自己重新构造的简单数据。

另一方面,这一类论文的目标本身可能就不明确。如果只是为了生成free text,完全可以找个同样在无标注语料上训练好的RNN或者VAE过来直接sample语句,结果还更符合语法,而且不用怎么限制词表。然而这类工作完全没有同这些最简单的baseline生成模型进行比较。

2.  技术方法

假设要把对抗式训练(adversarial training)推广到离散序列上,生成器使用RNN。一个主要的技术难点是:RNN每一步输出的是一个多项分布(取完softmax后得到的每个词的概率),但实际生成序列的时候,每一步只能取某一个词(one-hot)。这个离散输出不可导,所以不能像给图像用的GAN里的生成器G那样做反向传播。MILA那篇论文的主要贡献就是:直接把那个softmax喂给判别器D就好了嘛,这玩意儿可导……

然而这个时候,判别器D最后做的事情其实是:区分one-hot表示(真实语句还是离散句)与连续表示(G产生的softmax输出)。这其实跟判断是否是自然语言已经没有毛关系了。最后的效果变成:让生成器G产生尽可能接近one-hot的输出,强行认为自然语言==尖峰分布。

Do we know that the proposed model is doing more than introducing this kind of preference for spiky predictions? No, because this is never evaluated in the paper. It is not even discussed.

3. 实验评价

MILA那篇论文用了两套作者们自己都没仔细研究过的简单PCFG来产生语言,然后用这个语言语句的似然函数来评价生成效果。但大家都知道自然语言显然不是PCFG能建模的。有限语料库上导出的PCFG生成概率也并不能代表语法流畅度。同时,他们效仿先前工作,也在中文古诗数据上做了点实验。且不论实验用的诗句按长度看只有五言七言这么短的长度,所有这些工作最后评价的时候都只是孤立地去评判每一行。更甚者,评价方式不是去让人判断生成质量,而仅仅是算个BLEU完事。

I didn’t fully get that part, but its funky, and very much not how BLEU should be used. They say this is the same setup that the previous GAN-for-language paper they evaluate against use for this corpus. The Penn Treebank sentences were not really evaluated, but by comparing the sample likelihood over epochs we can see that it is going down, and that one of their model achieves better scores than some GAN baeline called MLE which they don’t fully describe, but which appeared in previous crappy GAN-for-language work. ... The Chinese Poetry generation test again compares results only against the previous GAN work, and not against a proper baseline, and reports maxmimal BLEU numbers of 0.87. BLEU scores are usually > 10, so I’m not sure what’s going on here, but in any case their BLEU setup is weird and meaningless to begin with.


所以,Yoav怼的是几个文本生成类工作中所表现出的一致问题:动机不明、方法不当、实验扯淡,以及缺陷这么多的工作还要放arxiv上收集引用率、误导他人。

至于我们应该“如何看待”?悲愤的Yoav在最后提出了若干呼吁,正是对这个问题最好的回答。

如果你是审稿人:审稿的时候请一定要尊重自然语言,不要被做法花哨、实际上只能处理极简化情形的overclaims蒙蔽双眼。一定要看他们如何进行了什么样的实验评估、实验结果能证明什么结论,而不是他们在论文里宣称提出了什么方法达到什么效果。更不要强求处理真实数据的NLP研究人员去引用、比较那些质量底下或者缺陷明显的“开创性论文”。

注:Yoav并不反对在简化情形下研究自然语言,但希望这些研究者搞清楚自己所做的范畴,不要总是试图写得像个大新闻。在Yoav最新发布的澄清内容中,特地强调:

the toy task must be meaningful and relevant, and you have to explain why it is meaningful and relevant. And, I think it goes without saying, you should understand the toy task you are using.

如果你是论文作者:尊重并试图更多了解自然语言,真正明白自己实验用的数据集、评价汇报的那些数值指标是否就是真正能验证自己的研究发现的东西。搞清楚自己在做什么,不要忘了和最明显的baseline进行对照。同时在论文中尽可能点明自己研究内容的局限性。

the paper should be clear about the scope of the work it is actually doing. In the title, in the abstract, in the text. Incrementality is perfectly fine, but you have to clearly define your contribution, position it w.r.t existing work, and precisely state (and evaluate) your increment.

我再补充一条:如果不是什么特别激动人心的发现,或者在没有反复确认自己的工作没有较大硬伤的情况下,不要在自己的论文被同行评议之前就放arxiv试图占坑。否则除了可能会误导他人以外,还要自行承担被Yoav或者其他同行拿出来作为反面教材批判一番的风险。

写得有点长,最后再把第一句话复制一遍:

没有从头到尾认真读完Yoav这篇博文的同学们请务必去完整读一遍,因为这篇博文几乎每段话都值得花时间去体会。


p.s. 除了这些值得所有人深思的问题以外,对于我个人其实还有一条额外的认识:论看问题的犀利程度和言语表达的到位程度,答主本人和Yoav这样身经百战经验丰富的前辈还是存在极大的差距。同样的意思,换我自己来讲总感觉缺斤少两,不甚全面。答主本人所说的正是自己出于对部分“开创性工作”的失望和对“占坑”现象的不满,在最近另一条回答中对同类现象给出的完全类似的批判:


作者:知乎用户
来源:知乎

题主以“如何看待”来起始提问句,那么首先我们就应该看清楚Yoav究竟在怼什么。

While it may seem that I am picking on a specific paper (and in a way I am), the broader message is that I am going against a trend in deep-learning-for-language papers, in particular papers that come from the “deep learning” community rather than the “natural language” community. There are many papers that share very similar flaws. I “chose” this one because it was getting some positive attention, and also because the authors are from a strong DL group and can (I hope) stand the heat. Also, because I find it really bad in pretty much every aspect, as I explain below.

Yoav其实只是以MILA这篇论文为例,来批判目前有一系列DL4NLP论文(以某些文本生成模型为主)的一致缺陷。这篇论文正好几乎汇集了所有这一类论文共有的毛病。

This post is also an ideological action w.r.t arxiv publishing: while I agree that short publication cycles on arxiv can be better than the lengthy peer-review process we now have, there is also a rising trend of people using arxiv for flag-planting, and to circumvent the peer-review process. This is especially true for work coming from “strong” groups. Currently, there is practically no downside of posting your (often very preliminary, often incomplete) work to arxiv, only potential benefits.Why do I care that some paper got on arxiv? Because many people take these papers seriously, especially when they come from a reputable lab like MILA. And now every work on either natural language generation or adversarial learning for text will have to cite “Rajeswar et al 2017'’. And they will accumulate citations. And reputation. Despite being a really, really poor work when it comes to language generation. And people will also replicate their setup (for comparability! for science!!). And it is terrible setup. And other people, likely serious NLP researchers, will come up with a good, realistic setup on a real task, or with something more nuanced, and then asked to compare against Rajeswar et al 2017. Which is not their task, and not their setup, and irrelevant, and shouldn’t exist. But the flag was already planted.

同时被批判的还有arxiv引起的占坑潮,即很多人喜欢把自己非常初步、根本不完整的工作挂到arxiv上占坑。这样的论文质量并没有经过同行评审把关,只因为放出来得早,立刻会有大批后续工作跟进,效仿它们槽点满满的实验设置。用答主最近在另外一条回答里说过的话,叫做“带坏一大批后续跟进的小朋友们”。

接下来开始结合论文实例,对上述两个主要批判对象进行具体多个方面的展开。

1. 研究内容

同类论文中,有一部分作者故意夸大论文的范畴,或者并没有意识清楚自己做这些的目的是什么、以及自己处理的对象究竟是什么。上来就去起“generation of natural language”或者“text generation”这种大标题,搞得好像论文里的模型真的可以生成自然语言一样。实际上真正做了的内容不过仅仅是“A Slightly Better Trick for Adversarial Training of Short Discrete Sequences with Small Vocabularies That Somewhat Works”。

Yoav非常不客气地把这种只在高度简化情形下做实验同时还要overclaim的行为称为“disrespecting language”(不尊重语言)。在此顺便黑了一把bAbI数据集,因为它也是机器学习研究者发现搞不定一般自然语言之后自己重新构造的简单数据。

另一方面,这一类论文的目标本身可能就不明确。如果只是为了生成free text,完全可以找个同样在无标注语料上训练好的RNN或者VAE过来直接sample语句,结果还更符合语法,而且不用怎么限制词表。然而这类工作完全没有同这些最简单的baseline生成模型进行比较。

2.  技术方法

假设要把对抗式训练(adversarial training)推广到离散序列上,生成器使用RNN。一个主要的技术难点是:RNN每一步输出的是一个多项分布(取完softmax后得到的每个词的概率),但实际生成序列的时候,每一步只能取某一个词(one-hot)。这个离散输出不可导,所以不能像给图像用的GAN里的生成器G那样做反向传播。MILA那篇论文的主要贡献就是:直接把那个softmax喂给判别器D就好了嘛,这玩意儿可导……

然而这个时候,判别器D最后做的事情其实是:区分one-hot表示(真实语句还是离散句)与连续表示(G产生的softmax输出)。这其实跟判断是否是自然语言已经没有毛关系了。最后的效果变成:让生成器G产生尽可能接近one-hot的输出,强行认为自然语言==尖峰分布。

Do we know that the proposed model is doing more than introducing this kind of preference for spiky predictions? No, because this is never evaluated in the paper. It is not even discussed.

3. 实验评价

MILA那篇论文用了两套作者们自己都没仔细研究过的简单PCFG来产生语言,然后用这个语言语句的似然函数来评价生成效果。但大家都知道自然语言显然不是PCFG能建模的。有限语料库上导出的PCFG生成概率也并不能代表语法流畅度。同时,他们效仿先前工作,也在中文古诗数据上做了点实验。且不论实验用的诗句按长度看只有五言七言这么短的长度,所有这些工作最后评价的时候都只是孤立地去评判每一行。更甚者,评价方式不是去让人判断生成质量,而仅仅是算个BLEU完事。

I didn’t fully get that part, but its funky, and very much not how BLEU should be used. They say this is the same setup that the previous GAN-for-language paper they evaluate against use for this corpus. The Penn Treebank sentences were not really evaluated, but by comparing the sample likelihood over epochs we can see that it is going down, and that one of their model achieves better scores than some GAN baeline called MLE which they don’t fully describe, but which appeared in previous crappy GAN-for-language work. ... The Chinese Poetry generation test again compares results only against the previous GAN work, and not against a proper baseline, and reports maxmimal BLEU numbers of 0.87. BLEU scores are usually > 10, so I’m not sure what’s going on here, but in any case their BLEU setup is weird and meaningless to begin with.


所以,Yoav怼的是几个文本生成类工作中所表现出的一致问题:动机不明方法不当实验扯淡,以及缺陷这么多的工作还要放arxiv上收集引用率、误导他人

至于我们应该“如何看待”?悲愤的Yoav在最后提出了若干呼吁,正是对这个问题最好的回答。

如果你是审稿人:审稿的时候请一定要尊重自然语言,不要被做法花哨、实际上只能处理极简化情形的overclaims蒙蔽双眼。一定要看他们如何进行了什么样的实验评估、实验结果能证明什么结论,而不是他们在论文里宣称提出了什么方法达到什么效果。更不要强求处理真实数据的NLP研究人员去引用、比较那些质量底下或者缺陷明显的“开创性论文”。

注:Yoav并不反对在简化情形下研究自然语言,但希望这些研究者搞清楚自己所做的范畴,不要总是试图写得像个大新闻。在Yoav最新发布的澄清内容 https://medium.com/@yoav.goldberg/clarifications-re-adversarial-review-of-adversarial-learning-of-nat-lang-post-62acd39ebe0d 中,特地强调:

the toy task must be meaningful and relevant, and you have to explain why it is meaningful and relevant. And, I think it goes without saying, you should understand the toy task you are using.

如果你是论文作者:尊重并试图更多了解自然语言,真正明白自己实验用的数据集、评价汇报的那些数值指标是否就是真正能验证自己的研究发现的东西。搞清楚自己在做什么,不要忘了和最明显的baseline进行对照。同时在论文中尽可能点明自己研究内容的局限性。

the paper should be clear about the scope of the work it is actually doing. In the title, in the abstract, in the text. Incrementality is perfectly fine, but you have to clearly define your contribution, position it w.r.t existing work, and precisely state (and evaluate) your increment.

我再补充一条:如果不是什么特别激动人心的发现,或者在没有反复确认自己的工作没有较大硬伤的情况下,不要在自己的论文被同行评议之前就放arxiv试图占坑。否则除了可能会误导他人以外,还要自行承担被Yoav或者其他同行拿出来作为反面教材批判一番的风险。

写得有点长,最后再把第一句话复制一遍:

没有从头到尾认真读完Yoav这篇博文的同学们请务必去完整读一遍,因为这篇博文几乎每段话都值得花时间去体会。


p.s. 除了这些值得所有人深思的问题以外,对于我个人其实还有一条额外的认识:论看问题的犀利程度和言语表达的到位程度,答主本人和Yoav这样身经百战经验丰富的前辈还是存在极大的差距。同样的意思,换我自己来讲总感觉缺斤少两,不甚全面。答主本人所说的正是自己出于对部分“开创性工作”的失望和对“占坑”现象的不满,在最近另一条回答中对同类现象给出的完全类似的批判:


来源:知乎



欢迎加入未来科技学院企业家群,共同提升企业科技竞争力

一日千里的科技进展,层出不穷的新概念,使企业家,投资人和社会大众面临巨大的科技发展压力,前沿科技现状和未来发展方向是什么?现代企业家如何应对新科学技术带来的产业升级挑战?


欢迎加入未来科技学院企业家群,未来科技学院将通过举办企业家与科技专家研讨会,未来科技学习班,企业家与科技专家、投资人的聚会交流,企业科技问题专题研究会等多种形式,帮助现代企业通过前沿科技解决产业升级问题、开展新业务拓展,提高科技竞争力。


未来科技学院由人工智能学家在中国科学院虚拟经济与数据科学研究中心的支持下建立,成立以来,已经邀请国际和国内著名科学家、科技企业家300多人参与学院建设,并建立覆盖2万余人的专业社群;与近60家投资机构合作,建立了近200名投资人的投资社群。开展前沿科技讲座和研讨会20多期。  欢迎行业、产业和科技领域的企业家加入未来科技学院


报名加入请扫描下列二维码,点击本文左下角“阅读原文”报名


上一篇:无人驾驶技术的灵魂——SLAM的现在与未来
下一篇:以更少人力创建更擅长学习的系统 ?看专家详解生成式对抗网络

相关阅读