英语阅读之熵与伤 2018-06-28 # 英语阅读之熵与伤 一直希望能流畅的阅读原汁原味的英文原版小说,其实这是个不经济的想法,因为中文词的信息量要比英文大,同一本小说中文可能很薄一本,英文就老厚的砖头,不过听说味道不一样。 大量的生词经常会成为阅读的障碍,前不久刚好看到Zipf定律,说的是把自然语言中的词按词频排序,词频和序号成正比,刚好最近在读Thomas Hardy 的《The Return of the Native》(《还乡》)就拿来验证了下这个定律。可以从[Gutenberg ](http://www.gutenberg.org/)下载这本小说。 词频和词序的对数之间的比例图如下:  很正常的一个想法是先把那些常出现的词都查一遍记住,然后就能有比较好的阅读体验呢?我对这小说中的词频做了下统计,发现词频分布太奇葩了。小说总共出现过11985个词 其中90%的词频在10次以下,只出现一次的词就有5883个。从这个角度好像是没什么办法。 部分高频词: ``` (',', 9456) ('.', 7497) ('the', 7314) ('to', 4057) ('and', 3777) ('of', 3758) ("''", 3294) ('a', 3225) ('``', 3168) ('I', 2762) ('in', 2243) ('was', 2053) ('that', 1838) ('her', 1733) ('you', 1730) ('it', 1464) ('he', 1364) ('had', 1309) ('as', 1280) ('she', 1270) ``` 又想,阅读是从文字中获得信息,书就是一个信息源,每个词就是不断发送的信号,如果假设单词之间是独立的那书就是一个无记忆信源。根据信息论就可以计算整本书和任意段落和句子的熵了。当然复杂点可以用高阶的马尔可夫模型,即当前词和前面多个词有关。 把整本书分成句子的组合,根据二八法则有没有可能20%的句子就包含了80%的熵呢?如果是这样我只要先读懂20%的句子就能极大的提高我的阅读体验了。 我用最简单的模型(每个单词独立出现)计算了每个句子的熵,小说共 9103 个句子。如果我要干掉80%的熵,从最难的句子开始读,我要读完55%的句子也就是5022个。这和二八法则有点远了,看来英语这事只能硬着头皮啃了,应了那句话:没有捷径。这简直就是硬伤啊,深深伤害了我二十多年。 当然这里有个变量是个人的阅读基础,如果你的英文阅读能力能阅读某个阈值之下的所有句子,那这个体验又不一样了。 此方法还能帮你选择简单些的小说来读,开始阅读熵比较小的小说,慢慢的递增。 来看几个最难的句子(熵最大)的句子,有点像是句子越长熵越大,其实不是。 每一排“#”是一个句子的分割,后面的数字是对应的熵和词数。 >##############################:4.403919 word count 101 Yeobright had, in fact, found his vocation in the career of an itinerant open-air preacher and lecturer on morally unimpeachable subjects; and from this day he laboured incessantly in that office, speaking not only in simple language on Rainbarrow and in the hamlets round, but in a more cultivated strain elsewhere--from the steps and porticoes of town-halls, from market-crosses, from conduits, on esplanades and on wharves, from the parapets of bridges, in barns and outhouses, and all other such places in the neighbouring Wessex towns and villages. ##############################:4.621805 word count 123 INDEMNITY - You agree to indemnify and hold the Foundation, the trademark owner, any agent or employee of the Foundation, anyone providing copies of Project Gutenberg-tm electronic works in accordance with this agreement, and any volunteers associated with the production,promotion and distribution of Project Gutenberg-tm electronic works,harmless from all liability, costs and expenses, including legal fees,that arise directly or indirectly from any of the following which you do or cause to occur: (a) distribution of this or any Project Gutenberg-tm work, (b) alteration, modification, or additions or deletions to any Project Gutenberg-tm work, and (c) any Defect you cause. ##############################:5.836207 word count 143 A faint beat of half-seconds conjured up Thomasin rocking the cradle, a wavering hum meant that she was singing the baby to sleep, a crunching of sand as between millstones raised the picture of Humphrey's, Fairway's, or Sam's heavy feet crossing the stone floor of the kitchen; a light boyish step, and a gay tune in a high key, betokened a visit from Grandfer Cantle; a sudden break-off in the Grandfer's utterances implied the application to his lips of a mug of small beer, a bustling and slamming of doors meant starting to go to market; for Thomasin, in spite of her added scope of gentility,led a ludicrously narrow life, to the end that she might save every possible pound for her little daughter. [这里有简单代码](https://gist.github.com/rockyzhengwu/732e25f0fada7574bb839d42d20fddc7),好像需要那啥才能看。 还有一些奇葩想法,但还不太能说得特别清楚,如果大家有好玩的想法,可以跟我交流 zhengwu@midday.me