Sunday, September 22, 2013

Kindle Wordbook -- Making a Personalized Dictionary from Books You’ve Read

 

Kindle Paperwhite is really a good e-reader. Since I had it I have read and re-read several books with great joy and ease. As I am not a native English speaker, I rely on the dictionary function in Paperwhite. Of course, this dictionary looking function is implemented similarly in other software running on Android and iOS. However I found Paperwhite offers the best touch experience (In the appendix of the article, I compare the dictionary looking function in Paperwhite with iBooks and Duokan reader). Paperwhite is also more suitable for reading plain words than a tablet/phone, in terms of light, size, and battery life.

Kindle stores all the highlighted texts, bookmarks, and notes in a text file: My Clippings.txt. When I look one word in the dictionary, I will immediately highlight it. This highlighting action does not cost any extra touch because one touch is anyway needed to close the dictionary window.

my-clipping

Figure 1. My Clippings.txt opened in a text editor.

One problem with highlighted words is that they are just words -- they don’t carry the meanings in the dictionary, nor do they contain the context of the highlighted words. For many words that I have looked up before, when I later encounter them again, I still could not remember their meanings.

So my idea is to parse My Clippings.txt and find the highlighted words out and then

1) find the Chinese meanings of them using an online dictionary, and

2) find all occurrences in my past readings.

For the first task, I can use the API of an online dictionary. After some searching, I find www.wordreference.com provides a simple API. An alternative is dict.baidu.com, which can be accessed via HTTP requests.

If I just aim to get the Chinese meanings of the unknown words, then I only need to do the first task. However by merely looking at the Chinese translations of the words, I may easily forget some of them; and even I remember the vague meanings of these words, I still don’t know when and how to use these words. The vocabulary that we apply in writing and speaking is much less than that we can recognize while reading. By reviewing these words in their context, I can remember them better, and more importantly, by learning their usages in different texts by different authors, I can summarize when and how they are used. For example, I get to know the word `remorse’ while reading Mary Shelley’s Frankenstein and this word is then linked with various episodes and images in the novel. The word not only has a meaning, but has a more concrete feeling. Acquiring such a concrete feeling, I think, is a perquisite towards using the word confidently and correctly.

So I wrote a little Python program to do the two tasks automatically for me. The program can be downloaded at:

https://github.com/yinz/pwdict

And have a look at my recent words in browser:

image

Figure 2. View word book in a browser.

Python is very good at trying out ideas and making things done quickly because of its vast libraries and dynamic nature. In this project, I need HTTP requests, parsing json text, reading epub files, and parsing English sentences. There are libraries for all these tasks in Python. The HTTP and json libraries are built-in. And I modified epub2txt project to extract the pure text content from epub formats, which are typical book formats used across various mobile reading software. Parsing an article into individual sentences seems easy, but can be tricky in many special cases. So I used the NLTK library for this.

Appendix. Why dictionary looking in Paperwhite is better

Paperwhite: one touch to open dictionary, and one touch to back (marking the word does not cost extra touches)

Duokan and iBooks: two touches to open dictionary, and one touch to go back. Marking the word cost extra two touches!

dict-compare