Exploring word surprisals and authorship verification
Published on under the Coding category. Toggle Memex mode
Last night, I was experimenting more with word surprisals (entropy), which are calculated using the probabilities of a word appearing in a specified corpus. For my analyses, I was using a corpus of articles from the New York Times to calculate word surprisals, which has proven effective for my blog [^1]. I started to think about whether you could use word surprisals for authorship verification.
I computed the word surprisals across all of my blog posts, then the Kullback–Leibler divergence between the surprisals in each blog post and the corpus of surprisals calculated using the NYT corpus. KL divergence is commonly used for measuring the distances between probabilities. I then calculated the posts in the 99th percentile of KL divergences, as I thought this would be a good way to find posts that have a different range of vocabulary than I usually use.
The results, however, were more interesting! The three quarters of all posts in the 99th percentile were interviews in my coffee interview series. These posts contain other people answering questions I have asked, which means the language in the post was not mine but that of someone else. Two of the posts in the 99th percentile were on topics that I have only touched on once or twice on my blog, and were stylisically distinct from most of my posts.
In summary, my approach was:
- Calculate word surprisals for a corpus of text to create a reference dictionary;
- Retrieve word surprisals for my writing using the reference dictionary of word surprisals;
- Calculate KL distances between the surprisal in each post and the reference dictionary, and;
- Retrieve posts in the 99th percentile by KL distance.
I wonder if this approach could be used to answer questions like "was this written by X or Y" in historical documents. I saw on Wikipedia there is a long history of authorship verification for texts. Wolfram Alpha has an excellent article on analyzing the provenance of papers in the Federalist Papers whose authorship is unknown. Their approach is different from mine. Indeed, there are many approaches to view authorship verification. Perhaps this would be only one factor to look at in a broader analysis.
I then started to think about how this could be used for verifying whether a text was written by a human or an AI. The trouble with this approach is that posts on entirely new topics, or in different styles, may be flagged as AI written, since the vocabulary used to express new concepts or leverage different styles (i.e. poetry, introspective writing, discursive writing, essays on new topics) will be technically divergent in terms of probabilities of words used. I do wonder if this could be used as a factor in a classifier.
[^1]: The choice of corpus does matter. I need to think more about how to evaluate whether corpus is appropriate for constituting the basis of a surprisal reference dictionary. For instance, if my blog was only programming blog posts, various less common words (i.e. "variable") may be seen as "surprising" even though, relative to the field, the word is not surprising. I need to do more research on this topic.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.
