Brainstorming a syntax for a word usage query language
Published on under the Coding category.Toggle Memex mode
This weekend I experimented with using word surprisals -- given a corpus of text, how "surprising" is it that a given word appears -- for text prediction. I ended up with a neat context-aware autocomplete tool that, given a blog post, will recommend how to complete a word. I made a user interface that lets you press the tab
key to accept a suggestion.
I have found numerous uses for word surprisals now:
- Computing an "index" for my personal website;
- Stylometry;
- Autocomplete, and;
- Probably other things I do not presently recall.
I am pretty excited by what you can do with statistics-based NLP. If you are passionate about this subject, please reach out. I'd love to chat more with people who have spent time thinking about statistical language analysis.
One use case I was thinking of at the weekend is using surprisals to see who is more likely to talk about a given keyword. An example use is to take two artists' song lyrics and see for whom using a given word would be most surprising. The idea is that a word like "reputation" would not be too surprising for Taylor Swift, but it would be moreso for Lady Gaga. I haven't tested this idea on different arists, but I think there is something there. Or you could compare across two different corpora from the same person, where each corpus represents the person in a different context. Is it more surprising for Swift to use "reputation" in an interview or her lyrics?
I thought about a concise, limited query language that lets you uncover these insights. It would work like this:
James -> ["jamesg.blog.json"]
Taylor -> ["taylor.json"]
# is it more surprising that James talks about love than Swift?
James love > Taylor love? - TRUE
# is it more surprising that James talks about coffee than Swift?
James coffee > Taylor coffee? - FALSE
# declarative statement to say code herein will use the James corpus
James!
# is it more surprising James talks more about coffee than tea?
coffee > tea TRUE
With this language, I could find out if it is more suprising that X talks about a topic than Y. In other words, is it more likely that X talks about a topic than Y?
I don't know if there is anything here, but I am intrigued. Perhaps I will revisit this idea!
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.