Naive SEO with Natural Language Processing
Because I've been writing for a while (over a year, if you believe it), I wanted to start asking what I wrote most about. Instead of doing this through metadata processing, I thought I might try to do it instead from the other direction: from the actual words I use, that might correlate to a search phrase. I know this is not how search engines work any longer, and the world of LLM has made how they do work even more opaque. I wanted to build an understanding of the topics I tend to focus on, and maybe, as I formalize some of my notes into actual documentation or more formal articles / essays (but even if I don't), have some ideas of what areas might deserve more fleshing out.
To do this, I wrote a basic python script to process my entire corpus of notes (including the unpublished ones) into a collection of nouns and verbs, and especially the relationship. Then I printed the top ten in each category. The results disappointed, to say the least. I write a lot about mice and use a lot of to be verbs. But this gave me some things to think and work on as I refine my writing process.
I made two changes after this: first, I created a check script for searching for to be verbs -- this one is much more naive and doesn't use NLP. I run this with the grammar and spellchecks that I run on every article now, ensuring I am adding less to be verbs to the pool. The second change I made was to add to the NLP library to print the top 100 instances of a verb, and the number of occurrences of it in the notes. The goal of this was to help me quantify focus, in the event that I want to be more intentional about some topics.
It's not perfect. For instance, books sits in the 1% spot--the least used noun in the top 100 nouns, but book is about midway, and the combination of them puts it in the top 10% of all the things talked about. However, it does present a clear picture of the context by which I frame topics. As a ready example, it will surprise no one that the word used the most--both in volume and in reference density across notes (that is, it appears in the most notes), is time.