In developing a project for my data mining class, I kept asking myself how I could objectively measure the similarity between two objects. These objects happened to be Magic: The Gathering trading cards, and I was attempting to build a card recommending system not unlike those used by YouTube or Netflix to recommend videos, TV shows, and movies. Essentially I wanted the user to be able to pick any card that exists, and have the system respond with a short list of the most similar cards. I knew the hard part was going to be determining what makes cards “similar.” The most important attributes on the cards are entirely text, and considering that there have been almost 20,000 unique cards printed since 1993, I didn’t even know where to start. All I had was a CSV file of every card ever printed with each card’s text attributes such as card name, card type (creature, sorcery, etc), and the actual effects and rules of the card.
This is where machine learning engineer Christian S. Perone’s blog Terra Incognita comes in to save the day. He explains the use of tf-idf to convert a textual representation of data (like my Magic cards) to a vector space model: an objective, algebraic model of a card’s attributes. The tf-idf of a document (trading card) is the product of two statistics, the term frequency (tf) and the inverse document frequency (idf). The term frequency measures how often a term appears on a specific card, while the inverse document frequency essentially measures how important the word is to that card; if a word is frequently used on a card but appears on most cards, it doesn’t provide us with much information. Now that we know about this tf-idf statistic, we can construct a matrix that contains a tf-idf measure (i.e. how important a word is) for every word on every card.
Now that we have a numerical measure of importance for every word on every card, we need to find cards with similar tf-idf vectors to the card selected by the user. Perone explains the popular cosine similarity method, which takes two vectors and returns a real number value between 0 and 1. We just have to find the cosine similarity between our chosen card and every other card ever printed and return the most similar cards.
Perone very clearly explains these techniques in both plain English and with precise mathematical notation. The entire process has taught me an enormous amount about processing textually represented data, and mathematically following human intuition about what makes two things “similar.” I know that the skills learned through this project are going to be invaluable to me, and they’ve certainly changed the way I think about textual data.