Paperless Post allows you to send invitations to events (“RSVPs”) and greeting cards (“cards”). Examples of RSVPs include weddings, company holiday parties and baptisms, to name a few. Cards include birthday greetings and Mother’s Day well wishes. It’s critical for us as a business to understand how to classify these events so that we can make better informed business decisions. We don’t ask people what they’re creating because we want them to have the easiest possible user flow. That said, it’s helpful for us know what people are using our service for so we can improve it.
Today we’re going to better understand how exactly we classify greeting cards and events (hereafter simply referred to as “events”) at Paperless Post. We’ll review:
- How we think about event classifications
- Complexities around text-based classification
- Ways to understand if our classifier is good or not
- Key aspects of our model
It’s important for us to know if a birthday party is for a 5 year old or a 55 year old. These are two different demographics which require different marketing strategies and different card designs, among other things. However it’s also important to distinguish between birthday parties and Christmas parties. As a result, we decided on a parent/child hierarchy for classifying events. At the top (primary level) are high level categorizations, such as birthday, wedding, personal, etc… Within each one of these lives a secondary classification – kids birthday vs adult birthday or wedding showers vs save the date. Finally, within each secondary category exists a tertiary category. Tertiary examples are baby shower vs teenage birthday vs tween birthday or wedding save the date vs engagement save the date. As you might imagine, classifying at the primary level is much more straightforward than the tertiary level. For one, there are fewer categories, so inherently fewer potential mistakes. In addition, it’s easier to distinguish between birthday and wedding. However, things get hairy at the secondary and tertiary level, which is why we need to build a model. For the purposes of this post, we’ll focus on the tertiary classification since that’s more interesting and generalizes to secondary and primary. Before diving into the model however, let’s try to solve this problem using simple heuristics.
Heuristics and Nuance
The crux of our classification algorithm is that it combs through the text of a user’s event and identifies key words that are used to correctly classify it. Simple enough. So to begin, let’s make a rule:
If the term Halloween exists in the text of the event, then assign that event to the tertiary category Halloween.
You can imagine making a number of these rules for other keywords – birthday, baptism, Christmas, etc… It’s almost like the text lines up one-to-one with the classification. And then we have some words like spooky and costume that may also map to the Halloween category.
While simple, this kind of rule can be effective for discriminating words that describe certain events well. But this breaks down when we move from primary to secondary or tertiary classifications. For example, “Join us to celebrate Dylan’s 3rd birthday” and “Join us to celebrate Dylan’s 40th birthday” are different events from the perspective of our business (as indicated earlier), and yet this rule wouldn’t know how to classify them at a tertiary level. We need to expand the rule to consider age. Of course, age is a number and it’s not obvious that all numbers are ages. What if we see a number that is a time? Or an address? In addition, not all birthday invitation include the age. Things are getting complication very quickly.
Let’s take another example: Weddings. On it’s face it seems straightforward. If an event contains the word wedding then classify as wedding. “Come celebrate Roy and Jenny’s wedding” is clearly wedding related. This would work at the primary level. But what about: “Cocktails and snacks in honor of Roy and Jenny’s big day“. No mention of wedding in that one, and in fact it might not even be a wedding, but our simple rule would have missed it completely.
You can build more and more simple rules to catch extreme cases, but then you’ll be left with a very long list of rules. In addition, you’d be surprised how creative some people are in describing their events. And of course as our terminology changes, you’ll always be playing catch up. Finally, when you have many simple rules, you can end up having a big, opaque, and potentially unintuitive model, which is antithetical to a simple rules-based model. But there’s another problem too.
Precision vs Recall
Say I stick to the birthday rule from above. For all the cards that I end up classifying as a birthday, you can be very confident that they will be related to birthdays. After all, there aren’t many events that include the word birthday, and are not actually birthday related! The proportion of the events that I classified as birthdays, and actually are birthdays are called true-positives, or model precision. On the flip side, there are many events that I didn’t classify as birthday but actually are birthday events. This is because they didn’t include the word birthday, which is all my simple rule uses. These events that I missed, are called false-negatives, or model recall. This tension between precision and recall is one of the fundamental tensions in classification problems. The more strict I am in my classification rule, the higher my precision but the lower my recall. On the flip side, the more liberal my rules are (i.e. wide net) the worse my precision but the better my recall. No free lunch.
In addition, a simple rule means that each event can only be labeled as one category. While intuitive, what do we do with events that might be classified as multiple categories? Maybe an organization’s Winter party is 60% Christmas party, 30% Holiday party and 10% Generic Drinks and Snacks?
We can say that the usefulness of a simple heuristic is inversely related to the level of granularity that it can accurately classify. All that is to say – heuristics are good for simple categorizations. So how do we balance simplicity with complexity? We can cheat…sort of.
Text + Meta data FTW
I used a model that’s a combination of text and non-text data (why limit yourself?). As mentioned above, text data includes the text of the event, while non-text data includes things like greeting card vs. RSVP, time of the event (Mother’s Day cards and Thanksgiving dinner invites occur at distinct times of the year), how many guests are invited (organizations vs personal), etc… I then took this combination of features and used it as my input to a model that can figure out what’s meaningful and then yields a primary/secondary/tertiary classification along with a confidence score for each event. The higher the score, the more confident the prediction. I’m able to tweak my inputs in order to achieve a higher and higher confidence score and better classification accuracy.
For the rest of this post however, let’s stick to text based modeling as that’s the more interesting problem to solve.
How would the model make sense of the following event?
To the human eye, this is clearly a Halloween party invite. What are the text-based signals (ignore the ghost!)?
- Dracula – not many people are named this
- October and 31st – this date is a very strong signal
- Fangs – this is a rare and odd word to have in a party invitation
- 9 PM suggests that it’s likely not a kid’s birthday
What is the noise?
- Wendy and Jack are a two names, presumably a couple. Where else do couples show up? Almost exclusively in wedding related events.
- Invite and you are words that many RSVP events have.
- Dance and party all show up in general parties and birthdays and a slew of other events.
These noisy terms are not helpful in discriminating between Halloween and non-Halloween, and actually might skew towards parties instead of Halloween.
Not all words are made equal
Let’s begin to move away from heuristics. We can see that different words have different importance. For example, dance is likely to be highly correlated with the event generic party but in the context of the above card, the word October 31 is probably a more important signal that the event is Halloween and not generic party because October 31 is so rarely used in general. So clearly the importance of a word in determining the classification depends on the context of the event.
We can re-weight word-value based on their importance. Instead seeing if a word exists or not (1 or 0), we apply a weight. This type of weighting scheme – where rare words are worth more – is called TFIDF (term frequency inverse document frequency). This is a well-studied way of weighing word importance in a group of documents. It’s also a great way to ignore words like the and and. So specific words can be used as signals. We apply TFIDF to our enter set of words, across all events. This represents our first step away from simple heuristics because TFIDF determines the weights associated with each word.
But the presence or absence of a single word by itself isn’t enough. We need more words.
Bag of Words
Implicit in the analysis so far is that prevalence of a specific word or group of words is highly correlated with some type of category (costume is correlated with Halloween). The key insight to our model is that words that occur more often together are more likely to be related to each other. This means that we need to count the frequency of our words (and apply TFIDF) instead of just checking if they exist or not. This intuitive, naive and surprisingly powerful approach is called a bag-of-words (BOW) model. It’s naive because it doesn’t incorporate context, just words themselves. Yet it has surprisingly good results on classification accuracy (or precision and recall). There do exist algorithms that take context into account (e.g word2vec) but those methods are sometimes overkill and can yield less intuitive results, depending on your problem. In addition, since you have to read through many combinations of words and cross-reference them against each other to determine context, these algorithms become computationally expensive, whereas BOW are based on simple word counts which can be calculated very quickly.
As mentioned above, the crux of our model is that certain words are associated with certain categories. The algorithm looks at all the words and the events together (forming a big matrix) and applies weights to each word when presented with a new event we want to classify.
So in the above example, the algorithm would put a lot of weight on words like October and costume and 9PM. It figures out the weights using some math and historical data that we fed into it to teach it what Halloween events look like, what weddings look like and what birthdays (kids vs. adults) look like. In this manner, we don’t need to explicitly say costume maps to Halloween. We hope that the model learns how strong this relationship is after seeing enough examples.
Learning Concepts vs. Words
So the main input to the model is a big matrix of words and events.
I highlighted so-called word/event clusters. As mentioned above, some words are more useful than others. For example, the isn’t helpful at all, so we use TFIDF to down weight it, effectively ignoring it (by almost zeroing it out).
We know that some words cluster together – turkey, pumpkin, foliage, November. These words are so related that we say they define a concept. Let’s call this concept Thanksgiving. To reveal these concepts, we can apply Singular Value Decomposition (SVD) to the big matrix of word counts. SVD is one of the most widely used techniques in all of numerical analysis. It’s a way to decompose a big matrix into smaller matrices, each one having its own meaning in the context of the problem we’re solving.
Using SVD on this matrix will essentially convert a lot of words into a few concepts. These concepts, theoretically, will map to the classifications we know – such as birthday, Thanksgiving, Halloween, Christmas, etc… Applying SVD to a term/document matrix as above is known as Latent Semantic Analysis. It’s latent because we are picking up latent (or hidden) concepts by clustering words based on how similar their documents are.
So now we have a method of figuring out concepts. But how can we actually calculate what key concept is inherent in each event? In other words, how can we figure out if the above event is Halloween or Thanksgiving? The next step in the process is to solve for the weights we apply to each word in the event.
On to the math…or not
This part of the model is quite simple. Generally speaking, we can write most machine learning problems in the form Y = X * B. In this context:
- X = Our concept/document matrix, after applying TFIDF and SVD
- Y = The classification of each event (birthday, St. Patrick’s Day, bachelor party, etc…)
- B = A vector of weights associated with each word
We know X and Y, and what we really want to know is B. Once we have B, then given a bunch of words (e.g. text from a new event), we can apply our weights to each word (via a dot-product) and output a classification (Y).
How exactly we get B isn’t all that important for this purpose. There are a number of algorithms we can use – for the record, we’re currently using an ensemble of methods which includes a Logistic Regression and Support Vector Machine. Some algorithms are better suited for this type of problem than others, but the most important factor is: does the model classify well? Does it balance precision and recall? Does it run quickly? If so, then we go ahead and productionalize it. If not, we keep making adjustments until it’s good enough for people to trust and use.
We covered a lot above, so here are the highlights:
- Accurately classifying events is useful for our business. The more granular the better – wedding brunch vs. engagement party is more useful than just wedding.
- Using simple heuristics are great for some things that are obvious like Halloween but not for others like holiday party. You’ll get high precision but low recall.
- Certain words are highly correlated with certain categories. This is the crux of the bag-of-words algorithm and is the key to its simplicity and intuitiveness.
- Instead of telling the model which words to watch out for, we ask the model to tell us which are important and which are not, based on historical data.
- Using SVD, we can learn more than just word associations – we can learn concepts, which are clusters of related words. Bonus points: This type of model can think in probabilities instead of binary yes/no.
- Rare words are more important in determining classification than common words, all else being equal. The is useless. Stag is helpful.
- Context is obviously nice to have, but can be very resource intensive because you need a lot of data. Sometimes word counts alone are sufficient for a good, working model, supplemented by meta data.
- Y = X * B is a very common way to think about machine learning tasks. There are many techniques to figure out B and it’s important to understand what they do and when to use which technique. But in general, algorithm choice – while the most mathematically and computationally intensive part of the process – is not nearly as important to solving the overall task as a thoughtful, holistic model building approach.
And that in a nutshell is how I built (and continue to maintain and improve) Paperless Post’s event classifier! Feel free to drop me an email @ firstname.lastname@example.org.