blog
/Development

Teaching a Computer to Read:

Murad

Pen and Code

Scripted recently released a new feature called Experts, which allows us to efficiently and confidently group together expert writers in a given subject, the idea being that a business looking for experts in that field can easily find writers who are highly qualified (as both a writer and a domain expert) to write about it. Part of what determines whether a writer is a good fit for an Expert team is knowing how many pieces they’ve written about that team’s subject matter. This entailed an interesting machine learning problem…

The Problem

How can I get a computer to tell me what an article is about (provided methods such as bribery and asking politely do not work)?

To formally explain the problem as well as the proposed solution to it, I’m going to stay fairly high-level and use a toy example, with links to resources for further reading and a disclaimer that in reality you would need a dataset consisting of more than four samples to actually make any of this work (like a lot more…).

For the sake of this example, imagine we are engineers at a much tinier version of Scripted, with only four pieces of writing in our system, the titles of which are:

“The Perfect Panini”
“Sandwich Sorcery”
“Boston Terriers: Friend or Foe?”
“How to Tell if your Dog is a Dog and not a Cat”

Note: These four documents, in the language of natural language processing, are known as a ‘corpus’. You can think of this as a collection of written documents that we want a computer to learn from. 

As imaginary employees of tiny Scripted, we’ve noticed that many clients are looking for writers who are experts in the fields of dogs and sandwiches to produce content for them. Accordingly, we make a “Dogs” team and a “Sandwiches” team. The question now is, which documents belong on which team?

Dog vs. Sandwich

Describing Documents

Imagine you’re playing a guessing game with someone. You provide your partner with a list of animals, and a list of attributes about each animal. You describe a domesticated animal that is furry, says ‘meow,’ and is prominently featured on the internet. Based on these clues, and the information you initially supplied, the person you’re playing with guesses “cat.”

This is pretty much the same process our algorithm will use (this task is actually a very familiar one to anyone who’s dabbled in machine learning, and is known formally as classification). To put it a bit more formally and relate it to our specific problem, we want the process of putting a document on a team to eventually look like:

  1. Extracting features from the document which are known to us
  2. Using these features to describe the document to the system
  3. Having the system guess which team that document belongs to based on that description.

First thing’s first. We need a way to numerically describe documents to a computer so that we can tell how similar two documents are to each other. You can think of this as a mapping from a document to a point in space such that the closer two of these points are to each other the more similar their documents corresponding to those points are. Thinking of documents as vectors in this way is formally known as the Vector Space Model.

Fig. 1. A two-dimensional vector space illustrating the concept of considering document similarity in terms of Euclidean Distance. All the dots represent

Fig. 1. A two-dimensional vector space illustrating the concept of considering document similarity in terms of Euclidean Distance. All the dots represent documents. The small red dots represent the documents most similar to the large red dot in the middle.

A common way to represent a written document as a vector is to think about it in terms of “Bag of Words” (BoW) vectors. These are vectors of word counts where each slot in the vector represents the number of times a certain word was used. The list of all the words we keep track of in these vectors is known as our vocabulary.

Let’s say that our entire vocabulary consists of only the words “dog” and “cat.” Then a BoW vector of a document with this vocabulary would be:

<# Times ‘dog’ was used in doc, #Times ‘cat’ was used in doc>

For example, the text `”dog cat cat dog dog”` would be represented as the vector `<3,2>`. This pattern holds for larger documents and vocabularies. Now that we have some understanding of the vector space model, we can get cracking on some code.

Setup

The entirety of the code covered in this tutorial can be found here. We’re going to be working in Python, and will use the following modules to make our lives way easier:

  • Numpy - The goto Python library for fast numerical computing.
  • Scipy - A scientific computing module with loads of functionality built on top of Numpy.
  • Gensim - “Topic Modeling for Humans”
  • Scikit-Learn - Machine learning library for Python also built on top of Numpy.

I’m assuming you have pip and git installed. If this isn’t the case, you should definitely start there, as these tools are super useful for developers. First thing we want to do with those installed is grab a local copy of the tutorial’s source code:

git clone https://github.com/Scripted/NLP-Tutorial

Now, we need to install the dependencies. In a perfect world, running `pip install -r requirements.txt` in the NLP-Tutorial directory would take care of all of this for you. Unfortunately, Numpy and Scipy can be tricky to install. You might want to try installing each individually in the following order:

pip install numpy
pip install scipy
pip install gensim
pip install scikit-learn

To test whether your installation was successful, run `python classifier.py`. If it doesn’t give you any errors, then you’re all set! If not, just follow the links I provided above next to each dependency and follow their installation instructions.

The Code

Now that you’re all set up, let’s switch gears back to the algorithm. Recall that the first step of the process we described was to turn documents into vectors. Let’s look at how to actually do that in code. We’ll start by pulling in all our dependencies.

import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
from math import sqrt
import gensim
from sklearn.svm import SVC
import os

Now that we’ve got all the resources we need, we start by loading in our corpus. If you recall, a corpus is a collection of documents we wish to learn from and work with in the context of an NLP algorithm. I’ve provided a toy corpus in the repository which is aptly named… “corpus. This directory contains the following four documents:

dog1.txt - "dog runs and barks at dog"
dog2.txt - "the dog runs and barks at the apatosaurus"
sandwich1.txt - "a sandwich of cheese and meat and bread and cheese is 
                   supercalifragilisticexpialidocious"
sandwich2.txt - "the sandwich of meat and meat and cheese and meat"

You might astutely observe that these documents are unrealistically simple, contain a very limited vocabulary, and no punctuation. Yup, that’s by design. This toy corpus was meticulously designed by yours truly to demonstrate certain aspects of the process, while not getting mired in the gritty details of practical application.

While things like cleaning punctuation, HTML, and URL’s out of your text and lemmatization are absolutely things you should look into for a “real world” application, they are definitely low-level details that distract us from the high level process. What we ultimately want all of these things to do is split a block of text into words.

if __name__ == '__main__':
    #Load in corpus, remove newlines, make strings lower-case
    docs = {}
    corpus_dir = 'corpus'
    for filename in os.listdir(corpus_dir):
        path = os.path.join(corpus_dir, filename)
        doc = open(path).read().strip().lower()
        docs[filename] = doc
    names = docs.keys()

    #Remove stopwords and split on spaces
    print "n---Corpus with Stopwords Removed---"
    stop = ['the', 'of', 'a', 'at', 'is']
    preprocessed_docs = {}
    for name in names:
        text = docs[name].split()
        preprocessed = [word for word in text if word not in stop]
        preprocessed_docs[name] = preprocessed
        print name, ":", preprocessed

The output so far…

---Corpus with Stopwords Removed---
sandwich2.txt : ['sandwich', 'meat', 'and', 'meat', 'and', 'cheese', 'and', 'meat']
dog2.txt : ['dog', 'runs', 'and', 'barks', 'apatosaurus']
dog1.txt : ['dog', 'runs', 'and', 'barks', 'dog']
sandwich1.txt : ['sandwich', 'cheese', 'and', 'meat', 'and', 'bread', 'and', 'cheese', 'supercalifragilisticexpialidocious']

What we’ve done so far is just some pretty basic Python to read in the documents in our corpus and store them in a dict. All we did to split the blocks of text into lists of words was call the string’s builtin split method, which for our purposes pretty much just splits on spaces.

Preprocessing

The code also hints at another aspect of NLP called preprocessing. You can think of preprocessing as filtering out words which ultimately don’t tell us a whole lot about the text. Words such as ['the', 'of', 'a', 'at', 'is'] which appear with great frequency in the vast majority of documents we’ll ever analyze are called stop words and should definitely be filtered out. Similarly, you might also want to filter out words which appear either extremely frequently, or extremely rarely. This is where gensim comes in handy.

#Build the dictionary and filter out common/rare terms
    dct = gensim.corpora.Dictionary(preprocessed_docs.values())
    unfiltered = dct.token2id.keys()
    dct.filter_extremes(no_below=2)
    filtered = dct.token2id.keys()
    filtered_out = set(unfiltered) - set(filtered)
    print "nThe following super common/rare words were filtered out..."
    print list(filtered_out), 'n'
    print "Vocabulary after filtering..."
    print dct.token2id.keys(), 'n'

Which outputs…

The following super common/rare words were filtered out...
['and', 'apatosaurus', 'bread', 'supercalifragilisticexpialidocious'] 

Vocabulary after filtering...
['cheese', 'runs', 'sandwich', 'meat', 'barks', 'dog']

To recap, the above code filters out words from our corpus which are too common or too rare to be useful. It does so by leveraging gensim’s Dictionary class which stores word counts for each term encountered in the corpus. After we feed it all the documents, we tell it to filter out words which are very common (defined by whatever gensim’s default threshold is) as well as words which occur only once. This leaves us with only six words in our vocabulary: ['cheese', 'runs', 'sandwich', 'meat', 'barks', 'dog'] . All words not in the vocabulary are ignored from here on out.

At this point, we’ve got all we need to start bringing bag of words vectors into the picture.

From Texts to Vectors

From this point on, we’ll be thinking about documents as vectors. That handy Dictionary class we used above for preprocessing also contains functionality to take a list of words, compute word counts for words in our vocabulary, and return the corresponding bag of words vectors.

#Build Bag of Words Vectors out of preprocessed corpus
    print "---Bag of Words Corpus---"

    bow_docs = {}
    for name in names:

        sparse = dct.doc2bow(preprocessed_docs[name])
        bow_docs[name] = sparse
        dense = vec2dense(sparse, num_terms=len(dct))
        print name, ":", dense

Note: vec2dense is just a helper function I wrote to convert vectors to a more familiar format for display purposes

Here are the resulting bag of words vectors

---Bag of Words Corpus---
sandwich2.txt : [1.0, 0.0, 1.0, 3.0, 0.0, 0.0]
dog2.txt : [0.0, 1.0, 0.0, 0.0, 1.0, 1.0]
dog1.txt : [0.0, 1.0, 0.0, 0.0, 1.0, 2.0]
sandwich1.txt : [2.0, 0.0, 1.0, 1.0, 0.0, 0.0]

There we go — our documents expressed as points in 6-dimensional space. For a practical application, working with 6-dimensional data is pretty damn reasonable. However, one can imagine a scenario where our corpus is composed of hundreds of thousands of documents which sample from a much more extensive vocabulary (like say the entire English language). Even after removing stop words, and exceedingly common/rare terms we are left with bag of words vectors with upwards of 50,000 unique word-counts/dimensions to keep track of. This should make the data nerd in you uncomfortable.

The Curse of Dimensionality

There are all kinds of terrible things that happen as the dimensionality of your descriptor vectors rises. One obvious one is that as the dimensionality rises, both the time and space complexity of dealing with these vectors rises, often exponentially.

Another issue is that as dimensionality rises, the amount of samples needed to draw useful conclusions from that data also rises steeply. Another way of phrasing that is with a fixed number of samples, the usefulness of each dimension diminishes. Finally, as the dimensionality rises, your points all tend to start becoming equidistant to each other, making it difficult to draw solid conclusions from them. The umbrella term that covers all these adverse effects of high dimensionality is “the curse of dimensionality.”

The Remedy

Fortunately, there’s a whole family of techniques called dimensionality reduction techniques which are entirely geared toward bringing down the number of dimensions in our descriptor vectors to something more reasonable. While the low level details often entail fairly advanced mathematics, the high level ideas behind the techniques are quite intuitive (also, the low-level details are often implemented for you, like in gensim).

Let’s look at those bag of words vectors again.

---Bag of Words Corpus---
sandwich2.txt : [1.0, 0.0, 1.0, 3.0, 0.0, 0.0]
dog2.txt : [0.0, 1.0, 0.0, 0.0, 1.0, 1.0]
dog1.txt : [0.0, 1.0, 0.0, 0.0, 1.0, 2.0]
sandwich1.txt : [2.0, 0.0, 1.0, 1.0, 0.0, 0.0]

Notice that certain dimensions are highly correlated with one another. For example, in both dog1.txt and dog2.txt, the first, third, and fourth values all seem to occur together. In terms of the text, this means that the terms ‘dog,’ ‘runs,’ and ‘barks’ frequently seem to occur together. In that case, maybe rather than thinking of our documents in terms of individual word counts, we should be thinking about them in terms of topics (or groups of words that occur together).

Dimensionality reduction techniques help us do exactly that. They math-magically (it’s a technical term, I promise) express our high dimensional data in a lower dimensional space. The only manual bit that many of them feature is that you need to specify how many dimensions the lower-dimensional space should feature. Unfortunately, there’s no algorithm that I know of that can look at your data and auto-detect the perfect dimensionality to reduce to. In most cases, there’s not even a clearcut solution to that problem were you to ask a human. Fortunately, this silly toy corpus isn’t most cases — it’s an especially trivial case! Arguably the best kind of case.

We look at the corpus, and we intuitively say that these documents are about either ‘dogs’ or ‘sandwiches,’ and thus, the number of dimensions in our lower dimensional space should be two. The algorithm that we use to do the dimensionality reduction in this case is called “Latent Semantic Indexing,” generally abbreviated to LSI.

Going into the math that makes LSI work is way beyond the scope of this article, so I’ll just summarize by saying it uses a technique from linear algebra called singular value decomposition to reduce the input matrix (your document vectors stacked on top of each other) to one of a lower rank (the number of dimensions you specified). If you didn’t understand any of what just happened in that last sentence, it’s totally fine, because gensim‘s done the heavy lifting for us.

#Dimensionality reduction using LSI. Go from 6D to 2D.
    print "n---LSI Model---"
    lsi_docs = {}
    num_topics = 2
    lsi_model = gensim.models.LsiModel(bow_docs.values(),
                                       num_topics=num_topics)
    for name in names:
        vec = bow_docs[name]
        sparse = lsi_model[vec]
        dense = vec2dense(sparse, num_topics)
        lsi_docs[name] = sparse
        print name, ':', dense

And here are our simplified 2d vectors…

---LSI Model---
sandwich2.txt : [3.222517, 0.0]
dog2.txt : [0.0, 1.6870012]
dog1.txt : [0.0, 2.4343436]
sandwich1.txt : [2.1483445, 0.0]

Look at that. Two-dimensional vectors which represent all the information we care about without having to consider our data in terms of high-dimensional word counts. The actual numbers in the vectors don’t matter so much as the fact that the first slot is clearly our sandwich topic while the second is our dog topic.

This is probably a good time to reiterate how much training on four samples of data would never, ever, ever work to generate a topic model like LSI for non-trivial data. You’re also not likely to have outcomes as cut and dry as these, where one topic contributes 100% to a given vector’s magnitude, while the other contributes nothing.

Document Similarity

I’ve alluded above to the fact that once documents are represented as points in space, we can tell how similar they are by how close they are to each other. Now that we’re at that point in the process, let’s go over what it means for points to be “close” to each other.

It turns out that there are many different ways of considering how close two points in space are to each other. The most natural one to us is called Euclidean Distance. If I were to draw a straight line between two points, and measure that line, that is the Euclidean distance. To see an example of Euclidean Distance in action refer to Fig. 1. The big red dot represents the document whose closest matches we’re trying to find. The smaller red dots around it are the most similar documents by the Euclidean distance metric.

There are, however, issues with Euclidean distance. In many text processing applications, we care more about the direction of the vector (or the angle from the origin to the point) than its actual location. To demonstrate why we might want to consider the vector in terms of direction rather than magnitude, let’s consider another toy bag of words example. We consider the following three documents.

doc1 - ['dog','dog','dog','dog','dog','dog']
doc2 - ['dog','dog']
doc3 - ['cat']

In their bag of words form:

doc1 - [6,0]
doc2 - [2,0]
doc3 - [0,1]

Intuitively, we know that in terms of subject matter, doc1 and doc2 should be more similar to each other than they are to doc3. They are both clearly 100% about dogs, while, doc3 is 100% about cats. However, when we take the Euclidean distances between them, we find that the distance between doc2 and doc1 is 4, while the distance between doc2 and doc3 is ~2.24. The length of the first document led us to believe that doc2 was more like doc3 than it was to doc1.

The fact that document length has nothing to do with what the document is actually about is exactly why we want to downplay the importance of vector magnitude and instead focus on direction. There are a few ways we can accomplish this.

Firstly, we can modify the vectors themselves by dividing each number in each vector by that vector’s magnitude. In doing so, all our vectors have a magnitude of 1. This process is called unit vectorization because the output vectors are units vectors.

The unit vectors make it so that the dog documents are now closest to each other:

doc1 - [1,0]
doc2 - [1,0]
doc3 - [0,1]

Another technique is to leave the vectors alone and just take the angle between them. Measuring similarity based on angle between vectors is know as cosine distance, or cosine similarity. In our example up there, you can see the angle between doc1 and doc2 is 0° because they are pointed in exactly the same direction. On the either hand, the angle between either dog document and doc3 is 90°. 0° is less than 90°, therefore, the dog articles are once again more similar to each other than they are to the one about cats.

Both Euclidean and cosine distance are called distance metrics. A distance metric is simply the formula we provide to an algorithm that will dictate how close we consider two vectors to be to each other. With that bit of theory under our belts, let’s go back to the code.

Cosine Distance in Action

We last left off at transforming our bag of words corpus into two-dimensional topic vectors. At this point, we want to unit vectorize those topic vectors because, again, we care more about the angle of the vector than the magnitude. The reason I’m using both unit vectorization and cosine distance in this code is twofold. Firstly, it couldn’t hurt. Second, when we move on to classification, we will be using a model that can only use Euclidean distance (as far as I know).

By unit vectorizing our corpus, we make Euclidean and Cosine distance equivalent in terms of ordering. They will not return the same exact distance for instance (as one is measuring magnitude while the other is measuring angle). However, what they will do is say that if cosine_distance(A,B) < cosine_distance(B,C) then euclidean_distance(A,B) < euclidean_distance(B,C) for all points A,B,C in our corpus. Intuitively, you can think of this as saying “if two points are closer to each other on the unit circle/sphere/hypersphere, then the angle between them is smaller.”

Here’s where the magic happens:

#Normalize LSI vectors by setting each vector to unit length
    print "n---Unit Vectorization---"

    unit_vecs = {}
    for name in names:

        vec = vec2dense(lsi_docs[name], num_topics)
        norm = sqrt(sum(num ** 2 for num in vec))
        unit_vec = [num / norm for num in vec]
        unit_vecs[name] = unit_vec
        print name, ':', unit_vec

and here’s the output:

---Unit Vectorization---
sandwich2.txt : [1.0, 0.0]
dog2.txt : [0.0, 1.0]
dog1.txt : [0.0, 1.0]
sandwich1.txt : [1.0, 0.0]

Now what we want to do is illustrate cosine distance correctly matching up documents that should be similar. Without further delay, here’s the code to do it.

#Take cosine distances between docs and show best matches
    print "n---Document Similarities---"

    index = gensim.similarities.MatrixSimilarity(lsi_docs.values())
    for i, name in enumerate(names):

        vec = lsi_docs[name]
        sims = index[vec]
        sims = sorted(enumerate(sims), key=lambda item: -item[1])

        #Similarities are a list of tuples of the form (doc #, score)
        #In order to extract the doc # we take first value in the tuple
        #Doc # is stored in tuple as numpy format, must cast to int

        if int(sims[0][0]) != i:
            match = int(sims[0][0])
        else:
            match = int(sims[1][0])

        match = names[match]
        print name, "is most similar to...", match

You might be asking yourself where the cosine distance computation happened in that code. gensim features a class called MatrixSimilarity which is a type of index, or a data structure used to efficiently store vectors and data so that when it comes time to make a similarity query, the search to find vectors closest to our query point is much faster than brute force.

The actual query is made when we pick a pick from our corpus (called vec in the code)  and say `sims = index[vec]`. sims is a list of all points and their distance to the query point (vec). We proceed to sort that list, pull out the closest match, and print it out which yields the following:

---Document Similarities---
sandwich2.txt is most similar to... sandwich1.txt
dog2.txt is most similar to... dog1.txt
dog1.txt is most similar to... dog2.txt
sandwich1.txt is most similar to... sandwich2.txt

Seems reasonable to me. At this point we’re very close to solving the problem we set out to solve.

Classification

Our progress so far answers the question “which document is this document most similar to?” The question we ultimately want to answer is “what is this document about?” There are different ways to answer that question, including keyword extraction, clustering, and all sorts of other techniques. The one that I like best involves supervised learning, where you train the algorithm on samples which have the “correct” answer provided with them.

The specific supervised learning problem we’re addressing here is called classification. You train an algorithm on labelled descriptor vectors, then ask it to label a previously unseen descriptor vector based on conclusions drawn from the training set. The way we are going to accomplish this in our case is you make use of support vector machines, a family of algorithms which define decision boundaries between classes based on labelled training data.

To give a high-level view of what exactly support vector machines (or SVM’s for short) do, I’ll refer back to our points in space. For our ‘dog’ vs. ‘sandwich’ classification problem, we provide the algorithm with some training samples. These samples are documents which have gone through our whole process (BoW vector -> topic vector -> unit vector) and carry with them either a ‘dog’ label or a ‘sandwich’ label. As you provide the SVM model with these samples, it looks at these points in space and essentially draws a line between the ‘sandwich’ documents and the ‘dog’ documents. This border between “dog”-land and “sandwich”-land is known as a decision boundary. Whichever side of the line the query point falls on determines what the algorithm labels it.

The Final Step

Time to build our SVM, train it, and test it.

#We add classes to the mix by labelling dog1.txt and sandwich1.txt
    #We use these as our training set, and test on all documents.
    print "n---Classification---"

    dog1 = unit_vecs['dog1.txt']
    sandwich1 = unit_vecs['sandwich1.txt']

    train = [dog1, sandwich1]

    # The label '1' represents the 'dog' category
    # The label '2' represents the 'sandwich' category

    label_to_name = dict([(1, 'dogs'), (2, 'sandwiches')])
    labels = [1, 2]
    classifier = SVC()
    classifier.fit(train, labels)

    for name in names:

        vec = unit_vecs[name]
        label = classifier.predict([vec])[0]
        cls = label_to_name[label]
        print name, 'is a document about', cls

    print 'n'

And the end result of all of this code is…

---Classification---
sandwich2.txt is a document about sandwiches
dog2.txt is a document about dogs
dog1.txt is a document about dogs
sandwich1.txt is a document about sandwiches

Voila, we’ve successfully answered the question we set out to solve! This part of the code is where we take advantage of scikit-learn to do all the SVM-related heavy lifting for us. I constructed a training set out of two of the documents by manually labeling them ‘dog,’ and ‘sandwich,’ then correctly classified all four documents.

Cross-Validation

As a result of the simplicity of our example, we have done a big no-no in “testing” our algorithm, which is to train and test on what essentially ended up becoming the same data. In reality, your dataset would be much more varied and numerous than our toy example and you would test your data by partitioning it into a training set and a test set.

All samples in both training and test sets are labeled. However, in practice, you would build the model on the labeled training set, ignore the labels on the test set, feed them into the model, have the model guess what those labels are, and finally check whether or not the algorithm guessed correctly. This process of testing out your supervised learning algorithm with a training and test set is called cross-validation.

A Quick Reality Check

So you’ve read through the article, understand everything that’s going on, but sense there’s something missing. I’ve pointed out several times that this example has been simplified to the point of triviality for the sake of demonstrating the most important aspects of the algorithm which might otherwise get lost in the details. How do you take this framework we’ve run through and apply it to something practical?

As usual, the devil’s in the details. I can’t spell out the entire process out for you (otherwise this article would be the size of a textbook), but I can list for you the questions you’ll need to answer in order to make any such machine learning application work. Hopefully research, experimentation, and intuition will take you the rest of the way.

First questions you should ask yourself

  • What is the problem I’m trying to solve?
  • What do I want to learn from my data? How will my findings be actionable?
  • What tools exist that can help me solve this problem?

I can’t stress enough how important these first few questions are. The answers to these questions should drive every design decision you make. It’s all too easy to get caught up in all of the awesome algorithms and techniques out there in this field, and get completely distracted from your end goal. That being said, there is obviously a place for playing around, exploring, and experimenting. Just be sure you always keep the problem you’re trying to solve in mind. Anyhow… back to the list!

Preprocessing

  • Where do I find a corpus?
  • How do I extract text from web pages in a scalable/generalizable way (if using a corpus from the web)
  • How do I split my text into individual words and extract the roots of those words?
  • How do I build my vocabulary?
  • Which words do I filter out?

Vector Space Model

  • What sort of normalization does the problem call for?
  • Which distance metric makes sense for the problem at hand?

Dimensionality Reduction

  • How do I pick how many dimensions I ultimately want to be working with?

Machine learning

  • Am I solving a supervised or an unsupervised learning problem?
  • Which model/algorithm makes sense for the problem at hand?

Testing

  • How do I gauge how well my algorithm is doing?
  • Is it enough just to look at accuracy?
  • Should I consider precision/recall?
  • Are my classes of equal sizes, or does one dominate the other? How do I remedy this?
  • Are my training/test datasets actually representative of what my algorithm will encounter in every day use?

If you take anything out of this article, it should be that useful applications of machine learning are all about the decisions you make. There are countless algorithms and techniques to choose from, but they are only as useful as your application of them. Identify the problem you’re solving, experiment with solutions, understand where there might be shortcomings in those solutions and remedy them when possible.

To Read More About Engineering, See Below:

The Wonders of Vim
The Hacking for Sales
Hacking for Sales, Part 1

  • Seth Williams

    Hi Murad, Thanks a lot for sharing this tutorial. This is very useful

  • M.B.

    Great article! One important note though: the default kernel function for the sklear SVC is rbf, which is unsuited for text classification and probably won’t give very good results on a real world problem. Try using kernel=’linear’ as a parameter for SVC() for better results.

About Murad