Machine Learning Systems Design

Spam classifier example, error analysis, skewed data, precision, recall and large data sets.

1. Building a Spam Classifier

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Prioritizing

Let’s say you want to build a spam classifier
How do you implement supervised learning?
- We can create the following
  - x = features of email
    - Choose 100 words indicative of spam or not spam
    - In practice is to look through training set and choose most frequently occurring n words (10 000 to 50 000)
  - y = spam (1) or non-spam (0)
  - Example
How do you spend your time to have a low error?
- Collect lots of data
- Develop sophisticated features based on email routing information (from email header)
- Develop sophisticated features for message body
  - Should ‘discount’ and ‘discounts’ be treated as the same word?
  - How about ‘deal’ and ‘Dealer’?
  - Punctuation?
- Develop sophisticated algorithm to detect misspellings
  - med1cine
  - w4tches
  - m0rtage
Don’t base on your gut feeling!

1b. Error Analysis

Recommended Approach
- Start with a simple algorithm that you can quickly implement and test it on your cross validation data
- Plot learning curves to decide if more data, more features, etc. are likely to help
- Error analysis:
  - Manually examine the examples (in cross validation set) that your algorithm made errors on
  - See if you spot any systematic trend in what type of examples it is making errors on
  - Don’t base anything off your gut feeling!
Error Analysis Example
- m_cv = 500
  - number of cross-validation examples
- Algorithm misclassified 100 emails
- Manually examine the 100 errors and categorize them base don
  - What type of email they are
    - Pharma: 12
    - Replica: 4
    - Phishing: 53
    - Other: 31
  - What cues (features) you think would have helped the algorithms classify them correctly
    - Deliberate misspellings: 5
      - m0rtgage
      - med1cine
        
        This indicates how there are a small number here
    - Unusual email routing: 16
    - Unusual punctuation: 32
      - This might be worthwhile to spend time to develop sophisticated features
- This is the reason why we should do a quick and dirty implementation to discover errors and identify areas to focus on
Importance of numerical evaluation
- Should discount, discounts, discounted, discounting etc. be treated as the same word?
  - You can use a “stemming” software, “Porter Stemmer”
  - This would allow you to treat all those variations as the same work
    - Software may mistake universe and university as the same word
  - Error analysis may not be helpful for deciding if this is likely to improve performance
    - The only way is to try it
    - We need a numerical solution (cross validation error)of algorithm’s performance with and without stemming
      - Without stemming: 5%
      - With stemming: 3%
        
        This implies that it may be useful to implement stemming
      - Distinguish between upper and lower case: 3.2%
Why is the recommended approach to perform error analysis using the cross validation data instead of the test data?
- If we develop new features by examining the test set, then we may end up choosing features that work well specifically for the test set, so Jtest(θ) is no longer a good estimate of how we generalize to new examples
- Do error analysis on cross validation set, do not use it on the test set!

2. Handling Skewed Data

2a. Error Metrics for Skewed Classes

Consider a problem where you want to find out if someone has cancer
- y = 1, cancer
- y = 0, no cancer
- You train a logistic regression model, h0(x) and you find that you have 1% error on a test set
  - 99% correct diagnosis
- But only 0.50% of patients have cancer
  - This is a problem of skewed classes
  - This code would have a 0.5% error, lower than your logistic regression model but it’s simply predicting based on 0.5% of patients who have cancer
    function y = predictCancer(x) y = 0; return
- Let’s say you have
  - 99.2% accuracy
    - 0.8% error
  - If you improve your algorithm to become 99.5% accuracy
    - 0.5% error
    - It might be the case of just predicting whether you have cancer that would yield this error
Precision/Recall
- By calculating precision/recall, we will have a better sense of how our algorithm is doing
  - If y = 0
    - Recall = 0
    - This shows that the classifier is not good

2b. Trading off Precision and Recall

If we want to avoid false positives
- We want to be more confident before predicting cancer (y = 1)
- We can increase the threshold of h0(x) from 0.5 to 0.7 or even 0.9
- Result
  - False positives: decrease
  - True positives: decrease
  - Recall: decrease
  - Precision: higher
If we want to avoid false negatives
- We want to avoid missing too many cases of cancer
- We can decrease the threshold of h0(x) from 0.5 to 0.3
- Result
  - False negatives: decrease
  - True positive: increase
  - Recall: higher
  - Precision: lower
Many different precision recall curve, but here is one example
How do we compare precision/recall numbers? Which pair is best?
- We can use an average
  - At the extremes, neither classifiers is good
  - If we predict y = 1 all the time, it’s a useless classifier even though if it has a high recall
  - But average is not good because the extreme scenarios may have a lower average than other combinations that may be better
- We should use the F score (F1 score)
  - F1 Score = (2 * P * R) / (P + R)
  - Remember to measure P and R on the cross-validation set and choose the threshold which maximizes the F-score

3. Using Large Data Sets

Under certain conditions, getting a lot of data and training a learning algorithm would result in very good performance
Designing a high accuracy learning system
- Classify between confusable words
  - to, two, too
  - then, than
- For breakfast, I ate two eggs
- Algorithms
  - Perceptron (logistic regression)
  - Winnow (less popular)
  - Memory-based (less popular)
  - Naive Bayes (popular)
- Algorithms give roughly similar performance
  - With larger training set, all algorithms’ precision increase
Often, it is not who has the best algorithm, but who has the most data
Large data rationale
In sum
- Low bias: use complex algorithm
- Low variance: use large training set

Tags: