How Do Regression and Classification Differ in Supervised Machine Learning?

Supervised Machine Learning is a fundamental aspect of artificial intelligence, involving the training of models on labeled datasets to generate predictions. The two primary categories of supervised learning tasks are Regression and Classification. Although they are based on similar foundational concepts, they have different objectives and employ various algorithms.


Regression vs Classification

Regression in Supervised Learning

Regression is a technique for predictive modeling that aims to estimate continuous outcomes. It analyzes the relationship between independent variables (features) and a dependent variable (target). The main features of regression include:

  • The output is a continuous variable (such as temperature, sales, or stock prices).
  • It predicts numerical values based on the provided input features.

The following are the commonly used regression algorithms:

  1. Linear Regression: Finds the best-fit line for the data.

    Equation:

    \[y = \beta_0 + \beta_1x_1 + \beta_2x_2 +...+ \beta_nx_n + \in\]

    where,

    \(y\implies\) Dependent variable (e.g., house price)

    \(x_i\implies\) Independent variables (e.g., size, location)

    \(\beta_i\implies\) Coefficients/weights for the features

    \(\in\implies\) Error term (residual)

    Example: Predicting house prices based on factors like size and location.


  2. Polynomial Regression: Models non-linear relationships by using polynomial terms, enhancing Linear Regression by adding higher-degree terms.

    Equation:

    \[y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 +...+ \beta_nx^n + \in\]

    Example: Predicting sales growth over time.


  3. Support Vector Regression (SVR): Extends support vector machines for regression tasks.

    SVR minimizes

    \[\frac{1}{2}||w||^2 + C\sum_{i=1}^{n}max(0,|y_i - (\langle {w,x_i} \rangle + b)| - \in)\]

    where,

    \(w\implies\) Weight vector

    \(x_i\implies\) Input features

    \(y_i\implies\) Target Values

    \(b\implies\) Bias term

    \(\in\implies\) Epsilon-tube defining tolerance for error

    \(C\implies\) Regularization parameter controlling margin flexibility

    Example: Predicting stock prices.


  4. Decision Trees/Random Forest Regressor: Splits data into decision nodes to predict a continuous outcome. It Splits data into nodes based on feature thresholds to minimize variance: \[\text{Variance Reduction} = \text{Variance(Parent Node)}\sum_{i}(\frac{\text{Samples in Child Node i}}{\text{Samples in Parent Node}}.\text{Variance(Child Node i)})\]

    Random Forest averages multiple decision trees to make a final prediction.

    Example: Predicting rainfall levels.


Evaluation Metrics for Regression
  1. Mean Absolute Error (MAE): This metric assesses the average size of errors, disregarding their direction. \[MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y_i}|\]

  2. Mean Squared Error (MSE): This metric emphasizes larger errors by squaring the differences, thus imposing a greater penalty on them compared to smaller errors. \[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2\]

  3. Root Mean Squared Error (RMSE): This metric offers a value that is expressed in the same units as the target variable, facilitating interpretation. \[RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2}\]

  4. R-Squared (Coefficient of Determination): This statistic reflects the extent to which the model accounts for the variability in the target variable. \[R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y_i})^2}{\sum_{i=1}^{n}(y_i - \bar{y_i})^2}\]

  5. Adjusted R-Squared: Adjusted R-squared penalizes the addition of non-significant predictors. If new predictors improve the model, Adjusted R-squared increases; otherwise, it decreases. \[R_{adj}^{2} = 1 - (\frac{(1-R^2)(n-1)}{n-k-1})\]

    where,

    \(R^2\implies\) Coefficient of determination

    \(n\implies\) Total no. of observations

    \(k\implies\) Number of independent variables (predictors) in the model


Example of Regression

Suppose we want to predict housing prices based on size, location, and amenities. A dataset could be structured as follows:

House Size (sq ft) Location Score Amenities Score Price ($)
2000 8 7 350,000
1500 6 6 250,000
3000 9 8 500,000

Using Linear Regression, the model outputs a numerical price prediction for new input data.

Regression Visualization:
Regression Visualization

The above scatter plot showing house sizes versus prices, with a red regression line indicating the trend represents how the house price changes with size.

Classification in Supervised Learning

Classification is a technique in predictive modeling that categorizes input data into established classes or categories. The main features of Classification within Supervised Learning include:

  • The output is a categorical variable (for instance, spam/not spam or disease/healthy).
  • It predicts discrete labels.

The following are the commonly used classification algorithms:

  1. Logistic Regression: This method estimates probabilities and assigns binary classes through the sigmoid function.

    Equation:

    \[P(y = 1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 +... + \beta_nx_n)}}\]
    • \(P(y = 1|x) \implies \) Probability of class 1 (e.g., spam)
    • Decision threshold (e.g., 0.5) determines class

    Example: Predicting whether an email is spam or not.


  2. Support Vector Machines (SVM): This algorithm identifies the hyperplane that effectively separates different classes of data.

    Equation:

    Maximizes margins between classes:

    \[Maximize: \frac{2}{||w||}\]

    Subject to:

    \[y_i(\langle {w_i,x_i} \rangle + b) >= 1\]

    where,

    \(w\implies\) Weight vector

    \(b\implies\) Bias term

    \(y_i\implies\) Class label (+1 or -1)

    Example: Classifying different types of flowers.


  3. Decision Trees/Random Forest Classifier: These algorithms create tree-like structures for classification purposes. It Splits data into nodes by maximizing information gain (e.g., Gini Index or Entropy): \[\text{Information Gain} = \text{Entropy(Parent Node)}\sum_{i}(\frac{\text{Samples in Child Node i}}{\text{Samples in Parent Node}}.\text{Entropy(Child Node i)})\]

    The Random Forest approach enhances classification accuracy by averaging the results of multiple decision trees.

    Example: Predicting customer churn.


  4. K-Nearest Neighbors (KNN): This method classifies data points based on the majority vote from the k nearest neighbors.

    Distance metrics (e.g., Euclidean distance):

    \[d(x,x^{'}) = \sqrt{\sum_{i=1}^{n}(x_i - x_i^{'})^2}\]

    Example: Classifying handwritten digits.


Evaluation Metrics for Classification
  1. Accuracy: This metric assesses the overall correctness of the classification model. \[Accuracy = \frac{\text{Number of Correct Predictions}}{\text{Total Predictions}}\]

  2. Precision: This indicates the proportion of predicted positive instances that are truly positive. \[Precision = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\]

  3. Recall (Sensitivity or True Positive Rate): This evaluates the model's capability to detect all actual positive instances. \[Recall = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\]

  4. F1-Score: This represents the harmonic mean of precision and recall, providing a balance between the two metrics. \[F1-Score = 2.\frac{Precision.Recall}{Precision + Recall}\]

  5. Receiver Operating Characteristic (ROC) Curve and Area Under Curve (AUC): The ROC curve illustrates the relationship between the True Positive Rate and the False Positive Rate, while the AUC measures the model's overall effectiveness in classifying different categories.

  6. Log Loss (Cross-Entropy Loss): This metric penalizes the discrepancies between actual and predicted probabilities. \[\text{Log Loss} = \frac{1}{n}\sum_{i=1}^{n}(y_i log(\hat{y_i}) + (1 - y_i)log(1-\hat{y_i}))\]

  7. Confusion Matrix: This is a comprehensive table that summarizes true positives, true negatives, false positives, and false negatives, facilitating the calculation of various metrics.

  8. Specificity (True Negative Rate): This measures the model's ability to accurately identify negative instances. \[Specificity = \frac{\text{True Negatives}}{\text{True Negatives} + \text{False Positives}}\]

Example of Classification

To classify emails as either spam or not spam, we could utilize a dataset that appears as follows:

Email Length (words) Links Count Contains "Free"? Spam/Not Spam
120 2 Yes Spam
50 0 No Not Spam
200 5 Yes Spam

Using Logistic Regression, the model outputs the probability of an email being spam, assigning it to a class based on a threshold (e.g., 0.5).

Classification Visualization:
Classification Visualization

The above 2D scatter plot showing two categories: "Spam" and "Not Spam." includes a decision boundary (black line) separating the two categories, with shaded regions for classification zones.

Conclusion

Regression and Classification are fundamental components of supervised learning, each tailored for distinct types of problems. Regression focuses on predicting continuous outcomes, whereas Classification is concerned with categorizing data points into specific labels. Grasping the distinctions between these methods is crucial for choosing the right strategy for your machine learning projects. By gaining proficiency in these techniques and their related algorithms, you can address a diverse range of practical challenges, such as predicting sales trends or identifying fraudulent activities.


Go to Index page


Disclaimer

The content or analysis presented in the Blog is exclusively intended for educational purposes. It is important to note that this should not be considered as a suggestion for investing in stocks or as legal or medical advice. It is highly recommended to seek guidance from an expert before making any decision.


You would also like to read: