The Silicon Corner

Decision trees uncovered

Pol Monroig Company — Thu, 09 Oct 2025 06:30:40 GMT

If you are a computer scientist I am sure you agree with me when I say that trees are everywhere. And I mean everywhere! It is extremely common to use trees as a basic data structure to improve and define new algorithms in all sorts of domains. Machine learning is no different; decision trees are one of the most used nonparametric methods, it can be used for both classification and regression.

Decision trees are hierarchical models that work by splitting the input space into smaller regions. A tree is composed of internal decision nodes and terminal leaves. Internal decision nodes implement a test function, this function works by given a set of variables (the most used approach is to use univariate trees, that is trees that test only 1 variable at a given node) we get a discrete output corresponding to which child node we should go next. Terminal nodes correspond to predictions; a classification output might be the corresponding class, and a regression a specific numerical value. A great advantage of decision trees is that they can work using categorical values directly.

For example, in the following tree, we might want to classify patients that required treatment versus patients that do not require it. Each node makes a decision based on a simple rule, and in each terminal nodes, we have the final prediction. The gini index is a measure of how impure the node is. If the impurity is equal to 0.0, that means we cannot split any further because we have reached a maximum purity.

One of the perks of decision trees, compared to other machine learning algorithms, is that they are extremely easy to understand and have a high degree of interpretability. Just by reading the tree, you can make decisions yourself. On the other hand, decision trees are very sensitive to small variations in the training data, so it usually recommended to apply a boosting method.

Note: In fact, a decision tree can be transformed into a series of rules that can then be used in a rule-based language such as Prolog.

Dimensionality reduction

The job of classification and regression trees (CART) is to predict an output based on the possible variables that the input might have; higher leaves tend to divide more important features and lower leaves tend to correspond to less important ones. That is why decision trees are commonly used as a dimensionality reduction technique. By running the CART algorithm you get the importance of each feature for free!

Error measures

As any machine learning model, we must ensure to have a correct error function. It has been shown that any of the following error functions tend to perform well:

MSE (regression): it is one of the most common error function on machine learning
Entropy (classification): entropy works by measuring the number of bits needed to encode a class code, based on its probability of occurrence.
Gini index (classification): Slightly faster impurity measure than entropy, it tends to isolate the most frequent class in its own branch of the tree, while entropy produces more balanced branches.

Note: Error functions on classification trees are also called impurity measures.

Boosting trees

Decision trees are very good estimators, but sometimes they can perform poorly. Fortunately, there are many ensemble methods to boost their performance.

Random forests: bagging/pasting method that works by training multiple decision trees, each with a subset of the dataset. Finally, each tree makes a prediction and they are all aggregated into a single final prediction.
AdaBoost: a first base classifier is trained and used to make predictions. Then, a second classifier is trained on the errors that the first one had. This continues on and on until there are no more classifiers to train.
Stacking: this idea works by creating a voting mechanism between different classifiers and create a blending classifier that is trained on the predictions of the other classifiers, instead of the data directly.

The following image represents a stacking ensemble:

Optimal neural networks

Pol Monroig Company — Wed, 08 Oct 2025 21:56:53 GMT

Like everything in this world, finding the right path to a high-end goal can become tedious if you don't have the right tools. Each objective and environment has different requirements and must be treated differently. An example of this might be traveling, using a car to go to the grocery shop might be the fastest and most comfortable way to get there. On the other hand, if we want to travel abroad it might be a better idea to get on an airplane (unless you are one of those who loves driving for hours).

But we are not here to talk about the different types of transportation, we are here to talk about how to improve the training of your neural networks and choosing the best optimizer based on the memory it uses, its complexity and speed.

Different optimizers

Training a deep neural network can be very slow, there are multiple ways to improve the speed of convergence. By improving the learning rules of the optimizer we can make the network learn faster (with some computational and memory cost).

Simple optimizer SGD

The most simple optimizer out there is a Stochastic Gradient Descent optimizer, this works by calculating the gradient and error through backpropagation and updating the corresponding weights with the learning rate factor.

Speed: because it is the most basic implementation it is the fastest

Memory: it is also the one that uses the fewest memory since it only needs to save the gradients of each weight for backpropagation.

Performance: it has a very slow convergence but generalizes better than most methods.

Usage: this function can be used in pytorch by providing the models parameters (weights) and the learning rate, the rest of the parameters are optional.

torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)

Momentum optimization

The momentum optimization is a variant of the SGD that incorporates the previous update in the current change as if there is a momentum. This momentum provides a smoothing effect on the training. The value of the momentum is usually between 0.5 and 1.0

Speed: very fast since it only has an additional multiplication.

Memory: this optimization requires a memory increase, since it needs to save the memory of the weight of the update in the last step.

Performance: very useful since it provides an averaging and smooth effect in the trajectory during convergence. It promotes a faster convergence and helps roll past local optima. It almost always goes faster than SGD.

Usage: to activate the momentum, you need to specify its value through the momentum parameter.

torch.optim.SGD(params, lr=, momentum>0, dampening=0, weight_decay=0, nesterov=False)

Nesterov accelerated gradient

A variant of the momentum optimization was proposed in which instead of mesuring the gradient at the local position,we measure it in the direction of the momentum.

Speed: an additional sum must be done to apply the momentum to the parameter.

Memory: no extra memory is used in this case.

Performance: it usually works better than simple momentum since the momentum vector points towards the optimum. In general, it converges faster than the original momentum since we are promoting the movement towards a specific direction.

Usage: to apply the use of Nesterov we must set the Nesterov flag to true and add some momentum to the optimizer.

torch.optim.SGD(params, lr=, momentum>0, dampening=0, weight_decay=0, nesterov=True)

Adagrad

Adagrad stands for adaptive learning rate and it works by adapting the learning rate depending on where we are located. When we are near a local minimum, Adagrad tries to optimize the learning rate in order to get faster in that direction. A benefit of using this optimizer is that we don't need to concern ourselves too much in tuning the learning rate manually. The learning rate adapts based on all the gradients in the current training.

Speed: it is much slower since it needs to multiply a lot of things.

Memory: it does not require any additional memory.

Performance: in general, it works well for simple quadratic problems, but it often stops too early when training neural networks, since the learning rate gets scaled too much, thus never getting to the minimum. It is not recommended for neural networks but it may be efficient for simpler problems.

Usage: Adagrad can be used by providing the default parameters.

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)

RMSprop

This is a variant of the Adagrad algorithm that fixes its never converging issue. It does it by accumulating only the gradients from the most recent iterations.

Speed: it is very similar to Adagrad

Memory: it uses the same memory as Adagrad

Performance: it converges much faster than Adagrad and does not stop before a local minimum, it. It has been used by machine learning researches for a long time before Adam came out. It does not perform very well on very simple problems.

Usage: you might notice there is a new hyperparameter, but the default values usually work well, this technique can be combined with a momentum.

torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)

Adam

Adam is a relatively new gradient descent optimization method, it stands for adaptive moment estimation. It is a mix between momentum optimization and RMSProp.

Speed: the one that costs more since it combines two methods.

Memory: the same as RMSprop

Performance: it usually performs better than RMSprop since it a combination of techniques trying to converge faster on the training data.

Usage: Adam can be used perfectly with the default parameters, it is even recommended to leave the learning rate as it is since it is an adaptive method that provides an automatic learning rate update.

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

Summary

In the end, which optimization algorithm should you use? It depends, adaptive algorithms are becoming really fancy nowadays but require more computational power, and most of the time more memory. It has been proven that simple SGD has better results on the validation set, as it tends to generalize better, it seems adaptive algorithms try to optimize the training set too much, thus ending with high variance and overfitting the data. The problem with SGD is that it might take a lot of time to reach a minimum, the computational resources needed in total are much higher than the ones needed in adaptive optimizations. So in the end, if you have a lot of computer resources you should consider using SGD with momentum as it tends to generalize better. On the other hand, if your resources, especially time resources, are limited Adam is your best choice.

The Perfect Activation

Pol Monroig Company — Wed, 08 Oct 2025 21:41:43 GMT

It might be too bold to call an activation function perfect, given that the No Free Lunch Theorem of machine learning states that there is no universally perfect machine learning algorithm. Nevertheless, as misleading as the title can be, I will try to summarize the most widely used activation functions and describe their main differences.

Linear (identity)

The linear activation function is essentially no activation at all.
Overhead: fastest, no computation at all
Performance: bad, since it does not enable a non linear transformation
Advantages:

Differentiable at all points
Fast execution

Common issues:

Does not provide any non-linear output.

Sigmoid

The Sigmoid activation function is one of the oldest ones. Initially made to mimic the activations in the brain it has been shown to have poor performance on artificial neural networks, nevertheless it is commonly used and a classifier output to transform outputs into class probabilities.

Uses: it is commonly used in the output layer of binary classification where we need a probability value between 0 and 1.
Overhead: very expensive because of the exponential term.
Performance: bad on hidden layers, mostly used on output layers
Advantages:

Outputs are between 0 and 1, that means that values won't explode.
It is differentiable at every point.

Common issues:

Outputs are between 0 and 1, that means outputs might saturate.
Vanishing gradients are possible.
Outputs are always positive ( zero centered functions help in a faster convergence).

Code:

# Pytorch 
torch.nn.Sigmoid() 
# Tensorflow 
tf.keras.activations.sigmoid()

Softmax

Generalization of the Sigmoid function to more than one class, it enables to transform the outputs into multiple probabilities. Used in multiclass classification.
Uses: used in the output layer of a multiclass neural network.
Overhead: similar to Sigmoid, but more overhead caused by more inputs.
Performance: bad on hidden layers, mostly used on output layers
Advantages:

Unlike Sigmoid, it ensures that outputs are normalized between 0 and 1

Common issues:

Same as Sigmoid.

Code:

# Pytorch 
torch.nn.Softmax(dim=...) 
# Tensorflow 
tf.keras.activations.softmax()

Hyperbolic Tangent

Tanh function has the same shape as Sigmoid, in fact is the same but it is mathematically shifted and it works better in most cases.
Uses: generally used in hidden layers as it outputs between -1 and 1, thus creating normalized outputs, making learning faster.
Overhead: very expensive, since it uses an exponential term.
Performance: similar to Sigmoid but with some added benefits
Advantages:

Outputs are between -1 and 1, that means that values won't explode.
It is differentiable at every point.
It is zero-centered, unlike Sigmoid.

Common issues:

Vanishing gradients.
Gradients saturation.

Code:

# Pytorch 
torch.nn.Tanh() 
# Tensorflow 
tf.keras.activations.tanh()

ReLU

ReLU, also called rectified linear unit is one of the most commonly used activations, both for its computational efficiency and its great performance. Multiple variations have been created to improve its flaws.
Uses: must be used in hidden layers as it provides better performance than tanh and Sigmoid, and is more efficient since it is computationally faster.
Overhead: Almost none, extremely fast.
Performance: great performance, recommended for most cases.
Advantages:

Adds non-linearity to the network.
Does not suffer from vanishing gradient.
Does not saturate.

Common issues:

It suffers from dying ReLU
Not differentiable at x = 0

Code:

# Pytorch 
torch.nn.ReLU() 
# Tensorflow 
tf.keras.activations.relu()

Leaky Relu

Given that ReLU suffers from the dying relu problem where negative values are rounded to 0. Leaky ReLU tries to diminish the problem by changing the 0 output by a very small value.
Uses: used in hidden layers.
Overhead: same as ReLU
Performance: great performance if the hyperparameter is chosen correctly
Advantages:

Similar to ReLU and fixes dying ReLU.

Common issues:

New hyperparameter to tune.

Code:

# Pytorch 
torch.nn.LeakyReLU(negative_slope=...) 
# Tensorflow 
tf.keras.layers.LeakyReLU(alpha=...)

Parametric ReLU

Takes the same idea as leaky ReLU but instead of predifining the leaky hyperparemeter, it is added as a parameter that must be learned.
Uses: used in hidden layers.
Overhead: a new parameter must be learned for each PreLU in the network.
Performance: bad on hidden layers, mostly used on output layers
Advantages:

Fixes the need of tuning an hyperparameter

Common issues:

The parameter learned is not guaranteed to be the optimum, and it increases the overhead, so you might as well try some yourself with leaky.

Code:

# Pytorch 
torch.nn.PReLU(x) 
# Tensorflow 
tf.keras.layers.PReLU(x)

ELU

The ELU was introduced as another alternative to fix the issues that you can encounter with ReLU.
Uses: used in hidden layers
Overhead: computational expensive, it uses an exponential term
Performance: bad on hidden layers, mostly used on output layers
Advantages:

Similar to reLU.
Produces negative outputs.
Bends smoothly unlike leakyReLU.
Differentiable at x = 0

Common issues:

Additional hyperparameter

Code:

# Pytorch 
torch.nn.ELU() 
# Tensorflow 
tf.keras.activations.elu()

Other alternatives

There are a lot of activations functions to cover them all in a single post. Here are some:

SeLU
GeLU
CeLU
Swish
Mish
Softplus

Note: if it ends with LU it usually comes from ReLU.

Summary

So... having so many choices, which activation should we use? As a rule of thumb you should always try using ReLU in the hidden layers, as it has a great performance with minimal computational overhead. After that (if you have enough computing power) you might want to try with some complex variations of ReLU or similar alternatives. I would never recommend using Sigmoid, Tanh or Sotfmax for any hidden layer. Sigmoid and Softmax should be used whenever we want probabilities outputs for a classification task. Finally, with the current progress and research in deep learning and AI surely new and better functions will appear, so keep an eye out.

Remember to try and experiment always, you never know which function will work better for a specific task.

How to make your code embarrassingly faster?

Pol Monroig Company — Wed, 08 Oct 2025 21:39:33 GMT

Amdahl’s law is most popular in the computer science community, named after Gene Amdahl in 1967. It is said that the more resources we add to a program the faster it goes, but how can we add more resources to the program? Imagine a program as a single task, if we could divide it into smaller tasks and those tasks into smaller tasks, in the end, we would have a program with a lot of tasks; if every task is independent of each other we could then assign each task to a different processor. The dividing of tasks into multiple processors is called parallel computing. The size of the tasks proportional to the number of tasks is called task granularity, the smaller the tasks the greater the granularity.

It is easier to see that if we run a lot of tasks concurrently we can save time since we can do a lot of things at the same time. Awesome right? Well… not always, first of all, when we create a new task there is an overhead we need to take into account and each task might be dependent on other tasks (e.g. Task A needs to be done before B) so we might not be able to parallelize everything. Finally, our computer can only handle a limited amount of concurrent tasks at the same time. So if you still think parallel computing is the best thing that ever happened continue reading and I will show you how to make your code embarrassingly parallel (despite the issues above).

OpenMP Introduction

Now, there are multiple ways to parallelize your code, but I will focus on using the OpenMP API because it is easy to implement and to learn. Let’s see a simple example into how you can parallelize the sum of two vectors

std::vector<int> sum(std::vector<int> const& v1, std::vector<int> const& v2){

    int size = v1.size(); // could have also done v2.size()
    std::vector<int> output(size); // initialize output with zeros 

    #pragma omp parallel 
    #pragma omp for 
    for(int i = 0; i < size; ++i){
      output[i] = v1[i] + v2[i];
    }

    return output;

}

Summing to vectors is simple, you just have to sum each element individually and write it into another vector. But what are those #pragma omp (something)? Those #pragma omp enable us to parallelize the code.

##pragma omp parallel creates a parallel region, all threads available execute it.
#pragma omp for divides the loop into k different chunks and each thread executes a different chunk (task). This way each thread sums a different part of the vector. The sum of two vectors is what in parallel computing is called embarrassingly parallel, what I mean by that is that we can create the tasks as small as we want and they won’t have any dependencies (The sum of each element in the vector is independent of each other). So what consequences does this have? First of all, we can parallelize it as much as the hardware supports and we don’t have to worry about data being shared between threads.

How to avoid data races

Imagine that instead of summing two vectors we would want to know the sum of the elements of a single one. It is tempting to use the same OpenMP structure as before but that would cause a data race, a data race happens when two threads want to access the same data at the same time, this causes inconsistency in the results since the value will update always in a different order. To visualize this let's see how we could solve the issue.

    std::vector v; 
    ...
    int sum = 0;
    #pragma omp parallel 
    #pragma omp for 
    for(int i = 0; i < v.size(); ++i){
      #pragma omp critical 
      sum += v[i];
    }

A simple way to do this is adding a #pragma omp critical, this sentence ensures that no thread will execute the same code at the same time, this way when a thread wants to update the sum it has to wait for the other thread to do it. A more efficient way to do it (causes less overhead) would be to replace atomic for critical. Either way, if you try to execute this code you’ll see that it does not go as fast as you would think because most of the time the threads are waiting. I wish there would be a way to make it faster…

    std::vector v; 
    ...
    int sum = 0;
    #pragma omp parallel 
    #pragma omp for reduction(+:sum)
    for(int i = 0; i < v.size(); ++i){
      sum += v[i];
    }

In fact, there is, the reduction clause solves all our problems. When a thread is created a private copy of sum is created and it is initialized with a value of 0. This way each thread updates its own copy of sum, and in the end, the copies are summed. By now you may be wondering how can I make a variable explicitly private or shared; in the #pragma omp for statement you can add a shared(var) clause or a private(var) clause.

Controlling the number of tasks

Until now we let OpenMP decide for us the tasks that are generated, and thus the number of threads, but how can we control this?


#include "omp.h" // remember to include the omp directive 

// equivalent to setting the env bash var 
// OMP_NUM_THREADS=N
omp_set_num_threads(N);

#pragma omp parallel num_threads(N)

The first way would be to set the environment variable OMP_NUM_THREADS, a second method is to set it with an omp function, this sets the number of threads for all parallel regions if you want to set the number of threads for a specific region you need to set it directly. But this only sets the number of threads it creates, what if we want to control the number of tasks.


// sum a vector from index begin to index end and return the value 
int sum_vector(std::vector<int> const& v, int begin, int end);

int sum1, sum2; 

// n is the size of the vector "v"
#pragma omp task 
sum1 = sum_vector(v, 0, n / 2); // sum first half 
#pragma omp task 
sum2 = sum_vectors(v, n / 2, n); // sum second half 
#pragma omp taskwait 
int sum = sum1 + sum2;

Based on the sum example we can perform the sum of a vector by dividing it into two tasks the first tasks sums the first half and the second task sums the second half, this is what we are doing here, we are creating two tasks with the task pragma. Simple right? You might have notice something strange at the end of the second task (#pragma omp taskwait), task wait does what the name says, it waits for the current tasks to finish before continuing the execution (it works as a sort of explicit barrier). But why do we have to wait for them to finish? The problem is that if we don’t wait we might not have sum1 and sum2 available for the final sum.

Conclusion

This has been a very short and limited introduction to OpenMP since they're a lot of things I wish I could have covered, but parallel computing is really an extensive topic. Nevertheless, I hope this introduction has opened your eyes on how you can improve your code with simple additions.

How to implement a simple lossless compression in C++

Pol Monroig Company — Wed, 08 Oct 2025 21:33:53 GMT

Compression algorithms are one of the most important computer science discoveries. It enables us to save data using less space and transfer it faster. Moreover, compression techniques are so enhanced that even lossy compressions give us an unnoticeable loss of the data being managed. Nevertheless, we are not going to talk about lossy compression algorithms, but loss-less algorithms, in particular, a very famous one called Huffman Encoding. You may know it because it is used in the JPEG image compression. In this post we will discover the magic behind this compression algorithm, we will go step by step until we end up designing a very simple implementation in C++.

Prefix property

Huffman encoding is a code system based on the prefix property. To encode a text we can move each distinct character of the text into a set, thus we will have a set of characters. To compress each symbol we need a function that is able to convert a character into code (e.g. a binary string). Given a set of symbols Σ we can define a function ϕ: Σ → {0,1}+ that maps each symbol into a code. The symbols in Σ contain the set of distinct characters in the text that needs to be compressed. The most simple prefix encoding would be to assign each letter a binary number, which is a simple ASCII to binary integer conversion. This is a very simple encoding since it is a function that maps a character to itself, but it surely does not compress at all. Prefix codes are very easy to decode, they only need to be read (left-to-right), this ensures us a decompression runtime complexity of O(n). A common way to represent this type of encoding is in a binary tree called the prefix tree.
For example, let's suppose we have the following set and encoding scheme.

Symbols: Σ = {A, B, C D}
Encoding: ϕ(A) = 1, ϕ(B)=01, ϕ(C)=000, ϕ(D)=001 then we can represent it using this

As we can see in the tree, to decode/encode a text (e.g. 00010010….) we must traverse the tree until we find a leaf (where the character is found). If the current prefix is a 0 we must go left and if it is a 1 we must go right. That simple!

After creating the tree it is easier to save the equivalencies (code — character) in a simple table.

A prefix tree has the following properties:

One leaf per symbol
Left edge labeled 0 and right edge labeled 1
Labels on the path from the root to a leaf specify the code for that leaf.

Encoding

Okay, so what do these strange prefix trees have to do we Huffman trees? Well, it turns out Huffman trees are prefix trees, but not just simple prefix trees, they represent the optimal prefix trees. Given a text, an optimal prefix code is a prefix code that minimizes the total number of bits needed to encode that text, in other words, it is the encoding that makes the text smaller (fewer bits = more compression). Note that if you are using a Huffman tree to compress data you should also save the tree in which it was encoded.

Now, how do we find this optimal tree? Well, we need to follow the following steps.

Find the frequencies of each character and save them in a table
For each character, we create a prefix tree consisting of only the leaf node. This node should contain the value of the character and its frequency in the text.
We should have a list of trees now, one per character. Next, we are going to select the two smallest trees, we consider a tree to be smaller to another one if its frequency is lower (in case of a tie we select the one with fewer nodes), and we are going to merge them into one; that is one of the two should become the left subtree and one the right subtree, afterward, a new parent node is created.

Well, that's it, after joining every tree you should be left with only one. If you were paying attention you must have noticed that I didn’t specify how to select the smaller tree from the list of all the trees. That is because it depends on the implementation. The fast way to do it is saving the trees in a MinHeap (priority queue in C++), each insertion and deletion in the heap has an O(log n) complexity but the lookup is always constant. Thus the total complexity of the encoding algorithm is O(n log n) because we must insert a new tree n times.

Implementation

The Huffman compression algorithm is a greedy algorithm, that is it always tries to make the optimal choice in a local space, to implement we can create a class called HuffmanTree.

class HuffmanTree{
public:

    HuffmanTree(char v, int w);

    HuffmanTree(HuffmanTree const& tree);

    HuffmanTree(HuffmanTree const& h1, HuffmanTree const& h2);

    bool operator<(HuffmanTree const& other) const;

private:

    // represents a value that will never be read;
    static const int NULL_VALUE = -1;

    // left subtree
    std::shared_ptr left;

    // right subtree
    std::shared_ptr right;

    char value; // character, null if !isLeaf 
    int weight; // aka. frequency 
    int size; // aka. number of nodes 
    bool isLeaf;
};

A HuffmanTree will contain, as we said before, the value (character), its weight (frequency), and the size (number of nodes). Finally, it also has a pointer to the left subtree and the right subtree, we used a shared pointer to promote modern C++ smart pointers and avoid worrying about memory leaks.

You may be wondering why would we want to implement three different constructors? Well, the first one creates a new tree with a given value and weight.



HuffmanTree::HuffmanTree(char v, int w){
    value = v;
    left = nullptr;
    right = nullptr;
    weight = w;
    size = 1;
    isLeaf = true;
}

The second constructor is just a copy constructor, that creates a new one based on the old one.

HuffmanTree::HuffmanTree(HuffmanTree const& tree){
    value = tree.value;
    left = tree.left;
    right = tree.right;
    weight = tree.weight;
    size = tree.size;
    isLeaf = tree.isLeaf;
}

Finally, we need a constructor that merges two different trees.

HuffmanTree::HuffmanTree(HuffmanTree const& h1, HuffmanTree const& h2) {
    left = std::make_shared(h1);
    right =  std::make_shared(h2);
    size = left->size  + right->size;
    weight = left->weight + right->weight;
    isLeaf = false;
    value = NULL_VALUE;
}

The HuffmanTree class has overloaded a comparison operator, but if you were paying attention, it should be self-explanatory.

bool HuffmanTree::operator<(HuffmanTree const& other) const{
    if(weight != other.weight)return weight < other.weight;
    else return size < other.size;
}

Finally, we need to make the core of the algorithm, as you can see we first create a HuffmanTree per character, then we merge trees until we are only left with one.



    ...

    std::priority_queue minHeap;

    for(auto const& letter : table){
        minHeap.push(HuffmanTree(letter.first, letter.second)); // first == char, second == frequency 
    }

    // join trees
    while(minHeap.size() > 1){
        HuffmanTree min1 = minHeap.top();
        minHeap.pop();
        HuffmanTree min2 = minHeap.top();
        minHeap.pop();
        minHeap.push(HuffmanTree(min1, min2));
    }

That’s all, you have successfully implemented a Huffman Tree, I hope you haven’t lost in the way!

Any doubts, please comment.

When accuracy is not enough...

Pol Monroig Company — Wed, 08 Oct 2025 21:29:21 GMT

The task of classification has existed long before the invention of machine learning. A problem that may arise when working with different algorithms is the use of an error function that determines if an algorithm is good enough, with classification algorithms it is no different.

One of the most used metrics applied in these algorithms is the accuracy metric; based on the total number of samples and the predictions made, we return the percentage of samples that were correctly classified. But this method does not always work so well; imagine that we have a total of 1000 samples, and an algorithm called DummyAlgorithm that tries to classify them in two different classes (A and B). Unfortunately, DummyAlgorithm does not know anything about the data distribution, as a result, it always tells us that a given sample is of type A. Now imagine that all the samples are of class A (you might see where I'm going). In this case, it is easy to see that even though DummyAlgorithm has a 100% accuracy rate, it is not a very good algorithm.

In this post, we'll learn how we can complement the accuracy metric with other machine learning strategies that do take into account the problem described before. Consequently we'll see a method to avoid such a problem.

Definitions

Before going any further, let's define some basic concepts.

Accuracy: metric that returns the percentage of correctly classified samples in a dataset

True Positives: samples that were correctly classified with their respective positive class

True Negatives: samples that were correctly classified with their respective negative class

False Positives: samples that were classified as positives but were negatives

False Negatives: samples that were classified as negatives but were positives

Precision: accuracy of the true positives (TP / TP + FP)

Recall: ratio of positive instances that are correctly classified (TP / TP + FN)

Note: when we talk about positives/negatives, we are talking about a specific class

Confusion Matrix

The confusion matrix creates a division for each of the four possible categorizations. It can be used in multiclass classification. In the following example we are making a binary classification that classifies red dots among other colors.

Precision vs recall tradeoff

As with other metrics, the classifier has to make a decision in which if it wants to learn to have a better precision or a better recall. Sometimes you care more about precision than you care about recall. For example, if you wish to detect safe for work posts in a social network, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision). On the other hand, suppose you train a classifier to detect shoplifters, it is probably better that the classifier has the most recall as possible (the security system will get some false alerts, but almost all shoplifters will get caught.

Based on this tradeoff we can define a curve called the precision/recall curve

ROC curve

The ROC curve (receiver operating characteristic curve) is a very common tool used with binary classifiers. It is very similar to the precision/recall curve, but it plots the true positive rate against the false positive rate. One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a AUC equal to 1. A purely random classifier will have a ROC AUC equal to 0.5.

As the ROC curve and the precision/recall curve are very similar, it might be difficult to choose between them. A common approach is to use the precision/recall curve whenever the positive class is rare and when you care more about the false positives than the false negatives, and the ROC curve otherwise.

Solutions

The accuracy problem essentially happens when the data the model is being tested with is unbalanced. To solve this issue there are several approaches.

If you have a lot of training data you can discard some of it to create a more balanced data, although your model might generalize worse with less data, this approach must be used in special cases.
Use a data augmentation technique to increase the data available.
Use a resampling technique in which you make the training data bigger by using the same data, useful if the data augmentation approach is too complicated.